Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig
As text-to-image models become increasingly prevalent, ensuring their
equitable performance across diverse cultural contexts is critical. Efforts to
mitigate cross-cultural biases have been hampered by trade-offs, including a
loss in performance, factual inaccuracies, or offensive outputs. Despite
widespread recognition of these challenges, an inability to reliably measure
these biases has stalled progress. To address this gap, we introduce CAIRe, a
novel evaluation metric that assesses the degree of cultural relevance of an
image, given a user-defined set of labels. Our framework grounds entities and
concepts in the image to a knowledge base and uses factual information to give
independent graded judgments for each culture label. On a manually curated
dataset of culturally salient but rare items built using language models, CAIRe
surpasses all baselines by 28% F1 points. Additionally, we construct two
datasets for culturally universal concept, one comprising of T2I-generated
outputs and another retrieved from naturally occurring data. CAIRe achieves
Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based
on a 5-point Likert scale of cultural relevance. This demonstrates its strong
alignment with human judgment across diverse image sources.
Authors' comments: Preprint, under review
Fan Xu, Luis A. Leiva
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms alternative metrics in feature alignment tasks. Furthermore, we observe that conventional architectures such as multilayer perceptrons are insufficient for capturing the complex interactions between image and text representations. Our study offers novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications.
Ke Wang, Bo Pan, Yingchaojie Feng, Yuwei Wu, Jieyi Chen, Minfeng Zhu, Wei Chen
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability
in enhancing Large Language Model (LLM)'s answer with an external knowledge
base. Compared to traditional RAG, it introduces a graph as an intermediate
representation to capture better structured relational knowledge in the corpus,
elevating the precision and comprehensiveness of generation results. However,
developers usually face challenges in analyzing the effectiveness of GraphRAG
on their dataset due to GraphRAG's complex information processing pipeline and
the overwhelming amount of LLM invocations involved during graph construction
and query, which limits GraphRAG interpretability and accessibility. This
research proposes a visual analysis framework that helps RAG developers
identify critical recalls of GraphRAG and trace these recalls through the
GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype
system incorporating a set of interactive visualizations to facilitate users'
analysis process, boosting failure cases collection and improvement
opportunities identification. Our evaluation demonstrates the effectiveness and
usability of our approach. Our work is open-sourced and available at
https://github.com/Gk0Wk/XGraphRAG.
Authors' comments: Accepted to IEEE Pacific Visualization Conference 2025
Animesh Bhandari
Fusion frames are extensively studied due to their effectiveness in
recovering signals from large-scale data. They are applicable in distributed
processing, wireless sensor networks, and packet encoding systems due to their
robustness and redundancy. Motivated by the foundational work of Bemrose et
al.\cite{Be16} and Balan\cite{Ba13}, this paper investigates the theoretical
properties and characterizations of phase retrievable weaving fusion frames.
These frames offer enhanced redundancy and stability in signal reconstruction.
We present key results that deepen the understanding of their structure and
behaviour. Lastly, an application involving probabilistic erasure is explored
to demonstrate their practical utility.
Authors' comments: arXiv admin note: text overlap with arXiv:2409.01288
Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson, Zhen Han, Vassilis N. Ioannidis, Soji Adeshina et al.
Retrieval-Augmented Generation (RAG) grounds large language models in
external evidence, yet it still falters when answers must be pieced together
across semantically distant documents. We close this gap with the Hierarchical
Lexical Graph (HLG), a three-tier index that (i) traces every atomic
proposition to its source, (ii) clusters propositions into latent topics, and
(iii) links entities and relations to expose cross-document paths. On top of
HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG,
which performs fine-grained entity-aware beam search over propositions for
high-precision factoid questions, and TopicGraphRAG, which selects coarse
topics before expanding along entity links to supply broad yet relevant context
for exploratory queries. Additionally, existing benchmarks lack the complexity
required to rigorously evaluate multi-hop summarization systems, often focusing
on single-document queries or limited datasets. To address this, we introduce a
synthetic dataset generation pipeline that curates realistic, multi-document
question-answer pairs, enabling robust evaluation of multi-hop retrieval
systems. Extensive experiments across five datasets demonstrate that our
methods outperform naive chunk-based RAG achieving an average relative
improvement of 23.1% in retrieval recall and correctness. Open-source Python
library is available at https://github.com/awslabs/graphrag-toolkit.
Authors' comments: KDD '25
Jinbao Zhu, Xiaohu Tang
The problem of $T$-colluding private information retrieval (PIR) enables the
user to retrieve one out of $M$ files from a distributed storage system with
$N$ servers without revealing anything about the index of the desired file to
any group of up to $T$ colluding servers. In the considered storage system, the
$M$ files are stored across the $N$ distributed servers in an $X$-secure
$K$-coded manner such that any group of up to $X$ colluding servers learns
nothing about the files; the storage overhead at each server is reduced by a
factor of $\frac{1}{K}$ compared to the total size of the files; and the files
can be reconstructed from any $K+X$ servers. However, in practical scenarios,
when the user retrieves the desired file from the distributed system, some
servers may respond to the user very slowly or not respond at all. These
servers are referred to as \emph{stragglers}, and particularly their identities
and numbers are unknown in advance and may change over time. This paper
considers the adaptive PIR problem that can be capable of tolerating the
presence of a varying number of stragglers. We propose a general coding method
for designing adaptive PIR schemes by introducing the concept of a
\emph{feasible PIR coding framework}. We demonstrate that any \emph{feasible
PIR coding framework} over a finite field $\mathbb{F}_q$ with size $q$ can be
used to construct an adaptive PIR scheme that achieves a retrieval rate of
$1-\frac{K+X+T-1}{N-S}$ simultaneously for all numbers of stragglers $0\leq
S\leq N-(K+X+T)$ over the same finite field. Additionally, we provide an
implementation of the \emph{feasible PIR coding framework}, ensuring that the
adaptive PIR scheme operates over any finite field $\mathbb{F}_q$ with size
$q\geq N+\max\{K, N-(K+X+T-1)\}$.
Authors' comments: Accepted by IEEE TIT
Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm
that enhances image generation by autoregressively incorporating knearest
neighbor retrievals at the patch level. Unlike prior methods that perform a
single, static retrieval before generation and condition the entire generation
on fixed reference images, AR-RAG performs context-aware retrievals at each
generation step, using prior-generated patches as queries to retrieve and
incorporate the most relevant patch-level visual references, enabling the model
to respond to evolving generation needs while avoiding limitations (e.g.,
over-copying, stylistic bias, etc.) prevalent in existing methods. To realize
AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in
Decoding (DAiD), a training-free plug-and-use decoding strategy that directly
merges the distribution of model-predicted patches with the distribution of
retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a
parameter-efficient fine-tuning method that progressively smooths the features
of retrieved patches via multi-scale convolution operations and leverages them
to augment the image generation process. We validate the effectiveness of
AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and
DPG-Bench, demonstrating significant performance gains over state-of-the-art
image generation models.
Authors' comments: Image Generation, Retrieval Augmented Generation
David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Online video web content is richly multimodal: a single video blends vision,
speech, ambient audio, and on-screen text. Retrieval systems typically treat
these modalities as independent retrieval sources, which can lead to noisy and
subpar retrieval. We explore multimodal video content retrieval, where
relevance can be scored from one particular modality or jointly across multiple
modalities simultaneously. Consequently, an effective retriever must
dynamically choose which modality (or set of modalities) best addresses the
query. We introduce CLaMR, a multimodal, late-interaction retriever that
jointly indexes 4 modalities: video frames, transcribed speech, on-screen text,
and metadata. CLaMR jointly encodes all modalities with a unified multimodal
backbone for improved contextualization and is trained to enhance dynamic
modality selection via two key innovations. First, given the lack of training
data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale
synthetic training dataset built on MultiVENT 2.0 (event-centric videos in
various languages paired with queries) with modality-targeted queries. Next, we
propose a modality-aware loss that jointly trains according to a standard
contrastive objective alongside an objective for learning correct modality
usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation
strategies, such as averaging similarities for baseline retrievers, degrade
performance by introducing noise from irrelevant modalities. In contrast, CLaMR
consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR
improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4
over the best multi-modality retriever. We illustrate CLaMR's downstream
utility on long-video QA, retrieving relevant frames and obtaining a 3.50%
boost over LanguageBind on Video-MME and 1.42% over dense sampling on
LongVideoBench.
Authors' comments: 18 pages. Code and data: https://github.com/meetdavidwan/clamr
Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.
Jing Liu, Haiye Huo
The classical phase retrieval refers to the recovery of an unknown signal
from its Fourier magnitudes, which is widely used in fields such as quantum
mechanics, signal processing, optics, etc. The offset linear canonical
transform (OLCT), which is a more general type of linear integral transform
including Fourier transform (FT), fractional Fourier transform (FrFT), and
linear canonical transform (LCT) as its special cases. Hence, in this paper, we
focus on the uniqueness problem of phase retrieval in the framework of OLCT.
First, we prove that all the nontrivial ambiguities in continuous OLCT phase
retrieval can be represented by convolution operators, and demonstrate that a
continuous compactly supported signal can be uniquely determined up to a global
phase from its multiple magnitude-only OLCT measurements. Moreover, we
investigate the nontrivial ambiguities in the discrete OLCT phase retrieval
case. Furthermore, we demenstrate that a nonseparable function can be uniquely
recovered from its magnitudes of short-time OLCT (STOLCT) up to a global phase.
Finally, we show that signals which are bandlimited in FT or OLCT domain can be
reconstructed from its sampled STOLCT magnitude measurements, up to a global
phase, providing the ambiguity function of window function satisfies some mild
conditions.
Authors' comments: 21 pages
Mengxi Xiao, Mang Ye, Ben Liu, Xiaofen Zong, He Li, Jimin Huang, Qianqian Xie, Min Peng
The application of AI in psychiatric diagnosis faces significant challenges,
including the subjective nature of mental health assessments, symptom overlap
across disorders, and privacy constraints limiting data availability. To
address these issues, we present MoodAngels, the first specialized multi-agent
framework for mood disorder diagnosis. Our approach combines granular-scale
analysis of clinical assessments with a structured verification process,
enabling more accurate interpretation of complex psychiatric data.
Complementing this framework, we introduce MoodSyn, an open-source dataset of
1,173 synthetic psychiatric cases that preserves clinical validity while
ensuring patient privacy. Experimental results demonstrate that MoodAngels
outperforms conventional methods, with our baseline agent achieving 12.3%
higher accuracy than GPT-4o on real-world cases, and our full multi-agent
system delivering further improvements. Evaluation in the MoodSyn dataset
demonstrates exceptional fidelity, accurately reproducing both the core
statistical patterns and complex relationships present in the original data
while maintaining strong utility for machine learning applications. Together,
these contributions provide both an advanced diagnostic tool and a critical
research resource for computational psychiatry, bridging important gaps in
AI-assisted mental health assessment.
Authors' comments: 40 pages, 11 figures
Xiwei Xu, Hans Weytjens, Dawen Zhang, Qinghua Lu, Ingo Weber, Liming Zhu
Recent studies show that 60% of LLM-based compound systems in enterprise environments leverage some form of retrieval-augmented generation (RAG), which enhances the relevance and accuracy of LLM (or other genAI) outputs by retrieving relevant information from external data sources. LLMOps involves the practices and techniques for managing the lifecycle and operations of LLM compound systems in production environments. It supports enhancing LLM systems through continuous operations and feedback evaluation. RAGOps extends LLMOps by incorporating a strong focus on data management to address the continuous changes in external data sources. This necessitates automated methods for evaluating and testing data operations, enhancing retrieval relevance and generation quality. In this paper, we (1) characterize the generic architecture of RAG applications based on the 4+1 model view for describing software architectures, (2) outline the lifecycle of RAG systems, which integrates the management lifecycles of both the LLM and the data, (3) define the key design considerations of RAGOps across different stages of the RAG lifecycle and quality trade-off analyses, (4) highlight the overarching research challenges around RAGOps, and (5) present two use cases of RAG applications and the corresponding RAGOps considerations.
Katherine Thai, Mohit Iyyer
How well do modern long-context language models understand literary fiction?
We explore this question via the task of literary evidence retrieval,
repurposing the RELiC dataset of That et al. (2022) to construct a benchmark
where the entire text of a primary source (e.g., The Great Gatsby) is provided
to an LLM alongside literary criticism with a missing quotation from that work.
This setting, in which the model must generate the missing quotation, mirrors
the human process of literary analysis by requiring models to perform both
global narrative reasoning and close textual examination. We curate a
high-quality subset of 292 examples through extensive filtering and human
verification. Our experiments show that recent reasoning models, such as Gemini
Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In
contrast, the best open-weight model achieves only 29.1% accuracy, highlighting
a wide gap in interpretive reasoning between open and closed-weight models.
Despite their speed and apparent accuracy, even the strongest models struggle
with nuanced literary signals and overgeneration, signaling open challenges for
applying LLMs to literary analysis. We release our dataset and evaluation code
to encourage future work in this direction.
Authors' comments: ACL 2025
Yuxuan Wu, Le Wang, Sanping Zhou, Mengnan Liu, Gang Hua, Haoxiang Li
Controllable layout generation aims to create plausible visual arrangements
of element bounding boxes within a graphic design according to certain optional
constraints, such as the type or position of a specific component. While recent
diffusion or flow-matching models have achieved considerable advances in
multifarious conditional generation tasks, there remains considerable room for
generating optimal arrangements under given conditions. In this work, we
propose to carry out layout generation through retrieving by conditions and
reference-guided generation. Specifically, we retrieve appropriate layout
templates according to given conditions as references. The references are then
utilized to guide the denoising or flow-based transport process. By retrieving
layouts compatible with the given conditions, we can uncover the potential
information not explicitly provided in the given condition. Such an approach
offers more effective guidance to the model during the generation process, in
contrast to previous models that feed the condition to the model and let the
model infer the unprovided layout attributes directly. Meanwhile, we design a
condition-modulated attention that selectively absorbs retrieval knowledge,
adapting to the difference between retrieved templates and given conditions.
Extensive experiment results show that our method successfully produces
high-quality layouts that meet the given conditions and outperforms existing
state-of-the-art models. Code will be released upon acceptance.
Authors' comments: 12 pages, 5 figures
Yingying Zhuang, Aman Gupta, Anurag Beniwal
Multilingual information retrieval has emerged as powerful tools for
expanding knowledge sharing across languages. On the other hand, resources on
high quality knowledge base are often scarce and in limited languages,
therefore an effective embedding model to transform sentences from different
languages into a feature vector space same as the knowledge base language
becomes the key ingredient for cross language knowledge sharing, especially to
transfer knowledge available in high-resource languages to low-resource ones.
In this paper we propose a novel strategy to fine-tune multilingual embedding
models with weighted sampling for contrastive learning, enabling multilingual
information retrieval with a monolingual knowledge base. We demonstrate that
the weighted sampling strategy produces performance gains compared to standard
ones by up to 31.03\% in MRR and up to 33.98\% in Recall@3. Additionally, our
proposed methodology is language agnostic and applicable for both multilingual
and code switching use cases.
Authors' comments: 6 pages, accepted at GENNEXT@SIGIR25
Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni et al.
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.
Daniel Szelogowski
Despite substantial research into the biological basis of memory, the precise
mechanisms by which experiences are encoded, stored, and retrieved in the brain
remain incompletely understood. A growing body of evidence supports the engram
theory, which posits that sparse populations of neurons undergo lasting
physical and biochemical changes to support long-term memory. Yet, a
comprehensive computational framework that integrates biological findings with
mechanistic models remains elusive. This work synthesizes insights from
cellular neuroscience and computational modeling to address key challenges in
engram research: how engram neurons are identified and manipulated; how
synaptic plasticity mechanisms contribute to stable memory traces; and how
sparsity promotes efficient, interference-resistant representations. Relevant
computational approaches -- such as sparse regularization, engram gating, and
biologically inspired architectures like Sparse Distributed Memory and spiking
neural networks -- are also examined. Together, these findings suggest that
memory efficiency, capacity, and stability emerge from the interaction of
plasticity and sparsity constraints. By integrating neurobiological and
computational perspectives, this paper provides a comprehensive theoretical
foundation for engram research and proposes a roadmap for future inquiry into
the mechanisms underlying memory, with implications for the diagnosis and
treatment of memory-related disorders.
Authors' comments: 18 pages, 7 figures, 3 tables
Chihiro Maru, Shoetsu Sato
Inspired by the success of large language models (LLMs) in natural language processing, recent research has explored the building of time series foundation models and applied them to tasks such as forecasting, classification, and anomaly detection. However, their performances vary between different domains and tasks. In LLM-based approaches, test-time adaptation using example-based prompting has become common, owing to the high cost of retraining. In the context of anomaly detection, which is the focus of this study, providing normal examples from the target domain can also be effective. However, time series foundation models do not naturally acquire the ability to interpret or utilize examples or instructions, because the nature of time series data used during training does not encourage such capabilities. To address this limitation, we propose a retrieval augmented time series foundation model (RATFM), which enables pretrained time series foundation models to incorporate examples of test-time adaptation. We show that RATFM achieves a performance comparable to that of in-domain fine-tuning while avoiding domain-dependent fine-tuning. Experiments on the UCR Anomaly Archive, a multi-domain dataset including nine domains, confirms the effectiveness of the proposed approach.
Mojtaba Nayyeri, Athish A Yogi, Nadeen Fathallah, Ratan Bahadur Thapa, Hans-Michael Tautenhahn, Anton Schnurpel, Steffen Staab
Transforming relational databases into knowledge graphs with enriched
ontologies enhances semantic interoperability and unlocks advanced graph-based
learning and reasoning over data. However, previous approaches either demand
significant manual effort to derive an ontology from a database schema or
produce only a basic ontology. We present RIGOR, Retrieval-augmented Iterative
Generation of RDB Ontologies, an LLM-driven approach that turns relational
schemas into rich OWL ontologies with minimal human effort. RIGOR combines
three sources via RAG, the database schema and its documentation, a repository
of domain ontologies, and a growing core ontology, to prompt a generative LLM
for producing successive, provenance-tagged delta ontology fragments. Each
fragment is refined by a judge-LLM before being merged into the core ontology,
and the process iterates table-by-table following foreign key constraints until
coverage is complete. Applied to real-world databases, our approach outputs
ontologies that score highly on standard quality dimensions such as accuracy,
completeness, conciseness, adaptability, clarity, and consistency, while
substantially reducing manual effort.
Authors' comments: Under review
Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.