Adriano Fragomeni, Dima Damen, Michael Wray
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts, derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so are able to distinguish concepts from one other. We conduct extensive experiments on five diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across the other two.
Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng
Selective retrieval improves retrieval-augmented generation (RAG) by reducing
distractions from low-quality retrievals and improving efficiency. However,
existing approaches under-utilize the inherent knowledge of large language
models (LLMs), leading to suboptimal retrieval decisions and degraded
generation performance. To bridge this gap, we propose Self-Routing RAG
(SR-RAG), a novel framework that binds selective retrieval with knowledge
verbalization. SR-RAG enables an LLM to dynamically decide between external
retrieval and verbalizing its own parametric knowledge. To this end, we design
a multi-task objective that jointly optimizes an LLM on knowledge source
selection, knowledge verbalization, and response generation. We further
introduce dynamic knowledge source inference via nearest neighbor search to
improve the accuracy of knowledge source decision under domain shifts.
Fine-tuning three LLMs with SR-RAG significantly improves both their response
accuracy and inference latency. Compared to the strongest selective retrieval
baseline, SR-RAG reduces retrievals by 29% while improving the performance by
5.1%.
Authors' comments: Work in Progress
Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.
Enrico Palumbo, Gustavo Penha, Andreas Damianou, José Luis Redondo García, Timothy Christopher Heath, Alice Wang, Hugues Bouchard, Mounia Lalmas
In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.
Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, Kang Liu
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
retrieving relevant documents from external sources and incorporating them into
the context. While it improves reliability by providing factual texts, it
significantly increases inference costs as context length grows and introduces
challenging issue of RAG hallucination, primarily caused by the lack of
corresponding parametric knowledge in LLMs. An efficient solution is to enhance
the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by
embedding document into LLMs parameters to perform test-time knowledge
enhancement, effectively reducing inference costs through offline training.
However, its high training and storage costs, along with limited generalization
ability, significantly restrict its practical adoption. To address these
challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that
leverages a lightweight parameter translator model to efficiently convert
documents into parametric knowledge. DyPRAG not only reduces inference,
training, and storage costs but also dynamically generates parametric
knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge
conflicts in a plug-and-play manner at test-time. Extensive experiments on
multiple datasets demonstrate the effectiveness and generalization capabilities
of DyPRAG, offering a powerful and practical RAG paradigm which enables
superior knowledge fusion and mitigates RAG hallucination in real-world
applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.
Authors' comments: preprint. Code is available at https://github.com/Trae1ounG/DyPRAG
Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin et al.
Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach, incorporating TF-IDF and cosine similarity to identify related episodes and enhance the overall story structure. Results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.
Xinyu Wang, Linrui Ma, Jerry Huang, Peng Lu, Prasanna Parthasarathi, Xiao-Wen Chang, Boxing Chen, Yufei Cui
Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce __Resona__, a simple and scalable framework for augmenting linear recurrent models with retrieval. __Resona__~augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that __Resona__-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.
Philippe Jaming
In this note we present several questions about the phase retrieval problem for the Schr{\"o}dinger equation. Some partial answers are given as well as some of the heuristics behind these questions.
Haoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng et al.
Standard Retrieval-Augmented Generation (RAG) relies on chunk-based
retrieval, whereas GraphRAG advances this approach by graph-based knowledge
representation. However, existing graph-based RAG approaches are constrained by
binary relations, as each edge in an ordinary graph connects only two entities,
limiting their ability to represent the n-ary relations (n >= 2) in real-world
knowledge. In this work, we propose HyperGraphRAG, a novel hypergraph-based RAG
method that represents n-ary relational facts via hyperedges, and consists of
knowledge hypergraph construction, retrieval, and generation. Experiments
across medicine, agriculture, computer science, and law demonstrate that
HyperGraphRAG outperforms both standard RAG and previous graph-based RAG
methods in answer accuracy, retrieval efficiency, and generation quality.
Authors' comments: Preprint
Kota Dohi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.
Heejin Kook, Junyoung Kim, Seongmin Park, Jongwuk Lee
Conversational recommender systems (CRSs) are designed to suggest the target
item that the user is likely to prefer through multi-turn conversations. Recent
studies stress that capturing sentiments in user conversations improves
recommendation accuracy. However, they employ a single user representation,
which may fail to distinguish between contrasting user intentions, such as
likes and dislikes, potentially leading to suboptimal performance. To this end,
we propose a novel conversational recommender model, called COntrasting user
pReference expAnsion and Learning (CORAL). Firstly, CORAL extracts the user's
hidden preferences through contrasting preference expansion using the reasoning
capacity of the LLMs. Based on the potential preference, CORAL explicitly
differentiates the contrasting preferences and leverages them into the
recommendation process via preference-aware learning. Extensive experiments
show that CORAL significantly outperforms existing methods in three benchmark
datasets, improving up to 99.72% in Recall@10. The code and datasets are
available at https://github.com/kookeej/CORAL
Authors' comments: NAACL 2025
Karanbir Singh, William Ngu
Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet's creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user's abilities to find the best information in its billions of links and sources at everybody's fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.
Yoav Rotman, Luis Welbanks, Michael R. Line, Peter McGill, Michael Radica, Matthew C. Nixon
Atmospheric retrievals are essential tools for interpreting exoplanet
transmission and eclipse spectra, enabling quantitative constraints on the
chemical composition, aerosol properties, and thermal structure of planetary
atmospheres. The James Webb Space Telescope (JWST) offers unprecedented
spectral precision, resolution, and wavelength coverage, unlocking
transformative insights into the formation, evolution, climate, and potential
habitability of planetary systems. However, this opportunity is accompanied by
challenges: modeling assumptions and unaccounted-for noise or signal sources
can bias retrieval outcomes and their interpretation. To address these
limitations, we introduce a Gaussian Process (GP)-aided atmospheric retrieval
framework that flexibly accounts for unmodeled features in exoplanet spectra,
whether global or localized. We validate this method on synthetic JWST
observations and show that GP-aided retrievals reduce bias in inferred
abundances and better capture model-data mismatches than traditional
approaches. We also introduce the concept of mean squared error to quantify the
trade-off between bias and variance, arguing that this metric more accurately
reflects retrieval performance than bias alone. We then reanalyze the
NIRISS/SOSS JWST transmission spectrum of WASP-96 b, finding that GP-aided
retrievals yield broader constraints on CO$_2$ and H$_2$O, alleviating tension
between previous retrieval results and equilibrium predictions. Our GP
framework provides precise and accurate constraints while highlighting regions
where models fail to explain the data. As JWST matures and future facilities
come online, a deeper understanding of the limitations of both data and models
will be essential, and GP-enabled retrievals like the one presented here offer
a principled path forward.
Authors' comments: Submitted to AAS Journals, 25 pages, 13 figures
Sichun Luo, Jian Xu, Xiaojie Zhang, Linrong Wang, Sicong Liu, Hanxu Hou, Linqi Song
Large Language Models (LLMs) have been integrated into recommender systems to
enhance user behavior comprehension. The Retrieval Augmented Generation (RAG)
technique is further incorporated into these systems to retrieve more relevant
items and improve system performance. However, existing RAG methods have two
shortcomings. \textit{(i)} In the \textit{retrieval} stage, they rely primarily
on textual semantics and often fail to incorporate the most relevant items,
thus constraining system effectiveness. \textit{(ii)} In the
\textit{generation} stage, they lack explicit chain-of-thought reasoning,
further limiting their potential.
In this paper, we propose Representation learning and \textbf{R}easoning
empowered retrieval-\textbf{A}ugmented \textbf{L}arge \textbf{L}anguage model
\textbf{Rec}ommendation (RALLRec+). Specifically, for the retrieval stage, we
prompt LLMs to generate detailed item descriptions and perform joint
representation learning, combining textual and collaborative signals extracted
from the LLM and recommendation models, respectively. To account for the
time-varying nature of user interests, we propose a simple yet effective
reranking method to capture preference dynamics. For the generation phase, we
first evaluate reasoning LLMs on recommendation tasks, uncovering valuable
insights. Then we introduce knowledge-injected prompting and consistency-based
merging approach to integrate reasoning LLMs with general-purpose LLMs,
enhancing overall performance. Extensive experiments on three real world
datasets validate our method's effectiveness.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2502.06101
Fumian Chen, Hui Fang
Information retrieval systems such as open web search and recommendation
systems are ubiquitous and significantly impact how people receive and consume
online information. Previous research has shown the importance of fairness in
information retrieval systems to combat the issue of echo chambers and mitigate
the rich-get-richer effect. Therefore, various fairness-aware information
retrieval methods have been proposed. Score-based fairness-aware information
retrieval algorithms, focusing on statistical parity, are interpretable but
could be mathematically infeasible and lack generalizability. In contrast,
learning-to-rank-based fairness-aware information retrieval algorithms using
fairness-aware loss functions demonstrate strong performance but lack
interpretability. In this study, we proposed a novel and interpretable
framework that recursively refines query keywords to retrieve documents from
underrepresented groups and achieve group fairness. Retrieved documents using
refined queries will be re-ranked to ensure relevance. Our method not only
shows promising retrieval results regarding relevance and fairness but also
preserves interpretability by showing refined keywords used at each iteration.
Authors' comments: This is a preprint of our paper accepted at ECIR 2025
Yedan Shen, Kaixin Wu, Yuechen Ding, Jingyuan Wen, Hong Liu, Mingjie Zhong, Zhouhan Lin, Jia Xu et al.
Generative retrieval (GR) has revolutionized document retrieval with the
advent of large language models (LLMs), and LLM-based GR is gradually being
adopted by the industry. Despite its remarkable advantages and potential,
LLM-based GR suffers from hallucination and generates documents that are
irrelevant to the query in some instances, severely challenging its credibility
in practical applications. We thereby propose an optimized GR framework
designed to alleviate retrieval hallucination, which integrates knowledge
distillation reasoning in model training and incorporate decision agent to
further improve retrieval precision. Specifically, we employ LLMs to assess and
reason GR retrieved query-document (q-d) pairs, and then distill the reasoning
data as transferred knowledge to the GR model. Moreover, we utilize a decision
agent as post-processing to extend the GR retrieved documents through retrieval
model and select the most relevant ones from multi perspectives as the final
generative retrieval result. Extensive offline experiments on real-world
datasets and online A/B tests on Fund Search and Insurance Search in Alipay
demonstrate our framework's superiority and effectiveness in improving search
quality and conversion gains.
Authors' comments: Accepted by SIGIR 2025
Andrei Niculae, Adrian Cosma, Emilian Radoi
Accurately mapping medical procedure names from healthcare providers to
standardized terminology used by insurance companies is a crucial yet complex
task. Inconsistencies in naming conventions lead to missclasified procedures,
causing administrative inefficiencies and insurance claim problems in private
healthcare settings. Many companies still use human resources for manual
mapping, while there is a clear opportunity for automation. This paper proposes
a retrieval-based architecture leveraging sentence embeddings for medical name
matching in the Romanian healthcare system. This challenge is significantly
more difficult in underrepresented languages such as Romanian, where existing
pretrained language models lack domain-specific adaptation to medical text. We
evaluate multiple embedding models, including Romanian, multilingual, and
medical-domain-specific representations, to identify the most effective
solution for this task. Our findings contribute to the broader field of medical
NLP for low-resource languages such as Romanian.
Authors' comments: Accepted at BIONLP 2025 and Shared Tasks, ACL 2025
Anja Reusch, Yonatan Belinkov
Generative Information Retrieval (GenIR) is a novel paradigm in which a transformer encoder-decoder model predicts document rankings based on a query in an end-to-end fashion. These GenIR models have received significant attention due to their simple retrieval architecture while maintaining high retrieval effectiveness. However, in contrast to established retrieval architectures like cross-encoders or bi-encoders, their internal computations remain largely unknown. Therefore, this work studies the internal retrieval process of GenIR models by applying methods based on mechanistic interpretability, such as patching and vocabulary projections. By replacing the GenIR encoder with one trained on fewer documents, we demonstrate that the decoder is the primary component responsible for successful retrieval. Our patching experiments reveal that not all components in the decoder are crucial for the retrieval process. More specifically, we find that a pass through the decoder can be divided into three stages: (I) the priming stage, which contributes important information for activating subsequent components in later layers; (II) the bridging stage, where cross-attention is primarily active to transfer query information from the encoder to the decoder; and (III) the interaction stage, where predominantly MLPs are active to predict the document identifier. Our findings indicate that interaction between query and document information occurs only in the last stage. We hope our results promote a better understanding of GenIR models and foster future research to overcome the current challenges associated with these models.
Nengbo Wang, Xiaotian Han, Jagdip Singh, Jing Ma, Vipin Chaudhary
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images
based on a multimodal query. Typical training data consists of triplets
containing a reference image, a textual description of desired modifications,
and the target image, which are expensive and time-consuming to acquire. The
scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic
triplets or leveraging vision-language models (VLMs) with ubiquitous
web-crawled image-caption pairs. However, these methods have significant
limitations: synthetic triplets suffer from limited scale, lack of diversity,
and unnatural modification text, while image-caption pairs hinder joint
embedding learning of the multimodal query due to the absence of triplet data.
Moreover, existing approaches struggle with complex and nuanced modification
texts that demand sophisticated fusion and understanding of vision and language
modalities. We present CoLLM, a one-stop framework that effectively addresses
these limitations. Our approach generates triplets on-the-fly from
image-caption pairs, enabling supervised training without manual annotation. We
leverage Large Language Models (LLMs) to generate joint embeddings of reference
images and modification texts, facilitating deeper multimodal fusion.
Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset
comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and
Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate
that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks
and settings. MTCIR yields competitive results, with up to 15% performance
improvement. Our refined benchmarks provide more reliable evaluation metrics
for CIR models, contributing to the advancement of this important field.
Authors' comments: CVPR 2025. Project page: https://collm-cvpr25.github.io/