Peilin Wu, Xinlu Zhang, Wenhao Yu, Xingyu Liu, Xinya Du, Zhiyu Zoey Chen
Recent advancements in Retrieval-Augmented Language Models (RALMs) have
demonstrated their efficacy in knowledge-intensive tasks. However, existing
evaluation benchmarks often assume a single optimal approach to leveraging
retrieved information, failing to account for varying user needs. This paper
introduces a novel evaluation framework that systematically assesses RALMs
under three user need cases-Context-Exclusive, Context-First, and
Memory-First-across three distinct context settings: Context Matching,
Knowledge Conflict, and Information Irrelevant. By varying both user
instructions and the nature of retrieved information, our approach captures the
complexities of real-world applications where models must adapt to diverse user
requirements. Through extensive experiments on multiple QA datasets, including
HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find
that restricting memory usage improves robustness in adversarial retrieval
conditions but decreases peak performance with ideal retrieval results and
model family dominates behavioral differences. Our findings highlight the
necessity of user-centric evaluations in the development of retrieval-augmented
systems and provide insights into optimizing model performance across varied
retrieval contexts. We will release our code and URAQ dataset upon acceptance
of the paper.
Authors' comments: Updated the motivation, data selection and creation process, and
terminology of some keywords for better writing. The updates are mianly in
introduction and experiment section
Tongfei Chen, Ankita Sharma, Adam Pauls, Benjamin Van Durme
Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.
Feibo Jiang, Wanyun Zhu, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan, Octavia A. Dobre
Large Language Models (LLMs) possess human-level cognitive and decision-making capabilities, making them a key technology for 6G. However, applying LLMs to the communication domain faces three major challenges: 1) Inadequate communication data; 2) Restricted input modalities; and 3) Difficulty in knowledge retrieval. To overcome these issues, we propose CommGPT, a multimodal foundation model designed specifically for communications. First, we create high-quality pretraining and fine-tuning datasets tailored in communication, enabling the LLM to engage in further pretraining and fine-tuning with communication concepts and knowledge. Then, we design a multimodal encoder to understand and process information from various input modalities. Next, we construct a Graph and Retrieval-Augmented Generation (GRG) framework, efficiently coupling Knowledge Graph (KG) with Retrieval-Augmented Generation (RAG) for multi-scale learning. Finally, we demonstrate the feasibility and effectiveness of the CommGPT through experimental validation.
Sam Pastoriza, Iman Yousfi, Christopher Redino, Marc Vucovich, Abdul Rahman, Sal Aguinaga, Dhruv Nandakumar
We propose a novel mechanism for real-time (human-in-the-loop) feedback
focused on false positive reduction to enhance anomaly detection models. It was
designed for the lightweight deployment of a behavioral network anomaly
detection model. This methodology is easily integrable to similar domains that
require a premium on throughput while maintaining high precision. In this
paper, we introduce Retrieval Augmented Anomaly Detection, a novel method
taking inspiration from Retrieval Augmented Generation. Human annotated
examples are sent to a vector store, which can modify model outputs on the very
next processed batch for model inference. To demonstrate the generalization of
this technique, we benchmarked several different model architectures and
multiple data modalities, including images, text, and graph-based data.
Authors' comments: 6 pages, 3 figures. 2 tables, accepted at ISDFS 2025
Jiaxing Li, Lin Jiang, Zeqi Ma, Kaihang Jiang, Xiaozhao Fang, Jie Wen
Deep online cross-modal hashing has gained much attention from researchers
recently, as its promising applications with low storage requirement, fast
retrieval efficiency and cross modality adaptive, etc. However, there still
exists some technical hurdles that hinder its applications, e.g., 1) how to
extract the coexistent semantic relevance of cross-modal data, 2) how to
achieve competitive performance when handling the real time data streams, 3)
how to transfer the knowledge learned from offline to online training in a
lightweight manner. To address these problems, this paper proposes a
lightweight contrastive distilled hashing (LCDH) for cross-modal retrieval, by
innovatively bridging the offline and online cross-modal hashing by similarity
matrix approximation in a knowledge distillation framework. Specifically, in
the teacher network, LCDH first extracts the cross-modal features by the
contrastive language-image pre-training (CLIP), which are further fed into an
attention module for representation enhancement after feature fusion. Then, the
output of the attention module is fed into a FC layer to obtain hash codes for
aligning the sizes of similarity matrices for online and offline training. In
the student network, LCDH extracts the visual and textual features by
lightweight models, and then the features are fed into a FC layer to generate
binary codes. Finally, by approximating the similarity matrices, the
performance of online hashing in the lightweight student network can be
enhanced by the supervision of coexistent semantic relevance that is distilled
from the teacher network. Experimental results on three widely used datasets
demonstrate that LCDH outperforms some state-of-the-art methods.
Authors' comments: Accepted by AAAI 2025
Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi
Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over a non-RAG baseline, based on 1-5 scale ratings from both human and LLM judge.
Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme
We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.
Haoyang Wen, Jiang Guo, Yi Zhang, Jiarong Jiang, Zhiguo Wang
This paper investigates synthetic data generation strategies in developing generative retrieval models for domain-specific corpora, thereby addressing the scalability challenges inherent in manually annotating in-domain queries. We study the data strategies for a two-stage training framework: in the first stage, which focuses on learning to decode document identifiers from queries, we investigate LLM-generated queries across multiple granularity (e.g. chunks, sentences) and domain-relevant search constraints that can better capture nuanced relevancy signals. In the second stage, which aims to refine document ranking through preference learning, we explore the strategies for mining hard negatives based on the initial model's predictions. Experiments on public datasets over diverse domains demonstrate the effectiveness of our synthetic data generation and hard negative sampling approach.
Anna Lueber, Konstantin Karchev, Chloe Fisher, Matthias Heim, Roberto Trotta, Kevin Heng
In the era of the James Webb Space Telescope (JWST), the dramatic improvement
in the spectra of exoplanetary atmospheres demands a corresponding leap forward
in our ability to analyze them: atmospheric retrievals need to be performed on
thousands of spectra, applying to each large ensembles of models (that explore
atmospheric chemistry, thermal profiles and cloud models) to identify the best
one(s). In this limit, traditional Bayesian inference methods such as nested
sampling become prohibitively expensive. We introduce FASTER (Fast Amortized
Simulation-based Transiting Exoplanet Retrieval), a neural-network based method
for performing atmospheric retrieval and Bayesian model comparison at a
fraction of the computational cost of classical techniques. We demonstrate that
the marginal posterior distributions of all parameters within a model as well
as the posterior probabilities of the models we consider match those computed
using nested sampling both on mock spectra, and for the real NIRSpec PRISM
spectrum of WASP-39b. The true power of the FASTER framework comes from its
amortized nature, which allows the trained networks to perform practically
instantaneous Bayesian inference and model comparison over ensembles of spectra
-- real or simulated -- at minimal additional computational cost. This offers
valuable insight into the expected results of model comparison (e.g.,
distinguishing cloudy from cloud-free and isothermal from non-isothermal
models), as well as their dependence on the underlying parameters, which is
computationally unfeasible with nested sampling. This approach will constitute
as large a leap in spectral analysis as the original retrieval methods based on
Markov Chain Monte Carlo have proven to be.
Authors' comments: 15 pages, 7 figures, 1 table. Accepted by Astrophysical Journal
Letters
Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Yu Gu, Ge Yu et al.
This paper introduces Multi-Modal Retrieval-Augmented Generation (M^2RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs) in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as input context for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments show that MM-RAIT improves the performance of RAG systems by enabling them to effectively learn from multi-modal contexts. All data and code are available at https://github.com/NEUIR/M2RAG.
Haris Riaz, Ellen Riloff, Mihai Surdeanu
We propose a simple, unsupervised method that injects pragmatic principles in
retrieval-augmented generation (RAG) frameworks such as Dense Passage Retrieval
to enhance the utility of retrieved contexts. Our approach first identifies
which sentences in a pool of documents retrieved by RAG are most relevant to
the question at hand, cover all the topics addressed in the input question and
no more, and then highlights these sentences within their context, before they
are provided to the LLM, without truncating or altering the context in any
other way. We show that this simple idea brings consistent improvements in
experiments on three question answering tasks (ARC-Challenge, PubHealth and
PopQA) using five different LLMs. It notably enhances relative accuracy by up
to 19.7% on PubHealth and 10% on ARC-Challenge compared to a conventional RAG
system.
Authors' comments: 16 pages, 2 figures, 8 tables. Preprint
Tianhui Zhang, Yi Zhou, Danushka Bollegala
Retrieval Augmented Generation (RAG) has gained popularity as a method for
conveniently incorporating novel facts that were not seen during the
pre-training stage in Large Language Model (LLM)-based Natural Language
Generation (NLG) systems. However, LLMs are known to encode significant levels
of unfair social biases. The modulation of these biases by RAG in NLG systems
is not well understood. In this paper, we systematically study the relationship
between the different components of a RAG system and the social biases
presented in the text generated across three languages (i.e. English, Japanese
and Chinese) and four social bias types (i.e. gender, race, age and religion).
Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we
evaluate the social biases in RAG responses from document collections with
varying levels of stereotypical biases, employing multiple LLMs used as
generators. We find that the biases in document collections are often amplified
in the generated responses, even when the generating LLM exhibits a low-level
of bias. Our findings raise concerns about the use of RAG as a technique for
injecting novel facts into NLG systems and call for careful evaluation of
potential social biases in RAG applications before their real-world deployment.
Authors' comments: 18 pages
Yinan Zhou, Yaxiong Wang, Haokun Lin, Chen Ma, Li Zhu, Zhedong Zheng
Composed Image Retrieval (CIR) aims to search an image of interest using a
combination of a reference image and modification text as the query. Despite
recent advancements, this task remains challenging due to limited training data
and laborious triplet annotation processes. To address this issue, this paper
proposes to synthesize the training triplets to augment the training resource
for the CIR problem. Specifically, we commence by training a modification text
generator exploiting large-scale multimodal models and scale up the CIR
learning throughout both the pretraining and fine-tuning stages. During
pretraining, we leverage the trained generator to directly create Modification
Text-oriented Synthetic Triplets(MTST) conditioned on pairs of images. For
fine-tuning, we first synthesize reverse modification text to connect the
target image back to the reference image. Subsequently, we devise a two-hop
alignment strategy to incrementally close the semantic gap between the
multimodal pair and the target image. We initially learn an implicit prototype
utilizing both the original triplet and its reversed version in a cycle manner,
followed by combining the implicit prototype feature with the modification text
to facilitate accurate alignment with the target image. Extensive experiments
validate the efficacy of the generated triplets and confirm that our proposed
methodology attains competitive recall on both the CIRR and FashionIQ
benchmarks.
Authors' comments: 12 pages, 8 figures
Peng Shen, Xugang Lu, Hisashi Kawai
Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.
Hansi Zeng, Julian Killingback, Hamed Zamani
Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.
Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang, Jimmy Lin, Vivek Srikumar
This survey examines the evolution of model architectures in information retrieval (IR), focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies to provide a focused analysis of structural innovations in IR systems.We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs). We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.
Haya Nachimovsky, Moshe Tennenholtz, Oren Kurland
The rise of large language models (LLMs) has introduced a new era in information retrieval (IR), where queries and documents that were once assumed to be generated exclusively by humans can now also be created by automated agents. These agents can formulate queries, generate documents, and perform ranking. This shift challenges some long-standing IR paradigms and calls for a reassessment of both theoretical frameworks and practical methodologies. We advocate for a multi-agent perspective to better capture the complex interactions between query agents, document agents, and ranker agents. Through empirical exploration of various multi-agent retrieval settings, we reveal the significant impact of these interactions on system performance. Our findings underscore the need to revisit classical IR paradigms and develop new frameworks for more effective modeling and evaluation of modern retrieval systems.
Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
Jiachen Zhu, Congmin Zheng, Jianghao Lin, Kounianhua Du, Ying Wen, Yong Yu, Jun Wang, Weinan Zhang
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.