Alireza Salemi, Hamed Zamani
Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model's performance based on query-document relevance labels shows a small correlation with the RAG system's downstream performance. We propose a novel evaluation approach, eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. In this manner, the downstream performance for each document serves as its relevance label. We employ various downstream task metrics to obtain document-level annotations and aggregate them using set-based or ranking metrics. Extensive experiments on a wide range of datasets demonstrate that eRAG achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's $\tau$ correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
Shiguang Wu, Wenda Wei, Mengqi Zhang, Zhumin Chen, Jun Ma, Zhaochun Ren, Maarten de Rijke, Pengjie Ren
Generative retrieval generates identifiers of relevant documents in an
end-to-end manner using a sequence-to-sequence architecture for a given query.
The relation between generative retrieval and other retrieval methods,
especially those based on matching within dense retrieval models, is not yet
fully comprehended. Prior work has demonstrated that generative retrieval with
atomic identifiers is equivalent to single-vector dense retrieval. Accordingly,
generative retrieval exhibits behavior analogous to hierarchical search within
a tree index in dense retrieval when using hierarchical semantic identifiers.
However, prior work focuses solely on the retrieval stage without considering
the deep interactions within the decoder of generative retrieval.
In this paper, we fill this gap by demonstrating that generative retrieval
and multi-vector dense retrieval share the same framework for measuring the
relevance to a query of a document. Specifically, we examine the attention
layer and prediction head of generative retrieval, revealing that generative
retrieval can be understood as a special case of multi-vector dense retrieval.
Both methods compute relevance as a sum of products of query and document
vectors and an alignment matrix. We then explore how generative retrieval
applies this framework, employing distinct strategies for computing document
token vectors and the alignment matrix. We have conducted experiments to verify
our conclusions and show that both paradigms exhibit commonalities of term
matching in their alignment matrix.
Authors' comments: 12 pages, 5 figures, 8 tables, accepted at SIGIR 2024
Benjamin Reichman, Larry Heck
Dense passage retrieval (DPR) is the first step in the retrieval augmented generation (RAG) paradigm for improving the performance of large language models (LLM). DPR fine-tunes pre-trained networks to enhance the alignment of the embeddings between queries and relevant textual data. A deeper understanding of DPR fine-tuning will be required to fundamentally unlock the full potential of this approach. In this work, we explore DPR-trained models mechanistically by using a combination of probing, layer activation analysis, and model editing. Our experiments show that DPR training decentralizes how knowledge is stored in the network, creating multiple access pathways to the same information. We also uncover a limitation in this training style: the internal knowledge of the pre-trained model bounds what the retrieval model can retrieve. These findings suggest a few possible directions for dense retrieval: (1) expose the DPR training process to more knowledge so more can be decentralized, (2) inject facts as decentralized representations, (3) model and incorporate knowledge uncertainty in the retrieval process, and (4) directly map internal model knowledge to a knowledge base.
Man Luo, Arindam Mitra, Tejas Gokhale, Chitta Baral
Information retrieval (IR) is essential in search engines and dialogue
systems as well as natural language processing tasks such as open-domain
question answering. IR serve an important function in the biomedical domain,
where content and sources of scientific knowledge may evolve rapidly. Although
neural retrievers have surpassed traditional IR approaches such as TF-IDF and
BM25 in standard open-domain question answering tasks, they are still found
lacking in the biomedical domain. In this paper, we seek to improve information
retrieval (IR) using neural retrievers (NR) in the biomedical domain, and
achieve this goal using a three-pronged approach. First, to tackle the relative
lack of data in the biomedical domain, we propose a template-based question
generation method that can be leveraged to train neural retriever models.
Second, we develop two novel pre-training tasks that are closely aligned to the
downstream task of information retrieval. Third, we introduce the ``Poly-DPR''
model which encodes each context into multiple context vectors. Extensive
experiments and analysis on the BioASQ challenge suggest that our proposed
method leads to large gains over existing neural approaches and beats BM25 in
the small-corpus setting. We show that BM25 and our method can complement each
other, and a simple hybrid model leads to further gains in the large corpus
setting.
Authors' comments: Accepted at AAAI 2022
Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv, Nan Duan, Weizhu Chen
Current dense text retrieval models face two typical challenges. First, they
adopt a siamese dual-encoder architecture to encode queries and documents
independently for fast indexing and searching, while neglecting the
finer-grained term-wise interactions. This results in a sub-optimal recall
performance. Second, their model training highly relies on a negative sampling
technique to build up the negative documents in their contrastive losses. To
address these challenges, we present Adversarial Retriever-Ranker (AR2), which
consists of a dual-encoder retriever plus a cross-encoder ranker. The two
models are jointly optimized according to a minimax adversarial objective: the
retriever learns to retrieve negative documents to cheat the ranker, while the
ranker learns to rank a collection of candidates including both the
ground-truth and the retrieved ones, as well as providing progressive direct
feedback to the dual-encoder retriever. Through this adversarial game, the
retriever gradually produces harder negative documents to train a better
ranker, whereas the cross-encoder ranker provides progressive feedback to
improve retriever. We evaluate AR2 on three benchmarks. Experimental results
show that AR2 consistently and significantly outperforms existing dense
retriever methods and achieves new state-of-the-art results on all of them.
This includes the improvements on Natural Questions R@5 to 77.9%(+2.1%),
TriviaQA R@5 to 78.2%(+1.4), and MS-MARCO MRR@10 to 39.5%(+1.3%). Code and
models are available at https://github.com/microsoft/AR2.
Authors' comments: ICLR 2022
Jinhyuk Lee, Alexander Wettig, Danqi Chen
Dense retrieval methods have shown great promise over sparse retrieval
methods in a range of NLP problems. Among them, dense phrase retrieval-the most
fine-grained retrieval unit-is appealing because phrases can be directly used
as the output for question answering and slot filling tasks. In this work, we
follow the intuition that retrieving phrases naturally entails retrieving
larger text blocks and study whether phrase retrieval can serve as the basis
for coarse-level retrieval including passages and documents. We first observe
that a dense phrase-retrieval system, without any retraining, already achieves
better passage retrieval accuracy (+3-5% in top-5 accuracy) compared to passage
retrievers, which also helps achieve superior end-to-end QA performance with
fewer passages. Then, we provide an interpretation for why phrase-level
supervision helps learn better fine-grained entailment compared to
passage-level supervision, and also show that phrase retrieval can be improved
to achieve competitive performance in document-retrieval tasks such as entity
linking and knowledge-grounded dialogue. Finally, we demonstrate how phrase
filtering and vector quantization can reduce the size of our index by 4-10x,
making dense phrase retrieval a practical and versatile solution in
multi-granularity retrieval.
Authors' comments: EMNLP 2021. Code available at
https://github.com/princeton-nlp/DensePhrases
Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, Weilong Yang
Image generation from scene description is a cornerstone technique for the
controlled generation, which is beneficial to applications such as content
creation and image editing. In this work, we aim to synthesize images from
scene description with retrieved patches as reference. We propose a
differentiable retrieval module. With the differentiable retrieval module, we
can (1) make the entire pipeline end-to-end trainable, enabling the learning of
better feature embedding for retrieval; (2) encourage the selection of mutually
compatible patches with additional objective functions. We conduct extensive
quantitative and qualitative experiments to demonstrate that the proposed
method can generate realistic and diverse images, where the retrieved patches
are reasonable and mutually compatible.
Authors' comments: ECCV 2020
Peter G. Casazza, Dorsa Ghoreishi, Shani Jose, Janet C. Tremain
We make a detailed study of norm retrieval. We give several classification theorems for norm retrieval and give a large number of examples to go with the theory. One consequence is a new result about Parseval frames: If a Parseval frame is divided into two subsets with spans $W_1,W_2$ and $W_1 \cap W_2=\{0\}$, then $W_1 \perp W_2$.
Yingqi Zhao, Vasilis Efthymiou, Jyrki Nummenmaa, Kostas Stefanidis
Retrieval-Augmented Generation (RAG) improves reliability of large language models by incorporating external knowledge, but the retrieval process can introduce bias that propagates to generated outputs. This issue is particularly challenging in top-k settings, where multiple documents jointly influence generation. We propose a fairness-aware retrieval framework that models and controls this bias. Our approach combines controlled bias injection via reranking, a position-aware model of bias propagation, and an optimization formulation that balances relevance and fairness. We further introduce a scalable solution based on Quadratic Fairness via Dual Hyperplane Approximation (FARO), which enables efficient optimization through problem decomposition. Experimental results show that our method effectively mitigates generation bias while preserving relevance. This work provides a principled approach for fairness-aware retrieval in RAG systems.
Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava
Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.
Tong Zhao, Yutao Zhu, Yucheng Tian, Zhicheng Dou
Retrieval-augmented generation (RAG) has become a cornerstone for knowledge-intensive tasks. However, the efficacy of RAG is often bottlenecked by the ``one-size-fits-all'' retrieval paradigm, as different queries exhibit distinct preferences for different retrievers. While recent routing techniques attempt to select the optimal retriever dynamically, they typically operate under a ``single and static capability'' assumption, selecting retrievers solely based on semantic relevance. This overlooks a critical distinction in RAG: a retrieved document must not only be relevant but also effectively support the generator in producing correct answers. To address this limitation, we propose R$^3$AG, a novel routing framework that explicitly models the dynamic alignment between queries and retriever capabilities. Unlike previous approaches, R$^3$AG decomposes retriever capability into two learnable dimensions: retrieval quality and generation utility. We employ a contrastive learning objective that leverages complementary supervision signals, \textit{i.e.}, document assessments and downstream answer correctness, to capture query-specific preference shifts. Extensive experiments on several knowledge-intensive tasks show that R$^3$AG consistently outperforms both the best individual retrievers and state-of-the-art static routing methods.
Deniz Qian, Hung-Ting Chen, Eunsol Choi
Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.
Authors' comments: 18 pages, 12 figures, 12 tables
Hiren Madhu, Ngoc Bui, Ali Maatouk, Leandros Tassiulas, Smita Krishnaswamy, Menglin Yang, Sukanta Ganguly, Kiran Srinivasan et al.
Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
The quality of answers generated by large language models (LLMs) in retrieval-augmented generation (RAG) is largely influenced by the contextual information contained in the retrieved documents. A key challenge for improving RAG is to predict both the utility of retrieved documents -- quantified as the performance gain from using context over generation without context -- and the quality of the final answers in terms of correctness and relevance. In this paper, we define two prediction tasks within RAG. The first is retrieval performance prediction (RPP), which estimates the utility of retrieved documents. The second is generation performance prediction (GPP), which estimates the final answer quality. We hypothesise that in RAG, the topical relevance of retrieved documents correlates with their utility, suggesting that query performance prediction (QPP) approaches can be adapted for RPP and GPP. Beyond these retriever-centric signals, we argue that reader-centric features, such as the LLM's perplexity of the retrieved context conditioned on the input query, can further enhance prediction accuracy for both RPP and GPP. Finally, we propose that features reflecting query-agnostic document quality and readability can also provide useful signals to the predictions. We train linear regression models with the above categories of predictors for both RPP and GPP. Experiments on the Natural Questions (NQ) dataset show that combining predictors from multiple feature categories yields the most accurate estimates of RAG performance.
Authors' comments: 18 pages (including reference), 3 figures, 2 table, 61 references; this paper has been accepted by ECIR'26 as a full paper
Xin Sun, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.
Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, Trevor Adriaanse
Retrieval models are key components of Retrieval-Augmented Generation (RAG) systems, which generate search queries, process the documents returned, and generate a response. RAG systems are often dynamic and may involve multiple rounds of retrieval. While many state-of-the-art retrieval methods are available through academic IR platforms, these platforms are typically designed for the Cranfield paradigm in which all queries are known up front and can be batch processed offline. This simplification accelerates research but leaves state-of-the-art retrieval models unable to support downstream applications that require online services, such as arbitrary dynamic RAG pipelines that involve looping, feedback, or even self-organizing agents. In this work, we introduce RoutIR, a Python package that provides a simple and efficient HTTP API that wraps arbitrary retrieval methods, including first stage retrieval, reranking, query expansion, and result fusion. By providing a minimal JSON configuration file specifying the retrieval models to serve, RoutIR can be used to construct and query retrieval pipelines on-the-fly using any permutation of available models (e.g., fusing the results of several first-stage retrieval methods followed by reranking). The API automatically performs asynchronous query batching and caches results by default. While many state-of-the-art retrieval methods are already supported by the package, RoutIR is also easily expandable by implementing the Engine abstract class. The package is open-sourced and publicly available on GitHub: http://github.com/hltcoe/routir.
Authors' comments: 17 pages, 1 figure
Rishita Agarwal, Himanshu Singhal, Peter Baile Chen, Manan Roy Choudhury, Dan Roth, Vivek Gupta
Answering natural language queries over relational data often requires
retrieving and reasoning over multiple tables, yet most retrievers optimize
only for query-table relevance and ignore table table compatibility. We
introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework
that separates semantic relevance from structural joinability for efficient,
high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables,
(ii) expands these with structurally joinable tables via fast, precomputed
column-embedding comparisons, and (iii) refines them by pruning noisy or weakly
related candidates. Empirically, REAR is retriever-agnostic and consistently
improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and
Spider) by improving both multi-table retrieval quality and downstream SQL
execution. Despite being LLM-free, it delivers performance competitive with
state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving
much lower latency and cost. Ablations confirm complementary gains from
expansion and refinement, underscoring REAR as a practical, scalable building
block for table-based downstream tasks (e.g., Text-to-SQL).
Authors' comments: 13 pages, 2 figures, 8 tables
Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.
Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
enhancing large language models (LLMs) by retrieving relevant documents from an
external corpus. However, existing RAG systems primarily focus on unimodal text
documents, and often fall short in real-world scenarios where both queries and
documents may contain mixed modalities (such as text and images). In this
paper, we address the challenge of Universal Retrieval-Augmented Generation
(URAG), which involves retrieving and reasoning over mixed-modal information to
improve vision-language generation. To this end, we propose Nyx, a unified
mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate
the scarcity of realistic mixed-modal data, we introduce a four-stage automated
pipeline for generation and filtering, leveraging web documents to construct
NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that
better reflect real-world information needs. Building on this high-quality
dataset, we adopt a two-stage training framework for Nyx: we first perform
pre-training on NyxQA along with a variety of open-source retrieval datasets,
followed by supervised fine-tuning using feedback from downstream
vision-language models (VLMs) to align retrieval outputs with generative
preferences. Experimental results demonstrate that Nyx not only performs
competitively on standard text-only RAG benchmarks, but also excels in the more
general and realistic URAG setting, significantly improving generation quality
in vision-language tasks.
Authors' comments: This work is in progress
Helia Hashemi, Victor Rühle, Saravan Rajmohan
Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.