Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu
Retrieval-Augmented Generation (RAG) has proven effective in mitigating hallucinations in large language models by incorporating external knowledge during inference. However, this integration introduces new security vulnerabilities, particularly to poisoning attacks. Although prior work has explored various poisoning strategies, a thorough assessment of their practical threat to RAG systems remains missing. To address this gap, we propose the first comprehensive benchmark framework for evaluating poisoning attacks on RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10 expanded variants, along with 13 poisoning attack methods and 7 defense mechanisms, representing a broad spectrum of existing techniques. Using this benchmark, we conduct a comprehensive evaluation of all included attacks and defenses across the full dataset spectrum. Our findings show that while existing attacks perform well on standard QA datasets, their effectiveness drops significantly on the expanded versions. Moreover, our results demonstrate that various advanced RAG architectures, such as sequential, branching, conditional, and loop RAG, as well as multi-turn conversational RAG, multimodal RAG systems, and RAG-based LLM agent systems, remain susceptible to poisoning attacks. Notably, current defense techniques fail to provide robust protection, underscoring the pressing need for more resilient and generalizable defense strategies.
Debrup Das, Sam O' Nuallain, Razieh Rahimi
We propose RaDeR, a set of reasoning-based dense retrieval models trained
with data derived from mathematical problem solving using large language models
(LLMs). Our method leverages retrieval-augmented reasoning trajectories of an
LLM and self-reflective relevance evaluation, enabling the creation of both
diverse and hard-negative samples for reasoning-intensive relevance. RaDeR
retrievers, trained for mathematical reasoning, effectively generalize to
diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently
outperforming strong baselines in overall performance.Notably, RaDeR achieves
significantly higher performance than baselines on the Math and Coding splits.
In addition, RaDeR presents the first dense retriever that outperforms BM25
when queries are Chain-of-Thought reasoning steps, underscoring the critical
role of reasoning-based retrieval to augment reasoning language models.
Furthermore, RaDeR achieves comparable or superior performance while using only
2.5% of the training data used by the concurrent work REASONIR, highlighting
the quality of our synthesized training data.
Authors' comments: 26 pages
Wei Liu, Sony Trenous, Leonardo F. R. Ribeiro, Bill Byrne, Felix Hieber
We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.
Xingyu Ji, Parker Glenn, Aditya G. Parameswaran, Madelon Hulsebos
The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
Owen Kwon, Abraham George, Alison Bartsch, Amir Barati Farimani
This paper introduces RT-cache, a novel trajectorymemory pipeline that
accelerates real-world robot inference by leveraging big-data retrieval and
learning from experience. While modern Vision-Language-Action (VLA) models can
handle diverse robotic tasks, they often incur high per-step inference costs,
resulting in significant latency, sometimes minutes per task. In contrast,
RT-cache stores a large-scale Memory of previously successful robot
trajectories and retrieves relevant multistep motion snippets, drastically
reducing inference overhead. By integrating a Memory Builder with a Trajectory
Retrieval, we develop an efficient retrieval process that remains tractable
even for extremely large datasets. RT-cache flexibly accumulates real-world
experiences and replays them whenever the current scene matches past states,
adapting quickly to new or unseen environments with only a few additional
samples. Experiments on the Open-X Embodiment Dataset and other real-world data
demonstrate that RT-cache completes tasks both faster and more successfully
than a baseline lacking retrieval, suggesting a practical, data-driven solution
for real-time manipulation.
Authors' comments: 9 pages, 5 figures. Submitted to an IEEE robotics conference
Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han
Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.
Tianxing Yang, Huigen Ye, Hua Xu
Mixed-Integer Linear Programming (MILP) is widely used in fields such as scheduling, logistics, and planning. Enhancing the performance of MILP solvers, particularly learning-based solvers, requires substantial amounts of high-quality data. However, existing methods for MILP instance generation typically necessitate training a separate model for each problem class and are computationally intensive when generating new instances. To address these limitations, we reformulate the MILP Instance Generation task as MILP Code Generation task, enabling efficient, flexible, and interpretable instance generation through code. Since MILP instances generated from code can vary significantly in scale, we introduce MILP-EmbedSim, a new similarity metric that accurately measures the similarity between instances of varying sizes within the same problem class. Leveraging this metric, we propose MILP-Retrieval, a pipeline that retrieves generation code from library to produce MILP instances highly similar to target instance. MILP-Retrieval outperforms baselines in both MILP Code Generation and Instance Generation tasks, provides a novel perspective on MILP instance generation and opens new possibilities for learning-based solvers.
Sean MacAvaney
Sharing artifacts -- such as trained models, pre-built indexes, and the code
to use them -- aids in reproducibility efforts by allowing researchers to
validate intermediate steps and improves the sustainability of research by
allowing multiple groups to build off one another's prior computational work.
Although there are de facto consensuses on how to share research code (through
a git repository linked to from publications) and trained models (via
HuggingFace Hub), there is no consensus for other types of artifacts, such as
built indexes. Given the practical utility of using shared indexes, researchers
have resorted to self-hosting these resources or performing ad hoc file
transfers upon request, ultimately limiting the artifacts' discoverability and
reuse. This demonstration introduces a flexible and interoperable way to share
artifacts for Information Retrieval research, improving both their
accessibility and usability.
Authors' comments: SIGIR 2025 (demo)
Mario Ceresa, Lorenzo Bertolini, Valentin Comte, Nicholas Spadaro, Barbara Raffael, Brigitte Toussaint, Sergio Consoli, Amalia Muñoz Piñeiro et al.
Safe and trustworthy use of Large Language Models (LLM) in the processing of healthcare documents and scientific papers could substantially help clinicians, scientists and policymakers in overcoming information overload and focusing on the most relevant information at a given moment. Retrieval Augmented Generation (RAG) is a promising method to leverage the potential of LLMs while enhancing the accuracy of their outcomes. This report assesses the potentials and shortcomings of such approaches in the automatic knowledge synthesis of different types of documents in the health domain. To this end, it describes: (1) an internally developed proof of concept pipeline that employs state-of-the-art practices to deliver safe and trustable analysis for healthcare documents and scientific papers called RAGEv (Retrieval Augmented Generation Evaluation); (2) a set of evaluation tools for LLM-based document retrieval and generation; (3) a benchmark dataset to verify the accuracy and veracity of the results called RAGEv-Bench. It concludes that careful implementations of RAG techniques could minimize most of the common problems in the use of LLMs for document processing in the health domain, obtaining very high scores both on short yes/no answers and long answers. There is a high potential for incorporating it into the day-to-day work of policy support tasks, but additional efforts are required to obtain a consistent and trustworthy tool.
Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Manning, Peter Henderson, Daniel E. Ho
As the legal community increasingly examines the use of large language models
(LLMs) for various legal applications, legal AI developers have turned to
retrieval-augmented LLMs ("RAG" systems) to improve system performance and
robustness. An obstacle to the development of specialized RAG systems is the
lack of realistic legal RAG benchmarks which capture the complexity of both
legal retrieval and downstream legal question-answering. To address this, we
introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA.
Our tasks correspond to real-world legal research tasks, and were produced
through annotation processes which resemble legal research. We describe the
construction of these benchmarks and the performance of existing retriever
pipelines. Our results suggest that legal RAG remains a challenging
application, thus motivating future research.
Authors' comments: CS&Law 2025. For data, see
https://reglab.github.io/legal-rag-benchmarks/
Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min et al.
We present ReasonIR-8B, the first retriever specifically trained for general
reasoning tasks. Existing retrievers have shown limited gains on reasoning
tasks, in part because existing training datasets focus on short factual
queries tied to documents that straightforwardly answer them. We develop a
synthetic data generation pipeline that, for each document, our pipeline
creates a challenging and relevant query, along with a plausibly related but
ultimately unhelpful hard negative. By training on a mixture of our synthetic
data and existing public data, ReasonIR-8B achieves a new state-of-the-art of
29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a
widely-used reasoning-intensive information retrieval (IR) benchmark. When
applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4%
and 22.6% respectively, relative to the closed-book baseline, outperforming
other retrievers and search engines. In addition, ReasonIR-8B uses test-time
compute more effectively: on BRIGHT, its performance consistently increases
with longer and more information-rich rewritten queries; it continues to
outperform other retrievers when combined with an LLM reranker. Our training
recipe is general and can be easily extended to future LLMs; to this end, we
open-source our code, data, and model.
Authors' comments: Our code is released at
\url{https://github.com/facebookresearch/ReasonIR}
Jingfen Qiao, Thong Nguyen, Evangelos Kanoulas, Andrew Yates
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.
Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Large language model (LLM) agents are increasingly employing
retrieval-augmented generation (RAG) to improve the factuality of their
responses. However, in practice, these systems often need to handle ambiguous
user queries and potentially conflicting information from multiple sources
while also suppressing inaccurate information from noisy or irrelevant
documents. Prior work has generally studied and addressed these challenges in
isolation, considering only one aspect at a time, such as handling ambiguity or
robustness to noise and misinformation. We instead consider multiple factors
simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and
Misinformation in Documents), a new dataset that simulates complex and
realistic scenarios for conflicting evidence for a user query, including
ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent
approach in which LLM agents debate over the merits of an answer over multiple
rounds, allowing an aggregator to collate responses corresponding to
disambiguated entities while discarding misinformation and noise, thereby
handling diverse sources of conflict jointly. We demonstrate the effectiveness
of MADAM-RAG using both closed and open-source models on AmbigDocs -- which
requires presenting all valid answers for ambiguous queries -- improving over
strong RAG baselines by up to 11.40% and on FaithEval -- which requires
suppressing misinformation -- where we improve by up to 15.80% (absolute) with
Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for
existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match
score). While MADAM-RAG begins to address these conflicting factors, our
analysis indicates that a substantial gap remains especially when increasing
the level of imbalance in supporting evidence and misinformation.
Authors' comments: Our data and code is available at:
https://github.com/HanNight/RAMDocs
Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, Shichao Kan
Composed Image Retrieval (CIR) retrieves target images using a multi-modal
query that combines a reference image with text describing desired
modifications. The primary challenge is effectively fusing this visual and
textual information. Current cross-modal feature fusion approaches for CIR
exhibit an inherent bias in intention interpretation. These methods tend to
disproportionately emphasize either the reference image features
(visual-dominant fusion) or the textual modification intent (text-dominant
fusion through image-to-text conversion). Such an imbalanced representation
often fails to accurately capture and reflect the actual search intent of the
user in the retrieval results. To address this challenge, we propose TMCIR, a
novel framework that advances composed image retrieval through two key
innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP
encoders contrastively using intent-reflecting pseudo-target images,
synthesized from reference images and textual descriptions via a diffusion
model. This step enhances the encoder ability of text to capture nuanced
intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune
all encoders contrastively by comparing adaptive token-fusion features with the
target image. This mechanism dynamically balances visual and textual
representations within the contrastive learning pipeline, optimizing the
composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR
datasets demonstrate that TMCIR significantly outperforms state-of-the-art
methods, particularly in capturing nuanced user intent.
Authors' comments: arXiv admin note: text overlap with arXiv:2310.05473 by other authors
Hanmeng Zhong, Linqing Chen, Weilei Wang, Wentao Wu
Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.
Baolei Zhang, Yuxi Chen, Minghong Fang, Zhuqing Liu, Lihai Nie, Tong Li, Zheli Liu
Large language models (LLMs) have demonstrated impressive natural language processing abilities but face challenges such as hallucination and outdated knowledge. Retrieval-Augmented Generation (RAG) has emerged as a state-of-the-art approach to mitigate these issues. While RAG enhances LLM outputs, it remains vulnerable to poisoning attacks. Recent studies show that injecting poisoned text into the knowledge database can compromise RAG systems, but most existing attacks assume that the attacker can insert a sufficient number of poisoned texts per query to outnumber correct-answer texts in retrieval, an assumption that is often unrealistic. To address this limitation, we propose CorruptRAG, a practical poisoning attack against RAG systems in which the attacker injects only a single poisoned text, enhancing both feasibility and stealth. Extensive experiments across multiple datasets demonstrate that CorruptRAG achieves higher attack success rates compared to existing baselines.
Fengxia Liu, Zhiyong Zheng, Kun Tian, Yi Zhang, Heng Guo, Zhe Hu, Oleksiy Zhedanov, Zixian Gong
This paper introduces a novel lower bound on communication complexity using
quantum relative entropy and mutual information, refining previous classical
entropy-based results. By leveraging Uhlmann's lemma and quantum Pinsker
inequalities, the authors establish tighter bounds for information-theoretic
security, demonstrating that quantum protocols inherently outperform classical
counterparts in balancing privacy and efficiency. Also explores symmetric
Quantum Private Information Retrieval (QPIR) protocols that achieve sub-linear
communication complexity while ensuring robustness against specious
adversaries: A post-quantum cryptography based protocol that can be
authenticated for the specious server; A ring-LWE-based protocol for
post-quantum security in a single-server setting, ensuring robustness against
quantum attacks; A multi-server protocol optimized for hardware practicality,
reducing implementation overhead while maintaining sub-linear efficiency. These
protocols address critical gaps in secure database queries, offering
exponential communication improvements over classical linear-complexity
methods. The work also analyzes security trade-offs under quantum specious
adversaries, providing theoretical guarantees for privacy and correctness.
Authors' comments: 11 pages, 1 figure
Sean MacAvaney, Antonio Mallia, Nicola Tonellotto
Multi-vector retrieval methods, exemplified by the ColBERT architecture, have
shown substantial promise for retrieval by providing strong trade-offs in terms
of retrieval latency and effectiveness. However, they come at a high cost in
terms of storage since a (potentially compressed) vector needs to be stored for
every token in the input collection. To overcome this issue, we propose
encoding documents to a fixed number of vectors, which are no longer
necessarily tied to the input tokens. Beyond reducing the storage costs, our
approach has the advantage that document representations become of a fixed size
on disk, allowing for better OS paging management. Through experiments using
the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a
representative multi-vector ranking model architecture, we find that passages
can be effectively encoded into a fixed number of vectors while retaining most
of the original effectiveness.
Authors' comments: ECIR 2025
Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
Recently, building retrieval-augmented generation (RAG) systems to enhance
the capability of large language models (LLMs) has become a common practice.
Especially in the legal domain, previous judicial decisions play a significant
role under the doctrine of stare decisis which emphasizes the importance of
making decisions based on (retrieved) prior documents. However, the overall
performance of RAG system depends on many components: (1) retrieval corpora,
(2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation
metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of
RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces
to facilitate seamless experiments and investigate how changes in the
aforementioned five components affect the overall accuracy. We validated LRAGE
using multilingual legal benches including Korean (KBL), English (LegalBench),
and Chinese (LawBench) by demonstrating how the overall accuracy changes when
varying the five components mentioned above. The source code is available at
https://github.com/hoorangyee/LRAGE.
Authors' comments: 12 pages