Hyunsun Hong, Jongmoon Baik
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes. Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples. While generation-based methods leverage code-specific pretraining on large code-natural language corpora to learn semantic relationships between code and natural language, they often struggle to generate low-frequency but semantically important tokens due to their probabilistic nature. In contrast, IR-based methods excel at recovering such rare tokens by copying from existing examples but lack flexibility in adapting to new code contexts-for example, when input code contains identifiers or structures not found in the retrieval database. To bridge the gap between generation-based and IR-based methods, this work proposes to leverage retrieval-augmented generation (RAG) for RCG by conditioning pretrained language models on retrieved code-review exemplars. By providing relevant examples that illustrate how similar code has been previously reviewed, the model is better guided to generate accurate review comments. Our evaluation on the Tufano et al. benchmark shows that RAG-based RCG outperforms both generation-based and IR-based RCG. It achieves up to +1.67% higher exact match and +4.25% higher BLEU scores compared to generation-based RCG. It also improves the generation of low-frequency ground-truth tokens by up to 24.01%. We additionally find that performance improves as the number of retrieved exemplars increases.
Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, Zilong Zheng
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale language models like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce TongSearch QR (Previously Known as "TongSearch Reasoner"), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. With a novel semi-rule-based reward function, we employ reinforcement learning approaches enabling smaller language models, e,g, Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve query reasoning performance rivaling large-scale language models without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that with BM25 as retrievers, both TongSearch QR-7B and TongSearch QR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment.
Amit Jaspal, Qian Dang, Ajantha Ramineni
Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top-ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre-ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.
Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu
Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.
Jifei Luo, Wenzheng Wu, Hantao Yao, Lu Yu, Changsheng Xu
Diffusion-based re-ranking methods are effective in modeling the data
manifolds through similarity propagation in affinity graphs. However, positive
signals tend to diminish over several steps away from the source, reducing
discriminative power beyond local regions. To address this issue, we introduce
the Locality Preserving Markovian Transition (LPMT) framework, which employs a
long-term thermodynamic transition process with multiple states for accurate
manifold distance measurement. The proposed LPMT first integrates diffusion
processes across separate graphs using Bidirectional Collaborative Diffusion
(BCD) to establish strong similarity relationships. Afterwards, Locality State
Embedding (LSE) encodes each instance into a distribution for enhanced local
consistency. These distributions are interconnected via the Thermodynamic
Markovian Transition (TMT) process, enabling efficient global retrieval while
maintaining local effectiveness. Experimental results across diverse tasks
confirm the effectiveness of LPMT for instance retrieval.
Authors' comments: This paper has been accepted by ICML2025
Qiang Fu, Zonglei Jing, Zonghao Ying, Xiaoqian Li
The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.
Wenzheng Zhang, Xi Victoria Lin, Karl Stratos, Wen-tau Yih, Mingda Chen
Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.
Xiaochen Wang, Zongyu Wu, Yuan Zhong, Xiang Zhang, Suhang Wang, Fenglong Ma
Graph retrieval-augmented generation (GRAG) places high demands on
graph-specific retrievers. However, existing retrievers often rely on language
models pretrained on plain text, limiting their effectiveness due to domain
misalignment and structure ignorance. To address these challenges, we propose
GPR, a graph-based retriever pretrained directly on knowledge graphs. GPR
aligns natural language questions with relevant subgraphs through LLM-guided
graph augmentation and employs a structure-aware objective to learn
fine-grained retrieval strategies. Experiments on two datasets, three LLM
backbones, and five baselines show that GPR consistently improves both
retrieval quality and downstream generation, demonstrating its effectiveness as
a robust retrieval solution for GRAG.
Authors' comments: Short paper submitted to EMNLP'25
Wenhao Ding, Sushant Veer, Yuxiao Chen, Yulong Cao, Chaowei Xiao, Marco Pavone
Learning-based planners generate natural human-like driving behaviors by learning to reason about nuanced interactions from data, overcoming the rigid behaviors that arise from rule-based planners. Nonetheless, data-driven approaches often struggle with rare, safety-critical scenarios and offer limited controllability over the generated trajectories. To address these challenges, we propose RealDrive, a Retrieval-Augmented Generation (RAG) framework that initializes a diffusion-based planning policy by retrieving the most relevant expert demonstrations from the training dataset. By interpolating between current observations and retrieved examples through a denoising process, our approach enables fine-grained control and safe behavior across diverse scenarios, leveraging the strong prior provided by the retrieved scenario. Another key insight we produce is that a task-relevant retrieval model trained with planning-based objectives results in superior planning performance in our framework compared to a task-agnostic retriever. Experimental results demonstrate improved generalization to long-tail events and enhanced trajectory diversity compared to standard learning-based planners -- we observe a 40% reduction in collision rate on the Waymo Open Motion dataset with RAG.
Fabio Fehr, Prabhu Teja Sivaprasad, Luca Franceschi, Giovanni Zappella
In this paper, we introduce CoRet, a dense retrieval model designed for
code-editing tasks that integrates code semantics, repository structure, and
call graph dependencies. The model focuses on retrieving relevant portions of a
code repository based on natural language queries such as requests to implement
new features or fix bugs. These retrieved code chunks can then be presented to
a user or to a second code-editing model or agent. To train CoRet, we propose a
loss function explicitly designed for repository-level retrieval. On SWE-bench
and Long Code Arena's bug localisation datasets, we show that our model
substantially improves retrieval recall by at least 15 percentage points over
existing models, and ablate the design choices to show their importance in
achieving these results.
Authors' comments: ACL 2025
Chunxu Liu, Chi Xie, Xiaxu Chen, Wei Li, Feng Zhu, Rui Zhao, Limin Wang
Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a
given textual query to images in a gallery. Existing benchmarks primarily focus
on textual queries describing overall image semantics or foreground salient
objects, possibly overlooking inconspicuous small objects, especially in
complex environments. Such small object retrieval is crucial, as in real-world
applications, the targets of interest are not always prominent in the image.
Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a
new subfield of T2IR, focusing on retrieving small objects in complex images
with textual queries. We propose a new benchmark, SORCE-1K, consisting of
images with complex environments and textual queries describing less
conspicuous small objects with minimal contextual cues from other salient
objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods
struggle to capture small objects and encode all the semantics into a single
embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we
propose to represent each image with multiple distinctive embeddings. We
leverage Multimodal Large Language Models (MLLMs) to extract multiple
embeddings for each image instructed by a set of Regional Prompts (ReP).
Experimental results show that our multi-embedding approach through MLLM and
ReP significantly outperforms existing T2IR methods on SORCE-1K. Our
experiments validate the effectiveness of SORCE-1K for benchmarking SORCE
performances, highlighting the potential of multi-embedding representation and
text-customized MLLM features for addressing this task.
Authors' comments: Project Page: https://github.com/MCG-NJU/SORCE
Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu
Retrieval-Augmented Generation (RAG) has proven effective in mitigating hallucinations in large language models by incorporating external knowledge during inference. However, this integration introduces new security vulnerabilities, particularly to poisoning attacks. Although prior work has explored various poisoning strategies, a thorough assessment of their practical threat to RAG systems remains missing. To address this gap, we propose the first comprehensive benchmark framework for evaluating poisoning attacks on RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10 expanded variants, along with 13 poisoning attack methods and 7 defense mechanisms, representing a broad spectrum of existing techniques. Using this benchmark, we conduct a comprehensive evaluation of all included attacks and defenses across the full dataset spectrum. Our findings show that while existing attacks perform well on standard QA datasets, their effectiveness drops significantly on the expanded versions. Moreover, our results demonstrate that various advanced RAG architectures, such as sequential, branching, conditional, and loop RAG, as well as multi-turn conversational RAG, multimodal RAG systems, and RAG-based LLM agent systems, remain susceptible to poisoning attacks. Notably, current defense techniques fail to provide robust protection, underscoring the pressing need for more resilient and generalizable defense strategies.
Debrup Das, Sam O' Nuallain, Razieh Rahimi
We propose RaDeR, a set of reasoning-based dense retrieval models trained
with data derived from mathematical problem solving using large language models
(LLMs). Our method leverages retrieval-augmented reasoning trajectories of an
LLM and self-reflective relevance evaluation, enabling the creation of both
diverse and hard-negative samples for reasoning-intensive relevance. RaDeR
retrievers, trained for mathematical reasoning, effectively generalize to
diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently
outperforming strong baselines in overall performance.Notably, RaDeR achieves
significantly higher performance than baselines on the Math and Coding splits.
In addition, RaDeR presents the first dense retriever that outperforms BM25
when queries are Chain-of-Thought reasoning steps, underscoring the critical
role of reasoning-based retrieval to augment reasoning language models.
Furthermore, RaDeR achieves comparable or superior performance while using only
2.5% of the training data used by the concurrent work REASONIR, highlighting
the quality of our synthesized training data.
Authors' comments: 26 pages
Wei Liu, Sony Trenous, Leonardo F. R. Ribeiro, Bill Byrne, Felix Hieber
We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.
Xingyu Ji, Parker Glenn, Aditya G. Parameswaran, Madelon Hulsebos
The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
Owen Kwon, Abraham George, Alison Bartsch, Amir Barati Farimani
This paper introduces RT-cache, a novel trajectorymemory pipeline that
accelerates real-world robot inference by leveraging big-data retrieval and
learning from experience. While modern Vision-Language-Action (VLA) models can
handle diverse robotic tasks, they often incur high per-step inference costs,
resulting in significant latency, sometimes minutes per task. In contrast,
RT-cache stores a large-scale Memory of previously successful robot
trajectories and retrieves relevant multistep motion snippets, drastically
reducing inference overhead. By integrating a Memory Builder with a Trajectory
Retrieval, we develop an efficient retrieval process that remains tractable
even for extremely large datasets. RT-cache flexibly accumulates real-world
experiences and replays them whenever the current scene matches past states,
adapting quickly to new or unseen environments with only a few additional
samples. Experiments on the Open-X Embodiment Dataset and other real-world data
demonstrate that RT-cache completes tasks both faster and more successfully
than a baseline lacking retrieval, suggesting a practical, data-driven solution
for real-time manipulation.
Authors' comments: 9 pages, 5 figures. Submitted to an IEEE robotics conference
Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han
Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.
Tianxing Yang, Huigen Ye, Hua Xu
Mixed-Integer Linear Programming (MILP) is widely used in fields such as scheduling, logistics, and planning. Enhancing the performance of MILP solvers, particularly learning-based solvers, requires substantial amounts of high-quality data. However, existing methods for MILP instance generation typically necessitate training a separate model for each problem class and are computationally intensive when generating new instances. To address these limitations, we reformulate the MILP Instance Generation task as MILP Code Generation task, enabling efficient, flexible, and interpretable instance generation through code. Since MILP instances generated from code can vary significantly in scale, we introduce MILP-EmbedSim, a new similarity metric that accurately measures the similarity between instances of varying sizes within the same problem class. Leveraging this metric, we propose MILP-Retrieval, a pipeline that retrieves generation code from library to produce MILP instances highly similar to target instance. MILP-Retrieval outperforms baselines in both MILP Code Generation and Instance Generation tasks, provides a novel perspective on MILP instance generation and opens new possibilities for learning-based solvers.
Sean MacAvaney
Sharing artifacts -- such as trained models, pre-built indexes, and the code
to use them -- aids in reproducibility efforts by allowing researchers to
validate intermediate steps and improves the sustainability of research by
allowing multiple groups to build off one another's prior computational work.
Although there are de facto consensuses on how to share research code (through
a git repository linked to from publications) and trained models (via
HuggingFace Hub), there is no consensus for other types of artifacts, such as
built indexes. Given the practical utility of using shared indexes, researchers
have resorted to self-hosting these resources or performing ad hoc file
transfers upon request, ultimately limiting the artifacts' discoverability and
reuse. This demonstration introduces a flexible and interoperable way to share
artifacts for Information Retrieval research, improving both their
accessibility and usability.
Authors' comments: SIGIR 2025 (demo)