Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.
Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li
Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.
Authors' comments: Accepted by EMNLP 2025 Findings
Julian Killingback, Hamed Zamani
Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.
Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.
Authors' comments: ACM Multimedia 2025
Leqian Li, Dianxi Shi, Jialu Zhou, Xinyu Wei, Mingyue Yang, Songchang Jin, Shaowu Yang
Large Language Models (LLMs) have shown remarkable capabilities across diverse tasks, yet they face inherent limitations such as constrained parametric knowledge and high retraining costs. Retrieval-Augmented Generation (RAG) augments the generation process by retrieving externally stored knowledge absent from the models internal parameters. However, RAG methods face challenges such as information loss and redundant retrievals during multi-round queries, accompanying the difficulties in precisely characterizing knowledge gaps for complex tasks. To address these problems, we propose Retrieval Feedback and Memory Retrieval Augmented Generation(RFM-RAG), which transforms the stateless retrieval of previous methods into stateful continuous knowledge management by constructing a dynamic evidence pool. Specifically, our method generates refined queries describing the models knowledge gaps using relational triples from questions and evidence from the dynamic evidence pool; Retrieves critical external knowledge to iteratively update this evidence pool; Employs a R-Feedback Model to evaluate evidence completeness until convergence. Compared to traditional RAG methods, our approach enables persistent storage of retrieved passages and effectively distills key information from passages to construct clearly new queries. Experiments on three public QA benchmarks demonstrate that RFM-RAG outperforms previous methods and improves overall system accuracy.
Julien Guinot, Elio Quinton, György Fazekas
Multimodal contrastive models have achieved strong performance in text-audio
retrieval and zero-shot settings, but improving joint embedding spaces remains
an active research area. Less attention has been given to making these systems
controllable and interactive for users. In text-music retrieval, the ambiguity
of freeform language creates a many-to-many mapping, often resulting in
inflexible or unsatisfying results.
We introduce Generative Diffusion Retriever (GDR), a novel framework that
leverages diffusion models to generate queries in a retrieval-optimized latent
space. This enables controllability through generative tools such as negative
prompting and denoising diffusion implicit models (DDIM) inversion, opening a
new direction in retrieval control. GDR improves retrieval performance over
contrastive teacher models and supports retrieval in audio-only latent spaces
using non-jointly trained encoders. Finally, we demonstrate that GDR enables
effective post-hoc manipulation of retrieval behavior, enhancing interactive
control for text-music retrieval tasks.
Authors' comments: Accepted to ISMIR 2025
Chaeeun Kim, Seungone Kim
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in
multi-step reasoning and calling search engines at appropriate steps. However,
existing retrieval-augmented reasoning approaches rely on separate retrieval
models, limiting the LRM's role in retrieval to deciding when to retrieve and
how to query. This separation not only increases hardware and operational costs
but also leads to errors in the retrieval process due to the representation
bottleneck, a phenomenon where the retriever's embedding space is not
expressive enough to meet the generator's requirements. To address this, we
shift our perspective from sequence-to-sequence matching to locating the
answer-containing paths within the corpus, and propose a novel framework called
FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables
LRMs to retrieve relevant knowledge on their own by acting as both a generator
and retriever. To achieve this, we introduce a variant of the MCTS algorithm
specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing
Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus
toward answer-containing regions. Our results on five open-domain QA
benchmarks, including single-hop and multi-hop questions, show that FREESON
achieves an average improvement of 14.4% in EM and F1 over four multi-step
reasoning models with a separate retriever, and it also performs comparably to
the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
Authors' comments: Work In Progress
Arthur Satouf, Gabriel Ben Zenou, Benjamin Piwowarski, Habiboulaye Amadou Boubacar, Pablo Piantanida
Current sparse neural information retrieval (IR) methods, and to a lesser
extent more traditional models such as BM25, do not take into account the
document collection and the complex interplay between different term weights
when representing a single document. In this paper, we show how the Rational
Speech Acts (RSA), a linguistics framework used to minimize the number of
features to be communicated when identifying an object in a set, can be adapted
to the IR case -- and in particular to the high number of potential features
(here, tokens). RSA dynamically modulates token-document interactions by
considering the influence of other documents in the dataset, better contrasting
document representations. Experiments show that incorporating RSA consistently
improves multiple sparse retrieval models and achieves state-of-the-art
performance on out-of-domain datasets from the BEIR benchmark.
https://github.com/arthur-75/Rational-Retrieval-Acts
Authors' comments: 6 pages - 2 figures - conference: accepted at SIGIR 2025
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye et al.
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.
Fengshuo Bai, Yu Li, Jie Chu, Tawei Chou, Runchuan Zhu, Ying Wen, Yaodong Yang, Yuanpei Chen
Retrieving objects buried beneath multiple objects is not only challenging but also time-consuming. Performing manipulation in such environments presents significant difficulty due to complex contact relationships. Existing methods typically address this task by sequentially grasping and removing each occluding object, resulting in lengthy execution times and requiring impractical grasping capabilities for every occluding object. In this paper, we present a dexterous arm-hand system for efficient object retrieval in multi-object stacked environments. Our approach leverages large-scale parallel reinforcement learning within diverse and carefully designed cluttered environments to train policies. These policies demonstrate emergent manipulation skills (e.g., pushing, stirring, and poking) that efficiently clear occluding objects to expose sufficient surface area of the target object. We conduct extensive evaluations across a set of over 10 household objects in diverse clutter configurations, demonstrating superior retrieval performance and efficiency for both trained and unseen objects. Furthermore, we successfully transfer the learned policies to a real-world dexterous multi-fingered robot system, validating their practical applicability in real-world scenarios. Videos can be found on our project website https://ChangWinde.github.io/RetrDex.
Jinyuan Fang, Zaiqiao Meng, Craig Macdonald
Iterative retrieval-augmented generation (iRAG) models offer an effective approach for multi-hop question answering (QA). However, their retrieval process faces two key challenges: (1) it can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi-hop QA.
Yanhao Jia, Xinyi Wu, Hao Li, Qinglin Zhang, Yuxiao Hu, Shuai Zhao, Wenqi Fan
In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. However, current retrieval models primarily focus on natural text-image retrieval, making them insufficiently tailored to educational scenarios due to the ambiguities in the retrieval process. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval Dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning. Uni-Retrieval extracts query style features as prototypes and builds a continuously updated Prompt Bank containing prompt tokens for diverse queries. This bank can updated during test time to represent domain-specific knowledge for different subject retrieval scenarios. Our framework demonstrates scalability and robustness by dynamically retrieving prompt tokens based on prototype similarity, effectively facilitating learning for unknown queries. Experimental results indicate that Uni-Retrieval outperforms existing retrieval models in most retrieval tasks. This advancement provides a scalable and precise solution for diverse educational needs.
Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik
Large Language Models (LLMs) have revolutionized artificial intelligence with
capabilities in reasoning, coding, and communication, driving innovation across
industries. Their true potential depends on effective alignment to ensure
correct, trustworthy and ethical behavior, addressing challenges like
misinformation, hallucinations, bias and misuse. While existing Reinforcement
Learning (RL)-based alignment methods are notoriously complex, direct
optimization approaches offer a simpler alternative. In this work, we introduce
a novel direct optimization approach for LLM alignment by drawing on
established Information Retrieval (IR) principles. We present a systematic
framework that bridges LLM alignment and IR methodologies, mapping LLM
generation and reward models to IR's retriever-reranker paradigm. Building on
this foundation, we propose LLM Alignment as Retriever Preference Optimization
(LarPO), a new alignment method that enhances overall alignment quality.
Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 %
averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work
opens new avenues for advancing LLM alignment by integrating IR foundations,
offering a promising direction for future research.
Authors' comments: 26 pages
Shi-Qi Yan, Zhen-Hua Ling
While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.
Arshia Hemmat, Kianoosh Vadaei, Mohammad Hassan Heydari, Afsaneh Fatemi
This paper introduces an innovative approach using Retrieval-Augmented
Generation (RAG) pipelines with Large Language Models (LLMs) to enhance
information retrieval and query response systems for university-related
question answering. By systematically extracting data from the university
official webpage and employing advanced prompt engineering techniques, we
generate accurate, contextually relevant responses to user queries.
We developed a comprehensive university benchmark, UniversityQuestionBench
(UQB), to rigorously evaluate our system performance, based on common key
metrics in the filed of RAG pipelines, assessing accuracy and reliability
through various metrics and real-world scenarios. Our experimental results
demonstrate significant improvements in the precision and relevance of
generated responses, enhancing user experience and reducing the time required
to obtain relevant answers. In summary, this paper presents a novel application
of RAG pipelines and LLMs, supported by a meticulously prepared university
benchmark, offering valuable insights into advanced AI techniques for academic
data retrieval and setting the stage for future research in this domain.
Authors' comments: 6 pages, 2 figures, 1 table, Submitted to 15th IKT conference
Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Chanyoung Park
While inorganic retrosynthesis planning is essential in the field of chemical
science, the application of machine learning in this area has been notably less
explored compared to organic retrosynthesis planning. In this paper, we propose
Retrieval-Retro for inorganic retrosynthesis planning, which implicitly
extracts the precursor information of reference materials that are retrieved
from the knowledge base regarding domain expertise in the field. Specifically,
instead of directly employing the precursor information of reference materials,
we propose implicitly extracting it with various attention layers, which
enables the model to learn novel synthesis recipes more effectively. Moreover,
during retrieval, we consider the thermodynamic relationship between target
material and precursors, which is essential domain expertise in identifying the
most probable precursor set among various options. Extensive experiments
demonstrate the superiority of Retrieval-Retro in retrosynthesis planning,
especially in discovering novel synthesis recipes, which is crucial for
materials discovery. The source code for Retrieval-Retro is available at
https://github.com/HeewoongNoh/Retrieval-Retro.
Authors' comments: NeurIPS 2024
Atula Tejaswi, Yoonsang Lee, Sujay Sanghavi, Eunsol Choi
We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.
Ruobing Wang, Daren Zha, Shi Yu, Qingfei Zhao, Yuxuan Chen, Yixuan Wang, Shuo Wang, Yukun Yan et al.
Retrieval-Augmented Generation (RAG) mitigates issues of the factual errors
and hallucinated outputs generated by Large Language Models (LLMs) in
open-domain question-answering tasks (OpenQA) via introducing external
knowledge. For complex QA, however, existing RAG methods use LLMs to actively
predict retrieval timing and directly use the retrieved information for
generation, regardless of whether the retrieval timing accurately reflects the
actual information needs, or sufficiently considers prior retrieved knowledge,
which may result in insufficient information gathering and interaction,
yielding low-quality answers. To address these, we propose a generic RAG
approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA
tasks, which includes the iterative information collector, adaptive memory
reviewer, and task-oriented generator, while following a new
Retriever-and-Memory paradigm. Specifically, Adaptive-Note introduces an
overarching view of knowledge growth, iteratively gathering new information in
the form of notes and updating them into the existing optimal knowledge
structure, enhancing high-quality knowledge interactions. In addition, we
employ an adaptive, note-based stop-exploration strategy to decide "what to
retrieve and when to stop" to encourage sufficient knowledge exploration. We
conduct extensive experiments on five complex QA datasets, and the results
demonstrate the superiority and effectiveness of our method and its components.
The code and data are at https://github.com/thunlp/Adaptive-Note.
Authors' comments: 15 pages, 2 figures
Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu
MLLM agents demonstrate potential for complex embodied tasks by retrieving
multimodal task-relevant trajectory data. However, current retrieval methods
primarily focus on surface-level similarities of textual or visual cues in
trajectories, neglecting their effectiveness for the specific task at hand. To
address this issue, we propose a novel method, MLLM As ReTriever (MART), which
enhances the performance of embodied agents by utilizing interaction data to
fine-tune an MLLM retriever based on preference learning, such that the
retriever fully considers the effectiveness of trajectories and prioritizes
them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism
that leverages MLLMs' summarization capabilities to represent trajectories with
fewer tokens while preserving key information, enabling agents to better
comprehend milestones in the trajectory. Experimental results across various
environments demonstrate our method significantly improves task success rates
in unseen scenes compared to baseline methods. This work presents a new
paradigm for multimodal retrieval in embodied agents, by fine-tuning a
general-purpose MLLM as the retriever to assess trajectory effectiveness. All
the code for benchmark tasks, simulator modifications, and the MLLM retriever
is available at https://github.com/PKU-RL/MART.
Authors' comments: ICLR 2025