Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye et al.
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.
Fengshuo Bai, Yu Li, Jie Chu, Tawei Chou, Runchuan Zhu, Ying Wen, Yaodong Yang, Yuanpei Chen
Retrieving objects buried beneath multiple objects is not only challenging but also time-consuming. Performing manipulation in such environments presents significant difficulty due to complex contact relationships. Existing methods typically address this task by sequentially grasping and removing each occluding object, resulting in lengthy execution times and requiring impractical grasping capabilities for every occluding object. In this paper, we present a dexterous arm-hand system for efficient object retrieval in multi-object stacked environments. Our approach leverages large-scale parallel reinforcement learning within diverse and carefully designed cluttered environments to train policies. These policies demonstrate emergent manipulation skills (e.g., pushing, stirring, and poking) that efficiently clear occluding objects to expose sufficient surface area of the target object. We conduct extensive evaluations across a set of over 10 household objects in diverse clutter configurations, demonstrating superior retrieval performance and efficiency for both trained and unseen objects. Furthermore, we successfully transfer the learned policies to a real-world dexterous multi-fingered robot system, validating their practical applicability in real-world scenarios. Videos can be found on our project website https://ChangWinde.github.io/RetrDex.
Jinyuan Fang, Zaiqiao Meng, Craig Macdonald
Iterative retrieval-augmented generation (iRAG) models offer an effective approach for multi-hop question answering (QA). However, their retrieval process faces two key challenges: (1) it can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi-hop QA.
Yanhao Jia, Xinyi Wu, Hao Li, Qinglin Zhang, Yuxiao Hu, Shuai Zhao, Wenqi Fan
In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. However, current retrieval models primarily focus on natural text-image retrieval, making them insufficiently tailored to educational scenarios due to the ambiguities in the retrieval process. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval Dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning. Uni-Retrieval extracts query style features as prototypes and builds a continuously updated Prompt Bank containing prompt tokens for diverse queries. This bank can updated during test time to represent domain-specific knowledge for different subject retrieval scenarios. Our framework demonstrates scalability and robustness by dynamically retrieving prompt tokens based on prototype similarity, effectively facilitating learning for unknown queries. Experimental results indicate that Uni-Retrieval outperforms existing retrieval models in most retrieval tasks. This advancement provides a scalable and precise solution for diverse educational needs.
Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik
Large Language Models (LLMs) have revolutionized artificial intelligence with
capabilities in reasoning, coding, and communication, driving innovation across
industries. Their true potential depends on effective alignment to ensure
correct, trustworthy and ethical behavior, addressing challenges like
misinformation, hallucinations, bias and misuse. While existing Reinforcement
Learning (RL)-based alignment methods are notoriously complex, direct
optimization approaches offer a simpler alternative. In this work, we introduce
a novel direct optimization approach for LLM alignment by drawing on
established Information Retrieval (IR) principles. We present a systematic
framework that bridges LLM alignment and IR methodologies, mapping LLM
generation and reward models to IR's retriever-reranker paradigm. Building on
this foundation, we propose LLM Alignment as Retriever Preference Optimization
(LarPO), a new alignment method that enhances overall alignment quality.
Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 %
averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work
opens new avenues for advancing LLM alignment by integrating IR foundations,
offering a promising direction for future research.
Authors' comments: 26 pages
Shi-Qi Yan, Zhen-Hua Ling
While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.
Arshia Hemmat, Kianoosh Vadaei, Mohammad Hassan Heydari, Afsaneh Fatemi
This paper introduces an innovative approach using Retrieval-Augmented
Generation (RAG) pipelines with Large Language Models (LLMs) to enhance
information retrieval and query response systems for university-related
question answering. By systematically extracting data from the university
official webpage and employing advanced prompt engineering techniques, we
generate accurate, contextually relevant responses to user queries.
We developed a comprehensive university benchmark, UniversityQuestionBench
(UQB), to rigorously evaluate our system performance, based on common key
metrics in the filed of RAG pipelines, assessing accuracy and reliability
through various metrics and real-world scenarios. Our experimental results
demonstrate significant improvements in the precision and relevance of
generated responses, enhancing user experience and reducing the time required
to obtain relevant answers. In summary, this paper presents a novel application
of RAG pipelines and LLMs, supported by a meticulously prepared university
benchmark, offering valuable insights into advanced AI techniques for academic
data retrieval and setting the stage for future research in this domain.
Authors' comments: 6 pages, 2 figures, 1 table, Submitted to 15th IKT conference
Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Chanyoung Park
While inorganic retrosynthesis planning is essential in the field of chemical
science, the application of machine learning in this area has been notably less
explored compared to organic retrosynthesis planning. In this paper, we propose
Retrieval-Retro for inorganic retrosynthesis planning, which implicitly
extracts the precursor information of reference materials that are retrieved
from the knowledge base regarding domain expertise in the field. Specifically,
instead of directly employing the precursor information of reference materials,
we propose implicitly extracting it with various attention layers, which
enables the model to learn novel synthesis recipes more effectively. Moreover,
during retrieval, we consider the thermodynamic relationship between target
material and precursors, which is essential domain expertise in identifying the
most probable precursor set among various options. Extensive experiments
demonstrate the superiority of Retrieval-Retro in retrosynthesis planning,
especially in discovering novel synthesis recipes, which is crucial for
materials discovery. The source code for Retrieval-Retro is available at
https://github.com/HeewoongNoh/Retrieval-Retro.
Authors' comments: NeurIPS 2024
Atula Tejaswi, Yoonsang Lee, Sujay Sanghavi, Eunsol Choi
We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.
Ruobing Wang, Daren Zha, Shi Yu, Qingfei Zhao, Yuxuan Chen, Yixuan Wang, Shuo Wang, Yukun Yan et al.
Retrieval-Augmented Generation (RAG) mitigates issues of the factual errors
and hallucinated outputs generated by Large Language Models (LLMs) in
open-domain question-answering tasks (OpenQA) via introducing external
knowledge. For complex QA, however, existing RAG methods use LLMs to actively
predict retrieval timing and directly use the retrieved information for
generation, regardless of whether the retrieval timing accurately reflects the
actual information needs, or sufficiently considers prior retrieved knowledge,
which may result in insufficient information gathering and interaction,
yielding low-quality answers. To address these, we propose a generic RAG
approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA
tasks, which includes the iterative information collector, adaptive memory
reviewer, and task-oriented generator, while following a new
Retriever-and-Memory paradigm. Specifically, Adaptive-Note introduces an
overarching view of knowledge growth, iteratively gathering new information in
the form of notes and updating them into the existing optimal knowledge
structure, enhancing high-quality knowledge interactions. In addition, we
employ an adaptive, note-based stop-exploration strategy to decide "what to
retrieve and when to stop" to encourage sufficient knowledge exploration. We
conduct extensive experiments on five complex QA datasets, and the results
demonstrate the superiority and effectiveness of our method and its components.
The code and data are at https://github.com/thunlp/Adaptive-Note.
Authors' comments: 15 pages, 2 figures
Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu
MLLM agents demonstrate potential for complex embodied tasks by retrieving
multimodal task-relevant trajectory data. However, current retrieval methods
primarily focus on surface-level similarities of textual or visual cues in
trajectories, neglecting their effectiveness for the specific task at hand. To
address this issue, we propose a novel method, MLLM As ReTriever (MART), which
enhances the performance of embodied agents by utilizing interaction data to
fine-tune an MLLM retriever based on preference learning, such that the
retriever fully considers the effectiveness of trajectories and prioritizes
them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism
that leverages MLLMs' summarization capabilities to represent trajectories with
fewer tokens while preserving key information, enabling agents to better
comprehend milestones in the trajectory. Experimental results across various
environments demonstrate our method significantly improves task success rates
in unseen scenes compared to baseline methods. This work presents a new
paradigm for multimodal retrieval in embodied agents, by fine-tuning a
general-purpose MLLM as the retriever to assess trajectory effectiveness. All
the code for benchmark tasks, simulator modifications, and the MLLM retriever
is available at https://github.com/PKU-RL/MART.
Authors' comments: ICLR 2025
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang et al.
Transformer-based Large Language Models (LLMs) have become increasingly
important. However, due to the quadratic time complexity of attention
computation, scaling LLMs to longer contexts incurs extremely slow inference
latency and high GPU memory consumption for caching key-value (KV) vectors.
This paper proposes RetrievalAttention, a training-free approach to both
accelerate attention computation and reduce GPU memory consumption. By
leveraging the dynamic sparsity of attention mechanism, RetrievalAttention
proposes to use approximate nearest neighbor search (ANNS) indexes for KV
vectors in CPU memory and retrieves the most relevant ones with vector search
during generation. Unfortunately, we observe that the off-the-shelf ANNS
indexes are often ineffective for such retrieval tasks due to the
out-of-distribution (OOD) between query vectors and key vectors in attention
mechanism. RetrievalAttention addresses the OOD challenge by designing an
attention-aware vector search algorithm that can adapt to the distribution of
query vectors. Our evaluation shows that RetrievalAttention only needs to
access 1--3% of data while maintaining high model accuracy. This leads to
significant reduction in the inference cost of long-context LLMs with much
lower GPU memory footprint. In particular, RetrievalAttention only needs a
single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B
parameters, which is capable of generating one token in 0.188 seconds.
Authors' comments: 16 pages
Hanqi Zhang, Chong Chen, Lang Mei, Qi Liu, Jiaxin Mao
In the information retrieval (IR) area, dense retrieval (DR) models use deep learning techniques to encode queries and passages into embedding space to compute their semantic relations. It is important for DR models to balance both efficiency and effectiveness. Pre-trained language models (PLMs), especially Transformer-based PLMs, have been proven to be effective encoders of DR models. However, the self-attention component in Transformer-based PLM results in a computational complexity that grows quadratically with sequence length, and thus exhibits a slow inference speed for long-text retrieval. Some recently proposed non-Transformer PLMs, especially the Mamba architecture PLMs, have demonstrated not only comparable effectiveness to Transformer-based PLMs on generative language tasks but also better efficiency due to linear time scaling in sequence length. This paper implements the Mamba Retriever to explore whether Mamba can serve as an effective and efficient encoder of DR model for IR tasks. We fine-tune the Mamba Retriever on the classic short-text MS MARCO passage ranking dataset and the long-text LoCoV0 dataset. Experimental results show that (1) on the MS MARCO passage ranking dataset and BEIR, the Mamba Retriever achieves comparable or better effectiveness compared to Transformer-based retrieval models, and the effectiveness grows with the size of the Mamba model; (2) on the long-text LoCoV0 dataset, the Mamba Retriever can extend to longer text length than its pre-trained length after fine-tuning on retrieval task, and it has comparable or better effectiveness compared to other long-text retrieval models; (3) the Mamba Retriever has superior inference speed for long-text retrieval. In conclusion, Mamba Retriever is both effective and efficient, making it a practical model, especially for long-text retrieval.
Hanzhen Lu, Zhongxin Liu
Code comment generation aims to generate high-quality comments from source code automatically and has been studied for years. Recent studies proposed to integrate information retrieval techniques with neural generation models to tackle this problem, i.e., Retrieval-Augmented Comment Generation (RACG) approaches, and achieved state-of-the-art results. However, the retrievers in previous work are built independently of their generators. This results in that the retrieved exemplars are not necessarily the most useful ones for generating comments, limiting the performance of existing approaches. To address this limitation, we propose a novel training strategy to enable the retriever to learn from the feedback of the generator and retrieve exemplars for generation. Specifically, during training, we use the retriever to retrieve the top-k exemplars and calculate their retrieval scores, and use the generator to calculate a generation loss for the sample based on each exemplar. By aligning high-score exemplars retrieved by the retriever with low-loss exemplars observed by the generator, the retriever can learn to retrieve exemplars that can best improve the quality of the generated comments. Based on this strategy, we propose a novel RACG approach named JOINTCOM and evaluate it on two real-world datasets, JCSD and PCSD. The experimental results demonstrate that our approach surpasses the state-of-the-art baselines by 7.3% to 30.0% in terms of five metrics on the two datasets. We also conduct a human evaluation to compare JOINTCOM with the best-performing baselines. The results indicate that JOINTCOM outperforms the baselines, producing comments that are more natural, informative, and useful.
Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Weichuan Liu, Lei Hou, Juanzi Li
This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM's self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods. We release our code at https://github.com/THU-KEG/SeaKR.
Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.
Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliott
Recent advances in retrieval-augmented models for image captioning highlight
the benefit of retrieving related captions for efficient, lightweight models
with strong domain-transfer capabilities. While these models demonstrate the
success of retrieval augmentation, retrieval models are still far from perfect
in practice: the retrieved information can sometimes mislead the model,
resulting in incorrect generation and worse performance. In this paper, we
analyze the robustness of a retrieval-augmented captioning model SmallCap. Our
analysis shows that the model is sensitive to tokens that appear in the
majority of the retrieved captions, and the input attribution shows that those
tokens are likely copied into the generated output. Given these findings, we
propose to train the model by sampling retrieved captions from more diverse
sets. This decreases the chance that the model learns to copy majority tokens,
and improves both in-domain and cross-domain performance.
Authors' comments: 9 pages, long paper at ACL 2024
Tiziano Labruna, Jon Ander Campos, Gorka Azkune
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu
Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.
Piotr Rybak, Maciej Ogrodniczuk
Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present Silver Retriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. Silver Retriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.