Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A. Rossi, Haoliang Wang, Julian McAuley
Large language models (LLMs) have been used to generate query expansions
augmenting original queries for improving information search. Recent studies
also explore providing LLMs with initial retrieval results to generate query
expansions more grounded to document corpus. However, these methods mostly
focus on enhancing textual similarities between search queries and target
documents, overlooking document relations. For queries like "Find me a highly
rated camera for wildlife photography compatible with my Nikon F-Mount lenses",
existing methods may generate expansions that are semantically similar but
structurally unrelated to user intents. To handle such semi-structured queries
with both textual and relational requirements, in this paper we propose a
knowledge-aware query expansion framework, augmenting LLMs with structured
document relations from knowledge graph (KG). To further address the limitation
of entity-based scoring in existing KG-based methods, we leverage document
texts as rich KG node representations and use document-based relation filtering
for our Knowledge-Aware Retrieval (KAR). Extensive experiments on three
datasets of diverse domains show the advantages of our method compared against
state-of-the-art baselines on textual and relational semi-structured retrieval.
Authors' comments: NAACL 2025
Ingeol Baek, Hwan Chang, Byeongjeong Kim, Jimin Lee, Hwanhee Lee
Retrieval-Augmented Generation (RAG) enhances language models by retrieving
and incorporating relevant external knowledge. However, traditional
retrieve-and-generate processes may not be optimized for real-world scenarios,
where queries might require multiple retrieval steps or none at all. In this
paper, we propose a Probing-RAG, which utilizes the hidden state
representations from the intermediate layers of language models to adaptively
determine the necessity of additional retrievals for a given query. By
employing a pre-trained prober, Probing-RAG effectively captures the model's
internal cognition, enabling reliable decision-making about retrieving external
documents. Experimental results across five open-domain QA datasets demonstrate
that Probing-RAG outperforms previous methods while reducing the number of
redundant retrieval steps.
Authors' comments: NAACL 2025 Findings
Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad
Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems
using heuristic-based metrics, but these require human preferences as the
ground truth for reference. In contrast, arena-based benchmarks, where systems
compete against each other, require an expensive large language model (LLM) as
a judge for a reliable evaluation. We present a simple efficient technique to
combine the best of both worlds. The idea is to train a surrogate judge using
heuristic metrics as input, to output the LLM as a judge prediction. In our
work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18
diverse languages on Wikipedia focused on multilingual answer generation
evaluation. It extensively couples both heuristic features and LLM as a judge
for evaluation. We benchmark 19 multilingual LLMs, and observe a high
correlation (Kendall Tau ($\tau$) = 0.909) using our surrogate judge and
between GPT-4o as a teacher using the Bradley-Terry framework. Our results show
proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our
code and datasets are made publicly available here:
https://github.com/vectara/mirage-bench.
Authors' comments: Accepted at NAACL 2025 (Main Conference)
Yuki Hou, Haruki Tamoto, Qinghua Zhao, Homei Miyashita
Existing retrieval methods in Large Language Models show degradation in
accuracy when handling temporally distributed conversations, primarily due to
their reliance on simple similarity-based retrieval. Unlike existing memory
retrieval methods that rely solely on semantic similarity, we propose
SynapticRAG, which uniquely combines temporal association triggers with
biologically-inspired synaptic propagation mechanisms. Our approach uses
temporal association triggers and synaptic-like stimulus propagation to
identify relevant dialogue histories. A dynamic leaky integrate-and-fire
mechanism then selects the most contextually appropriate memories. Experiments
on four datasets of English, Chinese and Japanese show that compared to
state-of-the-art memory retrieval methods, SynapticRAG achieves consistent
improvements across multiple metrics up to 14.66% points. This work bridges the
gap between cognitive science and language model development, providing a new
framework for memory management in conversational systems.
Authors' comments: Accepted to ACL 2025 Findings
Ambuje Gupta, Mrinal Rawat, Andreas Stolcke, Roberto Pieraccini
Retrieval augmented generation (RAG) pipelines are commonly used in tasks
such as question-answering (QA), relying on retrieving relevant documents from
a vector store computed using a pretrained embedding model. However, if the
retrieved context is inaccurate, the answers generated using the large language
model (LLM) may contain errors or hallucinations. Although pretrained embedding
models have advanced, adapting them to new domains remains challenging.
Fine-tuning is a potential solution, but industry settings often lack the
necessary fine-tuning data. To address these challenges, we propose REFINE, a
novel technique that generates synthetic data from available documents and then
uses a model fusion approach to fine-tune embeddings for improved retrieval
performance in new domains, while preserving out-of-domain capability. We
conducted experiments on the two public datasets: SQUAD and RAG-12000 and a
proprietary TOURISM dataset. Results demonstrate that even the standard
fine-tuning with the proposed data augmentation technique outperforms the
vanilla pretrained model. Furthermore, when combined with model fusion, the
proposed approach achieves superior performance, with a 5.76% improvement in
recall on the TOURISM dataset, and 6.58 % and 0.32% enhancement on SQUAD and
RAG-12000 respectively.
Authors' comments: Accepted in AJCAI'24
Jintao Liu, Ruixue Ding, Linhao Zhang, Pengjun Xie, Fie Huang
Retrieval-Augmented Generation (RAG) aims to enhance large language models (LLMs) to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources, thereby reducing the incidence of hallucinations. Despite the advancements, evaluating these systems remains a crucial research area due to the following issues: (1) Limited data diversity: The insufficient diversity of knowledge sources and query types constrains the applicability of RAG systems; (2) Obscure problems location: Existing evaluation methods have difficulty in locating the stage of the RAG pipeline where problems occur; (3) Unstable retrieval evaluation: These methods often fail to effectively assess retrieval performance, particularly when the chunking strategy changes. To tackle these challenges, we propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three phases, we introduce multi-granularity keywords, including coarse-grained and fine-grained keywords, to assess the retrieved context instead of relying on the annotation of golden chunks. Moreover, we release a holistic benchmark dataset tailored for diverse data scenarios covering a wide range of document formats and query types. We demonstrate the utility of the CoFE-RAG framework by conducting experiments to evaluate each stage of RAG systems. Our evaluation method provides unique insights into the effectiveness of RAG systems in handling diverse data scenarios, offering a more nuanced understanding of their capabilities and limitations.
Hai-Long Nguyen, Tan-Minh Nguyen, Duc-Minh Nguyen, Thi-Hai-Yen Vuong, Ha-Thanh Nguyen, Xuan-Hieu Phan
Statutory law retrieval is a typical problem in legal language processing,
that has various practical applications in law engineering. Modern deep
learning-based retrieval methods have achieved significant results for this
problem. However, retrieval systems relying on semantic and lexical
correlations often exhibit limitations, particularly when handling queries that
involve real-life scenarios, or use the vocabulary that is not specific to the
legal domain. In this work, we focus on overcoming this weaknesses by utilizing
the logical reasoning capabilities of large language models (LLMs) to identify
relevant legal terms and facts related to the situation mentioned in the query.
The proposed retrieval system integrates additional information from the
term--based expansion and query reformulation to improve the retrieval
accuracy. The experiments on COLIEE 2022 and COLIEE 2023 datasets show that
extra knowledge from LLMs helps to improve the retrieval result of both lexical
and semantic ranking models. The final ensemble retrieval system outperformed
the highest results among all participating teams in the COLIEE 2022 and 2023
competitions.
Authors' comments: Presented at NeLaMKRR@KR, 2024 (arXiv:2410.05339)
Wachara Fungwacharakorn, Nguyen Ha Thanh, May Myo Zin, Ken Satoh
This paper presents a novel approach termed Layer-of-Thoughts Prompting
(LoT), which utilizes constraint hierarchies to filter and refine candidate
responses to a given query. By integrating these constraints, our method
enables a structured retrieval process that enhances explainability and
automation. Existing methods have explored various prompting techniques but
often present overly generalized frameworks without delving into the nuances of
prompts in multi-turn interactions. Our work addresses this gap by focusing on
the hierarchical relationships among prompts. We demonstrate that the efficacy
of thought hierarchy plays a critical role in developing efficient and
interpretable retrieval algorithms. Leveraging Large Language Models (LLMs),
LoT significantly improves the accuracy and comprehensibility of information
retrieval tasks.
Authors' comments: Presented at NeLaMKRR@KR, 2024 (arXiv:2410.05339)
Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan
The integration of documents generated by LLMs themselves (Self-Docs)
alongside retrieved documents has emerged as a promising strategy for
retrieval-augmented generation systems. However, previous research primarily
focuses on optimizing the use of Self-Docs, with their inherent properties
remaining underexplored. To bridge this gap, we first investigate the overall
effectiveness of Self-Docs, identifying key factors that shape their
contribution to RAG performance (RQ1). Building on these insights, we develop a
taxonomy grounded in Systemic Functional Linguistics to compare the influence
of various Self-Docs categories (RQ2) and explore strategies for combining them
with external sources (RQ3). Our findings reveal which types of Self-Docs are
most beneficial and offer practical guidelines for leveraging them to achieve
significant improvements in knowledge-intensive question answering tasks.
Authors' comments: Accepted by NAACL 2025 (Findings). (Long Paper)
Ashwin Ram, Yigit Ege Bayiz, Arash Amini, Mustafa Munir, Radu Marculescu
Fake news threatens democracy and exacerbates the polarization and divisions in society; therefore, accurately detecting online misinformation is the foundation of addressing this issue. We present CrediRAG, the first fake news detection model that combines language models with access to a rich external political knowledge base with a dense social network to detect fake news across social media at scale. CrediRAG uses a news retriever to initially assign a misinformation score to each post based on the source credibility of similar news articles to the post title content. CrediRAG then improves the initial retrieval estimations through a novel weighted post-to-post network connected based on shared commenters and weighted by the average stance of all shared commenters across every pair of posts. We achieve 11% increase in the F1-score in detecting misinformative posts over state-of-the-art methods. Extensive experiments conducted on curated real-world Reddit data of over 200,000 posts demonstrate the superior performance of CrediRAG on existing baselines. Thus, our approach offers a more accurate and scalable solution to combat the spread of fake news across social media platforms.
Thaina Saraiva, Marco Sousa, Pedro Vieira, António Rodrigues
This paper proposes a Question-Answering (QA) system for the telecom domain using 3rd Generation Partnership Project (3GPP) technical documents. Alongside, a hybrid dataset, Telco-DPR, which consists of a curated 3GPP corpus in a hybrid format, combining text and tables, is presented. Additionally, the dataset includes a set of synthetic question/answer pairs designed to evaluate the retrieval performance of QA systems on this type of data. The retrieval models, including the sparse model, Best Matching 25 (BM25), as well as dense models, such as Dense Passage Retriever (DPR) and Dense Hierarchical Retrieval (DHR), are evaluated and compared using top-K accuracy and Mean Reciprocal Rank (MRR). The results show that DHR, a retriever model utilising hierarchical passage selection through fine-tuning at both the document and passage levels, outperforms traditional methods in retrieving relevant technical information, achieving a Top-10 accuracy of 86.2%. Additionally, the Retriever-Augmented Generation (RAG) technique, used in the proposed QA system, is evaluated to demonstrate the benefits of using the hybrid dataset and the DHR. The proposed QA system, using the developed RAG model and the Generative Pretrained Transformer (GPT)-4, achieves a 14% improvement in answer accuracy, when compared to a previous benchmark on the same dataset.
Yen-Hsiang Wang, Feng-Dian Su, Tzu-Yu Yeh, Yao-Chung Fan
This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings. Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws. This dataset aims to improve access to legal information for non-native speakers, particularly for foreign nationals in Taiwan. We propose several LLM-based methods as baselines for evaluating retrieval effectiveness, focusing on mitigating translation errors and improving cross-lingual retrieval performance. Our work provides a valuable resource for developing inclusive legal information retrieval systems.
Jiawei Lu, Haoye Wang, Zhongxin Liu, Keyu Liang, Lingfeng Bao, Xiaohu Yang
Recent studies proposed to leverage large language models (LLMs) with
In-Context Learning (ICL) to handle code intelligence tasks without
fine-tuning. ICL employs task instructions and a set of examples as
demonstrations to guide the model in generating accurate answers without
updating its parameters. While ICL has proven effective for code intelligence
tasks, its performance heavily relies on the selected examples. Previous work
has achieved some success in using BM25 to retrieve examples for code
intelligence tasks. However, existing approaches lack the ability to understand
the semantic and structural information of queries, resulting in less helpful
demonstrations. Moreover, they do not adapt well to the complex and dynamic
nature of user queries in diverse domains. In this paper, we introduce a novel
approach named Instructive Code Retriever (ICR), which is designed to retrieve
examples that enhance model inference across various code intelligence tasks
and datasets. We enable ICR to learn the semantic and structural information of
the corpus by a tree-based loss function. To better understand the correlation
between queries and examples, we incorporate the feedback from LLMs to guide
the training of the retriever. Experimental results demonstrate that our
retriever significantly outperforms state-of-the-art approaches. We evaluate
our model's effectiveness on various tasks, i.e., code summarization, program
synthesis, and bug fixing. Compared to previous state-of-the-art algorithms,
our method achieved improvements of 50.0% and 90.0% in terms of BLEU-4 for two
code summarization datasets, 74.6% CodeBLEU on program synthesis dataset, and
increases of 3.6 and 3.2 BLEU-4 on two bug fixing datasets.
Authors' comments: to appear at the 39th IEEE/ACM International Conference on Automated
Software Engineering (ASE 2024)
Zhongwu Chen, Chengjin Xu, Dingmin Wang, Zhen Huang, Yong Dou, Xuhui Jiang, Jian Guo
Retrieval-augmented generation (RAG) has shown promising potential in knowledge intensive question answering (QA). However, existing approaches only consider the query itself, neither specifying the retrieval preferences for the retrievers nor informing the generators of how to refer to the retrieved documents for the answers, which poses a significant challenge to the QA performance. To address these issues, we propose Rule-guided Retrieval-Augmented Generation with LMs, which explicitly introduces rules for in-context learning (RuleRAG-ICL) to guide retrievers to recall related documents in the directions of rules and uniformly guide generators to reason attributed by the same rules. Moreover, most existing RAG datasets were constructed without considering rules and Knowledge Graphs (KGs) are recognized as providing high-quality rules. Therefore, we construct five rule-aware RAG benchmarks for QA, RuleQA, based on KGs to stress the significance of retrieval and reasoning with rules. Experiments on RuleQA demonstrate RuleRAG-ICL improves the retrieval quality of +89.2% in Recall@10 and answer accuracy of +103.1% in Exact Match, and RuleRAG-FT yields more enhancement. In addition, experiments on four existing RAG datasets show RuleRAG is also effective by offering rules in RuleQA to them, further proving the generalization of rule guidance in RuleRAG.
Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco et al.
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $\textbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation.
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.
Yongxin Xu, Ruizhe Zhang, Xinke Jiang, Yujie Feng, Yuzhen Xiao, Xinyu Ma, Runchuan Zhu, Xu Chu et al.
Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence by incorporating externally retrieved knowledge. However, due to potential conflicts between internal and external knowledge, as well as retrieval noise, LLMs often struggle to effectively integrate external evidence, leading to a decline in performance. Although existing methods attempt to tackle these challenges, they often struggle to strike a balance between model adherence and robustness, resulting in significant learning variance. Inspired by human cognitive processes, we propose Parenting, a novel framework that decouples adherence and robustness within the parameter space of LLMs. Specifically, Parenting utilizes a key parameter mining method based on forward activation gain to identify and isolate the crucial parameter units that are strongly linked to adherence and robustness. Then, Parenting employs a type-guided tailored tuning strategy, applying specific and appropriate fine-tuning methods to parameter units representing different capabilities, aiming to achieve a balanced enhancement of adherence and robustness. Extensive experiments on various datasets and models validate the effectiveness and generalizability of our methods.
Choi Changin, Lim Sungjun, Rhee Wonjong
Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs. However, adapting LLMs to learn audio concepts requires massive training data and substantial computational resources. To address these challenges, Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base (KB) and augments them with query audio to generate accurate textual responses. In RAG, the relevance of the retrieved information plays a crucial role in effectively processing the input. In this paper, we analyze how different retrieval methods and knowledge bases impact the relevance of audio-text pairs and the performance of audio captioning with RAG. We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs to the query audio, thereby improving the relevance and accuracy of retrieved information. Additionally, we refine the large-scale knowledge base to retain only audio-text pairs that align with the contextualized intents. Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD, with detailed ablation studies validating the effectiveness of our retrieval and KB construction methods.
Haozhen Zhang, Tao Feng, Jiaxuan You
Retrieval-augmented generation (RAG) has revitalized Large Language Models
(LLMs) by injecting non-parametric factual knowledge. Compared with
long-context LLMs, RAG is considered an effective summarization tool in a more
concise and lightweight manner, which can interact with LLMs multiple times
using diverse queries to get comprehensive responses. However, the
LLM-generated historical responses, which contain potentially insightful
information, are largely neglected and discarded by existing approaches,
leading to suboptimal results. In this paper, we propose $\textit{graph of
records}$ ($\textbf{GoR}$), which leverages historical responses generated by
LLMs to enhance RAG for long-context global summarization. Inspired by the
$\textit{retrieve-then-generate}$ paradigm of RAG, we construct a graph by
establishing an edge between the retrieved text chunks and the corresponding
LLM-generated response. To further uncover the intricate correlations between
them, GoR features a $\textit{graph neural network}$ and an elaborately
designed $\textit{BERTScore}$-based objective for self-supervised model
training, enabling seamless supervision signal backpropagation between
reference summaries and node embeddings. We comprehensively compare GoR with 12
baselines across four long-context summarization datasets, and the results
indicate that our proposed method reaches the best performance
($\textit{e.g.}$, 15%, 8%, and 19% improvement over retrievers w.r.t. Rouge-L,
Rouge-1, and Rouge-2 on the WCEP dataset). Extensive experiments further
demonstrate the effectiveness of GoR.
Authors' comments: Accepted by ACL 2025 Main. The code is available at
https://github.com/ulab-uiuc/GoR
Saikrishna Sanniboina, Shiv Trivedi, Sreenidhi Vijayaraghavan
Retrieval-based question answering systems often suffer from positional bias, leading to suboptimal answer generation. We propose LoRE (Logit-Ranked Retriever Ensemble), a novel approach that improves answer accuracy and relevance by mitigating positional bias. LoRE employs an ensemble of diverse retrievers, such as BM25 and sentence transformers with FAISS indexing. A key innovation is a logit-based answer ranking algorithm that combines the logit scores from a large language model (LLM), with the retrieval ranks of the passages. Experimental results on NarrativeQA, SQuAD demonstrate that LoRE significantly outperforms existing retrieval-based methods in terms of exact match and F1 scores. On SQuAD, LoRE achieves 14.5\%, 22.83\%, and 14.95\% improvements over the baselines for ROUGE-L, EM, and F1, respectively. Qualitatively, LoRE generates more relevant and accurate answers, especially for complex queries.