Fangzheng Tian, Debasis Ganguly, Craig Macdonald
Retrieval Augmented Generation (RAG) is a framework for incorporating
external knowledge, usually in the form of a set of documents retrieved from a
collection, as a part of a prompt to a large language model (LLM) to
potentially improve the performance of a downstream task, such as question
answering. Different from a standard retrieval task's objective of maximising
the relevance of a set of top-ranked documents, a RAG system's objective is
rather to maximise their total utility, where the utility of a document
indicates whether including it as a part of the additional contextual
information in an LLM prompt improves a downstream task. Existing studies
investigate the role of the relevance of a RAG context for knowledge-intensive
language tasks (KILT), where relevance essentially takes the form of answer
containment. In contrast, in our work, relevance corresponds to that of topical
overlap between a query and a document for an information seeking task.
Specifically, we make use of an IR test collection to empirically investigate
whether a RAG context comprised of topically relevant documents leads to
improved downstream performance. Our experiments lead to the following
findings: (a) there is a small positive correlation between relevance and
utility; (b) this correlation decreases with increasing context sizes (higher
values of k in k-shot); and (c) a more effective retrieval model generally
leads to better downstream RAG performance.
Authors' comments: 18 pages (including reference), 5 figures, 1 table, 48 references;
this paper has been accepted by ECIR'25 as a full paper
Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Lisa Vandenhurk, Joey Chua
Retrieval-augmented generation (RAG)-based applications are gaining
prominence due to their ability to leverage large language models (LLMs). These
systems excel at combining retrieval mechanisms with generative capabilities,
resulting in more accurate, contextually relevant responses that enhance user
experience. In particular, Transurban, a road operation company, is replacing
its rule-based virtual assistant (VA) with a RAG-based VA (RAGVA) to offer more
flexible customer interactions and support a wider range of scenarios. In this
paper, drawing from the experience at Transurban, we present a comprehensive
step-by-step guide for building a conversational application and how to
engineer a RAGVA. These guides aim to serve as references for future
researchers and practitioners. While the engineering processes for traditional
software applications are well-established, the development and evaluation of
RAG-based applications are still in their early stages, with numerous emerging
challenges remaining uncharted. To address this gap, we conduct a focus group
study with Transurban practitioners regarding developing and evaluating their
RAGVA. We identified eight challenges encountered by the engineering team and
proposed eight future directions that should be explored to advance the
development of RAG-based applications. This study contributes to the
foundational understanding of a RAG-based conversational application and the
emerging AI software engineering challenges it presents.
Authors' comments: Under Review at the Journal of Systems and Software (JSS)
Yaochen Zhu, Chao Wan, Harald Steck, Dawen Liang, Yesu Feng, Nathan Kallus, Jundong Li
Conversational recommender systems (CRS) aim to provide personalized
recommendations via interactive dialogues with users. While large language
models (LLMs) enhance CRS with their superior understanding of context-aware
user preferences, they typically struggle to leverage behavioral data, which
have proven to be important for classical collaborative filtering (CF)-based
approaches. For this reason, we propose CRAG, Collaborative Retrieval Augmented
Generation for LLM-based CRS. To the best of our knowledge, CRAG is the first
approach that combines state-of-the-art LLMs with CF for conversational
recommendations. Our experiments on two publicly available movie conversational
recommendation datasets, i.e., a refined Reddit dataset (which we name
Reddit-v2) as well as the Redial dataset, demonstrate the superior item
coverage and recommendation performance of CRAG, compared to several CRS
baselines. Moreover, we observe that the improvements are mainly due to better
recommendation accuracy on recently released movies. The code and data are
available at https://github.com/yaochenzhu/CRAG.
Authors' comments: Accepted by WWW'2025
Karl Elbakian, Samuel Carton
A key aspect of alignment is the proper use of within-document evidence to
construct document-level decisions. We analyze the relationship between the
retrieval and interpretation of within-document evidence for large language
model in a few-shot setting. Specifically, we measure the extent to which model
prediction errors are associated with evidence retrieval errors with respect to
gold-standard human-annotated extractive evidence for five datasets, using two
popular closed proprietary models. We perform two ablation studies to
investigate when both label prediction and evidence retrieval errors can be
attributed to qualities of the relevant evidence. We find that there is a
strong empirical relationship between model prediction and evidence retrieval
error, but that evidence retrieval error is mostly not associated with evidence
interpretation error--a hopeful sign for downstream applications built on this
mechanism.
Authors' comments: 9 pages, 8 figures, Accepted to AAAI 2025 Main Conference (AI
Alignment Track)
Ningze Wang, Anoosheh Heidarzadeh, Alex Sprintson
Private Information Retrieval (PIR) is a fundamental problem in the broader
fields of security and privacy. In recent years, the problem has garnered
significant attention from the research community, leading to achievability
schemes and converse results for many important PIR settings.
This paper focuses on the Multi-message Private Information Retrieval (MPIR)
setting, where a user aims to retrieve \(D\) messages from a database of \(K\)
messages, with identical copies of the database available on \(N\) remote
servers. The user's goal is to maximize the download rate while keeping the
identities of the retrieved messages private. Existing approaches to the MPIR
problem primarily focus on either scalar-linear solutions or vector-linear
solutions, the latter requiring a high degree of subpacketization. Furthermore,
prior scalar-linear solutions are restricted to the special case of \(N =
D+1\). This limitation hinders the practical adoption of these schemes, as
real-world applications demand simple, easily implementable solutions that
support a broad range of scenarios.
In this work, we present a solution for the MPIR problem, which applies to a
broader range of system parameters and requires a limited degree of
subpacketization. In particular, the proposed scheme applies to all values of
\(N=DL+1\) for any integer \(L\geq 1\), and requires a degree of
subpacketization \(L\). Our scheme achieves capacity when \(D\) divides \(K\),
and in all other cases, its performance matches or comes within a small
additive margin of the best-known scheme that requires a high degree of
subpacketization.
Authors' comments: arXiv admin note: text overlap with arXiv:2208.13237
Qitao Qin, Yucong Luo, Yihang Lu, Zhibo Chu, Xianwei Meng
Retrieval-Augmented Generation (RAG), by integrating non-parametric knowledge
from external knowledge bases into models, has emerged as a promising approach
to enhancing response accuracy while mitigating factual errors and
hallucinations. This method has been widely applied in tasks such as Question
Answering (QA). However, existing RAG methods struggle with open-domain QA
tasks because they perform independent retrieval operations and directly
incorporate the retrieved information into generation without maintaining a
summarizing memory or using adaptive retrieval strategies, leading to noise
from redundant information and insufficient information integration. To address
these challenges, we propose Adaptive memory-based optimization for enhanced
RAG (Amber) for open-domain QA tasks, which comprises an Agent-based Memory
Updater, an Adaptive Information Collector, and a Multi-granular Content
Filter, working together within an iterative memory updating paradigm.
Specifically, Amber integrates and optimizes the language model's memory
through a multi-agent collaborative approach, ensuring comprehensive knowledge
integration from previous retrieval steps. It dynamically adjusts retrieval
queries and decides when to stop retrieval based on the accumulated knowledge,
enhancing retrieval efficiency and effectiveness. Additionally, it reduces
noise by filtering irrelevant content at multiple levels, retaining essential
information to improve overall model performance. We conduct extensive
experiments on several open-domain QA datasets, and the results demonstrate the
superiority and effectiveness of our method and its components. The source code
is available \footnote{https://anonymous.4open.science/r/Amber-B203/}.
Authors' comments: 8pages
Yixing Fan, Qiang Yan, Wenshan Wang, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
\Ac{RAG} has emerged as a crucial technique for enhancing large models with real-time and domain-specific knowledge. While numerous improvements and open-source tools have been proposed to refine the \ac{RAG} framework for accuracy, relatively little attention has been given to improving the trustworthiness of generated results. To address this gap, we introduce TrustRAG, a novel framework that enhances \ac{RAG} from three perspectives: indexing, retrieval, and generation. Specifically, in the indexing stage, we propose a semantic-enhanced chunking strategy that incorporates hierarchical indexing to supplement each chunk with contextual information, ensuring semantic completeness. In the retrieval stage, we introduce a utility-based filtering mechanism to identify high-quality information, supporting answer generation while reducing input length. In the generation stage, we propose fine-grained citation enhancement, which detects opinion-bearing sentences in responses and infers citation relationships at the sentence-level, thereby improving citation accuracy. We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks \footnote{https://huggingface.co/spaces/golaxy/TrustRAG}. Based on these, we aim to help researchers: 1) systematically enhancing the trustworthiness of \ac{RAG} systems and (2) developing their own \ac{RAG} systems with more reliable outputs.
Ylli Prifti, Alessandro Provetti, Pasquale de Meo
This article introduces the Data Retrieval Web Engine (also referred to as
doctor web), a flexible and modular tool for extracting structured data from
web pages using a simple query language. We discuss the engineering challenges
addressed during its development, such as dynamic content handling and messy
data extraction. Furthermore, we cover the steps for making the DR Web Engine
public, highlighting its open source potential.
Authors' comments: 10 pages, 1 figure, 1 table, 7 listings
Sjoerd Dirksen, Felix Krahmer, Patricia Römer, Palina Salanevich
We consider the problem of phaseless reconstruction from measurements with Poisson or Bernoulli distributed noise. This is of particular interest in biological imaging experiments where a low dose of radiation has to be used to mitigate potential damage of the specimen, resulting in low observed particle counts. We derive recovery guarantees for the spectral method for these noise models in the case of Gaussian measurements. Our results give a quantitative insight in the trade-off between the employed radiation dose per measurement and the overall sampling complexity.
Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang
Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG
Xiaoju Ye, Zhichun Wang, Jingyuan Wang
Limited by the context window size of Large Language Models(LLMs), handling
various tasks with input tokens exceeding the upper limit has been challenging,
whether it is a simple direct retrieval task or a complex multi-hop reasoning
task. Although various methods have been proposed to enhance the long-context
processing capabilities of LLMs, they either incur substantial post-training
costs, or require additional tool modules(e.g.,RAG), or have not shown
significant improvement in realistic tasks. Our work observes the correlation
between the attention distribution and generated answers across each layer, and
establishes the attention allocation aligns with retrieval-augmented
capabilities through experiments. Drawing on the above insights, we propose a
novel method InfiniRetri that leverages the LLMs's own attention information to
enable accurate retrieval across inputs of infinitely length. Our evaluations
indicate that InfiniRetri achieves 100% accuracy in the
Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
surpassing other method or larger models and setting a new
state-of-the-art(SOTA). Moreover, our method achieves significant performance
improvements on real-world benchmarks, with a maximum 288% improvement. In
addition, InfiniRetri can be applied to any Transformer-based LLMs without
additional training and substantially reduces inference latency and compute
overhead in long texts. In summary, our comprehensive studies show
InfiniRetri's potential for practical applications and creates a paradigm for
retrievaling information using LLMs own capabilities under infinite-length
tokens. Code will be released in link.
Authors' comments: 21 pages
Tanqiu Jiang, Changjiang Li, Fenglong Ma, Ting Wang
Differentially private diffusion models (DPDMs) harness the remarkable
generative capabilities of diffusion models while enforcing differential
privacy (DP) for sensitive data. However, existing DPDM training approaches
often suffer from significant utility loss, large memory footprint, and
expensive inference cost, impeding their practical uses. To overcome such
limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a
novel approach that integrates retrieval augmented generation (RAG) into DPDM
training. Specifically, RAPID leverages available public data to build a
knowledge base of sample trajectories; when training the diffusion model on
private data, RAPID computes the early sampling steps as queries, retrieves
similar trajectories from the knowledge base as surrogates, and focuses on
training the later sampling steps in a differentially private manner. Extensive
evaluation using benchmark datasets and models demonstrates that, with the same
privacy guarantee, RAPID significantly outperforms state-of-the-art approaches
by large margins in generative quality, memory footprint, and inference cost,
suggesting that retrieval-augmented DP training represents a promising
direction for developing future privacy-preserving generative models. The code
is available at: https://github.com/TanqiuJiang/RAPID
Authors' comments: Published in ICLR 2025
Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao et al.
Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.
Navve Wasserman, Roi Pony, Oshri Naparstek, Adi Raz Goldfarb, Eli Schwartz, Udi Barzelay, Leonid Karlinsky
Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.
Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang
Large language models (LLMs) inherently display hallucinations since the precision of generated texts cannot be guaranteed purely by the parametric knowledge they include. Although retrieval-augmented generation (RAG) systems enhance the accuracy and reliability of generative models by incorporating external documents, these retrieved documents often fail to adequately support the model's responses in practical applications. To address this issue, we propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment Re\textbf{trieval} for verifiable generation), which leverages an LLM to dynamically update queries and filter high-quality, reliable retrieval documents. Specifically, we parse the user query into its syntactic components and perform fine-grained grounded alignment with the retrieved documents. For query components that cannot be individually aligned, we propose a dynamic semantic compensation mechanism that iteratively refines and rewrites the query while continuously updating the retrieval results. This iterative process continues until the retrieved documents sufficiently support the query's response. Our approach introduces a novel criterion for filtering retrieved documents, closely emulating human strategies for acquiring targeted information. This ensures that the retrieved content effectively supports and verifies the generated outputs. On the ALCE benchmark, our method significantly surpasses a wide range of baselines, achieving state-of-the-art performance.
Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zeijang Shen, Yoon Kim, Monica Agrawal
Patients have long sought health information online, and increasingly, they
are turning to generative AI to answer their health-related queries. Given the
high stakes of the medical domain, techniques like retrieval-augmented
generation and citation grounding have been widely promoted as methods to
reduce hallucinations and improve the accuracy of AI-generated responses and
have been widely adopted into search engines. This paper argues that even when
these methods produce literally accurate content drawn from source documents
sans hallucinations, they can still be highly misleading. Patients may derive
significantly different interpretations from AI-generated outputs than they
would from reading the original source material, let alone consulting a
knowledgeable clinician. Through a large-scale query analysis on topics
including disputed diagnoses and procedure safety, we support our argument with
quantitative and qualitative evidence of the suboptimal answers resulting from
current systems. In particular, we highlight how these models tend to
decontextualize facts, omit critical relevant sources, and reinforce patient
misconceptions or biases. We propose a series of recommendations -- such as the
incorporation of communication pragmatics and enhanced comprehension of source
documents -- that could help mitigate these issues and extend beyond the
medical domain.
Authors' comments: Preprint
Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, Wentao Zhang
Retrieval-Augmented Generation (RAG) systems often struggle with imperfect retrieval, as traditional retrievers focus on lexical or semantic similarity rather than logical relevance. To address this, we propose \textbf{HopRAG}, a novel RAG framework that augments retrieval with logical reasoning through graph-structured knowledge exploration. During indexing, HopRAG constructs a passage graph, with text chunks as vertices and logical connections established via LLM-generated pseudo-queries as edges. During retrieval, it employs a \textit{retrieve-reason-prune} mechanism: starting with lexically or semantically similar passages, the system explores multi-hop neighbors guided by pseudo-queries and LLM reasoning to identify truly relevant ones. Experiments on multiple multi-hop benchmarks demonstrate that HopRAG's \textit{retrieve-reason-prune} mechanism can expand the retrieval scope based on logical connections and improve final answer quality.
Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han
Retrieval-augmented language models often struggle with knowledge-intensive
tasks due to inefficient retrieval, unstructured knowledge integration, and
single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel
framework that dynamically constructs and reasons over query-specific knowledge
graphs through iterative retrieval and structuring. RAS introduces four key
technical innovations: (1) a themescoped retrieval mechanism that efficiently
narrows the search space while maintaining retrieval quality, (2) an action
planning module that determines knowledge needs and generates focused
sub-queries, (3) a dynamic knowledge structuring approach that converts
retrieved text into an evolving knowledge graph, and (4) a graph-augmented
answering component that leverages the accumulated structured information. Our
framework achieves state-of-the-art performance, surpassing leading baselines
by 6.4% with open-source language models and 7.0% with proprietary models on
seven knowledge-intensive generation datasets across all evaluation metrics.
Detailed ablation studies verify the contribution of each technical component
to the overall system performance.
Authors' comments: under review
Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na
Large Language Models (LLMs) excel across a variety of language tasks yet are
constrained by limited input lengths and high computational costs. Existing
approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi)
and sliding window mechanisms\textemdash partially alleviate these issues but
often require additional training or suffer from performance degradation with
longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a
method that enhances length normalization and reduces inference latency without
any further training. Our approach leverages query-independent, offline caching
to efficiently reuse a Context KV Cache Store. We address the amplification of
abnormal token distributions problem by re-positioning cached keys and
introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during
pre-filling. Additionally, our Adaptive Positional Allocation Strategy
dynamically reassigns cache positions to maximize the use of the available
positional encoding range. Experiments on the Natural Questions and TriviaQA
datasets demonstrate that CacheFocus outperforms alternative methods even when
inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its
practical effectiveness for long-context LLMs. Moreover, even with large
maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows
that it maintains consistent performance even as the number of documents
increases, effectively managing long-text generation without degradation.
Authors' comments: 11 pages (Work in progress)
Yuqi Liu, Yan Zheng
Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks--similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method--RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR