Jiwoong Sohn, Yein Park, Chanwoong Yoon, Sihyeon Park, Hyeon Hwang, Mujeen Sung, Hyunjae Kim, Jaewoo Kang
Large language models (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG$^2$ (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG$^2$ incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG$^2$ improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1\%, and it outperforms the previous best medical RAG model by up to 5.6\% across three medical question-answering benchmarks. Our code is available at https://github.com/dmis-lab/RAG2.
Xinke Jiang, Rihong Qiu, Yongxin Xu, Wentao Zhang, Yichen Zhu, Ruizhe Zhang, Yuchen Fang, Xu Chu et al.
Graph Neural Networks (GNNs) have become essential in interpreting relational
data across various domains, yet, they often struggle to generalize to unseen
graph data that differs markedly from training instances. In this paper, we
introduce a novel framework called General Retrieval-Augmented Graph Learning
(RAGraph), which brings external graph data into the general graph foundation
model to improve model generalization on unseen scenarios. On the top of our
framework is a toy graph vector library that we established, which captures key
attributes, such as features and task-specific label information. During
inference, the RAGraph adeptly retrieves similar toy graphs based on key
similarities in downstream tasks, integrating the retrieved data to enrich the
learning context via the message-passing prompting mechanism. Our extensive
experimental evaluations demonstrate that RAGraph significantly outperforms
state-of-the-art graph learning methods in multiple tasks such as node
classification, link prediction, and graph classification across both dynamic
and static datasets. Furthermore, extensive testing confirms that RAGraph
consistently maintains high performance without the need for task-specific
fine-tuning, highlighting its adaptability, robustness, and broad
applicability.
Authors' comments: NeurIPS 2024
Michal Sedlák, Robert Stárek, Nikola Horová, Michal Mičuda, Jaromir Fiurášek, Alessandro Bisio
We address the fundamental task of converting $n$ uses of an unknown unitary
transformation into a quantum state (i.e., storage) and later retrieval of the
transformation. Specifically, we consider the case where the unknown unitary is
selected with equal prior probability from two options. First, we prove that
the optimal storage strategy involves the sequential application of the $n$
uses of the unknown unitary, and it produces the optimal state for
discrimination between the two possible unitaries. Next, we show that
incoherent "measure-and-prepare" retrieval achieves the maximum fidelity
between the retrieved operation and the original (qubit) unitary. We then
identify the retrieval strategy that maximizes the probability of successfully
and perfectly retrieving the unknown transformation. In the regime in which the
fidelity between the two possible unitaries is large the probability of success
scales as $ P_{succ} = 1 - \mathcal{O}(n^{-2} ) $, which is a quadratic
improvement with respect to the case in which the unitaries are drawn from the
entire unitary group $U(d)$ with uniform prior probability. Finally, we present
an optical experiment for this approach and assess the storage and retrieval
quality using quantum tomography of states and processes. The results are
discussed in relation to non-optimal measure-and-prepare strategy, highlighting
the advantages of our protocol.
Authors' comments: 18 pages, 10 figures
Sheryl Hsu, Omar Khattab, Chelsea Finn, Archit Sharma
The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Unfortunately, LLMs often struggle with posing the right search queries, especially when dealing with complex or otherwise indirect topics. Observing that LLMs can learn to search for relevant facts by $\textit{trying}$ different queries and learning to up-weight queries that successfully produce relevant results, we introduce $\underline{Le}$arning to $\underline{Re}$trieve by $\underline{T}$rying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. \methodclass can improve the absolute retrieval accuracy by up to 29\% and the downstream generator evaluations by 17\%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Project website: http://sherylhsu.com/LeReT/.
Yiruo Cheng, Kelong Mao, Ziliang Zhao, Guanting Dong, Hongjin Qian, Yongkang Wu, Tetsuya Sakai, Ji-Rong Wen et al.
Retrieval-Augmented Generation (RAG) has become a powerful paradigm for enhancing large language models (LLMs) through external knowledge retrieval. Despite its widespread attention, existing academic research predominantly focuses on single-turn RAG, leaving a significant gap in addressing the complexities of multi-turn conversations found in real-world applications. To bridge this gap, we introduce CORAL, a large-scale benchmark designed to assess RAG systems in realistic multi-turn conversational settings. CORAL includes diverse information-seeking conversations automatically derived from Wikipedia and tackles key challenges such as open-domain coverage, knowledge intensity, free-form responses, and topic shifts. It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling. We propose a unified framework to standardize various conversational RAG methods and conduct a comprehensive evaluation of these methods on CORAL, demonstrating substantial opportunities for improving existing approaches.
Le Huang, Hengzhi Lan, Zijun Sun, Chuan Shi, Ting Bai
As LLMs exhibit a high degree of human-like capability, increasing attention has been paid to role-playing research areas in which responses generated by LLMs are expected to mimic human replies. This has promoted the exploration of role-playing agents in various applications, such as chatbots that can engage in natural conversations with users and virtual assistants that can provide personalized support and guidance. The crucial factor in the role-playing task is the effective utilization of character memory, which stores characters' profiles, experiences, and historical dialogues. Retrieval Augmented Generation (RAG) technology is used to access the related memory to enhance the response generation of role-playing agents. Most existing studies retrieve related information based on the semantic similarity of memory to maintain characters' personalized traits, and few attempts have been made to incorporate the emotional factor in the retrieval argument generation (RAG) of LLMs. Inspired by the Mood-Dependent Memory theory, which indicates that people recall an event better if they somehow reinstate during recall the original emotion they experienced during learning, we propose a novel emotion-aware memory retrieval framework, termed Emotional RAG, which recalls the related memory with consideration of emotional state in role-playing agents. Specifically, we design two kinds of retrieval strategies, i.e., combination strategy and sequential strategy, to incorporate both memory semantic and emotional states during the retrieval process. Extensive experiments on three representative role-playing datasets demonstrate that our Emotional RAG framework outperforms the method without considering the emotional factor in maintaining the personalities of role-playing agents. This provides evidence to further reinforce the Mood-Dependent Memory theory in psychology.
Yucheng Zhang, Qinfeng Li, Tianyu Du, Xuhong Zhang, Xinkui Zhao, Zhengwen Feng, Jianwei Yin
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge, making them adaptable and cost-effective for various applications. However, the growing reliance on these systems also introduces potential security risks. In this work, we reveal a novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which enables attackers to manipulate the retrieval mechanisms of RAG systems by injecting malicious texts into the knowledge database. When the RAG system encounters target questions, it generates the attacker's pre-determined answers instead of the correct ones, undermining the integrity and trustworthiness of the system. We formalize HijackRAG as an optimization problem and propose both black-box and white-box attack strategies tailored to different levels of the attacker's knowledge. Extensive experiments on multiple benchmark datasets show that HijackRAG consistently achieves high attack success rates, outperforming existing baseline attacks. Furthermore, we demonstrate that the attack is transferable across different retriever models, underscoring the widespread risk it poses to RAG systems. Lastly, our exploration of various defense mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing the urgent need for more robust security measures to protect RAG systems in real-world deployments.
Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, Honguk Woo
Continual Imitation Learning (CiL) involves extracting and accumulating task knowledge from demonstrations across multiple stages and tasks to achieve a multi-task policy. With recent advancements in foundation models, there has been a growing interest in adapter-based CiL approaches, where adapters are established parameter-efficiently for tasks newly demonstrated. While these approaches isolate parameters for specific tasks and tend to mitigate catastrophic forgetting, they limit knowledge sharing among different demonstrations. We introduce IsCiL, an adapter-based CiL framework that addresses this limitation of knowledge sharing by incrementally learning shareable skills from different demonstrations, thus enabling sample-efficient task adaptation using the skills particularly in non-stationary CiL environments. In IsCiL, demonstrations are mapped into the state embedding space, where proper skills can be retrieved upon input states through prototype-based memory. These retrievable skills are incrementally learned on their corresponding adapters. Our CiL experiments with complex tasks in Franka-Kitchen and Meta-World demonstrate robust performance of IsCiL in both task adaptation and sample-efficiency. We also show a simple extension of IsCiL for task unlearning scenarios.
Jeongyeon Hwang, Junyoung Park, Hyejin Park, Sangdon Park, Jungseul Ok
Retrieval-augmented generation (RAG) addresses key limitations of large language models (LLMs), such as hallucinations and outdated knowledge, by incorporating external databases. These databases typically consult multiple sources to encompass up-to-date and various information. However, standard RAG methods often overlook the heterogeneous source reliability in the multi-source database and retrieve documents solely based on relevance, making them prone to propagating misinformation. To address this, we propose Reliability-Aware RAG (RA-RAG) which estimates the reliability of multiple sources and incorporates this information into both retrieval and aggregation processes. Specifically, it iteratively estimates source reliability and true answers for a set of queries with no labelling. Then, it selectively retrieves relevant documents from a few of reliable sources and aggregates them using weighted majority voting, where the selective retrieval ensures scalability while not compromising the performance. We also introduce a benchmark designed to reflect real-world scenarios with heterogeneous source reliability and demonstrate the effectiveness of RA-RAG compared to a set of baselines.
Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang
Instant Messaging is a popular means for daily communication, allowing users to send text and stickers. As the saying goes, "a picture is worth a thousand words", so developing an effective sticker retrieval technique is crucial for enhancing user experience. However, existing sticker retrieval methods rely on labeled data to interpret stickers, and general-purpose Vision-Language Models (VLMs) often struggle to capture the unique semantics of stickers. Additionally, relevant-based sticker retrieval methods lack personalization, creating a gap between diverse user expectations and retrieval results. To address these, we propose the Personalized Sticker Retrieval with Vision-Language Model framework, namely PerSRV, structured into offline calculations and online processing modules. The online retrieval part follows the paradigm of relevant recall and personalized ranking, supported by the offline pre-calculation parts, which are sticker semantic understanding, utility evaluation and personalization modules. Firstly, for sticker-level semantic understanding, we supervised fine-tuned LLaVA-1.5-7B to generate human-like sticker semantics, complemented by textual content extracted from figures and historical interaction queries. Secondly, we investigate three crowd-sourcing metrics for sticker utility evaluation. Thirdly, we cluster style centroids based on users' historical interactions to achieve personal preference modeling. Finally, we evaluate our proposed PerSRV method on a public sticker retrieval dataset from WeChat, containing 543,098 candidates and 12,568 interactions. Experimental results show that PerSRV significantly outperforms existing methods in multi-modal sticker retrieval. Additionally, our fine-tuned VLM delivers notable improvements in sticker semantic understandings.
Nour Jedidi, Yung-Sung Chuang, Leslie Shing, James Glass
Building effective dense retrieval systems remains difficult when relevance supervision is not available. Recent work has looked to overcome this challenge by using a Large Language Model (LLM) to generate hypothetical documents that can be used to find the closest real document. However, this approach relies solely on the LLM to have domain-specific knowledge relevant to the query, which may not be practical. Furthermore, generating hypothetical documents can be inefficient as it requires the LLM to generate a large number of tokens for each query. To address these challenges, we introduce Real Document Embeddings from Relevance Feedback (ReDE-RF). Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task, using an LLM to select which documents should be used for nearest neighbor search. Through this re-framing, the LLM no longer needs domain-specific knowledge but only needs to judge what is relevant. Additionally, relevance estimation only requires the LLM to output a single token, thereby improving search latency. Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods across a wide range of low-resource retrieval datasets while also making significant improvements in latency per-query.
Dongkyu Kim, Byoungwook Kim, Donggeon Han, Matouš Eibich
Using LLMs (Large Language Models) in conjunction with external documents has
made RAG (Retrieval-Augmented Generation) an essential technology. Numerous
techniques and modules for RAG are being researched, but their performance can
vary across different datasets. Finding RAG modules that perform well on
specific datasets is challenging. In this paper, we propose the AutoRAG
framework, which automatically identifies suitable RAG modules for a given
dataset. AutoRAG explores and approximates the optimal combination of RAG
modules for the dataset. Additionally, we share the results of optimizing a
dataset using AutoRAG. All experimental results and data are publicly available
and can be accessed through our GitHub repository
https://github.com/Marker-Inc-Korea/AutoRAG_ARAGOG_Paper .
Authors' comments: 20 pages
Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma
We introduce Plan*RAG, a novel framework that enables structured multi-hop
reasoning in retrieval-augmented generation (RAG) through test-time reasoning
plan generation. While existing approaches such as ReAct maintain reasoning
chains within the language model's context window, we observe that this often
leads to plan fragmentation and execution failures. Our key insight is that by
isolating the reasoning plan as a directed acyclic graph (DAG) outside the LM's
working memory, we can enable (1) systematic exploration of reasoning paths,
(2) atomic subqueries enabling precise retrievals and grounding, and (3)
efficiency through parallel execution and bounded context window utilization.
Moreover, Plan*RAG's modular design allows it to be integrated with existing
RAG methods, thus providing a practical solution to improve current RAG
systems. On standard multi-hop reasoning benchmarks, Plan*RAG consistently
achieves improvements over recently proposed methods such as RQ-RAG and
Self-RAG, while maintaining comparable computational costs.
Authors' comments: 19 pages, preprint
Ming Zhong, Zhizhi Wu, Nanako Honda
Dense retrievers have achieved state-of-the-art performance in various
information retrieval tasks, but their robustness against tokenizer poisoning
remains underexplored. In this work, we assess the vulnerability of dense
retrieval systems to poisoned tokenizers by evaluating models such as BERT,
Dense Passage Retrieval (DPR), Contriever, SimCSE, and ANCE. We find that
supervised models like BERT and DPR experience significant performance
degradation when tokenizers are compromised, while unsupervised models like
ANCE show greater resilience. Our experiments reveal that even small
perturbations can severely impact retrieval accuracy, highlighting the need for
robust defenses in critical applications.
Authors' comments: 7 pages
Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu
Medical information retrieval (MIR) is essential for retrieving relevant
medical knowledge from diverse sources, including electronic health records,
scientific literature, and medical databases. However, achieving effective
zero-shot dense retrieval in the medical domain poses substantial challenges
due to the lack of relevance-labeled data. In this paper, we introduce a novel
approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to
tackle this issue. SL-HyDE leverages large language models (LLMs) as generators
to generate hypothetical documents based on a given query. These generated
documents encapsulate key medical context, guiding a dense retriever in
identifying the most relevant documents. The self-learning framework
progressively refines both pseudo-document generation and retrieval, utilizing
unlabeled medical corpora without requiring any relevance-labeled data.
Additionally, we present the Chinese Medical Information Retrieval Benchmark
(CMIRB), a comprehensive evaluation framework grounded in real-world medical
scenarios, encompassing five tasks and ten datasets. By benchmarking ten models
on CMIRB, we establish a rigorous standard for evaluating medical information
retrieval systems. Experimental results demonstrate that SL-HyDE significantly
surpasses existing methods in retrieval accuracy while showcasing strong
generalization and scalability across various LLM and retriever configurations.
CMIRB data and evaluation code are publicly available at:
https://github.com/CMIRB-benchmark/CMIRB.
Authors' comments: 15 pages, 3 figures
Mingrui Liu, Sixiao Zhang, Cheng Long
Retrieval-Augmented Generation (RAG) has been an effective approach to
mitigate hallucinations in large language models (LLMs) by incorporating
up-to-date and domain-specific knowledge. Recently, there has been a trend of
storing up-to-date or copyrighted data in RAG knowledge databases instead of
using it for LLM training. This practice has raised concerns about Membership
Inference Attacks (MIAs), which aim to detect if a specific target document is
stored in the RAG system's knowledge database so as to protect the rights of
data producers. While research has focused on enhancing the trustworthiness of
RAG systems, existing MIAs for RAG systems remain largely insufficient.
Previous work either relies solely on the RAG system's judgment or is easily
influenced by other documents or the LLM's internal knowledge, which is
unreliable and lacks explainability. To address these limitations, we propose a
Mask-Based Membership Inference Attacks (MBA) framework. Our framework first
employs a masking algorithm that effectively masks a certain number of words in
the target document. The masked text is then used to prompt the RAG system, and
the RAG system is required to predict the mask values. If the target document
appears in the knowledge database, the masked text will retrieve the complete
target document as context, allowing for accurate mask prediction. Finally, we
adopt a simple yet effective threshold-based method to infer the membership of
target document by analyzing the accuracy of mask prediction. Our mask-based
approach is more document-specific, making the RAG system's generation less
susceptible to distractions from other documents or the LLM's internal
knowledge. Extensive experiments demonstrate the effectiveness of our approach
compared to existing baseline models.
Authors' comments: This paper is accepted by conference WWW 2025
Han Zhang, Yunjing Jiang, Mingming Li, Haowei Yuan, Wen-Yun Yang
Embedding retrieval aims to learn a shared semantic representation space for both queries and items, thus enabling efficient and effective item retrieval using approximate nearest neighbor (ANN) algorithms. In current industrial practice, retrieval systems typically retrieve a fixed number of items for different queries, which actually leads to insufficient retrieval (low recall) for head queries and irrelevant retrieval (low precision) for tail queries. Mostly due to the trend of frequentist approach to loss function designs, till now there is no satisfactory solution to holistically address this challenge in the industry. In this paper, we move away from the frequentist approach, and take a novel \textbf{p}robabilistic approach to \textbf{e}mbedding \textbf{b}ased \textbf{r}etrieval (namely \textbf{pEBR}) by learning the item distribution for different queries, which enables a dynamic cosine similarity threshold calculated by the probabilistic cumulative distribution function (CDF) value. The experimental results show that our approach improves both the retrieval precision and recall significantly. Ablation studies also illustrate how the probabilistic approach is able to capture the differences between head and tail queries.
Aryo Pradipta Gema, Chen Jin, Ahmed Abdulaal, Tom Diethe, Philip Teare, Beatrice Alex, Pasquale Minervini, Amrutha Saseendran
Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context or incorrectly recalling internal knowledge. Recent studies have identified specific attention heads within the Transformer architecture, known as retrieval heads, responsible for extracting relevant contextual information. We hypothesise that masking these retrieval heads can induce hallucinations and that contrasting the outputs of the base LLM and the masked LLM can reduce hallucinations. To this end, we propose Decoding by Contrasting Retrieval Heads (DeCoRe), a novel training-free decoding strategy that amplifies information found in the context and model parameters. DeCoRe mitigates potentially hallucinated responses by dynamically contrasting the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. Our extensive experiments confirm that DeCoRe significantly improves performance on tasks requiring high contextual faithfulness, such as summarisation (XSum by 18.6%), instruction following (MemoTrap by 10.9%), and open-book question answering (NQ-Open by 2.4% and NQ-Swap by 5.5%).
Jingwei Liu, Ling Yang, Hongyan Li, Shenda Hong
While time series diffusion models have received considerable focus from many
recent works, the performance of existing models remains highly unstable.
Factors limiting time series diffusion models include insufficient time series
datasets and the absence of guidance. To address these limitations, we propose
a Retrieval- Augmented Time series Diffusion model (RATD). The framework of
RATD consists of two parts: an embedding-based retrieval process and a
reference-guided diffusion model. In the first part, RATD retrieves the time
series that are most relevant to historical time series from the database as
references. The references are utilized to guide the denoising process in the
second part. Our approach allows leveraging meaningful samples within the
database to aid in sampling, thus maximizing the utilization of datasets.
Meanwhile, this reference-guided mechanism also compensates for the
deficiencies of existing time series diffusion models in terms of guidance.
Experiments and visualizations on multiple datasets demonstrate the
effectiveness of our approach, particularly in complicated prediction tasks.
Authors' comments: NeurIPS 2024
Tanya Chowdhury, Atharva Nijasure, James Allan
Transformer networks, particularly those achieving performance comparable to
GPT models, are well known for their robust feature extraction abilities.
However, the nature of these extracted features and their alignment with
human-engineered ones remain unexplored. In this work, we investigate the
internal mechanisms of state-of-the-art, fine-tuned LLMs for passage reranking.
We employ a probing-based analysis to examine neuron activations in ranking
LLMs, identifying the presence of known human-engineered and semantic features.
Our study spans a broad range of feature categories, including lexical signals,
document structure, query-document interactions, and complex semantic
representations, to uncover underlying patterns influencing ranking decisions.
Through experiments on four different ranking LLMs, we identify statistical
IR features that are prominently encoded in LLM activations, as well as others
that are notably missing. Furthermore, we analyze how these models respond to
out-of-distribution queries and documents, revealing distinct generalization
behaviors. By dissecting the latent representations within LLM activations, we
aim to improve both the interpretability and effectiveness of ranking models.
Our findings offer crucial insights for developing more transparent and
reliable retrieval systems, and we release all necessary scripts and code to
support further exploration.
Authors' comments: 9 pages