Mario Ceresa, Lorenzo Bertolini, Valentin Comte, Nicholas Spadaro, Barbara Raffael, Brigitte Toussaint, Sergio Consoli, Amalia Muñoz Piñeiro et al.
Safe and trustworthy use of Large Language Models (LLM) in the processing of healthcare documents and scientific papers could substantially help clinicians, scientists and policymakers in overcoming information overload and focusing on the most relevant information at a given moment. Retrieval Augmented Generation (RAG) is a promising method to leverage the potential of LLMs while enhancing the accuracy of their outcomes. This report assesses the potentials and shortcomings of such approaches in the automatic knowledge synthesis of different types of documents in the health domain. To this end, it describes: (1) an internally developed proof of concept pipeline that employs state-of-the-art practices to deliver safe and trustable analysis for healthcare documents and scientific papers called RAGEv (Retrieval Augmented Generation Evaluation); (2) a set of evaluation tools for LLM-based document retrieval and generation; (3) a benchmark dataset to verify the accuracy and veracity of the results called RAGEv-Bench. It concludes that careful implementations of RAG techniques could minimize most of the common problems in the use of LLMs for document processing in the health domain, obtaining very high scores both on short yes/no answers and long answers. There is a high potential for incorporating it into the day-to-day work of policy support tasks, but additional efforts are required to obtain a consistent and trustworthy tool.
Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Manning, Peter Henderson, Daniel E. Ho
As the legal community increasingly examines the use of large language models
(LLMs) for various legal applications, legal AI developers have turned to
retrieval-augmented LLMs ("RAG" systems) to improve system performance and
robustness. An obstacle to the development of specialized RAG systems is the
lack of realistic legal RAG benchmarks which capture the complexity of both
legal retrieval and downstream legal question-answering. To address this, we
introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA.
Our tasks correspond to real-world legal research tasks, and were produced
through annotation processes which resemble legal research. We describe the
construction of these benchmarks and the performance of existing retriever
pipelines. Our results suggest that legal RAG remains a challenging
application, thus motivating future research.
Authors' comments: CS&Law 2025. For data, see
https://reglab.github.io/legal-rag-benchmarks/
Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min et al.
We present ReasonIR-8B, the first retriever specifically trained for general
reasoning tasks. Existing retrievers have shown limited gains on reasoning
tasks, in part because existing training datasets focus on short factual
queries tied to documents that straightforwardly answer them. We develop a
synthetic data generation pipeline that, for each document, our pipeline
creates a challenging and relevant query, along with a plausibly related but
ultimately unhelpful hard negative. By training on a mixture of our synthetic
data and existing public data, ReasonIR-8B achieves a new state-of-the-art of
29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a
widely-used reasoning-intensive information retrieval (IR) benchmark. When
applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4%
and 22.6% respectively, relative to the closed-book baseline, outperforming
other retrievers and search engines. In addition, ReasonIR-8B uses test-time
compute more effectively: on BRIGHT, its performance consistently increases
with longer and more information-rich rewritten queries; it continues to
outperform other retrievers when combined with an LLM reranker. Our training
recipe is general and can be easily extended to future LLMs; to this end, we
open-source our code, data, and model.
Authors' comments: Our code is released at
\url{https://github.com/facebookresearch/ReasonIR}
Jingfen Qiao, Thong Nguyen, Evangelos Kanoulas, Andrew Yates
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.
Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Large language model (LLM) agents are increasingly employing
retrieval-augmented generation (RAG) to improve the factuality of their
responses. However, in practice, these systems often need to handle ambiguous
user queries and potentially conflicting information from multiple sources
while also suppressing inaccurate information from noisy or irrelevant
documents. Prior work has generally studied and addressed these challenges in
isolation, considering only one aspect at a time, such as handling ambiguity or
robustness to noise and misinformation. We instead consider multiple factors
simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and
Misinformation in Documents), a new dataset that simulates complex and
realistic scenarios for conflicting evidence for a user query, including
ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent
approach in which LLM agents debate over the merits of an answer over multiple
rounds, allowing an aggregator to collate responses corresponding to
disambiguated entities while discarding misinformation and noise, thereby
handling diverse sources of conflict jointly. We demonstrate the effectiveness
of MADAM-RAG using both closed and open-source models on AmbigDocs -- which
requires presenting all valid answers for ambiguous queries -- improving over
strong RAG baselines by up to 11.40% and on FaithEval -- which requires
suppressing misinformation -- where we improve by up to 15.80% (absolute) with
Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for
existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match
score). While MADAM-RAG begins to address these conflicting factors, our
analysis indicates that a substantial gap remains especially when increasing
the level of imbalance in supporting evidence and misinformation.
Authors' comments: Our data and code is available at:
https://github.com/HanNight/RAMDocs
Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, Shichao Kan
Composed Image Retrieval (CIR) retrieves target images using a multi-modal
query that combines a reference image with text describing desired
modifications. The primary challenge is effectively fusing this visual and
textual information. Current cross-modal feature fusion approaches for CIR
exhibit an inherent bias in intention interpretation. These methods tend to
disproportionately emphasize either the reference image features
(visual-dominant fusion) or the textual modification intent (text-dominant
fusion through image-to-text conversion). Such an imbalanced representation
often fails to accurately capture and reflect the actual search intent of the
user in the retrieval results. To address this challenge, we propose TMCIR, a
novel framework that advances composed image retrieval through two key
innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP
encoders contrastively using intent-reflecting pseudo-target images,
synthesized from reference images and textual descriptions via a diffusion
model. This step enhances the encoder ability of text to capture nuanced
intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune
all encoders contrastively by comparing adaptive token-fusion features with the
target image. This mechanism dynamically balances visual and textual
representations within the contrastive learning pipeline, optimizing the
composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR
datasets demonstrate that TMCIR significantly outperforms state-of-the-art
methods, particularly in capturing nuanced user intent.
Authors' comments: arXiv admin note: text overlap with arXiv:2310.05473 by other authors
Hanmeng Zhong, Linqing Chen, Weilei Wang, Wentao Wu
Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.
Baolei Zhang, Yuxi Chen, Minghong Fang, Zhuqing Liu, Lihai Nie, Tong Li, Zheli Liu
Large language models (LLMs) have demonstrated impressive natural language processing abilities but face challenges such as hallucination and outdated knowledge. Retrieval-Augmented Generation (RAG) has emerged as a state-of-the-art approach to mitigate these issues. While RAG enhances LLM outputs, it remains vulnerable to poisoning attacks. Recent studies show that injecting poisoned text into the knowledge database can compromise RAG systems, but most existing attacks assume that the attacker can insert a sufficient number of poisoned texts per query to outnumber correct-answer texts in retrieval, an assumption that is often unrealistic. To address this limitation, we propose CorruptRAG, a practical poisoning attack against RAG systems in which the attacker injects only a single poisoned text, enhancing both feasibility and stealth. Extensive experiments across multiple datasets demonstrate that CorruptRAG achieves higher attack success rates compared to existing baselines.
Fengxia Liu, Zhiyong Zheng, Kun Tian, Yi Zhang, Heng Guo, Zhe Hu, Oleksiy Zhedanov, Zixian Gong
This paper introduces a novel lower bound on communication complexity using
quantum relative entropy and mutual information, refining previous classical
entropy-based results. By leveraging Uhlmann's lemma and quantum Pinsker
inequalities, the authors establish tighter bounds for information-theoretic
security, demonstrating that quantum protocols inherently outperform classical
counterparts in balancing privacy and efficiency. Also explores symmetric
Quantum Private Information Retrieval (QPIR) protocols that achieve sub-linear
communication complexity while ensuring robustness against specious
adversaries: A post-quantum cryptography based protocol that can be
authenticated for the specious server; A ring-LWE-based protocol for
post-quantum security in a single-server setting, ensuring robustness against
quantum attacks; A multi-server protocol optimized for hardware practicality,
reducing implementation overhead while maintaining sub-linear efficiency. These
protocols address critical gaps in secure database queries, offering
exponential communication improvements over classical linear-complexity
methods. The work also analyzes security trade-offs under quantum specious
adversaries, providing theoretical guarantees for privacy and correctness.
Authors' comments: 11 pages, 1 figure
Sean MacAvaney, Antonio Mallia, Nicola Tonellotto
Multi-vector retrieval methods, exemplified by the ColBERT architecture, have
shown substantial promise for retrieval by providing strong trade-offs in terms
of retrieval latency and effectiveness. However, they come at a high cost in
terms of storage since a (potentially compressed) vector needs to be stored for
every token in the input collection. To overcome this issue, we propose
encoding documents to a fixed number of vectors, which are no longer
necessarily tied to the input tokens. Beyond reducing the storage costs, our
approach has the advantage that document representations become of a fixed size
on disk, allowing for better OS paging management. Through experiments using
the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a
representative multi-vector ranking model architecture, we find that passages
can be effectively encoded into a fixed number of vectors while retaining most
of the original effectiveness.
Authors' comments: ECIR 2025
Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
Recently, building retrieval-augmented generation (RAG) systems to enhance
the capability of large language models (LLMs) has become a common practice.
Especially in the legal domain, previous judicial decisions play a significant
role under the doctrine of stare decisis which emphasizes the importance of
making decisions based on (retrieved) prior documents. However, the overall
performance of RAG system depends on many components: (1) retrieval corpora,
(2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation
metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of
RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces
to facilitate seamless experiments and investigate how changes in the
aforementioned five components affect the overall accuracy. We validated LRAGE
using multilingual legal benches including Korean (KBL), English (LegalBench),
and Chinese (LawBench) by demonstrating how the overall accuracy changes when
varying the five components mentioned above. The source code is available at
https://github.com/hoorangyee/LRAGE.
Authors' comments: 12 pages
Philippe Jaming, Rolando Perez Iii
The aim of this paper is to get a deeper understanding of the spaces of variable bandwidth introduced by Gr{\"o}chenig and Klotz (What is variable bandwidth? Comm. Pure Appl. Math., 70 (2017), 2039-2083). In particular, we show that when the variation of the bandwidth is modeled by a step function with a finite number of jumps, then, the sign retrieval principle applies.
Yanhong Li, David Yunis, David McAllester, Jiawei Zhou
There has recently been considerable interest in incorporating information
retrieval into large language models (LLMs). Retrieval from a dynamically
expanding external corpus of text allows a model to incorporate current events
and can be viewed as a form of episodic memory. Here we demonstrate that
pre-processing the external corpus into semi-structured ''atomic facts'' makes
retrieval more efficient. More specifically, we demonstrate that our particular
form of atomic facts improves performance on various question answering tasks
when the amount of retrieved text is limited. Limiting the amount of retrieval
reduces the size of the context and improves inference efficiency.
Authors' comments: NAACL 2025 Main Conference
Jiali Cheng, Hadi Amiri
This study finds that existing information retrieval (IR) models show
significant biases based on the linguistic complexity of input queries,
performing well on linguistically simpler (or more complex) queries while
underperforming on linguistically more complex (or simpler) queries. To address
this issue, we propose EqualizeIR, a framework to mitigate linguistic biases in
IR models. EqualizeIR uses a linguistically biased weak learner to capture
linguistic biases in IR datasets and then trains a robust model by regularizing
and refining its predictions using the biased weak learner. This approach
effectively prevents the robust model from overfitting to specific linguistic
patterns in data. We propose four approaches for developing
linguistically-biased models. Extensive experiments on several datasets show
that our method reduces performance disparities across linguistically simple
and complex queries, while improving overall retrieval performance.
Authors' comments: NAACL 2025
Ahmed H. Salamah, Pierre McWhannel, Nicole Yan
Information retrieval systems have traditionally relied on exact term match methods such as BM25 for first-stage retrieval. However, recent advancements in neural network-based techniques have introduced a new method called dense retrieval. This approach uses a dual-encoder to create contextual embeddings that can be indexed and clustered efficiently at run-time, resulting in improved retrieval performance in Open-domain Question Answering systems. In this paper, we apply the dense retrieval technique to conversational search by conducting experiments on the CAsT benchmark dataset. We also propose an end-to-end conversational search system called GPT2QR+DPR, which incorporates various query reformulation strategies to improve retrieval accuracy. Our findings indicate that dense retrieval outperforms BM25 even without extensive fine-tuning. Our work contributes to the growing body of research on neural-based retrieval methods in conversational search, and highlights the potential of dense retrieval in improving retrieval accuracy in conversational search systems.
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods. The code of our proposed method is available at \href{https://github.com/hhy-huang/HiRAG}{https://github.com/hhy-huang/HiRAG}.
Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen
Text-to-Video Retrieval (TVR) aims to match videos with corresponding textual queries, yet the continual influx of new video content poses a significant challenge for maintaining system performance over time. In this work, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to overcome these limitations. Our analysis reveals that current TVR methods based on pre-trained models struggle to retain plasticity when adapting to new tasks, while existing continual learning approaches experience catastrophic forgetting, resulting in semantic misalignment between historical queries and stored video features. To address these challenges, we propose StableFusion, a novel CTVR framework comprising two main components: the Frame Fusion Adapter (FFA), which captures temporal dynamics in video content while preserving model flexibility, and the Task-Aware Mixture-of-Experts (TAME), which maintains consistent semantic alignment between queries across tasks and the stored video features. Comprehensive evaluations on two benchmark datasets under various task settings demonstrate that StableFusion outperforms existing continual learning and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks in the context of continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR
Qi Xu, Annie Qu
In the era of big data, large-scale, multi-modal datasets are increasingly ubiquitous, offering unprecedented opportunities for predictive modeling and scientific discovery. However, these datasets often exhibit complex heterogeneity, such as covariate shift, posterior drift, and missing modalities, that can hinder the accuracy of existing prediction algorithms. To address these challenges, we propose a novel Representation Retrieval ($R^2$) framework, which integrates a representation learning module (the representer) with a sparsity-induced machine learning model (the learner). Moreover, we introduce the notion of "integrativeness" for representers, characterized by the effective data sources used in learning representers, and propose a Selective Integration Penalty (SIP) to explicitly improve the property. Theoretically, we demonstrate that the $R^2$ framework relaxes the conventional full-sharing assumption in multi-task learning, allowing for partially shared structures, and that SIP can improve the convergence rate of the excess risk bound. Extensive simulation studies validate the empirical performance of our framework, and applications to two real-world datasets further confirm its superiority over existing approaches.
Yang Nan, Huichi Zhou, Xiaodan Xing, Giorgos Papanastasiou, Lei Zhu, Zhifan Gao, Alejandro F Fangi, Guang Yang
As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn hash functions using bottleneck features, which fail to produce representative hash codes from blended embeddings. Although contrastive hashing has shown superior performance, current approaches often treat image retrieval as a classification task, using category labels to create positive/negative pairs. Moreover, many methods fail to address the out-of-distribution (OOD) issue when models encounter external OOD queries or adversarial attacks. In this work, we propose a novel method to consolidate knowledge of hierarchical features and optimisation functions. We formulate the knowledge consolidation by introducing Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH). DaRF adaptively integrates shallow and deep representations into blended features, and SCH incorporates image fingerprints to enhance the adaptability of positive/negative pairings. These blended features further facilitate OOD detection and content-based recommendation, contributing to a secure AI-driven healthcare environment. Moreover, we present a content-guided ranking to improve the robustness and reproducibility of retrieval results. Our comprehensive assessments demonstrate that the proposed method could effectively recognise OOD samples and significantly outperform existing approaches in medical image retrieval (p<0.05). In particular, our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
Juseon-Do, Jaesung Hwang, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
This study investigates retrieval-augmented summarization by specifically
examining the impact of exemplar summary lengths under length constraints, not
covered by previous work. We propose a Diverse Length-aware Maximal Marginal
Relevance (DL-MMR) algorithm to better control summary lengths. This algorithm
combines the query relevance with diverse target lengths in retrieval-augmented
summarization. Unlike previous methods that necessitate exhaustive exemplar
exemplar relevance comparisons using MMR, DL-MMR considers the exemplar target
length as well and avoids comparing exemplars to each other, thereby reducing
computational cost and conserving memory during the construction of an exemplar
pool. Experimental results showed the effectiveness of DL-MMR, which considers
length diversity, compared to the original MMR algorithm. DL-MMR additionally
showed the effectiveness in memory saving of 781,513 times and computational
cost reduction of 500,092 times, while maintaining the same level of
informativeness.
Authors' comments: 12 pages, accepted to NAACL 2025 Findings