Aozhu Chen, Hazel Doughty, Xirong Li, Cees G. M. Snoek
Video-text retrieval has seen significant advancements, yet the ability of
models to discern subtle differences in captions still requires verification.
In this paper, we introduce a new approach for fine-grained evaluation. Our
approach can be applied to existing datasets by automatically generating hard
negative test captions with subtle single-word variations across nouns, verbs,
adjectives, adverbs, and prepositions. We perform comprehensive experiments
using four state-of-the-art models across two standard benchmarks (MSR-VTT and
VATEX) and two specially curated datasets enriched with detailed descriptions
(VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our
analyses show that the current evaluation benchmarks fall short in detecting a
model's ability to perceive subtle single-word differences, 2) our fine-grained
evaluation highlights the difficulty models face in distinguishing such subtle
variations. To enhance fine-grained understanding, we propose a new baseline
that can be easily combined with current methods. Experiments on our
fine-grained evaluations demonstrate that this approach enhances a model's
ability to understand fine-grained differences.
Authors' comments: Accepted to ACCV 2024
Amin Abolghasemi, Leif Azzopardi, Seyyed Hadi Hashemi, Maarten de Rijke, Suzan Verberne
Attributing answers to source documents is an approach used to enhance the verifiability of a model's output in retrieval augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM's output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%. Moreover, we show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs' trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of brittleness in LLMs.
Zerui Xu, Fang Wu, Yuanyuan Zhang, Yue Zhao
Machine learning (ML) exhibits promise in the clinical domain. However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Experiments conducted on real clinical trials from the \url{ClinicalTrials.gov} database demonstrate that our synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy. The code is publicly available at https://anonymous.4open.science/r/Retrieval_Reasoning_Clinical_Trial_Generation-3EC4.
Xuan Guo, Rohit Patki, Dante Everaert, Christopher Potts
The rapid introduction of new brand names into everyday language poses a unique challenge for e-commerce spelling correction services, which must distinguish genuine misspellings from novel brand names that use unconventional spelling. We seek to address this challenge via Retrieval Augmented Generation (RAG). On this approach, product names are retrieved from a catalog and incorporated into the context used by a large language model (LLM) that has been fine-tuned to do contextual spelling correction. Through quantitative evaluation and qualitative error analyses, we find improvements in spelling correction utilizing the RAG framework beyond a stand-alone LLM. We also demonstrate the value of additional finetuning of the LLM to incorporate retrieved context.
Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song et al.
Retrieval-Augmented Generation (RAG) models are designed to incorporate
external knowledge, reducing hallucinations caused by insufficient parametric
(internal) knowledge. However, even with accurate and relevant retrieved
content, RAG models can still produce hallucinations by generating outputs that
conflict with the retrieved information. Detecting such hallucinations requires
disentangling how Large Language Models (LLMs) utilize external and parametric
knowledge. Current detection methods often focus on one of these mechanisms or
without decoupling their intertwined effects, making accurate detection
difficult. In this paper, we investigate the internal mechanisms behind
hallucinations in RAG scenarios. We discover hallucinations occur when the
Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual
stream, while Copying Heads fail to effectively retain or integrate external
knowledge from retrieved content. Based on these findings, we propose ReDeEP, a
novel method that detects hallucinations by decoupling LLM's utilization of
external context and parametric knowledge. Our experiments show that ReDeEP
significantly improves RAG hallucination detection accuracy. Additionally, we
introduce AARF, which mitigates hallucinations by modulating the contributions
of Knowledge FFNs and Copying Heads.
Authors' comments: 23pages
Xinping Zhao, Dongfang Li, Yan Zhong, Boren Hu, Yibin Chen, Baotian Hu, Min Zhang
Recent studies in Retrieval-Augmented Generation (RAG) have investigated
extracting evidence from retrieved passages to reduce computational costs and
enhance the final RAG performance, yet it remains challenging. Existing methods
heavily rely on heuristic-based augmentation, encountering several issues: (1)
Poor generalization due to hand-crafted context filtering; (2) Semantics
deficiency due to rule-based context chunking; (3) Skewed length due to
sentence-wise filter learning. To address these issues, we propose a
model-based evidence extraction learning framework, SEER, optimizing a vanilla
model as an evidence extractor with desired properties through self-aligned
learning. Extensive experiments show that our method largely improves the final
RAG performance, enhances the faithfulness, helpfulness, and conciseness of the
extracted evidence, and reduces the evidence length by 9.25 times. The code
will be available at https://github.com/HITsz-TMG/SEER.
Authors' comments: 15 pages, 6 figures, 5 tables. Accepted by EMNLP 2024 (main)
Xiao Peng, Liang Chen
Recently, large language models (LLMs) like ChatGPT, LLaMA, and Claude have
prevailed in countless domains, including legal scenarios. With LLMs' rapid
technological progress, the development of prompt engineering (PE) as an
interface between the LLMs and real-world applications has drawn the attention
of all developers. Various PE methods have been proposed to overcome real-world
challenges, such as few-shot prompting, chain-of-thought, and
retrieval-augmented generation (RAG). However, RAG for legal judgment
prediction (LJP) is still underexplored. To address this, we propose "Athena",
a novel framework cultivating RAG as a core preprocess component to enhance
LLMs' performance on specialized tasks. Athena constructs a knowledge base for
accusations, attached with a semantic retrieval mechanism through
vectorization. Our experiments show that Athena's overall performance has
improved significantly, achieving state-of-the-art results on the CAIL2018
dataset. Our ablation study on the in-context window size parameter further
reproduces LLMs' "lost-in-the-middle" phenomenon with a relative positional
variation. And with moderate hyper-parameter-tuning, we can achieve at most 95%
of accuracy accordingly. We also study the impact of query rewriting and data
distribution, providing possible directions for future research based on former
analyses.
Authors' comments: 13 pages, 6 figures
Wei Dai, Peng Fu, Chunjing Gan
In an era marked by robust technological growth and swift information
renewal, furnishing researchers and the populace with top-tier, avant-garde
academic insights spanning various domains has become an urgent necessity. The
KDD Cup 2024 AQA Challenge is geared towards advancing retrieval models to
identify pertinent academic terminologies from suitable papers for scientific
inquiries. This paper introduces the LLM-KnowSimFuser proposed by Robo Space,
which wins the 2nd place in the competition. With inspirations drawed from the
superior performance of LLMs on multiple tasks, after careful analysis of the
provided datasets, we firstly perform fine-tuning and inference using
LLM-enhanced pre-trained retrieval models to introduce the tremendous language
understanding and open-domain knowledge of LLMs into this task, followed by a
weighted fusion based on the similarity matrix derived from the inference
results. Finally, experiments conducted on the competition datasets show the
superiority of our proposal, which achieved a score of 0.20726 on the final
leaderboard.
Authors' comments: The 2nd Place of KDD Cup 2024 OAG-Challenge AQA
Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang
This paper presents EasyRAG, a simple, lightweight, and efficient
retrieval-augmented generation framework for network automated operations. The
advantages of our solution are: 1.Accurate Question Answering: We designed a
straightforward RAG scheme based on (1) a specific data processing workflow (2)
dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking
(4) LLM answer generation and optimization. This approach achieved first place
in the GLM4 track in the preliminary round and second place in the GLM4 track
in the semifinals. 2.Simple Deployment: Our method primarily consists of BM25
retrieval and BGE-reranker reranking, requiring no fine-tuning of any models,
occupying minimal VRAM, easy to deploy, and highly scalable; we provide a
flexible code library with various search and generation strategies,
facilitating custom process implementation. 3.Efficient Inference: We designed
an efficient inference acceleration scheme for the entire coarse ranking,
reranking, and generation process that significantly reduces the inference
latency of RAG while maintaining a good level of accuracy; each acceleration
scheme can be plug-and-play into any component of the RAG process, consistently
enhancing the efficiency of the RAG system. Our code and data are released at
https://github.com/BUAADreamer/EasyRAG.
Authors' comments: 10 pages, 2 figures
Weiwei Sun, Zhengliang Shi, Jiulong Wu, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin et al.
Recent information retrieval (IR) models are pre-trained and
instruction-tuned on massive datasets and tasks, enabling them to perform well
on a wide range of tasks and potentially generalize to unseen tasks with
instructions. However, existing IR benchmarks focus on a limited scope of
tasks, making them insufficient for evaluating the latest IR models. In this
paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a
heterogeneous IR benchmark that includes 126 distinct IR tasks across 6
domains, collected from existing datasets. We benchmark state-of-the-art
instruction-tuned text embedding models and re-ranking models. Our experiments
reveal that instruction-tuned models generally achieve superior performance
compared to non-instruction-tuned models on MAIR. Additionally, our results
suggest that current instruction-tuned text embedding models and re-ranking
models still lack effectiveness in specific long-tail tasks. MAIR is publicly
available at https://github.com/sunnweiwei/Mair.
Authors' comments: EMNLP 2024
Xinping Zhao, Yan Zhong, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Dongfang Li, Baotian Hu, Min Zhang
Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It
mainly consists of retrieval and generation. The retrieval modules (a.k.a.
retrievers) aim to find useful information used to facilitate the generation
modules (a.k.a. generators). As such, generators' performance largely depends
on the effectiveness and efficiency of retrievers. However, the widely used
retrieval paradigm remains flat. It treats retrieval procedures as a one-off
deal with constant granularity. Despite effectiveness, we argue that they
suffer from two limitations: (1) flat retrieval exerts a significant burden on
one retriever; (2) constant granularity limits the ceiling of retrieval
performance. In this work, we propose a progressive retrieval paradigm with
coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance
effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive
retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small
quantity, and low-to-high capacity, which can relieve the burden on one
retriever and also promote the ceiling of retrieval performance. Extensive
experiments manifest that FunnelRAG achieves comparable retrieval performance
while the time overhead is reduced by nearly 40 percent.
Authors' comments: 18 pages, 6 figures, 13 tables. Accepted by NAACL 2025
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang et al.
Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag.
William Kuszmaul, Aaron Putterman, Tingqiang Xu, Hangrui Zhou, Renfei Zhou
Retrieval data structures are data structures that answer key-value queries
without paying the space overhead of explicitly storing keys. The problem can
be formulated in four settings (static, value-dynamic, incremental, or
dynamic), each of which offers different levels of dynamism to the user. In
this paper, we establish optimal bounds for the final two settings (incremental
and dynamic) in the case of a polynomial universe. Our results complete a line
of work that has spanned more than two decades, and also come with a surprise:
the incremental setting, which has long been viewed as essentially equivalent
to the dynamic one, actually has a phase transition, in which, as the value
size $v$ approaches $\log n$, the optimal space redundancy actually begins to
shrink, going from roughly $n \log \log n$ (which has long been thought to be
optimal) all the way down to $\Theta(n)$ (which is the optimal bound even for
the seemingly much-easier value-dynamic setting).
Authors' comments: 29 pages, in SODA 2025
Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, Ji-Rong Wen
Following natural instructions is crucial for the effective application of
Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in
Large Language Models (LLMs), research on assessing and improving
instruction-following (IF) alignment within the RAG domain remains limited. To
address this issue, we propose VIF-RAG, the first automated, scalable, and
verifiable synthetic pipeline for instruction-following alignment in RAG
systems. We start by manually crafting a minimal set of atomic instructions
(<100) and developing combination rules to synthesize and verify complex
instructions for a seed set. We then use supervised models for instruction
rewriting while simultaneously generating code to automate the verification of
instruction quality via a Python executor. Finally, we integrate these
instructions with extensive RAG and general data samples, scaling up to a
high-quality VIF-RAG-QA dataset (>100k) through automated processes. To further
bridge the gap in instruction-following auto-evaluation for RAG systems, we
introduce FollowRAG Benchmark, which includes approximately 3K test samples,
covering 22 categories of general instruction constraints and four
knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG
can seamlessly integrate with different RAG benchmarks. Using FollowRAG and
eight widely-used IF and foundational abilities benchmarks for LLMs, we
demonstrate that VIF-RAG markedly enhances LLM performance across a broad range
of general instruction constraints while effectively leveraging its
capabilities in RAG scenarios. Further analysis offers practical insights for
achieving IF alignment in RAG systems. Our code and datasets are released at
https://FollowRAG.github.io.
Authors' comments: Working in progress
David Beauchemin, Zachary Gagnon, Ricahrd Khoury
Large Language Models (LLMs) perform outstandingly in various downstream
tasks, and the use of the Retrieval-Augmented Generation (RAG) architecture has
been shown to improve performance for legal question answering (Nuruzzaman and
Hussain, 2020; Louis et al., 2024). However, there are limited applications in
insurance questions-answering, a specific type of legal document. This paper
introduces two corpora: the Quebec Automobile Insurance Expertise Reference
Corpus and a set of 82 Expert Answers to Layperson Automobile Insurance
Questions. Our study leverages both corpora to automatically and manually
assess a GPT4-o, a state-of-the-art LLM, to answer Quebec automobile insurance
questions. Our results demonstrate that, on average, using our expertise
reference corpus generates better responses on both automatic and manual
evaluation metrics. However, they also highlight that LLM QA is unreliable
enough for mass utilization in critical areas. Indeed, our results show that
between 5% to 13% of answered questions include a false statement that could
lead to customer misunderstanding.
Authors' comments: Accepted to NLLP 2024 EMNLP workshop
Jinyoung Park, Minseok Joo, Joo-Kyung Kim, Hyunwoo J. Kim
Knowledge graph-grounded dialog generation requires retrieving a
dialog-relevant subgraph from the given knowledge base graph and integrating it
with the dialog history. Previous works typically represent the graph using an
external encoder, such as graph neural networks, and retrieve relevant triplets
based on the similarity between single-vector representations of triplets and
the dialog history. However, these external encoders fail to leverage the rich
knowledge of pretrained language models, and the retrieval process is also
suboptimal due to the information bottleneck caused by the single-vector
abstraction of the dialog history. In this work, we propose Dialog generation
with Generative Subgraph Retrieval (DialogGSR), which retrieves relevant
knowledge subgraphs by directly generating their token sequences on top of
language models. For effective generative subgraph retrieval, we introduce two
key methods: (i) structure-aware knowledge graph linearization with
self-supervised graph-specific tokens and (ii) graph-constrained decoding
utilizing graph structural proximity-based entity informativeness scores for
valid and relevant generative retrieval. DialogGSR achieves state-of-the-art
performance in knowledge graph-grounded dialog generation, as demonstrated on
OpenDialKG and KOMODIS datasets.
Authors' comments: EMNLP (main)
Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jing Di, Yu Cheng, Qifan Wang, Lifu Huang
Current vision-language models (VLMs) still exhibit inferior performance on knowledge-intensive tasks, primarily due to the challenge of accurately encoding all the associations between visual objects and scenes to their corresponding entities and background knowledge. While retrieval augmentation methods offer an efficient way to integrate external knowledge, extending them to vision-language domain presents unique challenges in (1) precisely retrieving relevant information from external sources due to the inherent discrepancy within the multimodal queries, and (2) being resilient to the irrelevant, extraneous and noisy information contained in the retrieved multimodal knowledge snippets. In this work, we introduce RORA-VLM, a novel and robust retrieval augmentation framework specifically tailored for VLMs, with two key innovations: (1) a 2-stage retrieval process with image-anchored textual-query expansion to synergistically combine the visual and textual information in the query and retrieve the most relevant multimodal knowledge snippets; and (2) a robust retrieval augmentation method that strengthens the resilience of VLMs against irrelevant information in the retrieved multimodal knowledge by injecting adversarial noises into the retrieval-augmented training process, and filters out extraneous visual information, such as unrelated entities presented in images, via a query-oriented visual token refinement strategy. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets. Our results demonstrate that with a minimal amount of training instance, RORA-VLM enables the base model to achieve significant performance improvement and constantly outperform state-of-the-art retrieval-augmented VLMs on all benchmarks while also exhibiting a novel zero-shot domain transfer capability.
Luyu Gao, Yunyi Zhang, Jamie Callan
Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun Peng
Existing multimodal retrieval benchmarks primarily focus on evaluating
whether models can retrieve and utilize external textual knowledge for question
answering. However, there are scenarios where retrieving visual information is
either more beneficial or easier to access than textual data. In this paper, we
introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in
which we systematically identify and categorize scenarios where visually
augmented knowledge is better than textual knowledge, for instance, more images
from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353
human-annotated multiple-choice questions across 9 distinct scenarios. With
MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large
vision-language models (LVLMs). Our results show that all LVLMs exhibit greater
improvements when augmented with images compared to textual knowledge,
confirming that MRAG-Bench is vision-centric. Additionally, we conduct
extensive analysis with MRAG-Bench, which offers valuable insights into
retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces
challenges in effectively leveraging retrieved knowledge, achieving only a
5.82% improvement with ground-truth information, in contrast to a 33.16%
improvement observed in human participants. These findings highlight the
importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability
to utilize retrieved visual knowledge more effectively.
Authors' comments: https://mragbench.github.io
Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Ian Mackie, Jeff Dalton, Andrew Yates
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained
transformers, which often split entities into nonsensical fragments. Splitting
entities can reduce retrieval accuracy and limits the model's ability to
incorporate up-to-date world knowledge not included in the training data. In
this work, we enhance the LSR vocabulary with Wikipedia concepts and entities,
enabling the model to resolve ambiguities more effectively and stay current
with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo)
head, which leverages existing entity embeddings and an entity retrieval
component that identifies entities relevant to a query or document. We use the
DyVo head to generate entity weights, which are then merged with word piece
weights to create joint representations for efficient indexing and retrieval
using an inverted index. In experiments across three entity-rich document
ranking datasets, the resulting DyVo model substantially outperforms
state-of-the-art baselines.
Authors' comments: https://github.com/thongnt99/DyVo