Dae Yon Hwang, Bilal Taha, Harshit Pande, Yaroslav Nechaev
Despite the recent advancements in information retrieval (IR), zero-shot IR
remains a significant challenge, especially when dealing with new domains,
languages, and newly-released use cases that lack historical query traffic from
existing users. For such cases, it is common to use query augmentations
followed by fine-tuning pre-trained models on the document data paired with
synthetic queries. In this work, we propose a novel Universal Document Linking
(UDL) algorithm, which links similar documents to enhance synthetic query
generation across multiple datasets with different characteristics. UDL
leverages entropy for the choice of similarity models and named entity
recognition (NER) for the link decision of documents using similarity scores.
Our empirical studies demonstrate the effectiveness and universality of the UDL
across diverse datasets and IR models, surpassing state-of-the-art methods in
zero-shot cases. The developed code for reproducibility is included in
https://github.com/eoduself/UDL
Authors' comments: Accepted for publication at EMNLP 2024 Main Conference
Seong-Il Park, Jay-Yoon Lee
Retrieval Augmented Language Models (RALMs) have gained significant attention
for their ability to generate accurate answer and improve efficiency. However,
RALMs are inherently vulnerable to imperfect information due to their reliance
on the imperfect retriever or knowledge source. We identify three common
scenarios-unanswerable, adversarial, conflicting-where retrieved document sets
can confuse RALM with plausible real-world examples. We present the first
comprehensive investigation to assess how well RALMs detect and handle such
problematic scenarios. Among these scenarios, to systematically examine
adversarial robustness we propose a new adversarial attack method, Generative
model-based ADVersarial attack (GenADV) and a novel metric Robustness under
Additional Document (RAD). Our findings reveal that RALMs often fail to
identify the unanswerability or contradiction of a document set, which
frequently leads to hallucinations. Moreover, we show the addition of an
adversary significantly degrades RALM's performance, with the model becoming
even more vulnerable when the two scenarios overlap (adversarial+unanswerable).
Our research identifies critical areas for assessing and enhancing the
robustness of RALMs, laying the foundation for the development of more robust
models.
Authors' comments: Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL)
Cody Clop, Yannick Teglia
Large Language Models (LLMs) have demonstrated remarkable capabilities in
generating coherent text but remain limited by the static nature of their
training data. Retrieval Augmented Generation (RAG) addresses this issue by
combining LLMs with up-to-date information retrieval, but also expand the
attack surface of the system. This paper investigates prompt injection attacks
on RAG, focusing on malicious objectives beyond misinformation, such as
inserting harmful links, promoting unauthorized services, and initiating
denial-of-service behaviors. We build upon existing corpus poisoning techniques
and propose a novel backdoor attack aimed at the fine-tuning process of the
dense retriever component. Our experiments reveal that corpus poisoning can
achieve significant attack success rates through the injection of a small
number of compromised documents into the retriever corpus. In contrast,
backdoor attacks demonstrate even higher success rates but necessitate a more
complex setup, as the victim must fine-tune the retriever using the attacker
poisoned dataset.
Authors' comments: 12 pages, 5 figures
Xiangci Li, Jessica Ouyang
Retrieval-augmented generation (RAG) has emerged as a powerful method for enhancing natural language generation by integrating external knowledge into a model's output. While prior work has demonstrated the importance of improving knowledge retrieval for boosting generation quality, the role of knowledge selection remains less clear. In this paper, we perform a comprehensive analysis of how knowledge retrieval and selection influence downstream generation performance in RAG systems. By simulating different retrieval and selection conditions through a controlled mixture of gold and distractor knowledge, we assess the impact of these factors on generation outcomes. Our findings indicate that the downstream generator model's capability, as well as the complexity of the task and dataset, significantly influence the impact of knowledge retrieval and selection on the overall RAG system performance. In typical scenarios, improving the knowledge recall score is key to enhancing generation outcomes, with the knowledge selector providing a limited additional benefit when a strong generator model is used on clear, well-defined tasks. For weaker generator models or more ambiguous tasks and datasets, the knowledge F1 score becomes a critical factor, and the knowledge selector plays a more prominent role in improving overall performance.
Pengfei Jin, Peng Shu, Sekeun Kim, Qing Xiao, Sifan Song, Cheng Chen, Tianming Liu, Xiang Li et al.
Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.
Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, Zhen-Hua Ling
Recent Retrieval Augmented Generation (RAG) aims to enhance Large Language
Models (LLMs) by incorporating extensive knowledge retrieved from external
sources. However, such approach encounters some challenges: Firstly, the
original queries may not be suitable for precise retrieval, resulting in
erroneous contextual knowledge; Secondly, the language model can easily
generate inconsistent answer with external references due to their knowledge
boundary limitation. To address these issues, we propose the
chain-of-verification (CoV-RAG) to enhance the external retrieval correctness
and internal generation consistency. Specifically, we integrate the
verification module into the RAG, engaging in scoring, judgment, and rewriting.
To correct external retrieval errors, CoV-RAG retrieves new knowledge using a
revised query. To correct internal generation errors, we unify QA and
verification tasks with a Chain-of-Thought (CoT) reasoning during training. Our
comprehensive experiments across various LLMs demonstrate the effectiveness and
adaptability compared with other strong baselines. Especially, our CoV-RAG can
significantly surpass the state-of-the-art baselines using different LLM
backbones.
Authors' comments: Accepted to EMNLP 2024 Findings. 9 pages, 4 figures, 7 tables
Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo
Evaluating production-level retrieval systems at scale is a crucial yet
challenging task due to the limited availability of a large pool of
well-trained human annotators. Large Language Models (LLMs) have the potential
to address this scaling issue and offer a viable alternative to humans for the
bulk of annotation tasks. In this paper, we propose a framework for assessing
the product search engines in a large-scale e-commerce setting, leveraging
Multimodal LLMs for (i) generating tailored annotation guidelines for
individual queries, and (ii) conducting the subsequent annotation task. Our
method, validated through deployment on a large e-commerce platform,
demonstrates comparable quality to human annotations, significantly reduces
time and cost, facilitates rapid problem discovery, and provides an effective
solution for production-level quality control at scale.
Authors' comments: 13 pages, 5 figures, 4 Tables
Benjamin Clavié
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.
Zeyu Chen, Pengfei Zhang, Kai Ye, Wei Dong, Xin Feng, Yana Zhang
The burgeoning short video industry has accelerated the advancement of
video-music retrieval technology, assisting content creators in selecting
appropriate music for their videos. In self-supervised training for
video-to-music retrieval, the video and music samples in the dataset are
separated from the same video work, so they are all one-to-one matches. This
does not match the real situation. In reality, a video can use different music
as background music, and a music can be used as background music for different
videos. Many videos and music that are not in a pair may be compatible, leading
to false negative noise in the dataset. A novel inter-intra modal (II) loss is
proposed as a solution. By reducing the variation of feature distribution
within the two modalities before and after the encoder, II loss can reduce the
model's overfitting to such noise without removing it in a costly and laborious
way. The video-music retrieval framework, II-CLVM (Contrastive Learning for
Video-Music Retrieval), incorporating the II Loss, achieves state-of-the-art
performance on the YouTube8M dataset. The framework II-CLVTM shows better
performance when retrieving music using multi-modal video information (such as
text in videos). Experiments are designed to show that II loss can effectively
alleviate the problem of false negative noise in retrieval tasks. Experiments
also show that II loss improves various self-supervised and supervised
uni-modal and cross-modal retrieval tasks, and can obtain good retrieval models
with a small amount of training samples.
Authors' comments: 10 pages, 7 figures
Zhiyu An, Xianzhong Ding, Yen-Chun Fu, Cheng-Chung Chu, Yan Li, Wan Du
This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever's superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin
The zero-shot effectiveness of neural retrieval models is often evaluated on
the BEIR benchmark -- a combination of different IR evaluation datasets.
Interestingly, previous studies found that particularly on the BEIR subset
Touch\'e 2020, an argument retrieval task, neural retrieval models are
considerably less effective than BM25. Still, so far, no further investigation
has been conducted on what makes argument retrieval so "special". To more
deeply analyze the respective potential limits of neural retrieval models, we
run a reproducibility study on the Touch\'e 2020 data. In our study, we focus
on two experiments: (i) a black-box evaluation (i.e., no model retraining),
incorporating a theoretical exploration using retrieval axioms, and (ii) a data
denoising evaluation involving post-hoc relevance judgments. Our black-box
evaluation reveals an inherent bias of neural models towards retrieving short
passages from the Touch\'e 2020 data, and we also find that quite a few of the
neural models' results are unjudged in the Touch\'e 2020 data. As many of the
short Touch\'e passages are not argumentative and thus non-relevant per se, and
as the missing judgments complicate fair comparison, we denoise the Touch\'e
2020 data by excluding very short passages (less than 20 words) and by
augmenting the unjudged data with post-hoc judgments following the Touch\'e
guidelines. On the denoised data, the effectiveness of the neural models
improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code
and the augmented Touch\'e 2020 dataset are available at
\url{https://github.com/castorini/touche-error-analysis}.
Authors' comments: SIGIR 2024 (Resource & Reproducibility Track)
Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
Mingda Li, Xinyu Li, Yifan Chen, Wenfeng Xuan, Weinan Zhang
Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their
superiority in terms of factuality, they do not consistently outperform the
original retrieval-free Language Models (LMs). Our experiments reveal that this
example-level performance inconsistency exists not only between
retrieval-augmented and retrieval-free LM but also among different retrievers.
To understand this phenomenon, we investigate the degeneration behavior of
RALMs and theoretically decompose it into four categories. Further analysis
based on our decomposition reveals that the innate difference in knowledge
sources and the unpredictable degeneration of the reader model contribute most
to the inconsistency. Drawing from our analysis, we introduce Ensemble of
Retrievers (EoR), a trainable framework that can adaptively retrieve from
different knowledge sources and effectively decrease unpredictable reader
errors. Our experiments on Open Domain Question Answering show that EoR
substantially improves performance over the RALM with a single retriever by
considerably reducing inconsistent behaviors.
Authors' comments: ACL 2024 (findings)
Maya Anderson, Guy Amit, Abigail Goldsteen
Retrieval Augmented Generation (RAG) systems have shown great promise in
natural language processing. However, their reliance on data stored in a
retrieval database, which may contain proprietary or sensitive information,
introduces new privacy concerns. Specifically, an attacker may be able to infer
whether a certain text passage appears in the retrieval database by observing
the outputs of the RAG system, an attack known as a Membership Inference Attack
(MIA). Despite the significance of this threat, MIAs against RAG systems have
yet remained under-explored. This study addresses this gap by introducing an
efficient and easy-to-use method for conducting MIA against RAG systems. We
demonstrate the effectiveness of our attack using two benchmark datasets and
multiple generative models, showing that the membership of a document in the
retrieval database can be efficiently determined through the creation of an
appropriate prompt in both black-box and gray-box settings. Moreover, we
introduce an initial defense strategy based on adding instructions to the RAG
template, which shows high effectiveness for some datasets and models. Our
findings highlight the importance of implementing security countermeasures in
deployed RAG systems and developing more advanced defenses to protect the
privacy and security of retrieval databases.
Authors' comments: 12 pages, 4 figures
Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, Longtao Huang, Hui Xue, Xiaofeng He, Jun Huang
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
Xin Jiang, Hao Tang, Rui Yan, Jinhui Tang, Zechao Li
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.
Peter Baile Chen, Yi Zhang, Dan Roth
Retrieving relevant tables containing the necessary information to accurately
answer a given question over tables is critical to open-domain
question-answering (QA) systems. Previous methods assume the answer to such a
question can be found either in a single table or multiple tables identified
through question decomposition or rewriting. However, neither of these
approaches is sufficient, as many questions require retrieving multiple tables
and joining them through a join plan that cannot be discerned from the user
query itself. If the join plan is not considered in the retrieval stage, the
subsequent steps of reasoning and answering based on those retrieved tables are
likely to be incorrect. To address this problem, we introduce a method that
uncovers useful join relations for any query and database during table
retrieval. We use a novel re-ranking method formulated as a mixed-integer
program that considers not only table-query relevance but also table-table
relevance that requires inferring join relationships. Our method outperforms
the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score
and for end-to-end QA by up to 5.4% in accuracy.
Authors' comments: ACL 2024. Dataset and code are available at
https://peterbaile.github.io/jar
Shashi Kant Gupta, Aditya Basu, Bradley Taylor, Anai Kothari, Hrituraj Singh
Retrieving information from EHR systems is essential for answering specific
questions about patient journeys and improving the delivery of clinical care.
Despite this fact, most EHR systems still rely on keyword-based searches. With
the advent of generative large language models (LLMs), retrieving information
can lead to better search and summarization capabilities. Such retrievers can
also feed Retrieval-augmented generation (RAG) pipelines to answer any query.
However, the task of retrieving information from EHR real-world clinical data
contained within EHR systems in order to solve several downstream use cases is
challenging due to the difficulty in creating query-document support pairs. We
provide a blueprint for creating such datasets in an affordable manner using
large language models. Our method results in a retriever that is 30-50 F-1
points better than propriety counterparts such as Ada and Mistral for oncology
data elements. We further compare our model, called Onco-Retriever, against
fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation
on real-world EHR data along with latency analysis of the different models and
provide a path forward for healthcare organizations to build domain-specific
retrievers.
Authors' comments: 18 pages
Mingrui Wu, Sheng Cao
Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
Maxime Bouthors, Josep Crego, Francois Yvon
Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures, to better understand the interplay between these two processes. We conduct experiments in two language pairs in a multi-domain setting and consider several downstream architectures based on a standard autoregressive model, an edit-based model, and a large language model with in-context learning. Our experiments show that the choice of the retrieval technique impacts the translation scores, with variance across architectures. We also discuss the effects of increasing the number and diversity of examples, which are mostly positive across the board.