Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
Mingda Li, Xinyu Li, Yifan Chen, Wenfeng Xuan, Weinan Zhang
Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their
superiority in terms of factuality, they do not consistently outperform the
original retrieval-free Language Models (LMs). Our experiments reveal that this
example-level performance inconsistency exists not only between
retrieval-augmented and retrieval-free LM but also among different retrievers.
To understand this phenomenon, we investigate the degeneration behavior of
RALMs and theoretically decompose it into four categories. Further analysis
based on our decomposition reveals that the innate difference in knowledge
sources and the unpredictable degeneration of the reader model contribute most
to the inconsistency. Drawing from our analysis, we introduce Ensemble of
Retrievers (EoR), a trainable framework that can adaptively retrieve from
different knowledge sources and effectively decrease unpredictable reader
errors. Our experiments on Open Domain Question Answering show that EoR
substantially improves performance over the RALM with a single retriever by
considerably reducing inconsistent behaviors.
Authors' comments: ACL 2024 (findings)
Maya Anderson, Guy Amit, Abigail Goldsteen
Retrieval Augmented Generation (RAG) systems have shown great promise in
natural language processing. However, their reliance on data stored in a
retrieval database, which may contain proprietary or sensitive information,
introduces new privacy concerns. Specifically, an attacker may be able to infer
whether a certain text passage appears in the retrieval database by observing
the outputs of the RAG system, an attack known as a Membership Inference Attack
(MIA). Despite the significance of this threat, MIAs against RAG systems have
yet remained under-explored. This study addresses this gap by introducing an
efficient and easy-to-use method for conducting MIA against RAG systems. We
demonstrate the effectiveness of our attack using two benchmark datasets and
multiple generative models, showing that the membership of a document in the
retrieval database can be efficiently determined through the creation of an
appropriate prompt in both black-box and gray-box settings. Moreover, we
introduce an initial defense strategy based on adding instructions to the RAG
template, which shows high effectiveness for some datasets and models. Our
findings highlight the importance of implementing security countermeasures in
deployed RAG systems and developing more advanced defenses to protect the
privacy and security of retrieval databases.
Authors' comments: 12 pages, 4 figures
Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, Longtao Huang, Hui Xue, Xiaofeng He, Jun Huang
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
Xin Jiang, Hao Tang, Rui Yan, Jinhui Tang, Zechao Li
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.
Peter Baile Chen, Yi Zhang, Dan Roth
Retrieving relevant tables containing the necessary information to accurately
answer a given question over tables is critical to open-domain
question-answering (QA) systems. Previous methods assume the answer to such a
question can be found either in a single table or multiple tables identified
through question decomposition or rewriting. However, neither of these
approaches is sufficient, as many questions require retrieving multiple tables
and joining them through a join plan that cannot be discerned from the user
query itself. If the join plan is not considered in the retrieval stage, the
subsequent steps of reasoning and answering based on those retrieved tables are
likely to be incorrect. To address this problem, we introduce a method that
uncovers useful join relations for any query and database during table
retrieval. We use a novel re-ranking method formulated as a mixed-integer
program that considers not only table-query relevance but also table-table
relevance that requires inferring join relationships. Our method outperforms
the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score
and for end-to-end QA by up to 5.4% in accuracy.
Authors' comments: ACL 2024. Dataset and code are available at
https://peterbaile.github.io/jar
Shashi Kant Gupta, Aditya Basu, Bradley Taylor, Anai Kothari, Hrituraj Singh
Retrieving information from EHR systems is essential for answering specific
questions about patient journeys and improving the delivery of clinical care.
Despite this fact, most EHR systems still rely on keyword-based searches. With
the advent of generative large language models (LLMs), retrieving information
can lead to better search and summarization capabilities. Such retrievers can
also feed Retrieval-augmented generation (RAG) pipelines to answer any query.
However, the task of retrieving information from EHR real-world clinical data
contained within EHR systems in order to solve several downstream use cases is
challenging due to the difficulty in creating query-document support pairs. We
provide a blueprint for creating such datasets in an affordable manner using
large language models. Our method results in a retriever that is 30-50 F-1
points better than propriety counterparts such as Ada and Mistral for oncology
data elements. We further compare our model, called Onco-Retriever, against
fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation
on real-world EHR data along with latency analysis of the different models and
provide a path forward for healthcare organizations to build domain-specific
retrievers.
Authors' comments: 18 pages
Mingrui Wu, Sheng Cao
Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
Maxime Bouthors, Josep Crego, Francois Yvon
Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures, to better understand the interplay between these two processes. We conduct experiments in two language pairs in a multi-domain setting and consider several downstream architectures based on a standard autoregressive model, an edit-based model, and a large language model with in-context learning. Our experiments show that the choice of the retrieval technique impacts the translation scores, with variance across architectures. We also discuss the effects of increasing the number and diversity of examples, which are mostly positive across the board.
Kunal Sawarkar, Abhilasha Mangal, Shivam Raj Solanki
Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a
private knowledge base of documents with Large Language Models (LLM) to build
Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes
increasingly challenging as the corpus of documents scales up, with Retrievers
playing an outsized role in the overall RAG accuracy by extracting the most
relevant document from the corpus to provide context to the LLM. In this paper,
we propose the 'Blended RAG' method of leveraging semantic search techniques,
such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid
query strategies. Our study achieves better retrieval results and sets new
benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID
datasets. We further extend such a 'Blended Retriever' to the RAG system to
demonstrate far superior results on Generative Q\&A datasets like SQUAD, even
surpassing fine-tuning performance.
Authors' comments: Paper accepted by MIPR and presented at The 7th IEEE International
Conference on Multimedia Information. Processing and Retrieval (IEEE-MIPR
2024)
Zihan Zhang, Meng Fang, Ling Chen
Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine
the necessity of retrieval for queries instead of retrieving indiscriminately
to enhance the efficiency and relevance of the sourced information. However,
previous works largely overlook the evaluation of ARAG approaches, leading to
their effectiveness being understudied. This work presents a benchmark,
RetrievalQA, comprising 1,271 short-form questions covering new world and
long-tail knowledge. The knowledge necessary to answer the questions is absent
from LLMs; therefore, external information must be retrieved to answer
correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG
methods. We observe that calibration-based methods heavily rely on threshold
tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable
retrieval decisions. Based on our findings, we propose Time-Aware Adaptive
Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the
necessity of retrieval without calibration or additional training. The dataset
and code will be available at \url{https://github.com/hyintell/RetrievalQA}
Authors' comments: preprint
Qiaoyu Tang, Jiawei Chen, Zhuoqun Li, Bowen Yu, Yaojie Lu, Cheng Fu, Haiyang Yu, Hongyu Lin et al.
The rise of large language models (LLMs) has significantly transformed both
the construction and application of information retrieval (IR) systems.
However, current interactions between IR systems and LLMs remain limited, with
LLMs merely serving as part of components within IR systems, and IR systems
being constructed independently of LLMs. This separated architecture restricts
knowledge sharing and deep collaboration between them. In this paper, we
introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval
architecture. Self-Retrieval unifies all essential IR functions within a single
LLM, leveraging the inherent capabilities of LLMs throughout the IR process.
Specifically, Self-Retrieval internalizes the retrieval corpus through
self-supervised learning, transforms the retrieval process into sequential
passage generation, and performs relevance assessment for reranking.
Experimental results demonstrate that Self-Retrieval not only outperforms
existing retrieval approaches by a significant margin, but also substantially
enhances the performance of LLM-driven downstream applications like
retrieval-augmented generation.
Authors' comments: NeurIPS 2024 Camera-ready Version. Code:
https://github.com/icip-cas/SelfRetrieval
Seiji Maekawa, Hayate Iso, Sairam Gurajada, Nikita Bhutani
While large language models (LMs) demonstrate remarkable performance, they
encounter challenges in providing accurate responses when queried for
information beyond their pre-trained memorization. Although augmenting them
with relevant external information can mitigate these issues, failure to
consider the necessity of retrieval may adversely affect overall performance.
Previous research has primarily focused on examining how entities influence
retrieval models and knowledge recall in LMs, leaving other aspects relatively
unexplored. In this work, our goal is to offer a more detailed, fact-centric
analysis by exploring the effects of combinations of entities and relations. To
facilitate this, we construct a new question answering (QA) dataset called
WiTQA (Wikipedia Triple Question Answers). This dataset includes questions
about entities and relations of various popularity levels, each accompanied by
a supporting passage. Our extensive experiments with diverse LMs and retrievers
reveal when retrieval does not consistently enhance LMs from the viewpoints of
fact-centric popularity.Confirming earlier findings, we observe that larger LMs
excel in recalling popular facts. However, they notably encounter difficulty
with infrequent entity-relation pairs compared to retrievers. Interestingly,
they can effectively retain popular relations of less common entities. We
demonstrate the efficacy of our finer-grained metric and insights through an
adaptive retrieval system that selectively employs retrieval and recall based
on the frequencies of entities and relations in the question.
Authors' comments: NAACL2024 (main)
Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. However, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct RA all the time. A straightforward idea is to only conduct retrieval when LLMs are uncertain about a question. This motivates us to enhance the LLMs' ability to perceive their knowledge boundaries to help RA. In this paper, we first quantitatively measure LLMs' such ability and confirm their overconfidence. Then, we study how LLMs' certainty about a question correlates with their dependence on external retrieved information. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence. Additionally, equipped with these methods, LLMs can achieve comparable or even better performance of RA with much fewer retrieval calls.
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs). The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs, potentially resulting in internal hallucinations. While incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. A careful and balanced integration of the parametric knowledge within LLMs with external information is crucial to alleviate hallucinations. In this study, we present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinated outputs. This process is governed by a multilingual semantic-aware detection module, which evaluates the consistency of the perturbed responses across various languages for the same queries. Upon detecting inconsistencies indicative of hallucinations, Rowen activates the retrieval of external information to rectify the model outputs. Rowen adeptly harmonizes the intrinsic parameters in LLMs with external knowledge sources, effectively mitigating hallucinations by ensuring a balanced integration of internal reasoning and external evidence. Through a comprehensive empirical analysis, we demonstrate that Rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of LLMs.
Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng Chua
The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall" the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the corresponding identifier of the target image, given the textual query input. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, Bryan Hooi
Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. (Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.)
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, Jaewoo Kang
Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.
Lei Li, Jianxun Lian, Xiao Zhou, Xing Xie
Retrieval models aim at selecting a small set of item candidates which match
the preference of a given user. They play a vital role in large-scale
recommender systems since subsequent models such as rankers highly depend on
the quality of item candidates. However, most existing retrieval models employ
a single-round inference paradigm, which may not adequately capture the dynamic
nature of user preferences and stuck in one area in the item space. In this
paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for
recommender systems that iteratively refines user representations to better
capture potential candidates in the full item space. Ada-Retrieval comprises
two key modules: the item representation adapter and the user representation
adapter, designed to inject context information into items' and users'
representations. The framework maintains a model-agnostic design, allowing
seamless integration with various backbone models such as RNNs or Transformers.
We perform experiments on three widely used public datasets, incorporating five
powerful sequential recommenders as backbone models. Our results demonstrate
that Ada-Retrieval significantly enhances the performance of various base
models, with consistent improvements observed across different datasets. Our
code and data are publicly available at:
https://github.com/ll0ruc/Ada-Retrieval.
Authors' comments: 9 pages, Accepted to AAAI2024
Weicong Qin, Zelin Cao, Weijie Yu, Zihua Si, Sirui Chen, Jun Xu
Legal case retrieval and judgment prediction are crucial components in intelligent legal systems. In practice, determining whether two cases share the same charges through legal judgment prediction is essential for establishing their relevance in case retrieval. However, current studies on legal case retrieval merely focus on the semantic similarity between paired cases, ignoring their charge-level consistency. This separation leads to a lack of context and potential inaccuracies in the case retrieval that can undermine trust in the system's decision-making process. Given the guidance role of laws to both tasks and inspired by the success of generative retrieval, in this work, we propose to incorporate judgment prediction into legal case retrieval, achieving a novel law-aware Generative legal case retrieval method called Gear. Specifically, Gear first extracts rationales (key circumstances and key elements) for legal cases according to the definition of charges in laws, ensuring a shared and informative representation for both tasks. Then in accordance with the inherent hierarchy of laws, we construct a law structure constraint tree and assign law-aware semantic identifier(s) to each case based on this tree. These designs enable a unified traversal from the root, through intermediate charge nodes, to case-specific leaf nodes, which respectively correspond to two tasks. Additionally, in the training, we also introduce a revision loss that jointly minimizes the discrepancy between the identifiers of predicted and labeled charges as well as retrieved cases, improving the accuracy and consistency for both tasks. Extensive experiments on two datasets demonstrate that Gear consistently outperforms state-of-the-art methods in legal case retrieval while maintaining competitive judgment prediction performance.