Kunal Sawarkar, Abhilasha Mangal, Shivam Raj Solanki
Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a
private knowledge base of documents with Large Language Models (LLM) to build
Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes
increasingly challenging as the corpus of documents scales up, with Retrievers
playing an outsized role in the overall RAG accuracy by extracting the most
relevant document from the corpus to provide context to the LLM. In this paper,
we propose the 'Blended RAG' method of leveraging semantic search techniques,
such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid
query strategies. Our study achieves better retrieval results and sets new
benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID
datasets. We further extend such a 'Blended Retriever' to the RAG system to
demonstrate far superior results on Generative Q\&A datasets like SQUAD, even
surpassing fine-tuning performance.
Authors' comments: Paper accepted by MIPR and presented at The 7th IEEE International
Conference on Multimedia Information. Processing and Retrieval (IEEE-MIPR
2024)
Zihan Zhang, Meng Fang, Ling Chen
Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine
the necessity of retrieval for queries instead of retrieving indiscriminately
to enhance the efficiency and relevance of the sourced information. However,
previous works largely overlook the evaluation of ARAG approaches, leading to
their effectiveness being understudied. This work presents a benchmark,
RetrievalQA, comprising 1,271 short-form questions covering new world and
long-tail knowledge. The knowledge necessary to answer the questions is absent
from LLMs; therefore, external information must be retrieved to answer
correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG
methods. We observe that calibration-based methods heavily rely on threshold
tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable
retrieval decisions. Based on our findings, we propose Time-Aware Adaptive
Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the
necessity of retrieval without calibration or additional training. The dataset
and code will be available at \url{https://github.com/hyintell/RetrievalQA}
Authors' comments: preprint
Qiaoyu Tang, Jiawei Chen, Zhuoqun Li, Bowen Yu, Yaojie Lu, Cheng Fu, Haiyang Yu, Hongyu Lin et al.
The rise of large language models (LLMs) has significantly transformed both
the construction and application of information retrieval (IR) systems.
However, current interactions between IR systems and LLMs remain limited, with
LLMs merely serving as part of components within IR systems, and IR systems
being constructed independently of LLMs. This separated architecture restricts
knowledge sharing and deep collaboration between them. In this paper, we
introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval
architecture. Self-Retrieval unifies all essential IR functions within a single
LLM, leveraging the inherent capabilities of LLMs throughout the IR process.
Specifically, Self-Retrieval internalizes the retrieval corpus through
self-supervised learning, transforms the retrieval process into sequential
passage generation, and performs relevance assessment for reranking.
Experimental results demonstrate that Self-Retrieval not only outperforms
existing retrieval approaches by a significant margin, but also substantially
enhances the performance of LLM-driven downstream applications like
retrieval-augmented generation.
Authors' comments: NeurIPS 2024 Camera-ready Version. Code:
https://github.com/icip-cas/SelfRetrieval
Seiji Maekawa, Hayate Iso, Sairam Gurajada, Nikita Bhutani
While large language models (LMs) demonstrate remarkable performance, they
encounter challenges in providing accurate responses when queried for
information beyond their pre-trained memorization. Although augmenting them
with relevant external information can mitigate these issues, failure to
consider the necessity of retrieval may adversely affect overall performance.
Previous research has primarily focused on examining how entities influence
retrieval models and knowledge recall in LMs, leaving other aspects relatively
unexplored. In this work, our goal is to offer a more detailed, fact-centric
analysis by exploring the effects of combinations of entities and relations. To
facilitate this, we construct a new question answering (QA) dataset called
WiTQA (Wikipedia Triple Question Answers). This dataset includes questions
about entities and relations of various popularity levels, each accompanied by
a supporting passage. Our extensive experiments with diverse LMs and retrievers
reveal when retrieval does not consistently enhance LMs from the viewpoints of
fact-centric popularity.Confirming earlier findings, we observe that larger LMs
excel in recalling popular facts. However, they notably encounter difficulty
with infrequent entity-relation pairs compared to retrievers. Interestingly,
they can effectively retain popular relations of less common entities. We
demonstrate the efficacy of our finer-grained metric and insights through an
adaptive retrieval system that selectively employs retrieval and recall based
on the frequencies of entities and relations in the question.
Authors' comments: NAACL2024 (main)
Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. However, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct RA all the time. A straightforward idea is to only conduct retrieval when LLMs are uncertain about a question. This motivates us to enhance the LLMs' ability to perceive their knowledge boundaries to help RA. In this paper, we first quantitatively measure LLMs' such ability and confirm their overconfidence. Then, we study how LLMs' certainty about a question correlates with their dependence on external retrieved information. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence. Additionally, equipped with these methods, LLMs can achieve comparable or even better performance of RA with much fewer retrieval calls.
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs). The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs, potentially resulting in internal hallucinations. While incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. A careful and balanced integration of the parametric knowledge within LLMs with external information is crucial to alleviate hallucinations. In this study, we present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinated outputs. This process is governed by a multilingual semantic-aware detection module, which evaluates the consistency of the perturbed responses across various languages for the same queries. Upon detecting inconsistencies indicative of hallucinations, Rowen activates the retrieval of external information to rectify the model outputs. Rowen adeptly harmonizes the intrinsic parameters in LLMs with external knowledge sources, effectively mitigating hallucinations by ensuring a balanced integration of internal reasoning and external evidence. Through a comprehensive empirical analysis, we demonstrate that Rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of LLMs.
Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng Chua
The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall" the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the corresponding identifier of the target image, given the textual query input. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, Bryan Hooi
Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. (Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.)
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, Jaewoo Kang
Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.
Lei Li, Jianxun Lian, Xiao Zhou, Xing Xie
Retrieval models aim at selecting a small set of item candidates which match
the preference of a given user. They play a vital role in large-scale
recommender systems since subsequent models such as rankers highly depend on
the quality of item candidates. However, most existing retrieval models employ
a single-round inference paradigm, which may not adequately capture the dynamic
nature of user preferences and stuck in one area in the item space. In this
paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for
recommender systems that iteratively refines user representations to better
capture potential candidates in the full item space. Ada-Retrieval comprises
two key modules: the item representation adapter and the user representation
adapter, designed to inject context information into items' and users'
representations. The framework maintains a model-agnostic design, allowing
seamless integration with various backbone models such as RNNs or Transformers.
We perform experiments on three widely used public datasets, incorporating five
powerful sequential recommenders as backbone models. Our results demonstrate
that Ada-Retrieval significantly enhances the performance of various base
models, with consistent improvements observed across different datasets. Our
code and data are publicly available at:
https://github.com/ll0ruc/Ada-Retrieval.
Authors' comments: 9 pages, Accepted to AAAI2024
Weicong Qin, Zelin Cao, Weijie Yu, Zihua Si, Sirui Chen, Jun Xu
Legal case retrieval and judgment prediction are crucial components in intelligent legal systems. In practice, determining whether two cases share the same charges through legal judgment prediction is essential for establishing their relevance in case retrieval. However, current studies on legal case retrieval merely focus on the semantic similarity between paired cases, ignoring their charge-level consistency. This separation leads to a lack of context and potential inaccuracies in the case retrieval that can undermine trust in the system's decision-making process. Given the guidance role of laws to both tasks and inspired by the success of generative retrieval, in this work, we propose to incorporate judgment prediction into legal case retrieval, achieving a novel law-aware Generative legal case retrieval method called Gear. Specifically, Gear first extracts rationales (key circumstances and key elements) for legal cases according to the definition of charges in laws, ensuring a shared and informative representation for both tasks. Then in accordance with the inherent hierarchy of laws, we construct a law structure constraint tree and assign law-aware semantic identifier(s) to each case based on this tree. These designs enable a unified traversal from the root, through intermediate charge nodes, to case-specific leaf nodes, which respectively correspond to two tasks. Additionally, in the training, we also introduce a revision loss that jointly minimizes the discrepancy between the identifiers of predicted and labeled charges as well as retrieved cases, improving the accuracy and consistency for both tasks. Extensive experiments on two datasets demonstrate that Gear consistently outperforms state-of-the-art methods in legal case retrieval while maintaining competitive judgment prediction performance.
Anni Yue, Stephen L. Smith
Robotic-based compact storage and retrieval systems provide high-density
storage in distribution center and warehouse applications. In the system, items
are stored in bins, and the bins are organized inside a three-dimensional grid.
Robots move on top of the grid to retrieve and deliver bins. To retrieve a bin,
a robot removes all bins above one by one with its gripper, called bin digging.
The closer the target bin is to the top of the grid, the less digging is
required to retrieve the bin. In this paper, we propose a policy to optimally
arrange the bins in the grid while processing bin requests so that the most
frequently accessed bins remain near the top of the grid. This improves the
performance of the system and makes it responsive to changes in bin demand. Our
solution approach identifies the optimal bin arrangement in the storage
facility, initiates a transition to this optimal set-up, and subsequently
ensures the ongoing maintenance of this arrangement for optimal performance. We
perform extensive simulations on a custom-built discrete event model of the
system. Our simulation results show that under the proposed policy more than
half of the bins requested are located on top of the grid, reducing bin digging
compared to existing policies. Compared to existing approaches, the proposed
policy reduces the retrieval time of the requested bins by over 30% and the
number of bin requests that exceed certain time thresholds by nearly 50%.
Authors' comments: 35 pages, 16 figures, submitted to Transportation Science (INFORMS)
Haoran Tang, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari, Xin Zhou
Diffusion-based methods have demonstrated remarkable capabilities in generating a diverse array of high-quality images, sparking interests for styled avatars, virtual try-on, and more. Previous methods use the same reference image as the target. An overlooked aspect is the leakage of the target's spatial information, style, etc. from the reference, harming the generated diversity and causing shortcuts. However, this approach continues as widely available datasets usually consist of single images not grouped by identities, and it is expensive to recollect large-scale same-identity data. Moreover, existing metrics adopt decoupled evaluation on text alignment and identity preservation, which fail at distinguishing between balanced outputs and those that over-fit to one aspect. In this paper, we propose a multi-level, same-identity dataset RetriBooru, which groups anime characters by both face and cloth identities. RetriBooru enables adopting reference images of the same character and outfits as the target, while keeping flexible gestures and actions. We benchmark previous methods on our dataset, and demonstrate the effectiveness of training with a reference image different from target (but same identity). We introduce a new concept composition task, where the conditioning encoder learns to retrieve different concepts from several reference images, and modify a baseline network RetriNet for the new task. Finally, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.
So Kuroki, Mai Nishimura, Tadashi Kozuno
Due to the complex interactions between agents, learning multi-agent control
policy often requires a prohibited amount of data. This paper aims to enable
multi-agent systems to effectively utilize past memories to adapt to novel
collaborative tasks in a data-efficient fashion. We propose the Multi-Agent
Coordination Skill Database, a repository for storing a collection of
coordinated behaviors associated with key vectors distinctive to them. Our
Transformer-based skill encoder effectively captures spatio-temporal
interactions that contribute to coordination and provides a unique skill
representation for each coordinated behavior. By leveraging only a small number
of demonstrations of the target task, the database enables us to train the
policy using a dataset augmented with the retrieved demonstrations.
Experimental evaluations demonstrate that our method achieves a significantly
higher success rate in push manipulation tasks compared with baseline methods
like few-shot imitation learning. Furthermore, we validate the effectiveness of
our retrieve-and-learn framework in a real environment using a team of wheeled
robots.
Authors' comments: Published in the 2024 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2024)
Anupam Purwar, Rahul Sundar
Retrieving answers in a quick and low cost manner without hallucinations from a combination of structured and unstructured data using Language models is a major hurdle which prevents employment of Language models in knowledge retrieval automation. This becomes accentuated when one wants to integrate a speech interface. Besides, for commercial search and chatbot applications, complete reliance on commercial large language models (LLMs) like GPT 3.5 etc. can be very costly. In this work, authors have addressed this problem by first developing a keyword based search framework which augments discovery of the context to be provided to the large language model. The keywords in turn are generated by LLM and cached for comparison with keywords generated by LLM against the query raised. This significantly reduces time and cost to find the context within documents. Once the context is set, LLM uses that to provide answers based on a prompt tailored for Q&A. This research work demonstrates that use of keywords in context identification reduces the overall inference time and cost of information retrieval. Given this reduction in inference time and cost with the keyword augmented retrieval framework, a speech based interface for user input and response readout was integrated. This allowed a seamless interaction with the language model.
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from existing knowledge bases to answer visually-grounded
questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong
framework to tackle KB-VQA, first retrieves related documents with Dense
Passage Retrieval (DPR) and then uses them to answer questions. This paper
proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which
significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major
limitations in RA-VQA's retriever: (1) the image representations obtained via
image-to-text transforms can be incomplete and inaccurate and (2) relevance
scores between queries and documents are computed with one-dimensional
embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes
these limitations by obtaining image representations that complement those from
the image-to-text transforms using a vision model aligned with an existing
text-based retriever through a simple alignment network. FLMR also encodes
images and questions using multi-dimensional embeddings to capture
finer-grained relevance between queries and documents. FLMR significantly
improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%.
Finally, we equipped RA-VQA with two state-of-the-art large
multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA
dataset.
Authors' comments: To appear at NeurIPS 2023. This is a submission version, and the
camera-ready version will be updated soon
Seongha Eom, Namgyu Ho, Jaehoon Oh, Se-Young Yun
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality between the original query image and retrieved text, contributing to the final prediction. X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training, showcasing the effectiveness of utilizing cross-modal features to maximize CLIP's zero-shot ability.
Nils Bhne, Mark Berger, Ronald van Velzen
Nowadays, one of the critical challenges in forensics is analyzing the
enormous amounts of unstructured digital evidence, such as images. Often,
unstructured digital evidence contains precious information for forensic
investigations. Therefore, a retrieval system that can effectively identify
forensically relevant images is paramount. In this work, we explored the
effectiveness of interactive learning in improving image retrieval performance
in the forensic domain by proposing Excalibur - a zero-shot cross-modal image
retrieval system extended with interactive learning. Excalibur was evaluated
using both simulations and a user study. The simulations reveal that
interactive learning is highly effective in improving retrieval performance in
the forensic domain. Furthermore, user study participants could effectively
leverage the power of interactive learning. Finally, they considered Excalibur
effective and straightforward to use and expressed interest in using it in
their daily practice.
Authors' comments: Submitted to the AAAI22 conference
Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Yong Liu, Shen Huang
Multi-hop QA involves finding multiple relevant passages and step-by-step
reasoning to answer complex questions. While previous approaches have developed
retrieval modules for selecting relevant passages, they face challenges in
scenarios beyond two hops, owing to the limited performance of one-step methods
and the failure of two-step methods when selecting irrelevant passages in
earlier stages. In this work, we introduce Beam Retrieval, a general end-to-end
retrieval framework for multi-hop QA. This approach maintains multiple partial
hypotheses of relevant passages at each step, expanding the search space and
reducing the risk of missing relevant passages. Moreover, Beam Retrieval
jointly optimizes an encoder and two classification heads by minimizing the
combined loss across all hops. To establish a complete QA system, we
incorporate a supervised reader or a zero-shot GPT-3.5. Experimental results
demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with
baselines on challenging MuSiQue-Ans, and it also surpasses all previous
retrievers on HotpotQA and 2WikiMultiHopQA. Providing high-quality context,
Beam Retrieval helps our supervised reader achieve new state-of-the-art
performance and substantially improves (up to 28.8 points) the QA performance
of zero-shot GPT-3.5.
Authors' comments: Code is available at https://github.com/canghongjian/beam_retriever
Qian Dong, Yiding Liu, Qingyao Ai, Haitao Li, Shuaiqiang Wang, Yiqun Liu, Dawei Yin, Shaoping Ma
Passage retrieval is a fundamental task in many information systems, such as
web search and question answering, where both efficiency and effectiveness are
critical concerns. In recent years, neural retrievers based on pre-trained
language models (PLM), such as dual-encoders, have achieved huge success. Yet,
studies have found that the performance of dual-encoders are often limited due
to the neglecting of the interaction information between queries and candidate
passages. Therefore, various interaction paradigms have been proposed to
improve the performance of vanilla dual-encoders. Particularly, recent
state-of-the-art methods often introduce late-interaction during the model
inference process. However, such late-interaction based methods usually bring
extensive computation and storage cost on large corpus. Despite their
effectiveness, the concern of efficiency and space footprint is still an
important factor that limits the application of interaction-based neural
retrieval models. To tackle this issue, we incorporate implicit interaction
into dual-encoders, and propose I^3 retriever. In particular, our implicit
interaction paradigm leverages generated pseudo-queries to simulate
query-passage interaction, which jointly optimizes with query and passage
encoders in an end-to-end manner. It can be fully pre-computed and cached, and
its inference process only involves simple dot product operation of the query
vector and passage vector, which makes it as efficient as the vanilla dual
encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep
Learning Datasets, demonstrating the I^3 retriever's superiority in terms of
both effectiveness and efficiency. Moreover, the proposed implicit interaction
is compatible with special pre-training and knowledge distillation for passage
retrieval, which brings a new state-of-the-art performance.
Authors' comments: 10 pages