Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang et al.
Transformer-based Large Language Models (LLMs) have become increasingly
important. However, due to the quadratic time complexity of attention
computation, scaling LLMs to longer contexts incurs extremely slow inference
latency and high GPU memory consumption for caching key-value (KV) vectors.
This paper proposes RetrievalAttention, a training-free approach to both
accelerate attention computation and reduce GPU memory consumption. By
leveraging the dynamic sparsity of attention mechanism, RetrievalAttention
proposes to use approximate nearest neighbor search (ANNS) indexes for KV
vectors in CPU memory and retrieves the most relevant ones with vector search
during generation. Unfortunately, we observe that the off-the-shelf ANNS
indexes are often ineffective for such retrieval tasks due to the
out-of-distribution (OOD) between query vectors and key vectors in attention
mechanism. RetrievalAttention addresses the OOD challenge by designing an
attention-aware vector search algorithm that can adapt to the distribution of
query vectors. Our evaluation shows that RetrievalAttention only needs to
access 1--3% of data while maintaining high model accuracy. This leads to
significant reduction in the inference cost of long-context LLMs with much
lower GPU memory footprint. In particular, RetrievalAttention only needs a
single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B
parameters, which is capable of generating one token in 0.188 seconds.
Authors' comments: 16 pages
Hanqi Zhang, Chong Chen, Lang Mei, Qi Liu, Jiaxin Mao
In the information retrieval (IR) area, dense retrieval (DR) models use deep learning techniques to encode queries and passages into embedding space to compute their semantic relations. It is important for DR models to balance both efficiency and effectiveness. Pre-trained language models (PLMs), especially Transformer-based PLMs, have been proven to be effective encoders of DR models. However, the self-attention component in Transformer-based PLM results in a computational complexity that grows quadratically with sequence length, and thus exhibits a slow inference speed for long-text retrieval. Some recently proposed non-Transformer PLMs, especially the Mamba architecture PLMs, have demonstrated not only comparable effectiveness to Transformer-based PLMs on generative language tasks but also better efficiency due to linear time scaling in sequence length. This paper implements the Mamba Retriever to explore whether Mamba can serve as an effective and efficient encoder of DR model for IR tasks. We fine-tune the Mamba Retriever on the classic short-text MS MARCO passage ranking dataset and the long-text LoCoV0 dataset. Experimental results show that (1) on the MS MARCO passage ranking dataset and BEIR, the Mamba Retriever achieves comparable or better effectiveness compared to Transformer-based retrieval models, and the effectiveness grows with the size of the Mamba model; (2) on the long-text LoCoV0 dataset, the Mamba Retriever can extend to longer text length than its pre-trained length after fine-tuning on retrieval task, and it has comparable or better effectiveness compared to other long-text retrieval models; (3) the Mamba Retriever has superior inference speed for long-text retrieval. In conclusion, Mamba Retriever is both effective and efficient, making it a practical model, especially for long-text retrieval.
Hanzhen Lu, Zhongxin Liu
Code comment generation aims to generate high-quality comments from source code automatically and has been studied for years. Recent studies proposed to integrate information retrieval techniques with neural generation models to tackle this problem, i.e., Retrieval-Augmented Comment Generation (RACG) approaches, and achieved state-of-the-art results. However, the retrievers in previous work are built independently of their generators. This results in that the retrieved exemplars are not necessarily the most useful ones for generating comments, limiting the performance of existing approaches. To address this limitation, we propose a novel training strategy to enable the retriever to learn from the feedback of the generator and retrieve exemplars for generation. Specifically, during training, we use the retriever to retrieve the top-k exemplars and calculate their retrieval scores, and use the generator to calculate a generation loss for the sample based on each exemplar. By aligning high-score exemplars retrieved by the retriever with low-loss exemplars observed by the generator, the retriever can learn to retrieve exemplars that can best improve the quality of the generated comments. Based on this strategy, we propose a novel RACG approach named JOINTCOM and evaluate it on two real-world datasets, JCSD and PCSD. The experimental results demonstrate that our approach surpasses the state-of-the-art baselines by 7.3% to 30.0% in terms of five metrics on the two datasets. We also conduct a human evaluation to compare JOINTCOM with the best-performing baselines. The results indicate that JOINTCOM outperforms the baselines, producing comments that are more natural, informative, and useful.
Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Weichuan Liu, Lei Hou, Juanzi Li
This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM's self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods. We release our code at https://github.com/THU-KEG/SeaKR.
Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.
Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliott
Recent advances in retrieval-augmented models for image captioning highlight
the benefit of retrieving related captions for efficient, lightweight models
with strong domain-transfer capabilities. While these models demonstrate the
success of retrieval augmentation, retrieval models are still far from perfect
in practice: the retrieved information can sometimes mislead the model,
resulting in incorrect generation and worse performance. In this paper, we
analyze the robustness of a retrieval-augmented captioning model SmallCap. Our
analysis shows that the model is sensitive to tokens that appear in the
majority of the retrieved captions, and the input attribution shows that those
tokens are likely copied into the generated output. Given these findings, we
propose to train the model by sampling retrieved captions from more diverse
sets. This decreases the chance that the model learns to copy majority tokens,
and improves both in-domain and cross-domain performance.
Authors' comments: 9 pages, long paper at ACL 2024
Tiziano Labruna, Jon Ander Campos, Gorka Azkune
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu
Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.
Piotr Rybak, Maciej Ogrodniczuk
Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present Silver Retriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. Silver Retriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.
Ohad Rubin, Jonathan Berant
Retrieval-augmented language models (LMs) have received much attention
recently. However, typically the retriever is not trained jointly as a native
component of the LM, but added post-hoc to an already-pretrained LM, which
limits the ability of the LM and the retriever to adapt to one another. In this
work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture
and training procedure for jointly training a retrieval-augmented LM from
scratch and apply it to the task of modeling long texts. Given a recently
generated text chunk in a long document, the LM computes query representations,
which are then used to retrieve earlier chunks in the document, located
potentially tens of thousands of tokens before. Information from retrieved
chunks is fused into the LM representations to predict the next target chunk.
We train the retriever component with a semantic objective, where the goal is
to retrieve chunks that increase the probability of the next chunk, according
to a reference LM. We evaluate RPT on four long-range language modeling tasks,
spanning books, code, and mathematical writing, and demonstrate that RPT
improves retrieval quality and subsequently perplexity across the board
compared to strong baselines.
Authors' comments: Accepted to TACL 2024
Zhiyu Chen, Jason Choi, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi
Customers interacting with product search engines are increasingly
formulating information-seeking queries. Frequently Asked Question (FAQ)
retrieval aims to retrieve common question-answer pairs for a user query with
question intent. Integrating FAQ retrieval in product search can not only
empower users to make more informed purchase decisions, but also enhance user
retention through efficient post-purchase support. Determining when an FAQ
entry can satisfy a user's information need within product search, without
disrupting their shopping experience, represents an important challenge. We
propose an intent-aware FAQ retrieval system consisting of (1) an intent
classifier that predicts when a user's information need can be answered by an
FAQ; (2) a reformulation model that rewrites a query into a natural question.
Offline evaluation demonstrates that our approach improves Hit@1 by 13% on
retrieving ground-truth FAQs, while reducing latency by 95% compared to
baseline systems. These improvements are further validated by real user
feedback, where 71% of displayed FAQs on top of product search results received
explicit positive user feedback. Overall, our findings show promising
directions for integrating FAQ retrieval into product search at scale.
Authors' comments: ACL 2023 Industry Track
Ehsan Doostmohammadi, Tobias Norlund, Marco Kuhlmann, Richard Johansson
Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.
Viktoriia Chekalina, Alexander Panchenko
In this paper, we present a submission to the Touche lab's Task 2 on Argument Retrieval for Comparative Questions. Our team Katana supplies several approaches based on decision tree ensembles algorithms to rank comparative documents in accordance with their relevance and argumentative support. We use PyTerrier library to apply ensembles models to a ranking problem, considering statistical text features and features based on comparative structures. We also employ large contextualized language modelling techniques, such as BERT, to solve the proposed ranking task. To merge this technique with ranking modelling, we leverage neural ranking library OpenNIR. Our systems substantially outperforming the proposed baseline and scored first in relevance and second in quality according to the official metrics of the competition (for measure NDCG@5 score). Presented models could help to improve the performance of processing comparative queries in information retrieval and dialogue systems.
Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, Vincent Y. Zhao
Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020]
allow token-level interactions between queries and documents, and hence achieve
state of the art on many information retrieval benchmarks. However, their
non-linear scoring function cannot be scaled to millions of documents,
necessitating a three-stage process for inference: retrieving initial
candidates via token retrieval, accessing all token vectors, and scoring the
initial candidate documents. The non-linear scoring function is applied over
all token vectors of each candidate document, making the inference process
complicated and slow. In this paper, we aim to simplify the multi-vector
retrieval by rethinking the role of token retrieval. We present XTR,
ConteXtualized Token Retriever, which introduces a simple, yet novel, objective
function that encourages the model to retrieve the most important document
tokens first. The improvement to token retrieval allows XTR to rank candidates
only using the retrieved tokens rather than all tokens in the document, and
enables a newly designed scoring stage that is two-to-three orders of magnitude
cheaper than that of ColBERT. On the popular BEIR benchmark, XTR advances the
state-of-the-art by 2.8 nDCG@10 without any distillation. Detailed analysis
confirms our decision to revisit the token retrieval stage, as XTR demonstrates
much better recall of the token retrieval stage compared to ColBERT.
Authors' comments: NeurIPS 2023. Code available at
https://github.com/google-deepmind/xtr
Dwaipayan Roy, Zeljko Carevic, Philipp Mayr
Retrievability measures the influence a retrieval system has on the access to
information in a given collection of items. This measure can help in making an
evaluation of the search system based on which insights can be drawn. In this
paper, we investigate the retrievability in an integrated search system
consisting of items from various categories, particularly focussing on
datasets, publications \ijdl{and variables} in a real-life Digital Library
(DL). The traditional metrics, that is, the Lorenz curve and Gini coefficient,
are employed to visualize the diversity in retrievability scores of the
\ijdl{three} retrievable document types (specifically datasets, publications,
and variables). Our results show a significant popularity bias with certain
items being retrieved more often than others. Particularly, it has been shown
that certain datasets are more likely to be retrieved than other datasets in
the same category. In contrast, the retrievability scores of items from the
variable or publication category are more evenly distributed. We have observed
that the distribution of document retrievability is more diverse for datasets
as compared to publications and variables.
Authors' comments: To appear in International Journal on Digital Libraries (IJDL). arXiv
admin note: substantial text overlap with arXiv:2205.00937
Kai Zhang, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, Daxin Jiang
Retrieval models based on dense representations in semantic space have become
an indispensable branch for first-stage retrieval. These retrievers benefit
from surging advances in representation learning towards compressive global
sequence-level embeddings. However, they are prone to overlook local salient
phrases and entity mentions in texts, which usually play pivot roles in
first-stage retrieval. To mitigate this weakness, we propose to make a dense
retriever align a well-performing lexicon-aware representation model. The
alignment is achieved by weakened knowledge distillations to enlighten the
retriever via two aspects -- 1) a lexicon-augmented contrastive objective to
challenge the dense encoder and 2) a pair-wise rank-consistent regularization
to make dense model's behavior incline to the other. We evaluate our model on
three public benchmarks, which shows that with a comparable lexicon-aware
retriever as the teacher, our proposed dense one can bring consistent and
significant improvements, and even outdo its teacher. In addition, we found our
improvement on the dense retriever is complementary to the standard ranker
distillation, which can further lift state-of-the-art performance.
Authors' comments: 14 pages, 6 tables, 4 figures. WWW 2023
Yihan Wu, Hongyang Zhang, Heng Huang
Recent research works have shown that image retrieval models are vulnerable
to adversarial attacks, where slightly modified test inputs could lead to
problematic retrieval results. In this paper, we aim to design a provably
robust image retrieval model which keeps the most important evaluation metric
Recall@1 invariant to adversarial perturbation. We propose the first 1-nearest
neighbor (NN) image retrieval algorithm, RetrievalGuard, which is provably
robust against adversarial perturbations within an $\ell_2$ ball of calculable
radius. The challenge is to design a provably robust algorithm that takes into
consideration the 1-NN search and the high-dimensional nature of the embedding
space. Algorithmically, given a base retrieval model and a query sample, we
build a smoothed retrieval model by carefully analyzing the 1-NN search
procedure in the high-dimensional embedding space. We show that the smoothed
retrieval model has bounded Lipschitz constant and thus the retrieval score is
invariant to $\ell_2$ adversarial perturbations. Experiments on image retrieval
tasks validate the robustness of our RetrievalGuard method.
Authors' comments: accepted by ICML 2022
Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang, Daxin Jiang
Large-scale retrieval is to recall relevant documents from a huge collection
given a query. It relies on representation learning to embed documents and
queries into a common semantic encoding space. According to the encoding space,
recent retrieval methods based on pre-trained language models (PLM) can be
coarsely categorized into either dense-vector or lexicon-based paradigms. These
two paradigms unveil the PLMs' representation capability in different
granularities, i.e., global sequence-level compression and local word-level
contexts, respectively. Inspired by their complementary global-local
contextualization and distinct representing views, we propose a new learning
framework, UnifieR which unifies dense-vector and lexicon-based retrieval in
one model with a dual-representing capability. Experiments on passage retrieval
benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme
is further presented with even better retrieval quality. We lastly evaluate the
model on BEIR benchmark to verify its transferability.
Authors' comments: To appear at KDD ADS 2023
Yifan Gao, Qingyu Yin, Zheng Li, Rui Meng, Tong Zhao, Bing Yin, Irwin King, Michael R. Lyu
Keyphrase generation is the task of automatically predicting keyphrases given
a piece of long text. Despite its recent flourishing, keyphrase generation on
non-English languages haven't been vastly investigated. In this paper, we call
attention to a new setting named multilingual keyphrase generation and we
contribute two new datasets, EcommerceMKP and AcademicMKP, covering six
languages. Technically, we propose a retrieval-augmented method for
multilingual keyphrase generation to mitigate the data shortage problem in
non-English languages. The retrieval-augmented model leverages keyphrase
annotations in English datasets to facilitate generating keyphrases in
low-resource languages. Given a non-English passage, a cross-lingual dense
passage retrieval module finds relevant English passages. Then the associated
English keyphrases serve as external knowledge for keyphrase generation in the
current language. Moreover, we develop a retriever-generator iterative training
algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual
passage retriever. Comprehensive experiments and ablations show that the
proposed approach outperforms all baselines.
Authors' comments: NAACL 2022 (Findings)
Canwen Xu, Daya Guo, Nan Duan, Julian McAuley
In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever
that does not require any supervised data for training. Specifically, we first
present Iterative Contrastive Learning (ICoL) that iteratively trains the query
and document encoders with a cache mechanism. ICoL not only enlarges the number
of negative instances but also keeps representations of cached examples in the
same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a
simple yet effective way to enhance dense retrieval with lexical matching. We
evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18
datasets of 9 zero-shot text retrieval tasks. Experimental results show that
LaPraDoR achieves state-of-the-art performance compared with supervised dense
retrieval models, and further analysis reveals the effectiveness of our
training strategy and objectives. Compared to re-ranking, our lexicon-enhanced
approach can be run in milliseconds (22.5x faster) while achieving superior
performance.
Authors' comments: ACL 2022 (Findings)