Ohad Rubin, Jonathan Berant
Retrieval-augmented language models (LMs) have received much attention
recently. However, typically the retriever is not trained jointly as a native
component of the LM, but added post-hoc to an already-pretrained LM, which
limits the ability of the LM and the retriever to adapt to one another. In this
work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture
and training procedure for jointly training a retrieval-augmented LM from
scratch and apply it to the task of modeling long texts. Given a recently
generated text chunk in a long document, the LM computes query representations,
which are then used to retrieve earlier chunks in the document, located
potentially tens of thousands of tokens before. Information from retrieved
chunks is fused into the LM representations to predict the next target chunk.
We train the retriever component with a semantic objective, where the goal is
to retrieve chunks that increase the probability of the next chunk, according
to a reference LM. We evaluate RPT on four long-range language modeling tasks,
spanning books, code, and mathematical writing, and demonstrate that RPT
improves retrieval quality and subsequently perplexity across the board
compared to strong baselines.
Authors' comments: Accepted to TACL 2024
Zhiyu Chen, Jason Choi, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi
Customers interacting with product search engines are increasingly
formulating information-seeking queries. Frequently Asked Question (FAQ)
retrieval aims to retrieve common question-answer pairs for a user query with
question intent. Integrating FAQ retrieval in product search can not only
empower users to make more informed purchase decisions, but also enhance user
retention through efficient post-purchase support. Determining when an FAQ
entry can satisfy a user's information need within product search, without
disrupting their shopping experience, represents an important challenge. We
propose an intent-aware FAQ retrieval system consisting of (1) an intent
classifier that predicts when a user's information need can be answered by an
FAQ; (2) a reformulation model that rewrites a query into a natural question.
Offline evaluation demonstrates that our approach improves Hit@1 by 13% on
retrieving ground-truth FAQs, while reducing latency by 95% compared to
baseline systems. These improvements are further validated by real user
feedback, where 71% of displayed FAQs on top of product search results received
explicit positive user feedback. Overall, our findings show promising
directions for integrating FAQ retrieval into product search at scale.
Authors' comments: ACL 2023 Industry Track
Ehsan Doostmohammadi, Tobias Norlund, Marco Kuhlmann, Richard Johansson
Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.
Viktoriia Chekalina, Alexander Panchenko
In this paper, we present a submission to the Touche lab's Task 2 on Argument Retrieval for Comparative Questions. Our team Katana supplies several approaches based on decision tree ensembles algorithms to rank comparative documents in accordance with their relevance and argumentative support. We use PyTerrier library to apply ensembles models to a ranking problem, considering statistical text features and features based on comparative structures. We also employ large contextualized language modelling techniques, such as BERT, to solve the proposed ranking task. To merge this technique with ranking modelling, we leverage neural ranking library OpenNIR. Our systems substantially outperforming the proposed baseline and scored first in relevance and second in quality according to the official metrics of the competition (for measure NDCG@5 score). Presented models could help to improve the performance of processing comparative queries in information retrieval and dialogue systems.
Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, Vincent Y. Zhao
Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020]
allow token-level interactions between queries and documents, and hence achieve
state of the art on many information retrieval benchmarks. However, their
non-linear scoring function cannot be scaled to millions of documents,
necessitating a three-stage process for inference: retrieving initial
candidates via token retrieval, accessing all token vectors, and scoring the
initial candidate documents. The non-linear scoring function is applied over
all token vectors of each candidate document, making the inference process
complicated and slow. In this paper, we aim to simplify the multi-vector
retrieval by rethinking the role of token retrieval. We present XTR,
ConteXtualized Token Retriever, which introduces a simple, yet novel, objective
function that encourages the model to retrieve the most important document
tokens first. The improvement to token retrieval allows XTR to rank candidates
only using the retrieved tokens rather than all tokens in the document, and
enables a newly designed scoring stage that is two-to-three orders of magnitude
cheaper than that of ColBERT. On the popular BEIR benchmark, XTR advances the
state-of-the-art by 2.8 nDCG@10 without any distillation. Detailed analysis
confirms our decision to revisit the token retrieval stage, as XTR demonstrates
much better recall of the token retrieval stage compared to ColBERT.
Authors' comments: NeurIPS 2023. Code available at
https://github.com/google-deepmind/xtr
Dwaipayan Roy, Zeljko Carevic, Philipp Mayr
Retrievability measures the influence a retrieval system has on the access to
information in a given collection of items. This measure can help in making an
evaluation of the search system based on which insights can be drawn. In this
paper, we investigate the retrievability in an integrated search system
consisting of items from various categories, particularly focussing on
datasets, publications \ijdl{and variables} in a real-life Digital Library
(DL). The traditional metrics, that is, the Lorenz curve and Gini coefficient,
are employed to visualize the diversity in retrievability scores of the
\ijdl{three} retrievable document types (specifically datasets, publications,
and variables). Our results show a significant popularity bias with certain
items being retrieved more often than others. Particularly, it has been shown
that certain datasets are more likely to be retrieved than other datasets in
the same category. In contrast, the retrievability scores of items from the
variable or publication category are more evenly distributed. We have observed
that the distribution of document retrievability is more diverse for datasets
as compared to publications and variables.
Authors' comments: To appear in International Journal on Digital Libraries (IJDL). arXiv
admin note: substantial text overlap with arXiv:2205.00937
Kai Zhang, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, Daxin Jiang
Retrieval models based on dense representations in semantic space have become
an indispensable branch for first-stage retrieval. These retrievers benefit
from surging advances in representation learning towards compressive global
sequence-level embeddings. However, they are prone to overlook local salient
phrases and entity mentions in texts, which usually play pivot roles in
first-stage retrieval. To mitigate this weakness, we propose to make a dense
retriever align a well-performing lexicon-aware representation model. The
alignment is achieved by weakened knowledge distillations to enlighten the
retriever via two aspects -- 1) a lexicon-augmented contrastive objective to
challenge the dense encoder and 2) a pair-wise rank-consistent regularization
to make dense model's behavior incline to the other. We evaluate our model on
three public benchmarks, which shows that with a comparable lexicon-aware
retriever as the teacher, our proposed dense one can bring consistent and
significant improvements, and even outdo its teacher. In addition, we found our
improvement on the dense retriever is complementary to the standard ranker
distillation, which can further lift state-of-the-art performance.
Authors' comments: 14 pages, 6 tables, 4 figures. WWW 2023
Yihan Wu, Hongyang Zhang, Heng Huang
Recent research works have shown that image retrieval models are vulnerable
to adversarial attacks, where slightly modified test inputs could lead to
problematic retrieval results. In this paper, we aim to design a provably
robust image retrieval model which keeps the most important evaluation metric
Recall@1 invariant to adversarial perturbation. We propose the first 1-nearest
neighbor (NN) image retrieval algorithm, RetrievalGuard, which is provably
robust against adversarial perturbations within an $\ell_2$ ball of calculable
radius. The challenge is to design a provably robust algorithm that takes into
consideration the 1-NN search and the high-dimensional nature of the embedding
space. Algorithmically, given a base retrieval model and a query sample, we
build a smoothed retrieval model by carefully analyzing the 1-NN search
procedure in the high-dimensional embedding space. We show that the smoothed
retrieval model has bounded Lipschitz constant and thus the retrieval score is
invariant to $\ell_2$ adversarial perturbations. Experiments on image retrieval
tasks validate the robustness of our RetrievalGuard method.
Authors' comments: accepted by ICML 2022
Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang, Daxin Jiang
Large-scale retrieval is to recall relevant documents from a huge collection
given a query. It relies on representation learning to embed documents and
queries into a common semantic encoding space. According to the encoding space,
recent retrieval methods based on pre-trained language models (PLM) can be
coarsely categorized into either dense-vector or lexicon-based paradigms. These
two paradigms unveil the PLMs' representation capability in different
granularities, i.e., global sequence-level compression and local word-level
contexts, respectively. Inspired by their complementary global-local
contextualization and distinct representing views, we propose a new learning
framework, UnifieR which unifies dense-vector and lexicon-based retrieval in
one model with a dual-representing capability. Experiments on passage retrieval
benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme
is further presented with even better retrieval quality. We lastly evaluate the
model on BEIR benchmark to verify its transferability.
Authors' comments: To appear at KDD ADS 2023
Yifan Gao, Qingyu Yin, Zheng Li, Rui Meng, Tong Zhao, Bing Yin, Irwin King, Michael R. Lyu
Keyphrase generation is the task of automatically predicting keyphrases given
a piece of long text. Despite its recent flourishing, keyphrase generation on
non-English languages haven't been vastly investigated. In this paper, we call
attention to a new setting named multilingual keyphrase generation and we
contribute two new datasets, EcommerceMKP and AcademicMKP, covering six
languages. Technically, we propose a retrieval-augmented method for
multilingual keyphrase generation to mitigate the data shortage problem in
non-English languages. The retrieval-augmented model leverages keyphrase
annotations in English datasets to facilitate generating keyphrases in
low-resource languages. Given a non-English passage, a cross-lingual dense
passage retrieval module finds relevant English passages. Then the associated
English keyphrases serve as external knowledge for keyphrase generation in the
current language. Moreover, we develop a retriever-generator iterative training
algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual
passage retriever. Comprehensive experiments and ablations show that the
proposed approach outperforms all baselines.
Authors' comments: NAACL 2022 (Findings)
Canwen Xu, Daya Guo, Nan Duan, Julian McAuley
In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever
that does not require any supervised data for training. Specifically, we first
present Iterative Contrastive Learning (ICoL) that iteratively trains the query
and document encoders with a cache mechanism. ICoL not only enlarges the number
of negative instances but also keeps representations of cached examples in the
same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a
simple yet effective way to enhance dense retrieval with lexical matching. We
evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18
datasets of 9 zero-shot text retrieval tasks. Experimental results show that
LaPraDoR achieves state-of-the-art performance compared with supervised dense
retrieval models, and further analysis reveals the effectiveness of our
training strategy and objectives. Compared to re-ranking, our lexicon-enhanced
approach can be run in milliseconds (22.5x faster) while achieving superior
performance.
Authors' comments: ACL 2022 (Findings)
Zhongping Zhang, Yiwen Gu, Bryan A. Plummer
Article comprehension is an important challenge in natural language
processing with many applications such as article generation or
image-to-article retrieval. Prior work typically encodes all tokens in articles
uniformly using pretrained language models. However, in many applications, such
as understanding news stories, these articles are based on real-world events
and may reference many named entities that are difficult to accurately
recognize and predict by language models. To address this challenge, we propose
an ENtity-aware article GeneratIoN and rEtrieval (ENGINE) framework, to
explicitly incorporate named entities into language models. ENGINE has two main
components: a named-entity extraction module to extract named entities from
both metadata and embedded images associated with articles, and an entity-aware
mechanism that enhances the model's ability to recognize and predict entity
names. We conducted experiments on three public datasets: GoodNews, VisualNews,
and WikiText, where our results demonstrate that our model can boost both
article generation and article retrieval performance, with a 4-5 perplexity
improvement in article generation and a 3-4% boost in recall@1 in article
retrieval. We release our implementation at
https://github.com/Zhongping-Zhang/ENGINE .
Authors' comments: Accepted at EMNLP 2023 Findings
Yue Yang, Joongwon Kim, Artemis Panagopoulou, Mark Yatskar, Chris Callison-Burch
Schemata are structured representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance. Our system proceeds in three major phases: (1) Given a task with related videos, we construct an initial schema for a task using a joint video-text model to match video segments with text representing steps from wikiHow; (2) We generalize schemata to unseen tasks by leveraging language models to edit the text within existing schemata. Through generalization, we can allow our schemata to cover a more extensive range of tasks with a small amount of learning data; (3) We conduct zero-shot instructional video retrieval with the unseen task names as the queries. Our schema-guided approach outperforms existing methods for video retrieval, and we demonstrate that the schemata induced by our system are better than those generated by other models.
Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, Xipeng Qiu
Existing summarization systems mostly generate summaries purely relying on the content of the source document. However, even for humans, we usually need some references or exemplars to help us fully understand the source document and write summaries in a particular format. But how to find the high-quality exemplars and incorporate them into summarization systems is still challenging and worth exploring. In this paper, we propose RetrievalSum, a novel retrieval enhanced abstractive summarization framework consisting of a dense Retriever and a Summarizer. At first, several closely related exemplars are retrieved as supplementary input to help the generation model understand the text more comprehensively. Furthermore, retrieved exemplars can also play a role in guiding the model to capture the writing style of a specific corpus. We validate our method on a wide range of summarization datasets across multiple domains and two backbone models: BERT and BART. Results show that our framework obtains significant improvement by 1.38~4.66 in ROUGE-1 score when compared with the powerful pre-trained models, and achieve new state-of-the-art on BillSum. Human evaluation demonstrates that our retrieval enhanced model can better capture the domain-specific writing style.
Peng Shi, Rui Zhang, He Bai, Jimmy Lin
Dense retrieval has shown great success in passage ranking in English. However, its effectiveness in document retrieval for non-English languages remains unexplored due to the limitation in training resources. In this work, we explore different transfer techniques for document ranking from English annotations to multiple non-English languages. Our experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families reveal that zero-shot model-based transfer using mBERT improves the search quality in non-English mono-lingual retrieval. Also, we find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer that requires external translators and query generators.
Min Jin Chong, Wen-Sheng Chu, Abhishek Kumar, David Forsyth
We present Retrieve in Style (RIS), an unsupervised framework for facial
feature transfer and retrieval on real images. Recent work shows capabilities
of transferring local facial features by capitalizing on the disentanglement
property of the StyleGAN latent space. RIS improves existing art on the
following: 1) Introducing more effective feature disentanglement to allow for
challenging transfers (ie, hair, pose) that were not shown possible in SoTA
methods. 2) Eliminating the need for per-image hyperparameter tuning, and for
computing a catalog over a large batch of images. 3) Enabling fine-grained face
retrieval using disentangled facial features (eg, eyes). To our best knowledge,
this is the first work to retrieve face images at this fine level. 4)
Demonstrating robust, natural editing on real images. Our qualitative and
quantitative analyses show RIS achieves both high-fidelity feature transfers
and accurate fine-grained retrievals on real images. We also discuss the
responsible applications of RIS.
Authors' comments: Code is here https://github.com/mchong6/RetrieveInStyle
Weihao Gao, Xiangjun Fan, Chong Wang, Jiankai Sun, Kai Jia, Wenzhi Xiao, Ruofan Ding, Xingyan Bin et al.
One of the core problems in large-scale recommendations is to retrieve top
relevant candidates accurately and efficiently, preferably in sub-linear time.
Previous approaches are mostly based on a two-step procedure: first learn an
inner-product model, and then use some approximate nearest neighbor (ANN)
search algorithm to find top candidates. In this paper, we present Deep
Retrieval (DR), to learn a retrievable structure directly with user-item
interaction data (e.g. clicks) without resorting to the Euclidean space
assumption in ANN algorithms. DR's structure encodes all candidate items into a
discrete latent space. Those latent codes for the candidates are model
parameters and learnt together with other neural network parameters to maximize
the same objective function. With the model learnt, a beam search over the
structure is performed to retrieve the top candidates for reranking.
Empirically, we first demonstrate that DR, with sub-linear computational
complexity, can achieve almost the same accuracy as the brute-force baseline on
two public datasets. Moreover, we show that, in a live production
recommendation system, a deployed DR approach significantly outperforms a
well-tuned ANN baseline in terms of engagement metrics. To the best of our
knowledge, DR is among the first non-ANN algorithms successfully deployed at
the scale of hundreds of millions of items for industrial recommendation
systems.
Authors' comments: 9 pages, 6 figures
Deguang Han, Ted Juste, Youfa Li, Wenchang Sun
An exact phase-retrievable frame $\{f_{i}\}_{i}^{N}$ for an $n$-dimensional
Hilbert space is a phase-retrievable frame that fails to be phase-retrievable
if any one element is removed from the frame. Such a frame could have different
lengths. We shall prove that for the real Hilbert space case, exact
phase-retrievable frame of length $N$ exists for every $2n-1\leq N\leq
n(n+1)/2$. For arbitrary frames we introduce the concept of redundancy with
respect to its phase-retrievability and the concept of frames with exact
PR-redundancy. We investigate the phase-retrievability by studying its maximal
phase-retrievable subspaces with respect to a given frame which is not
necessarily phase-retrievable. These maximal PR-subspaces could have different
dimensions. We are able to identify the one with the largest dimension, which
can be considered as a generalization of the characterization for
phase-retrievable frames. In the basis case, we prove that if $M$ is a
$k$-dimensional PR-subspace, then $|supp(x)| \geq k$ for every nonzero vector
$x\in M$. Moreover, if $1\leq k< [(n+1)/2]$, then a $k$-dimensional PR-subspace
is maximal if and only if there exists a vector $x\in M$ such that $|supp(x) |
= k$.
Authors' comments: 21 pages
Georgios Pantazopoulos, Malvina Nikandrou, Ioannis Konstas, Alessandro Suglia
Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.
Yifan Wang, Mingxuan Jiang, Zhihao Sun, Yixin Cao, Yicun Liu, Keyang Chen, Guangnan Ye, Hongfeng Chai
Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: https://anonymous.4open.science/r/GAM_RAG-2EF6.