Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin
Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose Thoth, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design Thoth, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. Thoth proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement Thoth and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that Thoth reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
Embedding models play a pivot role in modern NLP applications such as IR and
RAG. While the context limit of LLMs has been pushed beyond 1 million tokens,
embedding models are still confined to a narrow context window not exceeding 8k
tokens, refrained from application scenarios requiring long inputs such as
legal contracts. This paper explores context window extension of existing
embedding models, pushing the limit to 32k without requiring additional
training. First, we examine the performance of current embedding models for
long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed
comprises two synthetic tasks and four carefully chosen real-world tasks,
featuring documents of varying length and dispersed target information.
Benchmarking results underscore huge room for improvement in these models.
Based on this, comprehensive experiments show that training-free context window
extension strategies like position interpolation can effectively extend the
context window of existing embedding models by several folds, regardless of
their original context being 512 or beyond 4k. Furthermore, for models
employing absolute position encoding (APE), we show the possibility of further
fine-tuning to harvest notable performance gains while strictly preserving
original behavior for short inputs. For models using rotary position embedding
(RoPE), significant enhancements are observed when employing RoPE-specific
methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for
context window extension. To facilitate future research, we release E5-Base-4k
and E5-RoPE-Base, along with the LongEmbed benchmark.
Authors' comments: EMNLP 2024 Camera Ready
Truman Hickok, Dhireesha Kudithipudi
One of the most widely used approaches in continual learning is referred to as replay. Replay methods support interleaved learning by storing past experiences in a replay buffer. Although there are methods for selectively constructing the buffer and reprocessing its contents, there is limited exploration of the problem of selectively retrieving samples from the buffer. Current solutions have been tested in limited settings and, more importantly, in isolation. Existing work has also not explored the impact of duplicate replays on performance. In this work, we propose a framework for evaluating selective retrieval strategies, categorized by simple, independent class- and sample-selective primitives. We evaluated several combinations of existing strategies for selective retrieval and present their performances. Furthermore, we propose a set of strategies to prevent duplicate replays and explore whether new samples with low loss values can be learned without replay. In an effort to match our problem setting to a realistic continual learning pipeline, we restrict our experiments to a setting involving a large, pre-trained, open vocabulary object detection model, which is fully fine-tuned on a sequence of 15 datasets.
Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy
Quantifying bias in retrieval functions through document retrievability
scores is vital for assessing recall-oriented retrieval systems. However, many
studies investigating retrieval model bias lack validation of their query
generation methods as accurate representations of retrievability for real users
and their queries. This limitation results from the absence of established
criteria for query generation in retrievability assessments. Typically,
researchers resort to using frequent collocations from document corpora when no
query log is available. In this study, we address the issue of reproducibility
and seek to validate query generation methods by comparing retrievability
scores generated from artificially generated queries to those derived from
query logs. Our findings demonstrate a minimal or negligible correlation
between retrievability scores from artificial queries and those from query
logs. This suggests that artificially generated queries may not accurately
reflect retrievability scores as derived from query logs. We further explore
alternative query generation techniques, uncovering a variation that exhibits
the highest correlation. This alternative approach holds promise for improving
reproducibility when query logs are unavailable.
Authors' comments: Accepted at ECIR 2024
Tianyu Zhu, Myong Chol Jung, Jesse Clark
Contrastive learning has gained widespread adoption for retrieval tasks due to its minimal requirement for manual annotations. However, popular training frameworks typically learn from binary (positive/negative) relevance, making them ineffective at incorporating desired rankings. As a result, the poor ranking performance of these models forces systems to employ a re-ranker, which increases complexity, maintenance effort and inference time. To address this, we introduce Generalized Contrastive Learning (GCL), a training framework designed to learn from continuous ranking scores beyond binary relevance. GCL encodes both relevance and ranking information into a unified embedding space by applying ranking scores to the loss function. This enables a single-stage retrieval system. In addition, during our research, we identified a lack of public multi-modal datasets that benchmark both retrieval and ranking capabilities. To facilitate this and future research for ranked retrieval, we curated a large-scale MarqoGS-10M dataset using GPT-4 and Google Shopping, providing ranking scores for each of the 10 million query-document pairs. Our results show that GCL achieves a 29.3% increase in NDCG@10 for in-domain evaluations and 6.0% to 10.0% increases for cold-start evaluations compared to the finetuned CLIP baseline with MarqoGS-10M. Additionally, we evaluated GCL offline on a proprietary user interaction data. GCL shows an 11.2% gain for in-domain evaluations. The dataset and the method are available at: https://github.com/marqo-ai/GCL.
Patrice Béchard, Orlando Marquez Ayala
A common and fundamental limitation of Generative AI (GenAI) is its
propensity to hallucinate. While large language models (LLM) have taken the
world by storm, without eliminating or at least reducing hallucinations,
real-world GenAI systems may face challenges in user adoption. In the process
of deploying an enterprise application that produces workflows based on natural
language requirements, we devised a system leveraging Retrieval Augmented
Generation (RAG) to greatly improve the quality of the structured output that
represents such workflows. Thanks to our implementation of RAG, our proposed
system significantly reduces hallucinations in the output and improves the
generalization of our LLM in out-of-domain settings. In addition, we show that
using a small, well-trained retriever encoder can reduce the size of the
accompanying LLM, thereby making deployments of LLM-based systems less
resource-intensive.
Authors' comments: To be presented at NAACL 2024. 11 pages and 4 figures
Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira
Despite Portuguese being one of the most spoken languages in the world, there
is a lack of high-quality information retrieval datasets in that language. We
present Quati, a dataset specifically designed for the Brazilian Portuguese
language. It comprises a collection of queries formulated by native speakers
and a curated set of documents sourced from a selection of high-quality
Brazilian Portuguese websites. These websites are frequented more likely by
real users compared to those randomly scraped, ensuring a more representative
and relevant corpus. To label the query-document pairs, we use a
state-of-the-art LLM, which shows inter-annotator agreement levels comparable
to human performance in our assessments. We provide a detailed description of
our annotation methodology to enable others to create similar datasets for
other languages, providing a cost-effective way of creating high-quality IR
datasets with an arbitrary number of labeled documents per query. Finally, we
evaluate a diverse range of open-source and commercial retrievers to serve as
baseline systems. Quati is publicly available at
https://huggingface.co/datasets/unicamp-dl/quati and all scripts at
https://github.com/unicamp-dl/quati .
Authors' comments: 22 pages
Alessandro Stolfo
We present an empirical study of groundedness in long-form question answering
(LFQA) by retrieval-augmented large language models (LLMs). In particular, we
evaluate whether every generated sentence is grounded in the retrieved
documents or the model's pre-training data. Across 3 datasets and 4 model
families, our findings reveal that a significant fraction of generated
sentences are consistently ungrounded, even when those sentences contain
correct ground-truth answers. Additionally, we examine the impacts of factors
such as model size, decoding strategy, and instruction tuning on groundedness.
Our results show that while larger models tend to ground their outputs more
effectively, a significant portion of correct answers remains compromised by
hallucinations. This study provides novel insights into the groundedness
challenges in LFQA and underscores the necessity for more robust mechanisms in
LLMs to mitigate the generation of ungrounded content.
Authors' comments: NAACL 2024 (Findings)
Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi
Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, *superposition prompting*, which can be directly applied to pre-trained transformer-based LLMs *without the need for fine-tuning*. At a high level, superposition prompting allows the LLM to process input documents in parallel *prompt paths*, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a 93x reduction in compute time while *improving* accuracy by 43% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
Alireza Salemi, Surya Kallumadi, Hamed Zamani
This paper studies retrieval-augmented approaches for personalizing large language models (LLMs), which potentially have a substantial impact on various applications and domains. We propose the first attempt to optimize the retrieval models that deliver a limited number of personal documents to large language models for the purpose of personalized generation. We develop two optimization algorithms that solicit feedback from the downstream personalized generation tasks for retrieval optimization--one based on reinforcement learning whose reward function is defined using any arbitrary metric for personalized generation and another based on knowledge distillation from the downstream LLM to the retrieval model. This paper also introduces a pre- and post-generation retriever selection model that decides what retriever to choose for each LLM input. Extensive experiments on diverse tasks from the language model personalization (LaMP) benchmark reveal statistically significant improvements in six out of seven datasets.
Megha Rayer, Charul Rajput, B. Sundar Rajan
In Pliable Private Information Retrieval (PPIR) with a single server,
messages are partitioned into $\Gamma$ non-overlapping classes \cite{ref5}. The
user wants to retrieve a message from its desired class without revealing the
identity of the desired class to the server. In \cite{ref6}, Obead et al.
consider the problem of PPIR with Side Information (PPIR-SI), where the user
now has side information. The user wants to retrieve any new message (not
included in the side information) from its desired class without revealing the
identity of the desired class and its side information. A scheme for the
PPIR-SI is given in \cite{ref6} for the case when the users side information is
unidentified, and this case is referred to as PPIR with Unidentifiable SI
(PPIR-USI). In this paper, we study the problem of PPIR for the single server
case when the side information is partially identifiable, and we term this case
as PPIR with Identifiable Side Information (PPIR-ISI). The user is well aware
of the identity of the side information belonging to $\eta$ number of classes,
where $1\leq \eta \leq \Gamma$. We give a scheme for PPIR-ISI, and we prove
that having identifiable side information is advantageous by comparing the rate
of the proposed scheme to the rate of the PPIR-USI scheme given in \cite{ref6}
for some cases. Further, we extend the problem of PPIR-ISI for multi-user case,
where users can collaboratively generate the query sets, and we give a scheme
for this problem.
Authors' comments: 10 pages and 3 figures
Jinpeng Wang, Bin Chen, Qiang Zhang, Zaiqiao Meng, Shangsong Liang, Shu-Tao Xia
Deep quantization methods have shown high efficiency on large-scale image
retrieval. However, current models heavily rely on ground-truth information,
hindering the application of quantization in label-hungry scenarios. A more
realistic demand is to learn from inexhaustible uploaded images that are
associated with informal tags provided by amateur users. Though such sketchy
tags do not obviously reveal the labels, they actually contain useful semantic
information for supervising deep quantization. To this end, we propose
Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first
work to learn deep quantization from weakly tagged images. Specifically, 1) we
use word embeddings to represent the tags and enhance their semantic
information based on a tag correlation graph. 2) To better preserve semantic
information in quantization codes and reduce quantization error, we jointly
learn semantics-preserving embeddings and supervised quantizer on hypersphere
by employing a well-designed fusion layer and tailor-made loss functions.
Extensive experiments show that WSDHQ can achieve state-of-art performance on
weakly-supervised compact coding. Code is available at
https://github.com/gimpong/AAAI21-WSDHQ.
Authors' comments: In proceedings of AAAI 2021. Code and data are available
Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma
Temporal Action Detection (TAD) focuses on detecting pre-defined actions,
while Moment Retrieval (MR) aims to identify the events described by open-ended
natural language within untrimmed videos. Despite that they focus on different
events, we observe they have a significant connection. For instance, most
descriptions in MR involve multiple actions from TAD. In this paper, we aim to
investigate the potential synergy between TAD and MR. Firstly, we propose a
unified architecture, termed Unified Moment Detection (UniMD), for both TAD and
MR. It transforms the inputs of the two tasks, namely actions for TAD or events
for MR, into a common embedding space, and utilizes two novel query-dependent
decoders to generate a uniform output of classification score and temporal
segments. Secondly, we explore the efficacy of two task fusion learning
approaches, pre-training and co-training, in order to enhance the mutual
benefits between TAD and MR. Extensive experiments demonstrate that the
proposed task fusion learning scheme enables the two tasks to help each other
and outperform the separately trained counterparts. Impressively, UniMD
achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA,
and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.
Authors' comments: ECCV2024
Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, Lina Yao
Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. However, it was observed by previous works that retrieval is not always helpful, especially when the LLM is already knowledgeable on the query to answer. Motivated by this, Adaptive Retrieval-Augmented Generation (ARAG) studies retrieving only when the knowledge asked by the query is absent in the LLM. Previous works of ARAG either require accessing the pre-training corpus or prompting with additional model inferences. Aiming to avoid such drawbacks, we propose to determine whether the model is knowledgeable on a query via inspecting the (contextualized) pre-trained token embeddings of LLMs. We hypothesize that such embeddings capture rich information on the model's intrinsic knowledge base, which enables an efficient way of judging the necessity to retrieve from an external corpus. Extensive experiments demonstrate our ARAG approach's superior performance across various benchmarks.
Dennis Wu, Jerry Yao-Chieh Hu, Teng-Yun Hsiao, Han Liu
We propose a two-stage memory retrieval dynamics for modern Hopfield models,
termed $\mathtt{U\text{-}Hop}$, with enhanced memory capacity. Our key
contribution is a learnable feature map $\Phi$ which transforms the Hopfield
energy function into kernel space. This transformation ensures convergence
between the local minima of energy and the fixed points of retrieval dynamics
within the kernel space. Consequently, the kernel norm induced by $\Phi$ serves
as a novel similarity measure. It utilizes the stored memory patterns as
learning data to enhance memory capacity across all modern Hopfield models.
Specifically, we accomplish this by constructing a separation loss
$\mathcal{L}_\Phi$ that separates the local minima of kernelized energy by
separating stored memory patterns in kernel space. Methodologically,
$\mathtt{U\text{-}Hop}$ memory retrieval process consists of: (Stage I)
minimizing separation loss for a more uniform memory (local minimum)
distribution, followed by (Stage II) standard Hopfield energy minimization for
memory retrieval. This results in a significant reduction of possible
metastable states in the Hopfield energy function, thus enhancing memory
capacity by preventing memory confusion. Empirically, with real-world datasets,
we demonstrate that $\mathtt{U\text{-}Hop}$ outperforms all existing modern
Hopfield models and state-of-the-art similarity measures, achieving substantial
improvements in both associative memory retrieval and deep learning tasks. Code
is available at https://github.com/MAGICS-LAB/UHop ; future updates are on
arXiv:2404.03827
Authors' comments: Accepted at ICML 2024; v3 added a note on follow-up UHop+
(arXiv:2410.23126); v2 updated to camera-ready version; Code available at
https://github.com/MAGICS-LAB/UHop
Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.
Zhe Xu, Daoyuan Chen, Jiayi Kuang, Zihao Yi, Yaliang Li, Ying Shen
Emotional Support Conversation (ESC) systems are pivotal in providing
empathetic interactions, aiding users through negative emotional states by
understanding and addressing their unique experiences. In this paper, we tackle
two key challenges in ESC: enhancing contextually relevant and empathetic
response generation through dynamic demonstration retrieval, and advancing
cognitive understanding to grasp implicit mental states comprehensively. We
introduce Dynamic Demonstration Retrieval and Cognitive-Aspect Situation
Understanding (\ourwork), a novel approach that synergizes these elements to
improve the quality of support provided in ESCs. By leveraging in-context
learning and persona information, we introduce an innovative retrieval
mechanism that selects informative and personalized demonstration pairs. We
also propose a cognitive understanding module that utilizes four cognitive
relationships from the ATOMIC knowledge source to deepen situational awareness
of help-seekers' mental states. Our supportive decoder integrates information
from diverse knowledge sources, underpinning response generation that is both
empathetic and cognitively aware. The effectiveness of \ourwork is demonstrated
through extensive automatic and human evaluations, revealing substantial
improvements over numerous state-of-the-art models, with up to 13.79\%
enhancement in overall performance of ten metrics. Our codes are available for
public access to facilitate further research and development.
Authors' comments: Accpeted by SIGIR 2024
Yushen Li, Jinpeng Wang, Tao Dai, Jieming Zhu, Jun Yuan, Rui Zhang, Shu-Tao Xia
Predicting click-through rates (CTR) is a fundamental task for Web
applications, where a key issue is to devise effective models for feature
interactions. Current methodologies predominantly concentrate on modeling
feature interactions within an individual sample, while overlooking the
potential cross-sample relationships that can serve as a reference context to
enhance the prediction. To make up for such deficiency, this paper develops a
Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature
interactions within and across samples. By retrieving similar samples, we
construct augmented input for each target sample. We then build Transformer
layers with cascaded attention to capture both intra- and cross-sample feature
interactions, facilitating comprehensive reasoning for improved CTR prediction
while retaining efficiency. Extensive experiments on real-world datasets
substantiate the effectiveness of RAT and suggest its advantage in long-tail
scenarios. The code has been open-sourced at
\url{https://github.com/YushenLi807/WWW24-RAT}.
Authors' comments: Accepted to The ACM Web Conference 2024 (WWW'24, short paper). Data
and code are available
Zhuo Chen, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Kewei Tu
In the era of large language models, applying techniques such as Retrieval
Augmented Generation can better address Open-Domain Question-Answering
problems. Due to constraints including model sizes and computing resources, the
length of context is often limited, and it becomes challenging to empower the
model to cover overlong contexts while answering questions from open domains.
This paper proposes a general and convenient method to covering longer contexts
in Open-Domain Question-Answering tasks. It leverages a small encoder language
model that effectively encodes contexts, and the encoding applies
cross-attention with origin inputs. With our method, the origin language models
can cover several times longer contexts while keeping the computing
requirements close to the baseline. Our experiments demonstrate that after
fine-tuning, there is improved performance across two held-in datasets, four
held-out datasets, and also in two In Context Learning settings.
Authors' comments: ACL2023 Findings
Chull Hwan Song, Jooyoung Yoon, Taebaek Hwang, Shunghyun Choi, Yeong Hyeon Gu, Yannis Avrithis
How important is it for training and evaluation sets to not have class
overlap in image retrieval? We revisit Google Landmarks v2 clean, the most
popular training set, by identifying and removing class overlap with Revisited
Oxford and Paris [34], the most popular evaluation set. By comparing the
original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art
methods, our findings are striking. Not only is there a dramatic drop in
performance, but it is inconsistent across methods, changing the ranking.What
does it take to focus on objects or interest and ignore background clutter when
indexing? Do we need to train an object detector and the representation
separately? Do we need location supervision? We introduce Single-stage
Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect
objects of interest and extract a global image representation. We outperform
previous state-of-the-art on both existing training sets and the new
RGLDv2-clean. Our dataset is available at
https://github.com/dealicious-inc/RGLDv2-clean.
Authors' comments: CVPR2024 Accepted