Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang
In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.
Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig
One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
Jacob Trzaska, Amit Ashok
We consider the problem of determining the spatial phase profile of a single-mode electromagnetic field. Our attention is on input states that are a statistical mixture of displaced and squeezed number states, a superset of Gaussian states. In particular, we derive the quantum Fisher information matrix (QFIM) for estimating the expansion coefficients of the wavefront in an orthonormal basis, finding that it is diagonal. Moreover, we show that a measurement saturating the QFIM always exists, and point to an adaptive strategy capable of implementing it. We then construct the optimal measurements for three particular states: mixtures of photon number, coherent, and single-mode squeezed vacuum states. Sensitivity of the measurements to nuisance parameters is explored.
João Rodrigues, António Branco
Retrieval-augmented generation resorts to content retrieved from external sources in order to leverage the performance of large language models in downstream tasks. The excessive volume of retrieved content, the possible dispersion of its parts, or their out of focus range may happen nevertheless to eventually have a detrimental rather than an incremental effect. To mitigate this issue and improve retrieval-augmented generation, we propose a method to refine the retrieved content before it is included in the prompt by resorting to meta-prompting optimization. Put to empirical test with the demanding multi-hop question answering task from the StrategyQA dataset, the evaluation results indicate that this method outperforms a similar retrieval-augmented system but without this method by over 30%.
Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina
Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at https://github.com/naver/bergen.
Varun Nagaraj Rao, Siddharth Choudhary, Aditya Deshpande, Ravi Kumar Satzoda, Srikar Appalaraju
The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang
Personalized dialogue generation, focusing on generating highly tailored
responses by leveraging persona profiles and dialogue context, has gained
significant attention in conversational AI applications. However, persona
profiles, a prevalent setting in current personalized dialogue datasets,
typically composed of merely four to five sentences, may not offer
comprehensive descriptions of the persona about the agent, posing a challenge
to generate truly personalized dialogues. To handle this problem, we propose
$\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for
$\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration
($\textbf{LAPDOG}$), which studies the potential of leveraging external
knowledge for persona dialogue generation. Specifically, the proposed LAPDOG
model consists of a story retriever and a dialogue generator. The story
retriever uses a given persona profile as queries to retrieve relevant
information from the story document, which serves as a supplementary context to
augment the persona profile. The dialogue generator utilizes both the dialogue
history and the augmented persona profile to generate personalized responses.
For optimization, we adopt a joint training framework that collaboratively
learns the story retriever and dialogue generator, where the story retriever is
optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for
the dialogue generator to generate personalized responses. Experiments
conducted on the CONVAI2 dataset with ROCStory as a supplementary data source
show that the proposed LAPDOG method substantially outperforms the baselines,
indicating the effectiveness of the proposed method. The LAPDOG model code is
publicly available for further exploration.
https://github.com/hqsiswiliam/LAPDOG
Authors' comments: Accepted to EMNLP-2023
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park
Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user-related) features related to the query. Yet, they may be suboptimal to effectively augment the query, though there is plenty of information available to augment it in a relational database. Motivated by this, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with our graph-based set encoding strategy, which considers hierarchies of features in the database without order. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database, demonstrating that ours significantly enhances overall retrieval performance, compared to existing query augmentation methods.
Tassallah Abdullahi, Ritambhara Singh, Carsten Eickhoff
Zero-shot text learning enables text classifiers to handle unseen classes
efficiently, alleviating the need for task-specific training data. A simple
approach often relies on comparing embeddings of query (text) to those of
potential classes. However, the embeddings of a simple query sometimes lack
rich contextual information, which hinders the classification performance.
Traditionally, this has been addressed by improving the embedding model with
expensive training. We introduce QZero, a novel training-free knowledge
augmentation approach that reformulates queries by retrieving supporting
categories from Wikipedia to improve zero-shot text classification performance.
Our experiments across six diverse datasets demonstrate that QZero enhances
performance for state-of-the-art static and contextual embedding models without
the need for retraining. Notably, in News and medical topic classification
tasks, QZero improves the performance of even the largest OpenAI embedding
model by at least 5% and 3%, respectively. Acting as a knowledge amplifier,
QZero enables small word embedding models to achieve performance levels
comparable to those of larger contextual models, offering the potential for
significant computational savings. Additionally, QZero offers meaningful
insights that illuminate query context and verify topic relevance, aiding in
understanding model predictions. Overall, QZero improves embedding-based
zero-shot classifiers while maintaining their simplicity. This makes it
particularly valuable for resource-constrained environments and domains with
constantly evolving information.
Authors' comments: Proceedings of the 2024 ACM SIGIR International Conference on the
Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC,
DC, USA
Dung Ngoc Thai, Victor Ardulov, Jose Ulises Mena, Simran Tiwari, Gleb Erofeev, Ramy Eskander, Karim Tarabishy, Ravi B Parikh et al.
Identifying patient cohorts is fundamental to numerous healthcare tasks, including clinical trial recruitment and retrospective studies. Current cohort retrieval methods in healthcare organizations rely on automated queries of structured data combined with manual curation, which are time-consuming, labor-intensive, and often yield low-quality results. Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems. Major challenges include managing extensive eligibility criteria and handling the longitudinal nature of unstructured Electronic Medical Records (EMRs) while ensuring that the solution remains cost-effective for real-world application. This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches. We provide a benchmark task, a query dataset, an EMR dataset, and an evaluation framework. Our findings underscore the necessity for efficient, high-quality ACR systems capable of longitudinal reasoning across extensive patient databases.
Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai
Existing methods for long video understanding primarily focus on videos only
lasting tens of seconds, with limited exploration of techniques for handling
longer videos. The increased number of frames in longer videos presents two
main challenges: difficulty in locating key information and performing
long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based
system designed for long video understanding. Our key idea is to convert the
long-video understanding problem into a long-document understanding task so as
to effectively leverage the power of large language models. Specifically,
DrVideo transforms a long video into a text-based long document to initially
retrieve key frames and augment the information of these frames, which is used
this as the system's starting point. It then employs an agent-based iterative
loop to continuously search for missing information, augment relevant data, and
provide final predictions in a chain-of-thought manner once sufficient
question-related information is gathered. Extensive experiments on long video
benchmarks confirm the effectiveness of our method. DrVideo outperforms
existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3
minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode
(10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
Authors' comments: 11 pages
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
In the real world, documents are organized in different formats and varied
modalities. Traditional retrieval pipelines require tailored document parsing
techniques and content extraction modules to prepare input for indexing. This
process is tedious, prone to errors, and has information loss. To this end, we
propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that
regards document screenshots as a unified input format, which does not require
any content extraction preprocess and preserves all the information in a
document (e.g., text, image and layout). DSE leverages a large vision-language
model to directly encode document screenshots into dense representations for
retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a
1.3M Wikipedia web page screenshots as the corpus to answer the questions from
the Natural Questions dataset. In such a text-intensive document retrieval
setting, DSE shows competitive effectiveness compared to other text retrieval
methods relying on parsing. For example, DSE outperforms BM25 by 17 points in
top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide
retrieval, DSE significantly outperforms OCR text retrieval methods by over 15
points in nDCG@10. These experiments show that DSE is an effective document
retrieval paradigm for diverse types of documents. Model checkpoints, code, and
Wiki-SS collection will be released.
Authors' comments: EMNLP2024 main
Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.
Yuxuan Mu, Shihao Zou, Kangning Yin, Zheng Tian, Li Cheng, Weinan Zhang, Jun Wang
In computer animation, driving a simulated character with lifelike motion is
challenging. Current generative models, though able to generalize to diverse
motions, often pose challenges to the responsiveness of end-user control. To
address these issues, we introduce RACon: Retrieval-Augmented Simulated
Character Locomotion Control. Our end-to-end hierarchical reinforcement
learning method utilizes a retriever and a motion controller. The retriever
searches motion experts from a user-specified database in a task-oriented
fashion, which boosts the responsiveness to the user's control. The selected
motion experts and the manipulation signal are then transferred to the
controller to drive the simulated character. In addition, a retrieval-augmented
discriminator is designed to stabilize the training process. Our method
surpasses existing techniques in both quality and quantity in locomotion
control, as demonstrated in our empirical study. Moreover, by switching
extensive databases for retrieval, it can adapt to distinctive motion types at
run time.
Authors' comments: Accepted in ICME2024 for oral presentation
Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang
Audio-text retrieval is a challenging task, requiring the search for an audio
clip or a text caption within a database. The predominant focus of existing
research on English descriptions poses a limitation on the applicability of
such models, given the abundance of non-English content in real-world data. To
address these linguistic disparities, we propose a language enhancement (LE),
using a multilingual text encoder (SONAR) to encode the text data with
language-specific information. Additionally, we optimize the audio encoder
through the application of consistent ensemble distillation (CED), enhancing
support for variable-length audio-text retrieval. Our methodology excels in
English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance
on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the
approach exhibits proficiency in retrieving content in seven other languages
with only 10% of additional language-enhanced training data, yielding promising
results. The source code is publicly available
https://github.com/zyyan4/ml-clap.
Authors' comments: interspeech2024
Genta Indra Winata, Ruochen Zhang, David Ifeoluwa Adelani
Words have been represented in a high-dimensional vector space that encodes
their semantic similarities, enabling downstream applications such as
retrieving synonyms, antonyms, and relevant contexts. However, despite recent
advances in multilingual language models (LMs), the effectiveness of these
models' representations in semantic retrieval contexts has not been
comprehensively explored. To fill this gap, this paper introduces the MINERS, a
benchmark designed to evaluate the ability of multilingual LMs in semantic
retrieval tasks, including bitext mining and classification via
retrieval-augmented contexts. We create a comprehensive framework to assess the
robustness of LMs in retrieving samples across over 200 diverse languages,
including extremely low-resource languages in challenging cross-lingual and
code-switching settings. Our results demonstrate that by solely retrieving
semantically similar embeddings yields performance competitive with
state-of-the-art approaches, without requiring any fine-tuning.
Authors' comments: Accepted by EMNLP 2024 Findings
Matteo Gabburo, Nicolaas Paul Jedema, Siddhant Garg, Leonardo F. R. Ribeiro, Alessandro Moschitti
In this paper, we investigate which questions are challenging for
retrieval-based Question Answering (QA). We (i) propose retrieval complexity
(RC), a novel metric conditioned on the completeness of retrieved documents,
which measures the difficulty of answering questions, and (ii) propose an
unsupervised pipeline to measure RC given an arbitrary retrieval system. Our
proposed pipeline measures RC more accurately than alternative estimators,
including LLMs, on six challenging QA benchmarks. Further investigation reveals
that RC scores strongly correlate with both QA performance and expert judgment
across five of the six studied benchmarks, indicating that RC is an effective
measure of question difficulty. Subsequent categorization of high-RC questions
shows that they span a broad set of question shapes, including multi-hop,
compositional, and temporal QA, indicating that RC scores can categorize a new
subset of complex questions. Our system can also have a major impact on
retrieval-based systems by helping to identify more challenging questions on
existing datasets.
Authors' comments: Accepted to ACL 2024 (findings)
Jifei Luo, Hantao Yao, Changsheng Xu
Diffusion-based re-ranking is a common method used for retrieving instances
by performing similarity propagation in a nearest neighbor graph. However,
existing techniques that construct the affinity graph based on pairwise
instances can lead to the propagation of misinformation from outliers and other
manifolds, resulting in inaccurate results. To overcome this issue, we propose
a novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The
primary concept of CAS is to conduct similarity diffusion within local
clusters, which can reduce the influence from other manifolds explicitly. To
obtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity
Diffusion strategy introduces an inverse constraint term to the optimization
objective of local cluster diffusion. Additionally, we have optimized a
Neighbor-guided Similarity Smoothing approach to ensure similarity consistency
among the local neighbors of each instance. Evaluations in instance retrieval
and object re-identification validate the effectiveness of the proposed CAS,
our code is publicly available.
Authors' comments: This paper has been accepted by ICML2024
Tzu-Lin Kuo, Tzu-Wei Chiu, Tzung-Sheng Lin, Sheng-Yang Wu, Chao-Wei Huang, Yun-Nung Chen
Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to directly map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking. This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges. We discuss various document identifier strategies, including numerical and string-based identifiers, and explore different document representation methods. Our primary contribution lies in outlining future research directions that could profoundly impact the field: improving the quality of query generation, exploring learnable document identifiers, enhancing scalability, and integrating GR with multi-task learning frameworks. By examining state-of-the-art GR techniques and their applications, this survey aims to provide a foundational understanding of GR and inspire further innovations in this transformative approach to information retrieval. We also make the complementary materials such as paper collection publicly available at https://github.com/MiuLab/GenIR-Survey/