Tassallah Abdullahi, Ritambhara Singh, Carsten Eickhoff
Zero-shot text learning enables text classifiers to handle unseen classes
efficiently, alleviating the need for task-specific training data. A simple
approach often relies on comparing embeddings of query (text) to those of
potential classes. However, the embeddings of a simple query sometimes lack
rich contextual information, which hinders the classification performance.
Traditionally, this has been addressed by improving the embedding model with
expensive training. We introduce QZero, a novel training-free knowledge
augmentation approach that reformulates queries by retrieving supporting
categories from Wikipedia to improve zero-shot text classification performance.
Our experiments across six diverse datasets demonstrate that QZero enhances
performance for state-of-the-art static and contextual embedding models without
the need for retraining. Notably, in News and medical topic classification
tasks, QZero improves the performance of even the largest OpenAI embedding
model by at least 5% and 3%, respectively. Acting as a knowledge amplifier,
QZero enables small word embedding models to achieve performance levels
comparable to those of larger contextual models, offering the potential for
significant computational savings. Additionally, QZero offers meaningful
insights that illuminate query context and verify topic relevance, aiding in
understanding model predictions. Overall, QZero improves embedding-based
zero-shot classifiers while maintaining their simplicity. This makes it
particularly valuable for resource-constrained environments and domains with
constantly evolving information.
Authors' comments: Proceedings of the 2024 ACM SIGIR International Conference on the
Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC,
DC, USA
Dung Ngoc Thai, Victor Ardulov, Jose Ulises Mena, Simran Tiwari, Gleb Erofeev, Ramy Eskander, Karim Tarabishy, Ravi B Parikh et al.
Identifying patient cohorts is fundamental to numerous healthcare tasks, including clinical trial recruitment and retrospective studies. Current cohort retrieval methods in healthcare organizations rely on automated queries of structured data combined with manual curation, which are time-consuming, labor-intensive, and often yield low-quality results. Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems. Major challenges include managing extensive eligibility criteria and handling the longitudinal nature of unstructured Electronic Medical Records (EMRs) while ensuring that the solution remains cost-effective for real-world application. This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches. We provide a benchmark task, a query dataset, an EMR dataset, and an evaluation framework. Our findings underscore the necessity for efficient, high-quality ACR systems capable of longitudinal reasoning across extensive patient databases.
Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai
Existing methods for long video understanding primarily focus on videos only
lasting tens of seconds, with limited exploration of techniques for handling
longer videos. The increased number of frames in longer videos presents two
main challenges: difficulty in locating key information and performing
long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based
system designed for long video understanding. Our key idea is to convert the
long-video understanding problem into a long-document understanding task so as
to effectively leverage the power of large language models. Specifically,
DrVideo transforms a long video into a text-based long document to initially
retrieve key frames and augment the information of these frames, which is used
this as the system's starting point. It then employs an agent-based iterative
loop to continuously search for missing information, augment relevant data, and
provide final predictions in a chain-of-thought manner once sufficient
question-related information is gathered. Extensive experiments on long video
benchmarks confirm the effectiveness of our method. DrVideo outperforms
existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3
minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode
(10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
Authors' comments: 11 pages
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
In the real world, documents are organized in different formats and varied
modalities. Traditional retrieval pipelines require tailored document parsing
techniques and content extraction modules to prepare input for indexing. This
process is tedious, prone to errors, and has information loss. To this end, we
propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that
regards document screenshots as a unified input format, which does not require
any content extraction preprocess and preserves all the information in a
document (e.g., text, image and layout). DSE leverages a large vision-language
model to directly encode document screenshots into dense representations for
retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a
1.3M Wikipedia web page screenshots as the corpus to answer the questions from
the Natural Questions dataset. In such a text-intensive document retrieval
setting, DSE shows competitive effectiveness compared to other text retrieval
methods relying on parsing. For example, DSE outperforms BM25 by 17 points in
top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide
retrieval, DSE significantly outperforms OCR text retrieval methods by over 15
points in nDCG@10. These experiments show that DSE is an effective document
retrieval paradigm for diverse types of documents. Model checkpoints, code, and
Wiki-SS collection will be released.
Authors' comments: EMNLP2024 main
Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.
Yuxuan Mu, Shihao Zou, Kangning Yin, Zheng Tian, Li Cheng, Weinan Zhang, Jun Wang
In computer animation, driving a simulated character with lifelike motion is
challenging. Current generative models, though able to generalize to diverse
motions, often pose challenges to the responsiveness of end-user control. To
address these issues, we introduce RACon: Retrieval-Augmented Simulated
Character Locomotion Control. Our end-to-end hierarchical reinforcement
learning method utilizes a retriever and a motion controller. The retriever
searches motion experts from a user-specified database in a task-oriented
fashion, which boosts the responsiveness to the user's control. The selected
motion experts and the manipulation signal are then transferred to the
controller to drive the simulated character. In addition, a retrieval-augmented
discriminator is designed to stabilize the training process. Our method
surpasses existing techniques in both quality and quantity in locomotion
control, as demonstrated in our empirical study. Moreover, by switching
extensive databases for retrieval, it can adapt to distinctive motion types at
run time.
Authors' comments: Accepted in ICME2024 for oral presentation
Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang
Audio-text retrieval is a challenging task, requiring the search for an audio
clip or a text caption within a database. The predominant focus of existing
research on English descriptions poses a limitation on the applicability of
such models, given the abundance of non-English content in real-world data. To
address these linguistic disparities, we propose a language enhancement (LE),
using a multilingual text encoder (SONAR) to encode the text data with
language-specific information. Additionally, we optimize the audio encoder
through the application of consistent ensemble distillation (CED), enhancing
support for variable-length audio-text retrieval. Our methodology excels in
English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance
on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the
approach exhibits proficiency in retrieving content in seven other languages
with only 10% of additional language-enhanced training data, yielding promising
results. The source code is publicly available
https://github.com/zyyan4/ml-clap.
Authors' comments: interspeech2024
Genta Indra Winata, Ruochen Zhang, David Ifeoluwa Adelani
Words have been represented in a high-dimensional vector space that encodes
their semantic similarities, enabling downstream applications such as
retrieving synonyms, antonyms, and relevant contexts. However, despite recent
advances in multilingual language models (LMs), the effectiveness of these
models' representations in semantic retrieval contexts has not been
comprehensively explored. To fill this gap, this paper introduces the MINERS, a
benchmark designed to evaluate the ability of multilingual LMs in semantic
retrieval tasks, including bitext mining and classification via
retrieval-augmented contexts. We create a comprehensive framework to assess the
robustness of LMs in retrieving samples across over 200 diverse languages,
including extremely low-resource languages in challenging cross-lingual and
code-switching settings. Our results demonstrate that by solely retrieving
semantically similar embeddings yields performance competitive with
state-of-the-art approaches, without requiring any fine-tuning.
Authors' comments: Accepted by EMNLP 2024 Findings
Matteo Gabburo, Nicolaas Paul Jedema, Siddhant Garg, Leonardo F. R. Ribeiro, Alessandro Moschitti
In this paper, we investigate which questions are challenging for
retrieval-based Question Answering (QA). We (i) propose retrieval complexity
(RC), a novel metric conditioned on the completeness of retrieved documents,
which measures the difficulty of answering questions, and (ii) propose an
unsupervised pipeline to measure RC given an arbitrary retrieval system. Our
proposed pipeline measures RC more accurately than alternative estimators,
including LLMs, on six challenging QA benchmarks. Further investigation reveals
that RC scores strongly correlate with both QA performance and expert judgment
across five of the six studied benchmarks, indicating that RC is an effective
measure of question difficulty. Subsequent categorization of high-RC questions
shows that they span a broad set of question shapes, including multi-hop,
compositional, and temporal QA, indicating that RC scores can categorize a new
subset of complex questions. Our system can also have a major impact on
retrieval-based systems by helping to identify more challenging questions on
existing datasets.
Authors' comments: Accepted to ACL 2024 (findings)
Jifei Luo, Hantao Yao, Changsheng Xu
Diffusion-based re-ranking is a common method used for retrieving instances
by performing similarity propagation in a nearest neighbor graph. However,
existing techniques that construct the affinity graph based on pairwise
instances can lead to the propagation of misinformation from outliers and other
manifolds, resulting in inaccurate results. To overcome this issue, we propose
a novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The
primary concept of CAS is to conduct similarity diffusion within local
clusters, which can reduce the influence from other manifolds explicitly. To
obtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity
Diffusion strategy introduces an inverse constraint term to the optimization
objective of local cluster diffusion. Additionally, we have optimized a
Neighbor-guided Similarity Smoothing approach to ensure similarity consistency
among the local neighbors of each instance. Evaluations in instance retrieval
and object re-identification validate the effectiveness of the proposed CAS,
our code is publicly available.
Authors' comments: This paper has been accepted by ICML2024
Tzu-Lin Kuo, Tzu-Wei Chiu, Tzung-Sheng Lin, Sheng-Yang Wu, Chao-Wei Huang, Yun-Nung Chen
Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to directly map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking. This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges. We discuss various document identifier strategies, including numerical and string-based identifiers, and explore different document representation methods. Our primary contribution lies in outlining future research directions that could profoundly impact the field: improving the quality of query generation, exploring learnable document identifiers, enhancing scalability, and integrating GR with multi-task learning frameworks. By examining state-of-the-art GR techniques and their applications, this survey aims to provide a foundational understanding of GR and inspire further innovations in this transformative approach to information retrieval. We also make the complementary materials such as paper collection publicly available at https://github.com/MiuLab/GenIR-Survey/
Akshat Mohan Dasula, Hrushitha Tigulla, Preethika Bhukya
Traditionally in the domain of legal research, the retrieval of pertinent
citations from intricate case descriptions has demanded manual effort and
keyword-based search applications that mandate expertise in understanding legal
jargon. Legal case descriptions hold pivotal information for legal
professionals and researchers, necessitating more efficient and automated
approaches. We propose a methodology that combines natural language processing
(NLP) and machine learning techniques to enhance the organization and
utilization of legal case descriptions. This approach revolves around the
creation of textual embeddings with the help of state-of-art embedding models.
Our methodology addresses two primary objectives: unsupervised clustering and
supervised citation retrieval, both designed to automate the citation
extraction process. Although the proposed methodology can be used for any
dataset, we employed the Supreme Court of The United States (SCOTUS) dataset,
yielding remarkable results. Our methodology achieved an impressive accuracy
rate of 90.9%. By automating labor-intensive processes, we pave the way for a
more efficient, time-saving, and accessible landscape in legal research,
benefiting legal professionals, academics, and researchers.
Authors' comments: 14 pages, 16 images, Submitted to Multimedia Tools and Applications
Springer journal
Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi et al.
Despite the impressive advancements of Large Language Models (LLMs) in generating text, they are often limited by the knowledge contained in the input and prone to producing inaccurate or hallucinated content. To tackle these issues, Retrieval-augmented Generation (RAG) is employed as an effective strategy to enhance the available knowledge base and anchor the responses in reality by pulling additional texts from external databases. In real-world applications, texts are often linked through entities within a graph, such as citations in academic papers or comments in social networks. This paper exploits these topological relationships to guide the retrieval process in RAG. Specifically, we explore two kinds of topological connections: proximity-based, focusing on closely connected nodes, and role-based, which looks at nodes sharing similar subgraph structures. Our empirical research confirms their relevance to text relationships, leading us to develop a Topology-aware Retrieval-augmented Generation framework. This framework includes a retrieval module that selects texts based on their topological relationships and an aggregation module that integrates these texts into prompts to stimulate LLMs for text generation. We have curated established text-attributed networks and conducted comprehensive experiments to validate the effectiveness of this framework, demonstrating its potential to enhance RAG with topological awareness.
Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal
Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses. In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses, even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.
Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondrej Chum, Yannis Avrithis, Konstantinos Karantzalos
This work introduces composed image retrieval to remote sensing. It allows to
query a large image archive by image examples alternated by a textual
description, enriching the descriptive power over unimodal queries, either
visual or textual. Various attributes can be modified by the textual part, such
as shape, color, or context. A novel method fusing image-to-image and
text-to-image similarity is introduced. We demonstrate that a vision-language
model possesses sufficient descriptive power and no further learning step or
training data are necessary. We present a new evaluation benchmark focused on
color, context, density, existence, quantity, and shape modifications. Our work
not only sets the state-of-the-art for this task, but also serves as a
foundational step in addressing a gap in the field of remote sensing image
retrieval. Code at: https://github.com/billpsomas/rscir
Authors' comments: Accepted for ORAL presentation at the 2024 IEEE International
Geoscience and Remote Sensing Symposium
Laura Dietz
This resource paper addresses the challenge of evaluating Information
Retrieval (IR) systems in the era of autoregressive Large Language Models
(LLMs). Traditional methods relying on passage-level judgments are no longer
effective due to the diversity of responses generated by LLM-based systems. We
provide a workbench to explore several alternative evaluation approaches to
judge the relevance of a system's response that incorporate LLMs: 1. Asking an
LLM whether the response is relevant; 2. Asking the LLM which set of nuggets
(i.e., relevant key facts) is covered in the response; 3. Asking the LLM to
answer a set of exam questions with the response.
This workbench aims to facilitate the development of new, reusable test
collections. Researchers can manually refine sets of nuggets and exam
questions, observing their impact on system evaluation and leaderboard
rankings.
Resource available at https://github.com/TREMA-UNH/autograding-workbench
Authors' comments: 10 pages. To appear in the Resource & Reproducibility Track of SIGIR
2024
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara
The objective of image captioning models is to bridge the gap between the
visual and linguistic modalities by generating natural language descriptions
that accurately reflect the content of input images. In recent years,
researchers have leveraged deep learning-based models and made advances in the
extraction of visual features and the design of multimodal connections to
tackle this task. This work presents a novel approach towards developing image
captioning models that utilize an external kNN memory to improve the generation
process. Specifically, we propose two model variants that incorporate a
knowledge retriever component that is based on visual similarities, a
differentiable encoder to represent input images, and a kNN-augmented language
model to predict tokens based on contextual cues and text retrieved from the
external memory. We experimentally validate our approach on COCO and nocaps
datasets and demonstrate that incorporating an explicit external memory can
significantly enhance the quality of captions, especially with a larger
retrieval corpus. This work provides valuable insights into retrieval-augmented
captioning models and opens up new avenues for improving image captioning at a
larger scale.
Authors' comments: ACM Transactions on Multimedia Computing, Communications and
Applications (2024)
Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang
With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.
Jean-Jacques Godeme, Jalal Fadili, Claude Amra, Myriam Zerrad
In this paper, we aim to reconstruct an n-dimensional real vector from m phaseless measurements corrupted by an additive noise. We extend the noiseless framework developed in [15], based on mirror descent (or Bregman gradient descent), to deal with noisy measurements and prove that the procedure is stable to (small enough) additive noise. In the deterministic case, we show that mirror descent converges to a critical point of the phase retrieval problem, and if the algorithm is well initialized and the noise is small enough, the critical point is near the true vector up to a global sign change. When the measurements are i.i.d Gaussian and the signal-to-noise ratio is large enough, we provide global convergence guarantees that ensure that with high probability, mirror descent converges to a global minimizer near the true vector (up to a global sign change), as soon as the number of measurements m is large enough. The sample complexity bound can be improved if a spectral method is used to provide a good initial guess. We complement our theoretical study with several numerical results showing that mirror descent is both a computationally and statistically efficient scheme to solve the phase retrieval problem.