Jianqiang Shen, Yuchin Juan, Shaobo Zhang, Ping Liu, Wen Pu, Sriram Vasudevan, Qingquan Song, Fedor Borisyuk et al.
Web-scale search systems typically tackle the scalability challenge with a two-step paradigm: retrieval and ranking. The retrieval step, also known as candidate selection, often involves extracting standardized entities, creating an inverted index, and performing term matching for retrieval. Such traditional methods require manual and time-consuming development of query models. In this paper, we discuss applying learning-to-retrieve technology to enhance LinkedIns job search and recommendation systems. In the realm of promoted jobs, the key objective is to improve the quality of applicants, thereby delivering value to recruiter customers. To achieve this, we leverage confirmed hire data to construct a graph that evaluates a seeker's qualification for a job, and utilize learned links for retrieval. Our learned model is easy to explain, debug, and adjust. On the other hand, the focus for organic jobs is to optimize seeker engagement. We accomplished this by training embeddings for personalized retrieval, fortified by a set of rules derived from the categorization of member feedback. In addition to a solution based on a conventional inverted index, we developed an on-GPU solution capable of supporting both KNN and term matching efficiently.
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, Aidong Zhang
While large language models (LLMs) have achieved state-of-the-art performance
on a wide range of medical question answering (QA) tasks, they still face
challenges with hallucinations and outdated knowledge. Retrieval-augmented
generation (RAG) is a promising solution and has been widely adopted. However,
a RAG system can involve multiple flexible components, and there is a lack of
best practices regarding the optimal RAG setting for various medical purposes.
To systematically evaluate such systems, we propose the Medical Information
Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind
benchmark including 7,663 questions from five medical QA datasets. Using
MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt
tokens on 41 combinations of different corpora, retrievers, and backbone LLMs
through the MedRAG toolkit introduced in this work. Overall, MedRAG improves
the accuracy of six different LLMs by up to 18% over chain-of-thought
prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our
results show that the combination of various medical corpora and retrievers
achieves the best performance. In addition, we discovered a log-linear scaling
property and the "lost-in-the-middle" effects in medical RAG. We believe our
comprehensive evaluations can serve as practical guidelines for implementing
RAG systems for medicine.
Authors' comments: Homepage: https://teddy-xionggz.github.io/benchmark-medical-rag/
Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, Tao Yu
Recently the retrieval-augmented generation (RAG) has been successfully
applied in code generation. However, existing pipelines for retrieval-augmented
code generation (RACG) employ static knowledge bases with a single source,
limiting the adaptation capabilities of Large Language Models (LLMs) to domains
they have insufficient knowledge of. In this work, we develop a novel pipeline,
EVOR, that employs the synchronous evolution of both queries and diverse
knowledge bases. On two realistic settings where the external knowledge is
required to solve code generation tasks, we compile four new datasets
associated with frequently updated libraries and long-tail programming
languages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR
achieves two to four times of execution accuracy compared to other methods such
as Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We
demonstrate that EVOR is flexible and can be easily combined with them to
achieve further improvement. Further analysis reveals that EVOR benefits from
the synchronous evolution of queries and documents and the diverse information
sources in the knowledge base. We hope that our studies will inspire more
insights into the design of advanced RACG pipelines in future research. Our
model, code, and data are available at https://arks-codegen.github.io.
Authors' comments: Retrieval-augmented code generation
Thong Nguyen, Mariya Hendriksen, Andrew Yates
Learned Sparse Retrieval (LSR) is a group of neural methods designed to
encode queries and documents into sparse lexical vectors. These vectors can be
efficiently indexed and retrieved using an inverted index. While LSR has shown
promise in text retrieval, its potential in multi-modal retrieval remains
largely unexplored. Motivated by this, in this work, we explore the application
of LSR in the multi-modal domain, i.e., we focus on Multi-Modal Learned Sparse
Retrieval (MLSR). We conduct experiments using several MLSR model
configurations and evaluate the performance on the image suggestion task. We
find that solving the task solely based on the image content is challenging.
Enriching the image content with its caption improves the model performance
significantly, implying the importance of image captions to provide
fine-grained concepts and context information of images. Our approach presents
a practical and effective solution for training LSR retrieval models in
multi-modal settings.
Authors' comments: 5 pages, TREC 2023
Oron Nir, Idan Vidra, Avi Neeman, Barak Kinarti, Ariel Shamir
Streamlining content discovery within media archives requires integrating advanced data representations and effective visualization techniques for clear communication of video topics to users. The proposed system addresses the challenge of efficiently navigating large video collections by exploiting a fusion of visual, audio, and textual features to accurately index and categorize video content through a text-based method. Additionally, semantic embeddings are employed to provide contextually relevant information and recommendations to users, resulting in an intuitive and engaging exploratory experience over our topics ontology map using OpenAI GPT-4.
Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Guido Zuccon
This paper introduces a novel unsupervised technique that utilizes large language models (LLMs) to determine the most suitable dense retriever for a specific test(target) corpus. Selecting the appropriate dense retriever is vital for numerous IR applications that employ these retrievers, trained on public datasets, to encode or conduct searches within a new private target corpus. The effectiveness of a dense retriever can significantly diminish when applied to a target corpus that diverges in domain or task from the original training set. The problem becomes more pronounced in cases where the target corpus is unlabeled, e.g. in zero-shot scenarios, rendering direct evaluation of the model's effectiveness on the target corpus unattainable. Therefore, the unsupervised selection of an optimally pre-trained dense retriever, especially under conditions of domain shift, emerges as a critical challenge. Existing methodologies for ranking dense retrievers fall short in addressing these domain shift scenarios. To tackle this, our method capitalizes on LLMs to create pseudo-relevant queries, labels, and reference lists by analyzing a subset of documents from the target corpus. This allows for the ranking of dense retrievers based on their performance with these pseudo-relevant signals. Significantly, this strategy is the first to depend exclusively on the target corpus data, removing the necessity for training data and test labels. We assessed the effectiveness of our approach by compiling a comprehensive pool of cutting-edge dense retrievers and comparing our method against traditional dense retriever selection benchmarks. The findings reveal that our proposed solution surpasses the existing benchmarks in both the selection and ranking of dense retrievers.
Jingxi Xu, Yinsen Jia, Dongxiao Yang, Patrick Meng, Xinyue Zhu, Zihan Guo, Shuran Song, Matei Ciocarlie
We introduce GEOTACT, a robotic manipulation method capable of retrieving objects buried in granular media. This is a challenging task due to the need to interact with granular media, and doing so based exclusively on tactile feedback, since a buried object can be completely hidden from vision. Tactile feedback is in itself challenging in this context, due to ubiquitous contact with the surrounding media, and the inherent noise level induced by the tactile readings. To address these challenges, we use a learning method trained end-to-end with simulated sensor noise. We show that our problem formulation leads to the natural emergence of learned pushing behaviors that the manipulator uses to reduce uncertainty and funnel the object to a stable grasp despite spurious and noisy tactile readings. We also introduce a training curriculum that enables learning these behaviors in simulation, followed by zero-shot transfer to real hardware. To the best of our knowledge, GEOTACT is the first method to reliably retrieve a number of different objects from a granular environment, doing so on real hardware and with integrated tactile sensing. Videos and additional information can be found at https://jxu.ai/geotact.
EuiYul Song, Philhoon Oh, Sangryul Kim, James Thorne
Modern deterministic retrieval pipelines prioritize achieving
state-of-the-art performance but often lack interpretability in
decision-making. These models face challenges in assessing uncertainty, leading
to overconfident predictions. To overcome these limitations, we integrate
uncertainty calibration and interpretability into a retrieval pipeline.
Specifically, we introduce Bayesian methodologies and multi-perspective
retrieval to calibrate uncertainty within a retrieval pipeline. We incorporate
techniques such as LIME and SHAP to analyze the behavior of a black-box
reranker model. The importance scores derived from these explanation
methodologies serve as supplementary relevance scores to enhance the base
reranker model. We evaluate the resulting performance enhancements achieved
through uncertainty calibration and interpretable reranking on Question
Answering and Fact Checking tasks. Our methods demonstrate substantial
performance improvements across three KILT datasets.
Authors' comments: 15 pages, 7 figures
EuiYul Song, Sangryul Kim, Haeju Lee, Joonkee Kim, James Thorne
Generative retrieval models encode pointers to information in a corpus as an
index within the model's parameters. These models serve as part of a larger
pipeline, where retrieved information conditions generation for
knowledge-intensive NLP tasks. However, we identify two limitations: the
generative retrieval does not account for contextual information. Secondly, the
retrieval can't be tuned for the downstream readers as decoding the page title
is a non-differentiable operation. This paper introduces Re3val, trained with
generative reranking and reinforcement learning using limited data. Re3val
leverages context acquired via Dense Passage Retrieval to rerank the retrieved
page titles and utilizes REINFORCE to maximize rewards generated by constrained
decoding. Additionally, we generate questions from our pre-training dataset to
mitigate epistemic uncertainty and bridge the domain gap between the
pre-training and fine-tuning datasets. Subsequently, we extract and rerank
contexts from the KILT database using the rerank page titles. Upon grounding
the top five reranked contexts, Re3val demonstrates the Top 1 KILT scores
compared to all other generative retrieval models across five KILT datasets.
Authors' comments: 17 pages, 4 figures, Findings of the Association for Computational
Linguistics: EACL 2023
Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang et al.
Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.
Mathias Vast, Yuxuan Zong, Basile Van Cooten, Benjamin Piwowarski, Laure Soulier
In Information Retrieval, and more generally in Natural Language Processing,
adapting models to specific domains is conducted through fine-tuning. Despite
the successes achieved by this method and its versatility, the need for
human-curated and labeled data makes it impractical to transfer to new tasks,
domains, and/or languages when training data doesn't exist. Using the model
without training (zero-shot) is another option that however suffers an
effectiveness cost, especially in the case of first-stage retrievers. Numerous
research directions have emerged to tackle these issues, most of them in the
context of adapting to a task or a language. However, the literature is scarcer
for domain (or topic) adaptation. In this paper, we address this issue of
cross-topic discrepancy for a sparse first-stage retriever by transposing a
method initially designed for language adaptation. By leveraging pre-training
on the target data to learn domain-specific knowledge, this technique
alleviates the need for annotated data and expands the scope of domain
adaptation. Despite their relatively good generalization ability, we show that
even sparse retrievers can benefit from our simple domain adaptation method.
Authors' comments: Accepted at ECIR 2024
Stéphanie Juneau, Alice Jacques, Steve Pothier, Adam S. Bolton, Benjamin A. Weaver, Ragadeepika Pucha, Sean McManus, Robert Nikutta et al.
SPectra Analysis and Retrievable Catalog Lab (SPARCL) at NOIRLab's Astro Data
Lab was created to efficiently serve large optical and infrared spectroscopic
datasets. It consists of services, tools, example workflows and currently
contains spectra for over 7.5 million stars, galaxies and quasars from the
Sloan Digital Sky Survey (SDSS) and the Dark Energy Spectroscopic Instrument
(DESI) survey. We aim to eventually support the broad range of spectroscopic
datasets that will be hosted at NOIRLab and beyond. Major elements of SPARCL
include capabilities to discover and query for spectra based on parameters of
interest, a fast web service that delivers desired spectra either individually
or in bulk as well as documentation and example Jupyter Notebooks to empower
users in their research. More information is available on the SPARCL website
(https://astrosparcl.datalab.noirlab.edu).
Authors' comments: 4 pages, 1 figure, Conference Proceedings for ADASS 2023
(Astronomical Data Analysis Software & Systems XXXIII). Revised figure 1
(text is unchanged)
Arian Askari, Zihui Yang, Zhaochun Ren, Suzan Verberne
The task of answer retrieval in the legal domain aims to help users to seek
relevant legal advice from massive amounts of professional responses. Two main
challenges hinder applying existing answer retrieval approaches in other
domains to the legal domain: (1) a huge knowledge gap between lawyers and
non-professionals; and (2) a mix of informal and formal content on legal QA
websites. To tackle these challenges, we propose CE_FS, a novel cross-encoder
(CE) re-ranker based on the fine-grained structured inputs. CE_FS uses
additional structured information in the CQA data to improve the effectiveness
of cross-encoder re-rankers. Furthermore, we propose LegalQA: a real-world
benchmark dataset for evaluating answer retrieval in the legal domain.
Experiments conducted on LegalQA show that our proposed method significantly
outperforms strong cross-encoder re-rankers fine-tuned on MS MARCO. Our novel
finding is that adding the question tags of each question besides the question
description and title into the input of cross-encoder re-rankers structurally
boosts the rankers' effectiveness. While we study our proposed method in the
legal domain, we believe that our method can be applied in similar applications
in other domains.
Authors' comments: accepted at ECIR 2024
Nikhilesh Bhatnagar, Ashok Urlana, Vandan Mujadia, Pruthwik Mishra, Dipti Misra Sharma
Cross-lingual summarization involves the summarization of text written in one
language to a different one. There is a body of research addressing
cross-lingual summarization from English to other European languages. In this
work, we aim to perform cross-lingual summarization from English to Hindi. We
propose pairing up the coverage of newsworthy events in textual and video
format can prove to be helpful for data acquisition for cross lingual
summarization. We analyze the data and propose methods to match articles to
video descriptions that serve as document and summary pairs. We also outline
filtering methods over reasonable thresholds to ensure the correctness of the
summaries. Further, we make available 28,583 mono and cross-lingual
article-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also
build and analyze multiple baselines on the collected data and report error
analysis.
Authors' comments: 6 pages, 6 tables, 2 figures, conference: ICON 2023
Raviteja Anantha, Bortik Bandyopadhyay, Anirudh Kashi, Sayantan Mahinder, Andrew W Hill, Srinivas Chappidi
Large language models (LLMs) are increasingly employed for complex multi-step
planning tasks, where the tool retrieval (TR) step is crucial for achieving
successful outcomes. Two prevalent approaches for TR are single-step retrieval,
which utilizes the complete query, and sequential retrieval using task
decomposition (TD), where a full query is segmented into discrete atomic
subtasks. While single-step retrieval lacks the flexibility to handle
"inter-tool dependency," the TD approach necessitates maintaining "subtask-tool
atomicity alignment," as the toolbox can evolve dynamically. To address these
limitations, we introduce the Progressive Tool retrieval to Improve Planning
(ProTIP) framework. ProTIP is a lightweight, contrastive learning-based
framework that implicitly performs TD without the explicit requirement of
subtask labels, while simultaneously maintaining subtask-tool atomicity. On the
ToolBench dataset, ProTIP outperforms the ChatGPT task decomposition-based
approach by a remarkable margin, achieving a 24% improvement in Recall@K=10 for
TR and a 41% enhancement in tool accuracy for plan generation.
Authors' comments: preprint version
Hangfei Lin, Li Miao, Amir Ziai
Few-shot image classification is the task of classifying unseen images to one of N mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K) presents a significant challenge to classification accuracy in some cases. To address this, we have developed a method for augmenting the set of K with an addition set of A retrieved images. We call this system Retrieval-Augmented Few-shot Image Classification (RAFIC). Through a series of experiments, we demonstrate that RAFIC markedly improves performance of few-shot image classification across two challenging datasets. RAFIC consists of two main components: (a) a retrieval component which uses CLIP, LAION-5B, and faiss, in order to efficiently retrieve images similar to the supplied images, and (b) retrieval meta-learning, which learns to judiciously utilize the retrieved images. Code and data is available at github.com/amirziai/rafic.
Xuechen Liu, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi
In this study, we introduce a novel cross-modal retrieval task involving
speaker descriptions and their corresponding audio samples. Utilizing
pre-trained speaker and text encoders, we present a simple learning framework
based on contrastive learning. Additionally, we explore the impact of
incorporating speaker labels into the training process. Our findings establish
the effectiveness of linking speaker and text information for the task for both
English and Japanese languages, across diverse data configurations. Additional
visual analysis unveils potential nuanced associations between speaker
clustering and retrieval performance.
Authors' comments: Submitted to IEEE Signal Processing Letters
Raviteja Anantha, Tharun Bethi, Danil Vodianik, Srinivas Chappidi
Large language models (LLMs) have the remarkable ability to solve new tasks
with just a few examples, but they need access to the right tools. Retrieval
Augmented Generation (RAG) addresses this problem by retrieving a list of
relevant tools for a given task. However, RAG's tool retrieval step requires
all the required information to be explicitly present in the query. This is a
limitation, as semantic search, the widely adopted tool retrieval method, can
fail when the query is incomplete or lacks context. To address this limitation,
we propose Context Tuning for RAG, which employs a smart context retrieval
system to fetch relevant information that improves both tool retrieval and plan
generation. Our lightweight context retrieval model uses numerical,
categorical, and habitual usage signals to retrieve and rank context items. Our
empirical results demonstrate that context tuning significantly enhances
semantic search, achieving a 3.5-fold and 1.5-fold improvement in Recall@K for
context retrieval and tool retrieval tasks respectively, and resulting in an
11.6% increase in LLM-based planner accuracy. Additionally, we show that our
proposed lightweight model using Reciprocal Rank Fusion (RRF) with LambdaMART
outperforms GPT-4 based retrieval. Moreover, we observe context augmentation at
plan generation, even after tool retrieval, reduces hallucination.
Authors' comments: preprint version
Susav Shrestha, Narasimha Reddy, Zongwang Li
Recent advances in large language models have demonstrated remarkable
effectiveness in information retrieval (IR) tasks. While many neural IR systems
encode queries and documents into single-vector representations, multi-vector
models elevate the retrieval quality by producing multi-vector representations
and facilitating similarity searches at the granularity of individual tokens.
However, these models significantly amplify memory and storage requirements for
retrieval indices by an order of magnitude. This escalation in index size
renders the scalability of multi-vector IR models progressively challenging due
to their substantial memory demands. We introduce Embedding from Storage
Pipelined Network (ESPN) where we offload the entire re-ranking embedding
tables to SSDs and reduce the memory requirements by 5-16x. We design a
software prefetcher with hit rates exceeding 90%, improving SSD based retrieval
up to 6.4x, and demonstrate that we can maintain near memory levels of query
latency even for large query batch sizes.
Authors' comments: 10 pages, 10 figures
Yujie Qian, Zhening Li, Zhengkai Tu, Connor W. Coley, Regina Barzilay
This paper focuses on using natural language descriptions to enhance
predictive models in the chemistry field. Conventionally, chemoinformatics
models are trained with extensive structured data manually extracted from the
literature. In this paper, we introduce TextReact, a novel method that directly
augments predictive chemistry with texts retrieved from the literature.
TextReact retrieves text descriptions relevant for a given chemical reaction,
and then aligns them with the molecular representation of the reaction. This
alignment is enhanced via an auxiliary masked LM objective incorporated in the
predictor training. We empirically validate the framework on two chemistry
tasks: reaction condition recommendation and one-step retrosynthesis. By
leveraging text retrieval, TextReact significantly outperforms state-of-the-art
chemoinformatics models trained solely on molecular data.
Authors' comments: EMNLP 2023