Rose E. Wang, Pawan Wirawarn, Omar Khattab, Noah Goodman, Dorottya Demszky
Many online content portals allow users to ask questions to supplement their
understanding (e.g., of lectures). While information retrieval (IR) systems may
provide answers for such user queries, they do not directly assist content
creators -- such as lecturers who want to improve their content -- identify
segments that _caused_ a user to ask those questions. We introduce the task of
backtracing, in which systems retrieve the text segment that most likely caused
a user query. We formalize three real-world domains for which backtracing is
important in improving content delivery and communication: understanding the
cause of (a) student confusion in the Lecture domain, (b) reader curiosity in
the News Article domain, and (c) user emotion in the Conversation domain. We
evaluate the zero-shot performance of popular information retrieval methods and
language modeling methods, including bi-encoder, re-ranking and
likelihood-based methods and ChatGPT. While traditional IR systems retrieve
semantically relevant information (e.g., details on "projection matrices" for a
query "does projecting multiple times still lead to the same point?"), they
often miss the causally relevant context (e.g., the lecturer states "projecting
twice gets me the same answer as one projection"). Our results show that there
is room for improvement on backtracing and it requires new retrieval
approaches. We hope our benchmark serves to improve future retrieval systems
for backtracing, spawning systems that refine content generation and identify
linguistic triggers influencing user queries. Our code and data are
open-sourced: https://github.com/rosewang2008/backtracing.
Authors' comments: Code: https://github.com/rosewang2008/backtracing; EACL 2024
Findings, Long Paper
Antonio Francesco Mello, Guglielmo Lami, Mario Collura
Quantum computing's promise lies in its intrinsic complexity, with
entanglement initially heralded as its hallmark. However, the quest for quantum
advantage extends beyond entanglement, encompassing the realm of nonstabilizer
(magic) states. Despite their significance, quantifying and characterizing
these states pose formidable challenges. Here, we introduce a novel approach
leveraging Convolutional Neural Networks (CNNs) to classify quantum states
based on their magic content. Without relying on a complete knowledge of the
state, we utilize partial information acquired from measurement snapshots to
train the CNN in distinguishing between stabilizer and nonstabilizer states.
Importantly, our methodology circumvents the limitations of full state
tomography, offering a practical solution for real-world quantum experiments.
In addition, we unveil a theoretical connection between Stabilizer R\'enyi
Entropies (SREs) and the expectation value of Pauli matrices for pure quantum
states. Our findings pave the way for experimental applications, providing a
robust and accessible tool for deciphering the intricate landscape of quantum
resources.
Authors' comments: 7 pages, 4 figures
Hui Wu, Min Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li
In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the limited capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradigm, which advances existing asymmetric retrieval systems by considering the complementarity among different features just at the gallery side. Specifically, it first embeds each gallery image into various features, e.g., local features and global features. Then, a dynamic mixer is introduced to aggregate these features into compact embedding for efficient search. On the query side, only a single lightweight model is deployed for feature extraction. The query model and dynamic mixer are jointly trained by sharing a momentum-updated classifier. Notably, the proposed paradigm boosts the accuracy of asymmetric retrieval without introducing any extra overhead to the query side. Exhaustive experiments on various landmark retrieval datasets demonstrate the superiority of our paradigm.
Tom Hosking, Hao Tang, Mirella Lapata
We propose a method for unsupervised abstractive opinion summarization, that
combines the attributability and scalability of extractive approaches with the
coherence and fluency of Large Language Models (LLMs). Our method, HIRO, learns
an index structure that maps sentences to a path through a semantically
organized discrete hierarchy. At inference time, we populate the index and use
it to identify and retrieve clusters of sentences containing popular opinions
from input reviews. Then, we use a pretrained LLM to generate a readable
summary that is grounded in these extracted evidential clusters. The modularity
of our approach allows us to evaluate its efficacy at each stage. We show that
HIRO learns an encoding space that is more semantically structured than prior
work, and generates summaries that are more representative of the opinions in
the input reviews. Human evaluation confirms that HIRO generates significantly
more coherent, detailed and accurate summaries.
Authors' comments: Accepted to TACL; Pre MIT Press version
Pierre Erbacher, Jian-Yun Nie, Philippe Preux, Laure Soulier
Conversational systems have made significant progress in generating natural language responses. However, their potential as conversational search systems is currently limited due to their passive role in the information-seeking process. One major limitation is the scarcity of datasets that provide labelled ambiguous questions along with a supporting corpus of documents and relevant clarifying questions. This work aims to tackle the challenge of generating relevant clarifying questions by taking into account the inherent ambiguities present in both user queries and documents. To achieve this, we propose PAQA, an extension to the existing AmbiNQ dataset, incorporating clarifying questions. We then evaluate various models and assess how passage retrieval impacts ambiguity detection and the generation of clarifying questions. By addressing this gap in conversational search systems, we aim to provide additional supervision to enhance their active participation in the information-seeking process and provide users with more accurate results.
Jianyou Wang, Kaicheng Wang, Xiaoyue Wang, Weili Cao, Ramamohan Paturi, Leon Bergen
Effective information retrieval (IR) in settings with limited training data,
particularly for complex queries, remains a challenging task. This paper
introduces IR2, Information Regularization for Information Retrieval, a
technique for reducing overfitting during synthetic data generation. This
approach, representing a novel application of regularization techniques in
synthetic data creation for IR, is tested on three recent IR tasks
characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook.
Experimental results indicate that our regularization techniques not only
outperform previous synthetic query generation methods on the tasks considered
but also reduce cost by up to 50%. Furthermore, this paper categorizes and
explores three regularization methods at different stages of the query
synthesis pipeline-input, prompt, and output-each offering varying degrees of
performance improvement compared to models where no regularization is applied.
This provides a systematic approach for optimizing synthetic data generation in
data-limited, complex-query IR scenarios. All code, prompts and synthetic data
are available at
https://github.com/Info-Regularization/Information-Regularization.
Authors' comments: Accepted by LREC-COLING 2024 - The 2024 Joint International
Conference on Computational Linguistics, Language Resources and Evaluation
Seraphina Goldfarb-Tarrant, Pedro Rodriguez, Jane Dwivedi-Yu, Patrick Lewis
Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.
Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task
focused on identifying a specific moment within a vast corpus of untrimmed
videos using the natural language query. Existing methods for VCMR typically
rely on frame-aware video retrieval, calculating similarities between the query
and video frames to rank videos based on maximum frame similarity.However, this
approach overlooks the semantic structure embedded within the information
between frames, namely, the event, a crucial element for human comprehension of
videos. Motivated by this, we propose EventFormer, a model that explicitly
utilizes events within videos as fundamental units for video retrieval. The
model extracts event representations through event reasoning and hierarchical
event encoding. The event reasoning module groups consecutive and visually
similar frame representations into events, while the hierarchical event
encoding encodes information at both the frame and event levels. We also
introduce anchor multi-head self-attenion to encourage Transformer to capture
the relevance of adjacent content in the video. The training of EventFormer is
conducted by two-branch contrastive learning and dual optimization for two
sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo
benchmarks show the effectiveness and efficiency of EventFormer in VCMR,
achieving new state-of-the-art results. Additionally, the effectiveness of
EventFormer is also validated on partially relevant video retrieval task.
Authors' comments: 11 pages, 5 figures, 9 tables
Jianqiang Shen, Yuchin Juan, Shaobo Zhang, Ping Liu, Wen Pu, Sriram Vasudevan, Qingquan Song, Fedor Borisyuk et al.
Web-scale search systems typically tackle the scalability challenge with a two-step paradigm: retrieval and ranking. The retrieval step, also known as candidate selection, often involves extracting standardized entities, creating an inverted index, and performing term matching for retrieval. Such traditional methods require manual and time-consuming development of query models. In this paper, we discuss applying learning-to-retrieve technology to enhance LinkedIns job search and recommendation systems. In the realm of promoted jobs, the key objective is to improve the quality of applicants, thereby delivering value to recruiter customers. To achieve this, we leverage confirmed hire data to construct a graph that evaluates a seeker's qualification for a job, and utilize learned links for retrieval. Our learned model is easy to explain, debug, and adjust. On the other hand, the focus for organic jobs is to optimize seeker engagement. We accomplished this by training embeddings for personalized retrieval, fortified by a set of rules derived from the categorization of member feedback. In addition to a solution based on a conventional inverted index, we developed an on-GPU solution capable of supporting both KNN and term matching efficiently.
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, Aidong Zhang
While large language models (LLMs) have achieved state-of-the-art performance
on a wide range of medical question answering (QA) tasks, they still face
challenges with hallucinations and outdated knowledge. Retrieval-augmented
generation (RAG) is a promising solution and has been widely adopted. However,
a RAG system can involve multiple flexible components, and there is a lack of
best practices regarding the optimal RAG setting for various medical purposes.
To systematically evaluate such systems, we propose the Medical Information
Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind
benchmark including 7,663 questions from five medical QA datasets. Using
MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt
tokens on 41 combinations of different corpora, retrievers, and backbone LLMs
through the MedRAG toolkit introduced in this work. Overall, MedRAG improves
the accuracy of six different LLMs by up to 18% over chain-of-thought
prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our
results show that the combination of various medical corpora and retrievers
achieves the best performance. In addition, we discovered a log-linear scaling
property and the "lost-in-the-middle" effects in medical RAG. We believe our
comprehensive evaluations can serve as practical guidelines for implementing
RAG systems for medicine.
Authors' comments: Homepage: https://teddy-xionggz.github.io/benchmark-medical-rag/
Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, Tao Yu
Recently the retrieval-augmented generation (RAG) has been successfully
applied in code generation. However, existing pipelines for retrieval-augmented
code generation (RACG) employ static knowledge bases with a single source,
limiting the adaptation capabilities of Large Language Models (LLMs) to domains
they have insufficient knowledge of. In this work, we develop a novel pipeline,
EVOR, that employs the synchronous evolution of both queries and diverse
knowledge bases. On two realistic settings where the external knowledge is
required to solve code generation tasks, we compile four new datasets
associated with frequently updated libraries and long-tail programming
languages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR
achieves two to four times of execution accuracy compared to other methods such
as Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We
demonstrate that EVOR is flexible and can be easily combined with them to
achieve further improvement. Further analysis reveals that EVOR benefits from
the synchronous evolution of queries and documents and the diverse information
sources in the knowledge base. We hope that our studies will inspire more
insights into the design of advanced RACG pipelines in future research. Our
model, code, and data are available at https://arks-codegen.github.io.
Authors' comments: Retrieval-augmented code generation
Thong Nguyen, Mariya Hendriksen, Andrew Yates
Learned Sparse Retrieval (LSR) is a group of neural methods designed to
encode queries and documents into sparse lexical vectors. These vectors can be
efficiently indexed and retrieved using an inverted index. While LSR has shown
promise in text retrieval, its potential in multi-modal retrieval remains
largely unexplored. Motivated by this, in this work, we explore the application
of LSR in the multi-modal domain, i.e., we focus on Multi-Modal Learned Sparse
Retrieval (MLSR). We conduct experiments using several MLSR model
configurations and evaluate the performance on the image suggestion task. We
find that solving the task solely based on the image content is challenging.
Enriching the image content with its caption improves the model performance
significantly, implying the importance of image captions to provide
fine-grained concepts and context information of images. Our approach presents
a practical and effective solution for training LSR retrieval models in
multi-modal settings.
Authors' comments: 5 pages, TREC 2023
Oron Nir, Idan Vidra, Avi Neeman, Barak Kinarti, Ariel Shamir
Streamlining content discovery within media archives requires integrating advanced data representations and effective visualization techniques for clear communication of video topics to users. The proposed system addresses the challenge of efficiently navigating large video collections by exploiting a fusion of visual, audio, and textual features to accurately index and categorize video content through a text-based method. Additionally, semantic embeddings are employed to provide contextually relevant information and recommendations to users, resulting in an intuitive and engaging exploratory experience over our topics ontology map using OpenAI GPT-4.
Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Guido Zuccon
This paper introduces a novel unsupervised technique that utilizes large language models (LLMs) to determine the most suitable dense retriever for a specific test(target) corpus. Selecting the appropriate dense retriever is vital for numerous IR applications that employ these retrievers, trained on public datasets, to encode or conduct searches within a new private target corpus. The effectiveness of a dense retriever can significantly diminish when applied to a target corpus that diverges in domain or task from the original training set. The problem becomes more pronounced in cases where the target corpus is unlabeled, e.g. in zero-shot scenarios, rendering direct evaluation of the model's effectiveness on the target corpus unattainable. Therefore, the unsupervised selection of an optimally pre-trained dense retriever, especially under conditions of domain shift, emerges as a critical challenge. Existing methodologies for ranking dense retrievers fall short in addressing these domain shift scenarios. To tackle this, our method capitalizes on LLMs to create pseudo-relevant queries, labels, and reference lists by analyzing a subset of documents from the target corpus. This allows for the ranking of dense retrievers based on their performance with these pseudo-relevant signals. Significantly, this strategy is the first to depend exclusively on the target corpus data, removing the necessity for training data and test labels. We assessed the effectiveness of our approach by compiling a comprehensive pool of cutting-edge dense retrievers and comparing our method against traditional dense retriever selection benchmarks. The findings reveal that our proposed solution surpasses the existing benchmarks in both the selection and ranking of dense retrievers.
Jingxi Xu, Yinsen Jia, Dongxiao Yang, Patrick Meng, Xinyue Zhu, Zihan Guo, Shuran Song, Matei Ciocarlie
We introduce GEOTACT, a robotic manipulation method capable of retrieving objects buried in granular media. This is a challenging task due to the need to interact with granular media, and doing so based exclusively on tactile feedback, since a buried object can be completely hidden from vision. Tactile feedback is in itself challenging in this context, due to ubiquitous contact with the surrounding media, and the inherent noise level induced by the tactile readings. To address these challenges, we use a learning method trained end-to-end with simulated sensor noise. We show that our problem formulation leads to the natural emergence of learned pushing behaviors that the manipulator uses to reduce uncertainty and funnel the object to a stable grasp despite spurious and noisy tactile readings. We also introduce a training curriculum that enables learning these behaviors in simulation, followed by zero-shot transfer to real hardware. To the best of our knowledge, GEOTACT is the first method to reliably retrieve a number of different objects from a granular environment, doing so on real hardware and with integrated tactile sensing. Videos and additional information can be found at https://jxu.ai/geotact.
EuiYul Song, Philhoon Oh, Sangryul Kim, James Thorne
Modern deterministic retrieval pipelines prioritize achieving
state-of-the-art performance but often lack interpretability in
decision-making. These models face challenges in assessing uncertainty, leading
to overconfident predictions. To overcome these limitations, we integrate
uncertainty calibration and interpretability into a retrieval pipeline.
Specifically, we introduce Bayesian methodologies and multi-perspective
retrieval to calibrate uncertainty within a retrieval pipeline. We incorporate
techniques such as LIME and SHAP to analyze the behavior of a black-box
reranker model. The importance scores derived from these explanation
methodologies serve as supplementary relevance scores to enhance the base
reranker model. We evaluate the resulting performance enhancements achieved
through uncertainty calibration and interpretable reranking on Question
Answering and Fact Checking tasks. Our methods demonstrate substantial
performance improvements across three KILT datasets.
Authors' comments: 15 pages, 7 figures
EuiYul Song, Sangryul Kim, Haeju Lee, Joonkee Kim, James Thorne
Generative retrieval models encode pointers to information in a corpus as an
index within the model's parameters. These models serve as part of a larger
pipeline, where retrieved information conditions generation for
knowledge-intensive NLP tasks. However, we identify two limitations: the
generative retrieval does not account for contextual information. Secondly, the
retrieval can't be tuned for the downstream readers as decoding the page title
is a non-differentiable operation. This paper introduces Re3val, trained with
generative reranking and reinforcement learning using limited data. Re3val
leverages context acquired via Dense Passage Retrieval to rerank the retrieved
page titles and utilizes REINFORCE to maximize rewards generated by constrained
decoding. Additionally, we generate questions from our pre-training dataset to
mitigate epistemic uncertainty and bridge the domain gap between the
pre-training and fine-tuning datasets. Subsequently, we extract and rerank
contexts from the KILT database using the rerank page titles. Upon grounding
the top five reranked contexts, Re3val demonstrates the Top 1 KILT scores
compared to all other generative retrieval models across five KILT datasets.
Authors' comments: 17 pages, 4 figures, Findings of the Association for Computational
Linguistics: EACL 2023
Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang et al.
Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.
Mathias Vast, Yuxuan Zong, Basile Van Cooten, Benjamin Piwowarski, Laure Soulier
In Information Retrieval, and more generally in Natural Language Processing,
adapting models to specific domains is conducted through fine-tuning. Despite
the successes achieved by this method and its versatility, the need for
human-curated and labeled data makes it impractical to transfer to new tasks,
domains, and/or languages when training data doesn't exist. Using the model
without training (zero-shot) is another option that however suffers an
effectiveness cost, especially in the case of first-stage retrievers. Numerous
research directions have emerged to tackle these issues, most of them in the
context of adapting to a task or a language. However, the literature is scarcer
for domain (or topic) adaptation. In this paper, we address this issue of
cross-topic discrepancy for a sparse first-stage retriever by transposing a
method initially designed for language adaptation. By leveraging pre-training
on the target data to learn domain-specific knowledge, this technique
alleviates the need for annotated data and expands the scope of domain
adaptation. Despite their relatively good generalization ability, we show that
even sparse retrievers can benefit from our simple domain adaptation method.
Authors' comments: Accepted at ECIR 2024
Stéphanie Juneau, Alice Jacques, Steve Pothier, Adam S. Bolton, Benjamin A. Weaver, Ragadeepika Pucha, Sean McManus, Robert Nikutta et al.
SPectra Analysis and Retrievable Catalog Lab (SPARCL) at NOIRLab's Astro Data
Lab was created to efficiently serve large optical and infrared spectroscopic
datasets. It consists of services, tools, example workflows and currently
contains spectra for over 7.5 million stars, galaxies and quasars from the
Sloan Digital Sky Survey (SDSS) and the Dark Energy Spectroscopic Instrument
(DESI) survey. We aim to eventually support the broad range of spectroscopic
datasets that will be hosted at NOIRLab and beyond. Major elements of SPARCL
include capabilities to discover and query for spectra based on parameters of
interest, a fast web service that delivers desired spectra either individually
or in bulk as well as documentation and example Jupyter Notebooks to empower
users in their research. More information is available on the SPARCL website
(https://astrosparcl.datalab.noirlab.edu).
Authors' comments: 4 pages, 1 figure, Conference Proceedings for ADASS 2023
(Astronomical Data Analysis Software & Systems XXXIII). Revised figure 1
(text is unchanged)