Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
Chats emerge as an effective user-friendly approach for information
retrieval, and are successfully employed in many domains, such as customer
service, healthcare, and finance. However, existing image retrieval approaches
typically address the case of a single query-to-image round, and the use of
chats for image retrieval has been mostly overlooked. In this work, we
introduce ChatIR: a chat-based image retrieval system that engages in a
conversation with the user to elicit information, in addition to an initial
query, in order to clarify the user's search intent. Motivated by the
capabilities of today's foundation models, we leverage Large Language Models to
generate follow-up questions to an initial image description. These questions
form a dialog with the user in order to retrieve the desired image from a large
corpus. In this study, we explore the capabilities of such a system tested on a
large dataset and reveal that engaging in a dialog yields significant gains in
image retrieval. We start by building an evaluation pipeline from an existing
manually generated dataset and explore different modules and training
strategies for ChatIR. Our comparison includes strong baselines derived from
related applications trained with Reinforcement Learning. Our system is capable
of retrieving the target image from a pool of 50K images with over 78% success
rate after 5 dialogue rounds, compared to 75% when questions are asked by
humans, and 64% for a single shot text-to-image retrieval. Extensive
evaluations reveal the strong capabilities and examine the limitations of
CharIR under different settings. Project repository is available at
https://github.com/levymsn/ChatIR.
Authors' comments: Camera Ready version for NeurIPS 2023
Thong Nguyen, Sean MacAvaney, Andrew Yates
Learned sparse retrieval (LSR) is a family of neural retrieval methods that
transform queries and documents into sparse weight vectors aligned with a
vocabulary. While LSR approaches like Splade work well for short passages, it
is unclear how well they handle longer documents. We investigate existing
aggregation approaches for adapting LSR to longer documents and find that
proximal scoring is crucial for LSR to handle long documents. To leverage this
property, we proposed two adaptations of the Sequential Dependence Model (SDM)
to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term
dependence, while SoftSDM uses potential functions that model the dependence of
query terms and their expansion terms (i.e., terms identified using a
transformer's masked language modeling head).
Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate
that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches
for different document length constraints. Surprisingly, SoftSDM does not
provide any performance benefits over ExactSDM. This suggests that soft
proximity matching is not necessary for modeling term dependence in LSR.
Overall, this study provides insights into handling long documents with LSR,
proposing adaptations that improve its performance.
Authors' comments: SIGIR 2023
Po-Ya Angela Wang, Pin-Er Chen, Hsin-Yu Chou, Yu-Hsiang Tseng, Shu-Kai Hsieh
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study investigating the Lexical Retrieval Hypothesis (LRH), specifically examining whether the hand gestures co-occurring with speech constants facilitate lexical retrieval or serve other discourse functions. With detailed annotations on eight parliamentary interpellations in Taiwan Mandarin, we explore the co-occurrence between speech constants and non-verbal features (i.e., head movement, face movement, hand gesture, and function of hand gesture). Our findings suggest that while hand gestures do serve as facilitators for lexical retrieval in some cases, they also serve the purpose of information emphasis. This study highlights the potential of the MultiMoco Corpus to provide an important resource for in-depth analysis and further research in multimodal communication studies.
Michael Tang, Shunyu Yao, John Yang, Karthik Narasimhan
We propose Referral-Augmented Retrieval (RAR), a simple technique that concatenates document indices with referrals, i.e. text from other documents that cite or link to the given document, to provide significant performance gains for zero-shot information retrieval. The key insight behind our method is that referrals provide a more complete, multi-view representation of a document, much like incoming page links in algorithms like PageRank provide a comprehensive idea of a webpage's importance. RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques such as DocT5Query and Query2Doc a 37% and 21% absolute improvement on ACL paper retrieval Recall@10 -- while also eliminating expensive model training and inference. We also analyze different methods for multi-referral aggregation and show that RAR enables up-to-date information retrieval without re-training.
Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen
Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly $k$NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that $k$NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: https://github.com/Princeton-SysML/kNNLM_privacy .
Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, Vincent Y Zhao
In-context learning (ICL), teaching a large language model (LLM) to perform a task with few-shot demonstrations rather than adjusting the model parameters, has emerged as a strong paradigm for using LLMs. While early studies primarily used a fixed or random set of demonstrations for all test queries, recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations. Furthermore, we extend the success of retrieval-based ICL to instruction-finetuned LLMs as well as Chain-of-Thought (CoT) prompting. For instruction-finetuned LLMs, we find that although a model has already seen the training data at training time, retrieving demonstrations from the training data at test time yields better results compared to using no demonstrations or random demonstrations. Last but not least, we train a task-specific demonstration retriever that outperforms off-the-shelf retrievers.
Peitian Zhang, Zheng Liu, Yujia Zhou, Zhicheng Dou, Fangchao Liu, Zhao Cao
Recently, generative retrieval emerges as a promising alternative to traditional retrieval paradigms. It assigns each document a unique identifier, known as DocID, and employs a generative model to directly generate the relevant DocID for the input query. A common choice for DocID is one or several natural language sequences, e.g. the title or n-grams, so that the pre-trained knowledge of the generative model can be utilized. However, a sequence is generated token by token, where only the most likely candidates are kept and the rest are pruned at each decoding step, thus, retrieval fails if any token within the relevant DocID is falsely pruned. What's worse, during decoding, the model can only perceive preceding tokens in DocID while being blind to subsequent ones, hence is prone to make such errors. To address this problem, we present a novel framework for generative retrieval, dubbed Term-Set Generation (TSGen). Instead of sequences, we use a set of terms as DocID, which are automatically selected to concisely summarize the document's semantics and distinguish it from others. On top of the term-set DocID, we propose a permutation-invariant decoding algorithm, with which the term set can be generated in any permutation yet will always lead to the corresponding document. Remarkably, TSGen perceives all valid terms rather than only the preceding ones at each decoding step. Given the constant decoding space, it can make more reliable decisions due to the broader perspective. TSGen is also resilient to errors: the relevant DocID will not be pruned as long as the decoded term belongs to it. Lastly, we design an iterative optimization procedure to incentivize the model to generate the relevant term set in its favorable permutation. We conduct extensive experiments on popular benchmarks, which validate the effectiveness, the generalizability, the scalability, and the efficiency of TSGen.
Ilias Chalkidis, Yova Kementchedjhieva
Multi-label text classification (MLC) is a challenging task in settings of large label sets, where label support follows a Zipfian distribution. In this paper, we address this problem through retrieval augmentation, aiming to improve the sample efficiency of classification models. Our approach closely follows the standard MLC architecture of a Transformer-based encoder paired with a set of classification heads. In our case, however, the input document representation is augmented through cross-attention to similar documents retrieved from the training set and represented in a task-specific manner. We evaluate this approach on four datasets from the legal and biomedical domains, all of which feature highly skewed label distributions. Our experiments show that retrieval augmentation substantially improves model performance on the long tail of infrequent labels especially so for lower-resource training scenarios and more challenging long-document data scenarios.
Zhiqi Huang, Hansi Zeng, Hamed Zamani, James Allan
In this work, we explore a Multilingual Information Retrieval (MLIR) task, where the collection includes documents in multiple languages. We demonstrate that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections -- some languages are better represented in the collection and some benefit from large-scale training data. To address this issue, we present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space. To address the challenges of data scarcity and imbalance, we introduce a knowledge distillation strategy. The teacher model is trained on rich English retrieval data, and by leveraging bi-text data, our distillation framework transfers its retrieval knowledge to the multilingual document encoder. Therefore, our approach does not require any multilingual retrieval training data. Extensive experiments on three MLIR datasets with a total of 15 languages demonstrate that KD-SPD significantly outperforms competitive baselines in all cases. We conduct extensive analyses to show that our method has less language bias and better zero-shot transfer ability towards new languages.
Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melisek, Ivan Vykopal, Jakub Simko et al.
Fact-checkers are often hampered by the sheer amount of online content that
needs to be fact-checked. NLP can help them by retrieving already existing
fact-checks relevant to the content being investigated. This paper introduces a
new multilingual dataset -- MultiClaim -- for previously fact-checked claim
retrieval. We collected 28k posts in 27 languages from social media, 206k
fact-checks in 39 languages written by professional fact-checkers, as well as
31k connections between these two groups. This is the most extensive and the
most linguistically diverse dataset of this kind to date. We evaluated how
different unsupervised methods fare on this dataset and its various dimensions.
We show that evaluating such a diverse dataset has its complexities and proper
care needs to be taken before interpreting the results. We also evaluated a
supervised fine-tuning approach, improving upon the unsupervised method
significantly.
Authors' comments: Accepted at EMNLP 2023
Orion Weller, Dawn Lawrie, Benjamin Van Durme
Negation is a common everyday phenomena and has been a consistent area of
weakness for language models (LMs). Although the Information Retrieval (IR)
community has adopted LMs as the backbone of modern IR architectures, there has
been little to no research in understanding how negation impacts neural IR. We
therefore construct a straightforward benchmark on this theme: asking IR models
to rank two documents that differ only by negation. We show that the results
vary widely according to the type of IR architecture: cross-encoders perform
best, followed by late-interaction models, and in last place are bi-encoder and
sparse neural architectures. We find that most information retrieval models
(including SOTA ones) do not consider negation, performing the same or worse
than a random ranking. We show that although the obvious approach of continued
fine-tuning on a dataset of contrastive documents containing negations
increases performance (as does model size), there is still a large gap between
machine and human performance.
Authors' comments: Accepted to EACL 2024
Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Jimmy Lin
The ever-increasing size of language models curtails their widespread
availability to the community, thereby galvanizing many companies into offering
access to large language models through APIs. One particular type, suitable for
dense retrieval, is a semantic embedding service that builds vector
representations of input text. With a growing number of publicly available
APIs, our goal in this paper is to analyze existing offerings in realistic
retrieval scenarios, to assist practitioners and researchers in finding
suitable services according to their needs. Specifically, we investigate the
capabilities of existing semantic embedding APIs on domain generalization and
multilingual retrieval. For this purpose, we evaluate these services on two
standard benchmarks, BEIR and MIRACL. We find that re-ranking BM25 results
using the APIs is a budget-friendly approach and is most effective in English,
in contrast to the standard practice of employing them as first-stage
retrievers. For non-English retrieval, re-ranking still improves the results,
but a hybrid model with BM25 works best, albeit at a higher cost. We hope our
work lays the groundwork for evaluating semantic embedding APIs that are
critical in search and more broadly, for information access.
Authors' comments: ACL 2023 Industry Track
Yiqing Xie, Xiao Liu, Chenyan Xiong
In this work, we present an unsupervised retrieval method with contrastive
learning on web anchors. The anchor text describes the content that is
referenced from the linked page. This shows similarities to search queries that
aim to retrieve pertinent information from relevant documents. Based on their
commonalities, we train an unsupervised dense retriever, Anchor-DR, with a
contrastive learning task that matches the anchor text and the linked document.
To filter out uninformative anchors (such as ``homepage'' or other functional
anchors), we present a novel filtering technique to only select anchors that
contain similar types of information as search queries. Experiments show that
Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval
by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is
especially significant for search and question answering tasks. Our analysis
further reveals that the pattern of anchor-document pairs is similar to that of
search query-document pairs. Code available at
https://github.com/Veronicium/AnchorDR.
Authors' comments: SIGIR'23 Short
Marco Peer, Florian Kleber, Robert Sablatnig
This paper presents an unsupervised approach for writer retrieval based on clustering SIFT descriptors detected at keypoint locations resulting in pseudo-cluster labels. With those cluster labels, a residual network followed by our proposed NetRVLAD, an encoding layer with reduced complexity compared to NetVLAD, is trained on 32x32 patches at keypoint locations. Additionally, we suggest a graph-based reranking algorithm called SGR to exploit similarities of the page embeddings to boost the retrieval performance. Our approach is evaluated on two historical datasets (Historical-WI and HisIR19). We include an evaluation of different backbones and NetRVLAD. It competes with related work on historical datasets without using explicit encodings. We set a new State-of-the-art on both datasets by applying our reranking scheme and show that our approach achieves comparable performance on a modern dataset as well.
Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang et al.
In-context learning is a new learning paradigm where a language model
conditions on a few input-output pairs (demonstrations) and a test input, and
directly outputs the prediction. It has been shown highly dependent on the
provided demonstrations and thus promotes the research of demonstration
retrieval: given a test input, relevant examples are retrieved from the
training set to serve as informative demonstrations for in-context learning.
While previous works focus on training task-specific retrievers for several
tasks separately, these methods are often hard to transfer and scale on various
tasks, and separately trained retrievers incur a lot of parameter storage and
deployment cost. In this paper, we propose Unified Demonstration Retriever
(\textbf{UDR}), a single model to retrieve demonstrations for a wide range of
tasks. To train UDR, we cast various tasks' training signals into a unified
list-wise ranking formulation by language model's feedback. Then we propose a
multi-task list-wise ranking training framework, with an iterative mining
strategy to find high-quality candidates, which can help UDR fully incorporate
various tasks' signals. Experiments on 30+ tasks across 13 task families and
multiple data domains show that UDR significantly outperforms baselines.
Further analyses show the effectiveness of each proposed component and UDR's
strong ability in various scenarios including different LMs (1.3B - 175B),
unseen datasets, varying demonstration quantities, etc.
Authors' comments: ACL 2023 camera ready version
Nishant Balepur, Jie Huang, Kevin Chen-Chuan Chang
Expository documents are vital resources for conveying complex information to
readers. Despite their usefulness, writing expository text by hand is a
challenging process that requires careful content planning, obtaining facts
from multiple sources, and the ability to clearly synthesize these facts. To
ease these burdens, we propose the task of expository text generation, which
seeks to automatically generate an accurate and stylistically consistent
expository text for a topic by intelligently searching a knowledge source. We
solve our task by developing IRP, a framework that overcomes the limitations of
retrieval-augmented models and iteratively performs content planning, fact
retrieval, and rephrasing. Through experiments on three diverse,
newly-collected datasets, we show that IRP produces factual and organized
expository texts that accurately inform readers.
Authors' comments: Accepted to EMNLP 2023 Main Conference
Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, Yingfei Sun
The Differentiable Search Index (DSI) is a novel information retrieval (IR)
framework that utilizes a differentiable function to generate a sorted list of
document identifiers in response to a given query. However, due to the
black-box nature of the end-to-end neural architecture, it remains to be
understood to what extent DSI possesses the basic indexing and retrieval
abilities. To mitigate this gap, in this study, we define and examine three
important abilities that a functioning IR framework should possess, namely,
exclusivity, completeness, and relevance ordering. Our analytical
experimentation shows that while DSI demonstrates proficiency in memorizing the
unidirectional mapping from pseudo queries to document identifiers, it falls
short in distinguishing relevant documents from random ones, thereby negatively
impacting its retrieval effectiveness. To address this issue, we propose a
multi-task distillation approach to enhance the retrieval quality without
altering the structure of the model and successfully endow it with improved
indexing abilities. Through experiments conducted on various datasets, we
demonstrate that our proposed method outperforms previous DSI baselines.
Authors' comments: Accepted to Findings of ACL 2023
James Mayfield, Eugene Yang, Dawn Lawrie, Samuel Barham, Orion Weller, Marc Mason, Suraj Nair, Scott Miller
A key stumbling block for neural cross-language information retrieval (CLIR)
systems has been the paucity of training data. The appearance of the MS MARCO
monolingual training set led to significant advances in the state of the art in
neural monolingual retrieval. By translating the MS MARCO documents into other
languages using machine translation, this resource has been made useful to the
CLIR community. Yet such translation suffers from a number of problems. While
MS MARCO is a large resource, it is of fixed size; its genre and domain of
discourse are fixed; and the translated documents are not written in the
language of a native speaker of the language, but rather in translationese. To
address these problems, we introduce the JH-POLO CLIR training set creation
methodology. The approach begins by selecting a pair of non-English passages. A
generative large language model is then used to produce an English query for
which the first passage is relevant and the second passage is not relevant. By
repeating this process, collections of arbitrary size can be created in the
style of MS MARCO but using naturally-occurring documents in any desired genre
and domain of discourse. This paper describes the methodology in detail, shows
its use in creating new CLIR training sets, and describes experiments using the
newly created training data.
Authors' comments: 11 pages, 4 figures
Hamed Zamani, Michael Bendersky
Dense retrieval models use bi-encoder network architectures for learning
query and document representations. These representations are often in the form
of a vector representation and their similarities are often computed using the
dot product function. In this paper, we propose a new representation learning
framework for dense retrieval. Instead of learning a vector for each query and
document, our framework learns a multivariate distribution and uses negative
multivariate KL divergence to compute the similarity between distributions. For
simplicity and efficiency reasons, we assume that the distributions are
multivariate normals and then train large language models to produce mean and
variance vectors for these distributions. We provide a theoretical foundation
for the proposed framework and show that it can be seamlessly integrated into
the existing approximate nearest neighbor algorithms to perform retrieval
efficiently. We conduct an extensive suite of experiments on a wide range of
datasets, and demonstrate significant improvements compared to competitive
dense retrieval models.
Authors' comments: Accepted for publication at SIGIR 2023
Aleksei Shabanov, Aleksei Tarasov, Sergey Nikolenko
Current metric learning approaches for image retrieval are usually based on
learning a space of informative latent representations where simple approaches
such as the cosine distance will work well. Recent state of the art methods
such as HypViT move to more complex embedding spaces that may yield better
results but are harder to scale to production environments. In this work, we
first construct a simpler model based on triplet loss with hard negatives
mining that performs at the state of the art level but does not have these
drawbacks. Second, we introduce a novel approach for image retrieval
postprocessing called Siamese Transformer for Image Retrieval (STIR) that
reranks several top outputs in a single forward pass. Unlike previously
proposed Reranking Transformers, STIR does not rely on global/local feature
extraction and directly compares a query image and a retrieved candidate on
pixel level with the usage of attention mechanism. The resulting approach
defines a new state of the art on standard image retrieval datasets: Stanford
Online Products and DeepFashion In-shop. We also release the source code at
https://github.com/OML-Team/open-metric-learning/tree/main/pipelines/postprocessing/
and an interactive demo of our approach at
https://dapladoc-oml-postprocessing-demo-srcappmain-pfh2g0.streamlit.app/
Authors' comments: 14 pages, 3 figures