Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk
In multitask retrieval, a single retriever is trained to retrieve relevant
contexts for multiple tasks. Despite its practical appeal, naive multitask
retrieval lags behind task-specific retrieval in which a separate retriever is
trained for each task. We show that it is possible to train a multitask
retriever that outperforms task-specific retrievers by promoting task
specialization. The main ingredients are: (1) a better choice of pretrained
model (one that is explicitly optimized for multitasking) along with compatible
prompting, and (2) a novel adaptive learning method that encourages each
parameter to specialize in a particular task. The resulting multitask retriever
is highly performant on the KILT benchmark. Upon analysis, we find that the
model indeed learns parameters that are more task-specialized compared to naive
multitasking without prompting or adaptive learning.
Authors' comments: TACL 2023
Iain Mackie, Shubham Chatterjee, Sean MacAvaney, Jeffrey Dalton
Despite considerable progress in neural relevance ranking techniques, search engines still struggle to process complex queries effectively - both in terms of precision and recall. Sparse and dense Pseudo-Relevance Feedback (PRF) approaches have the potential to overcome limitations in recall, but are only effective with high precision in the top ranks. In this work, we tackle the problem of search over complex queries using three complementary techniques. First, we demonstrate that applying a strong neural re-ranker before sparse or dense PRF can improve the retrieval effectiveness by 5-8%. This improvement in PRF effectiveness can be attributed directly to improving the precision of the feedback set. Second, we propose an enhanced expansion model, Latent Entity Expansion (LEE), which applies fine-grained word and entity-based relevance modelling incorporating localized features. Specifically, we find that by including both words and entities for expansion achieve a further 2-8% improvement in NDCG. Our analysis also demonstrated that LEE is largely robust to its parameters across datasets and performs well on entity-centric queries. And third, we include an 'adaptive' component in the retrieval process, which iteratively refines the re-ranking pool during scoring using the expansion model and avoids re-ranking additional documents. We find that this combination of techniques achieves the best NDCG, MAP and R@1000 results on the TREC Robust 2004 and CODEC document datasets, demonstrating a significant advancement in expansion effectiveness.
William Yang, Noah Bergam, Arnav Jain, Nima Sheikhoslami
In this paper, we consider the extent to which the transformer-based Dense Passage Retrieval (DPR) algorithm, developed by (Karpukhin et. al. 2020), can be optimized without further pre-training. Our method involves two particular insights: we apply the DPR context encoder at various phrase lengths (e.g. one-sentence versus five-sentence segments), and we take a confidence-calibrated ensemble prediction over all of these different segmentations. This somewhat exhaustive approach achieves start-of-the-art results on benchmark datasets such as Google NQ and SQuAD. We also apply our method to domain-specific datasets, and the results suggest how different granularities are optimal for different domains
Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
Generative retrieval stands out as a promising new paradigm in text retrieval
that aims to generate identifier strings of relevant passages as the retrieval
target. This generative paradigm taps into powerful generative language models,
distinct from traditional sparse or dense retrieval methods. However, only
learning to generate is insufficient for generative retrieval. Generative
retrieval learns to generate identifiers of relevant passages as an
intermediate goal and then converts predicted identifiers into the final
passage rank list. The disconnect between the learning objective of
autoregressive models and the desired passage ranking target leads to a
learning gap. To bridge this gap, we propose a learning-to-rank framework for
generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn
to rank passages directly, optimizing the autoregressive model toward the final
passage ranking target via a rank loss. This framework only requires an
additional learning-to-rank training phase to enhance current generative
retrieval systems and does not add any burden to the inference stage. We
conducted experiments on three public benchmarks, and the results demonstrate
that LTRGR achieves state-of-the-art performance among generative retrieval
methods. The code and checkpoints are released at
https://github.com/liyongqi67/LTRGR.
Authors' comments: AAAI 2024
Erik Malm
Phase retrieval is the numerical procedure of recovering a complex-valued signal from knowledge about its amplitude and some additional information. Here, an indirect registration procedure, based on the large deformation diffeomorphic metric mapping (LDDMM) formalism, is investigated as a phase retrieval method for coherent diffractive imaging. The method attempts to find a deformation which transforms an initial, template image to match an unknown target image by comparing the diffraction pattern to the data. The exterior calculus framework is used to treat different types of deformations in a unified and coordinate-free way. The algorithm performance with respect to measurement noise, image topology, and particular action are explored through numerical examples.
Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
Current large language models (LLMs) can exhibit near-human levels of performance on many natural language tasks, including open-domain question answering. Unfortunately, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report a simple experiment to automatically verify generated answers against a corpus. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. We base our experiment on questions and passages from the MS MARCO (V1) test collection, exploring three retrieval approaches ranging from standard BM25 to a full question answering stack, including a reader based on the LLM. For a large fraction of questions, we find that an LLM is capable of verifying its generated answer if appropriate supporting material is provided. However, with an accuracy of 70-80%, this approach cannot be fully relied upon to detect hallucinations.
Michael Glass, Xueqing Wu, Ankita Rajaram Naik, Gaetano Rossiello, Alfio Gliozzo
Data preparation, also called data wrangling, is considered one of the most
expensive and time-consuming steps when performing analytics or building
machine learning models. Preparing data typically involves collecting and
merging data from complex heterogeneous, and often large-scale data sources,
such as data lakes. In this paper, we introduce a novel approach toward
automatic data wrangling in an attempt to alleviate the effort of end-users,
e.g. data analysts, in structuring dynamic views from data lakes in the form of
tabular data. We aim to address table augmentation tasks, including row/column
population and data imputation. Given a corpus of tables, we propose a
retrieval augmented self-trained transformer model. Our self-learning strategy
consists in randomly ablating tables from the corpus and training the
retrieval-based model to reconstruct the original values or headers given the
partial tables as input. We adopt this strategy to first train the dense neural
retrieval model encoding table-parts to vectors, and then the end-to-end model
trained to perform table augmentation tasks. We test on EntiTables, the
standard benchmark for table augmentation, as well as introduce a new benchmark
to advance further research: WebTables. Our model consistently and
substantially outperforms both supervised statistical methods and the current
state-of-the-art transformer-based models.
Authors' comments: Findings of ACL 2023
Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong
Dense retrieval is a basic building block of information retrieval
applications. One of the main challenges of dense retrieval in real-world
settings is the handling of queries containing misspelled words. A popular
approach for handling misspelled queries is minimizing the representations
discrepancy between misspelled queries and their pristine ones. Unlike the
existing approaches, which only focus on the alignment between misspelled and
pristine queries, our method also improves the contrast between each misspelled
query and its surrounding queries. To assess the effectiveness of our proposed
method, we compare it against the existing competitors using two benchmark
datasets and two base encoders. Our method outperforms the competitors in all
cases with misspelled queries. Our code and models are available at
https://github. com/panuthept/DST-DenseRetrieval.
Authors' comments: 5 pages, 2 figures
Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan
Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.
Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi Xie
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
Shufang Xie, Rui Yan, Junliang Guo, Yingce Xia, Lijun Wu, Tao Qin
Retrosynthesis, which predicts the reactants of a given target molecule, is
an essential task for drug discovery. In recent years, the machine learing
based retrosynthesis methods have achieved promising results. In this work, we
introduce RetroKNN, a local reaction template retrieval method to further boost
the performance of template-based systems with non-parametric retrieval. We
first build an atom-template store and a bond-template store that contain the
local templates in the training data, then retrieve from these templates with a
k-nearest-neighbor (KNN) search during inference. The retrieved templates are
combined with neural network predictions as the final output. Furthermore, we
propose a lightweight adapter to adjust the weights when combing neural network
and KNN predictions conditioned on the hidden representation and the retrieved
templates. We conduct comprehensive experiments on two widely used benchmarks,
the USPTO-50K and USPTO-MIT. Especially for the top-1 accuracy, we improved
7.1% on the USPTO-50K dataset and 12.0% on the USPTO-MIT dataset. These results
demonstrate the effectiveness of our method.
Authors' comments: AAAI-2023 camera ready
Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
Chats emerge as an effective user-friendly approach for information
retrieval, and are successfully employed in many domains, such as customer
service, healthcare, and finance. However, existing image retrieval approaches
typically address the case of a single query-to-image round, and the use of
chats for image retrieval has been mostly overlooked. In this work, we
introduce ChatIR: a chat-based image retrieval system that engages in a
conversation with the user to elicit information, in addition to an initial
query, in order to clarify the user's search intent. Motivated by the
capabilities of today's foundation models, we leverage Large Language Models to
generate follow-up questions to an initial image description. These questions
form a dialog with the user in order to retrieve the desired image from a large
corpus. In this study, we explore the capabilities of such a system tested on a
large dataset and reveal that engaging in a dialog yields significant gains in
image retrieval. We start by building an evaluation pipeline from an existing
manually generated dataset and explore different modules and training
strategies for ChatIR. Our comparison includes strong baselines derived from
related applications trained with Reinforcement Learning. Our system is capable
of retrieving the target image from a pool of 50K images with over 78% success
rate after 5 dialogue rounds, compared to 75% when questions are asked by
humans, and 64% for a single shot text-to-image retrieval. Extensive
evaluations reveal the strong capabilities and examine the limitations of
CharIR under different settings. Project repository is available at
https://github.com/levymsn/ChatIR.
Authors' comments: Camera Ready version for NeurIPS 2023
Thong Nguyen, Sean MacAvaney, Andrew Yates
Learned sparse retrieval (LSR) is a family of neural retrieval methods that
transform queries and documents into sparse weight vectors aligned with a
vocabulary. While LSR approaches like Splade work well for short passages, it
is unclear how well they handle longer documents. We investigate existing
aggregation approaches for adapting LSR to longer documents and find that
proximal scoring is crucial for LSR to handle long documents. To leverage this
property, we proposed two adaptations of the Sequential Dependence Model (SDM)
to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term
dependence, while SoftSDM uses potential functions that model the dependence of
query terms and their expansion terms (i.e., terms identified using a
transformer's masked language modeling head).
Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate
that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches
for different document length constraints. Surprisingly, SoftSDM does not
provide any performance benefits over ExactSDM. This suggests that soft
proximity matching is not necessary for modeling term dependence in LSR.
Overall, this study provides insights into handling long documents with LSR,
proposing adaptations that improve its performance.
Authors' comments: SIGIR 2023
Po-Ya Angela Wang, Pin-Er Chen, Hsin-Yu Chou, Yu-Hsiang Tseng, Shu-Kai Hsieh
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study investigating the Lexical Retrieval Hypothesis (LRH), specifically examining whether the hand gestures co-occurring with speech constants facilitate lexical retrieval or serve other discourse functions. With detailed annotations on eight parliamentary interpellations in Taiwan Mandarin, we explore the co-occurrence between speech constants and non-verbal features (i.e., head movement, face movement, hand gesture, and function of hand gesture). Our findings suggest that while hand gestures do serve as facilitators for lexical retrieval in some cases, they also serve the purpose of information emphasis. This study highlights the potential of the MultiMoco Corpus to provide an important resource for in-depth analysis and further research in multimodal communication studies.
Michael Tang, Shunyu Yao, John Yang, Karthik Narasimhan
We propose Referral-Augmented Retrieval (RAR), a simple technique that concatenates document indices with referrals, i.e. text from other documents that cite or link to the given document, to provide significant performance gains for zero-shot information retrieval. The key insight behind our method is that referrals provide a more complete, multi-view representation of a document, much like incoming page links in algorithms like PageRank provide a comprehensive idea of a webpage's importance. RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques such as DocT5Query and Query2Doc a 37% and 21% absolute improvement on ACL paper retrieval Recall@10 -- while also eliminating expensive model training and inference. We also analyze different methods for multi-referral aggregation and show that RAR enables up-to-date information retrieval without re-training.
Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen
Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly $k$NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that $k$NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: https://github.com/Princeton-SysML/kNNLM_privacy .
Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, Vincent Y Zhao
In-context learning (ICL), teaching a large language model (LLM) to perform a task with few-shot demonstrations rather than adjusting the model parameters, has emerged as a strong paradigm for using LLMs. While early studies primarily used a fixed or random set of demonstrations for all test queries, recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations. Furthermore, we extend the success of retrieval-based ICL to instruction-finetuned LLMs as well as Chain-of-Thought (CoT) prompting. For instruction-finetuned LLMs, we find that although a model has already seen the training data at training time, retrieving demonstrations from the training data at test time yields better results compared to using no demonstrations or random demonstrations. Last but not least, we train a task-specific demonstration retriever that outperforms off-the-shelf retrievers.
Peitian Zhang, Zheng Liu, Yujia Zhou, Zhicheng Dou, Fangchao Liu, Zhao Cao
Recently, generative retrieval emerges as a promising alternative to traditional retrieval paradigms. It assigns each document a unique identifier, known as DocID, and employs a generative model to directly generate the relevant DocID for the input query. A common choice for DocID is one or several natural language sequences, e.g. the title or n-grams, so that the pre-trained knowledge of the generative model can be utilized. However, a sequence is generated token by token, where only the most likely candidates are kept and the rest are pruned at each decoding step, thus, retrieval fails if any token within the relevant DocID is falsely pruned. What's worse, during decoding, the model can only perceive preceding tokens in DocID while being blind to subsequent ones, hence is prone to make such errors. To address this problem, we present a novel framework for generative retrieval, dubbed Term-Set Generation (TSGen). Instead of sequences, we use a set of terms as DocID, which are automatically selected to concisely summarize the document's semantics and distinguish it from others. On top of the term-set DocID, we propose a permutation-invariant decoding algorithm, with which the term set can be generated in any permutation yet will always lead to the corresponding document. Remarkably, TSGen perceives all valid terms rather than only the preceding ones at each decoding step. Given the constant decoding space, it can make more reliable decisions due to the broader perspective. TSGen is also resilient to errors: the relevant DocID will not be pruned as long as the decoded term belongs to it. Lastly, we design an iterative optimization procedure to incentivize the model to generate the relevant term set in its favorable permutation. We conduct extensive experiments on popular benchmarks, which validate the effectiveness, the generalizability, the scalability, and the efficiency of TSGen.
Ilias Chalkidis, Yova Kementchedjhieva
Multi-label text classification (MLC) is a challenging task in settings of large label sets, where label support follows a Zipfian distribution. In this paper, we address this problem through retrieval augmentation, aiming to improve the sample efficiency of classification models. Our approach closely follows the standard MLC architecture of a Transformer-based encoder paired with a set of classification heads. In our case, however, the input document representation is augmented through cross-attention to similar documents retrieved from the training set and represented in a task-specific manner. We evaluate this approach on four datasets from the legal and biomedical domains, all of which feature highly skewed label distributions. Our experiments show that retrieval augmentation substantially improves model performance on the long tail of infrequent labels especially so for lower-resource training scenarios and more challenging long-document data scenarios.