Anirudh Khatry, Yasharth Bajpai, Priyanshu Gupta, Sumit Gulwani, Ashish Tiwari
Information retrieval involves selecting artifacts from a corpus that are
most relevant to a given search query. The flavor of retrieval typically used
in classical applications can be termed as homogeneous and relaxed, where
queries and corpus elements are both natural language (NL) utterances
(homogeneous) and the goal is to pick most relevant elements from the corpus in
the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed).
Recently, retrieval is being used extensively in preparing prompts for large
language models (LLMs) to enable LLMs to perform targeted tasks. These new
applications of retrieval are often heterogeneous and strict -- the queries and
the corpus contain different kinds of entities, such as NL and code, and there
is a need for improving retrieval at Top-K for small values of K, such as K=1
or 3 or 5. Current dense retrieval techniques based on pretrained embeddings
provide a general-purpose and powerful approach for retrieval, but they are
oblivious to task-specific notions of similarity of heterogeneous artifacts. We
introduce Adapted Dense Retrieval, a mechanism to transform embeddings to
enable improved task-specific, heterogeneous and strict retrieval. Adapted
Dense Retrieval works by learning a low-rank residual adaptation of the
pretrained black-box embedding. We empirically validate our approach by showing
improvements over the state-of-the-art general-purpose embeddings-based
baseline.
Authors' comments: 14 pages
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and
effective audio captioning system that generates captions conditioned on an
input audio and other captions similar to the audio retrieved from a datastore.
Additionally, our proposed method can transfer to any domain without the need
for any additional fine-tuning. To generate a caption for an audio sample, we
leverage an audio-text model CLAP to retrieve captions similar to it from a
replaceable datastore, which are then used to construct a prompt. Next, we feed
this prompt to a GPT-2 decoder and introduce cross-attention layers between the
CLAP encoder and GPT-2 to condition the audio for caption generation.
Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP
achieves competitive performance in in-domain settings and significant
improvements in out-of-domain settings. Additionally, due to its capability to
exploit a large text-captions-only datastore in a \textit{training-free}
fashion, RECAP shows unique capabilities of captioning novel audio events never
seen during training and compositional audios with multiple events. To promote
research in this space, we also release 150,000+ new weakly labeled captions
for AudioSet, AudioCaps, and Clotho.
Authors' comments: Code and data soon here: https://github.com/Sreyan88/RECAP
Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, Liqiang Nie
Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.
F. Javadi, M. J. Mehdipour
In this paper, we study phase retrievable sequences and give a characterization of phase retrievability of a sequence of bounded linear operators on a Hilbert space $H$; in particular, for $H=\ell_2^d(\Bbb{C})$. We also give several approaches for constructing phase retrievable sequences. Then, we investigate the property of phase retrieval for $g$-frames and frames.
Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive
video-text data on the Internet. A plethora of work characterized by using a
two-stream Vision-Language model architecture that learns a joint
representation of video-text pairs has become a prominent approach for the VTR
task. However, these models operate under the assumption of bijective
video-text correspondences and neglect a more practical scenario where video
content usually encompasses multiple events, while texts like user queries or
webpage metadata tend to be specific and correspond to single events. This
establishes a gap between the previous training objective and real-world
applications, leading to the potential performance degradation of earlier
models during inference. In this study, we introduce the Multi-event Video-Text
Retrieval (MeVTR) task, addressing scenarios in which each video contains
multiple different events, as a niche scenario of the conventional Video-Text
Retrieval Task. We present a simple model, Me-Retriever, which incorporates key
event video representation and a new MeVTR loss for the MeVTR task.
Comprehensive experiments show that this straightforward framework outperforms
other models in the Video-to-Text and Text-to-Video tasks, effectively
establishing a robust baseline for the MeVTR task. We believe this work serves
as a strong foundation for future studies. Code is available at
https://github.com/gengyuanmax/MeVTR.
Authors' comments: accepted to ICCV2023
Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.
Jiaqi Zhai, Zhaojie Gong, Yueming Wang, Xiao Sun, Zheng Yan, Fu Li, Xing Liu
Retrieval finds a small number of relevant candidates from a large corpus for
information retrieval and recommendation applications. A key component of
retrieval is to model (user, item) similarity, which is commonly represented as
the dot product of two learned embeddings. This formulation permits efficient
inference, commonly known as Maximum Inner Product Search (MIPS). Despite its
popularity, dot products cannot capture complex user-item interactions, which
are multifaceted and likely high rank. We hence examine non-dot-product
retrieval settings on accelerators, and propose \textit{mixture of logits}
(MoL), which models (user, item) similarity as an adaptive composition of
elementary similarity functions. This new formulation is expressive, capable of
modeling high rank (user, item) interactions, and further generalizes to the
long tail. When combined with a hierarchical retrieval strategy,
\textit{h-indexer}, we are able to scale up MoL to 100M corpus on a single GPU
with latency comparable to MIPS baselines. On public datasets, our approach
leads to uplifts of up to 77.3\% in hit rate (HR). Experiments on a large
recommendation surface at Meta showed strong metric gains and reduced
popularity bias, validating the proposed approach's performance and improved
generalization.
Authors' comments: To appear in the 29th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining (KDD 2023)
Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen et al.
We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the
Information Retrieval Experiment Platform (TIREx) to promote more standardized,
reproducible, scalable, and even blinded retrieval experiments. Standardization
is achieved when a retrieval approach implements PyTerrier's interfaces and the
input and output of an experiment are compatible with ir_datasets and
ir_measures. However, none of this is a must for reproducibility and
scalability, as TIRA can run any dockerized software locally or remotely in a
cloud-native execution environment. Version control and caching ensure
efficient (re)execution. TIRA allows for blind evaluation when an experiment
runs on a remote server or cloud not under the control of the experimenter. The
test data and ground truth are then hidden from public access, and the
retrieval software has to process them in a sandbox that prevents data leaks.
We currently host an instance of TIREx with 15 corpora (1.9 billion
documents) on which 32 shared retrieval tasks are based. Using Docker images of
50 standard retrieval approaches, we automatically evaluated all approaches on
all tasks (50 $\cdot$ 32 = 1,600~runs) in less than a week on a midsize cluster
(1,620 CPU cores and 24 GPUs). This instance of TIREx is open for submissions
and will be integrated with the IR Anthology, as well as released open source.
Authors' comments: 11 pages. To be published in the proceedings of SIGIR 2023
Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
Instead of simply matching a query to pre-existing passages, generative
retrieval generates identifier strings of passages as the retrieval target. At
a cost, the identifier must be distinctive enough to represent a passage.
Current approaches use either a numeric ID or a text piece (such as a title or
substrings) as the identifier. However, these identifiers cannot cover a
passage's content well. As such, we are motivated to propose a new type of
identifier, synthetic identifiers, that are generated based on the content of a
passage and could integrate contextualized information that text pieces lack.
Furthermore, we simultaneously consider multiview identifiers, including
synthetic identifiers, titles, and substrings. These views of identifiers
complement each other and facilitate the holistic ranking of passages from
multiple perspectives. We conduct a series of experiments on three public
datasets, and the results indicate that our proposed approach performs the best
in generative retrieval, demonstrating its effectiveness and robustness.
Authors' comments: ACL 2023 Main Conference
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay et al.
Modern recommender systems perform large-scale retrieval by first embedding
queries and item candidates in the same unified space, followed by approximate
nearest neighbor search to select top candidates given a query embedding. In
this paper, we propose a novel generative retrieval approach, where the
retrieval model autoregressively decodes the identifiers of the target
candidates. To that end, we create semantically meaningful tuple of codewords
to serve as a Semantic ID for each item. Given Semantic IDs for items in a user
session, a Transformer-based sequence-to-sequence model is trained to predict
the Semantic ID of the next item that the user will interact with. To the best
of our knowledge, this is the first Semantic ID-based generative model for
recommendation tasks. We show that recommender systems trained with the
proposed paradigm significantly outperform the current SOTA models on various
datasets. In addition, we show that incorporating Semantic IDs into the
sequence-to-sequence model enhances its ability to generalize, as evidenced by
the improved retrieval performance observed for items with no prior interaction
history.
Authors' comments: To appear in The 37th Conference on Neural Information Processing
Systems (NeurIPS 2023)
Kazuma Kobayashi, Lin Gu, Ryuichiro Hataya, Takaaki Mizuno, Mototaka Miyake, Hirokazu Watanabe, Masamichi Takahashi, Yasuyuki Takamizawa et al.
The amount of medical images stored in hospitals is increasing faster than ever; however, utilizing the accumulated medical images has been limited. This is because existing content-based medical image retrieval (CBMIR) systems usually require example images to construct query vectors; nevertheless, example images cannot always be prepared. Besides, there can be images with rare characteristics that make it difficult to find similar example images, which we call isolated samples. Here, we introduce a novel sketch-based medical image retrieval (SBMIR) system that enables users to find images of interest without example images. The key idea lies in feature decomposition of medical images, whereby the entire feature of a medical image can be decomposed into and reconstructed from normal and abnormal features. By extending this idea, our SBMIR system provides an easy-to-use two-step graphical user interface: users first select a template image to specify a normal feature and then draw a semantic sketch of the disease on the template image to represent an abnormal feature. Subsequently, it integrates the two kinds of input to construct a query vector and retrieves reference images with the closest reference vectors. Using two datasets, ten healthcare professionals with various clinical backgrounds participated in the user test for evaluation. As a result, our SBMIR system enabled users to overcome previous challenges, including image retrieval based on fine-grained image characteristics, image retrieval without example images, and image retrieval for isolated samples. Our SBMIR system achieves flexible medical image retrieval on demand, thereby expanding the utility of medical image databases.
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer et al.
Recent multimodal models such as DALL-E and CM3 have achieved remarkable
progress in text-to-image and image-to-text generation. However, these models
store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the
model parameters, requiring increasingly larger models and training data to
capture more knowledge. To integrate knowledge in a more scalable and modular
way, we propose a retrieval-augmented multimodal model, which enables a base
multimodal model (generator) to refer to relevant text and images fetched by a
retriever from external memory (e.g., documents on the web). Specifically, for
the retriever, we use a pretrained CLIP, and for the generator, we train a CM3
Transformer on the LAION dataset. Our resulting model, named
Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can
retrieve and generate both text and images. We show that RA-CM3 significantly
outperforms baseline multimodal models such as DALL-E and CM3 on both image and
caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while
requiring much less compute for training (<30% of DALL-E). Moreover, we show
that RA-CM3 exhibits novel capabilities, such as faithful image generation and
multimodal in-context learning (e.g., image generation from demonstrations).
Authors' comments: Published at ICML 2023. Blog post available at
https://cs.stanford.edu/~myasu/blog/racm3/
Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, Wen-tau Yih
We study the problem of retrieval with instructions, where users of a
retrieval system explicitly describe their intent along with their queries. We
aim to develop a general-purpose task-aware retrieval system using multi-task
instruction tuning, which can follow human-written instructions to find the
best documents for a given query. We introduce the first large-scale collection
of approximately 40 retrieval datasets with instructions, BERRI, and present
TART, a multi-task retrieval system trained on BERRI with instructions. TART
shows strong capabilities to adapt to a new retrieval task via instructions and
advances the state of the art on two zero-shot retrieval benchmarks, BEIR and
LOTTE, outperforming models up to three times larger. We further introduce a
new evaluation setup, X^2-Retrieval to better reflect real-world scenarios,
where diverse domains and tasks are pooled and a system needs to find documents
aligning users' intents. In this setup, TART significantly outperforms
competitive baselines, further demonstrating the effectiveness of guiding
retrieval with instructions.
Authors' comments: Code, data and pretrained model checkpoints are available at
https://github.com/facebookresearch/tart
Kundana Mandapaka
Search algorithms are applied where data retrieval with specified
specifications is required. The motivation behind developing search algorithms
in Functional Object-Oriented Networks is that most of the time, a certain
recipe needs to be retrieved or ingredients for a certain recipe needs to be
determined. According to the introduction, there is a time when execution of an
entire recipe is not available for a robot thus prompting the need to retrieve
a certain recipe or ingredients. With a quality FOON, robots can decipher a
task goal, find the correct objects at the required states on which to operate
and output a sequence of proper manipulation motions. This paper shows several
proposed weighted FOON and task planning algorithms that allow a robot and a
human to successfully complete complicated tasks together with higher success
rates than a human doing them alone.
Authors' comments: 3 pages, 1 figure, and 2 tables; modified references
Zecheng Wang, Yik-Cheung Tam
Causal language modeling (LM) uses word history to predict the next word.
BERT, on the other hand, makes use of bi-directional word information in a
sentence to predict words at masked positions. While BERT is effective in
sequence encoding, it is non-causal by nature and is not designed for sequence
generation. In this paper, we propose a novel language model, SUffix
REtrieval-Augmented LM (SUREALM), that simulates a bi-directional contextual
effect in an autoregressive manner. SUREALM employs an embedding retriever to
search for training sentences in a data store that share similar word history
during sequence generation. In particular, the suffix portions of the retrieved
sentences mimick the "future" context. We evaluated our proposed model on the
DSTC9 spoken dialogue corpus and showed promising word perplexity reduction on
the validation and test set compared to competitive baselines.
Authors' comments: 5 pages, 1 figure. Submitted to ICASSP 2023
Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, Zijian Zhang
Explainable information retrieval is an emerging research area aiming to make
transparent and trustworthy information retrieval systems. Given the increasing
use of complex machine learning models in search systems, explainability is
essential in building and auditing responsible information retrieval models.
This survey fills a vital gap in the otherwise topically diverse literature of
explainable information retrieval. It categorizes and discusses recent
explainability methods developed for different application domains in
information retrieval, providing a common framework and unifying perspectives.
In addition, it reflects on the common concern of evaluating explanations and
highlights open challenges and opportunities.
Authors' comments: 35 pages, 10 figures. Under review
Kshitij Alwadhi, Rohan Sharma, Siddhant Sharma
A MIDI based approach for music recognition is proposed and implemented in this paper. Our Clarinet music retrieval system is designed to search piano MIDI files with high recall and speed. We design a novel melody extraction algorithm that improves recall results by more than 10%. We also implement 3 algorithms for retrieval-two self designed (RSA Note and RSA Time), and a modified version of the Mongeau Sankoff Algorithm. Algorithms to achieve tempo and scale invariance are also discussed in this paper. The paper also contains detailed experimentation and benchmarks with four different metrics. Clarinet achieves recall scores of more than 94%.
Hyunji Lee, Jaeyoung Kim, Hoyeon Chang, Hanseok Oh, Sohee Yang, Vlad Karpukhin, Yi Lu, Minjoon Seo
The generative retrieval model depends solely on the information encoded in
its model parameters without external memory, its information capacity is
limited and fixed. To overcome the limitation, we propose Nonparametric
Decoding (Np Decoding) which can be applied to existing generative retrieval
models. Np Decoding uses nonparametric contextualized vocab embeddings
(external memory) rather than vanilla vocab embeddings as decoder vocab
embeddings. By leveraging the contextualized vocab embeddings, the generative
retrieval model is able to utilize both the parametric and nonparametric space.
Evaluation over 9 datasets (8 single-hop and 1 multi-hop) in the document
retrieval task shows that applying Np Decoding to generative retrieval models
significantly improves the performance. We also show that Np Decoding is data-
and parameter-efficient, and shows high performance in the zero-shot setting.
Authors' comments: published at Findings of ACL 2023
Baoyu Jing, Si Zhang, Yada Zhu, Bin Peng, Kaiyu Guan, Andrew Margenot, Hanghang Tong
Time series data appears in a variety of applications such as smart
transportation and environmental monitoring. One of the fundamental problems
for time series analysis is time series forecasting. Despite the success of
recent deep time series forecasting methods, they require sufficient
observation of historical values to make accurate forecasting. In other words,
the ratio of the output length (or forecasting horizon) to the sum of the input
and output lengths should be low enough (e.g., 0.3). As the ratio increases
(e.g., to 0.8), the uncertainty for the forecasting accuracy increases
significantly. In this paper, we show both theoretically and empirically that
the uncertainty could be effectively reduced by retrieving relevant time series
as references. In the theoretical analysis, we first quantify the uncertainty
and show its connections to the Mean Squared Error (MSE). Then we prove that
models with references are easier to learn than models without references since
the retrieved references could reduce the uncertainty. To empirically
demonstrate the effectiveness of the retrieval based time series forecasting
models, we introduce a simple yet effective two-stage method, called ReTime
consisting of a relational retrieval and a content synthesis. We also show that
ReTime can be easily adapted to the spatial-temporal time series and time
series imputation settings. Finally, we evaluate ReTime on real-world datasets
to demonstrate its effectiveness.
Authors' comments: CIKM'22 AMLTS
Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
Multi-modal retrieval has seen tremendous progress with the development of
vision-language models. However, further improving these models require
additional labelled data which is a huge manual effort. In this paper, we
propose a framework MuMUR, that utilizes knowledge transfer from a multilingual
model to boost the performance of multi-modal (image and video) retrieval. We
first use state-of-the-art machine translation models to construct pseudo
ground-truth multilingual visual-text pairs. We then use this data to learn a
joint vision-text representation where English and non-English text queries are
represented in a common embedding space based on pretrained multilingual
models. We evaluate our proposed approach on a diverse set of retrieval
datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades
and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and
Multi30k . Experimental results demonstrate that our approach achieves
state-of-the-art results on all video retrieval datasets outperforming previous
models. Additionally, our framework MuMUR significantly beats other
multilingual video retrieval dataset. We also observe that MuMUR exhibits
strong performance on image retrieval. This demonstrates the universal ability
of MuMUR to perform retrieval across all visual inputs (image and video) and
text inputs (monolingual and multilingual).
Authors' comments: This is an extension of the previous MKTVR paper (for which you can
find a reference here :
https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous
version on arxiv). This version was published to the Information Retrieval
Journal