F. Javadi, M. J. Mehdipour
In this paper, we study phase retrievable sequences and give a characterization of phase retrievability of a sequence of bounded linear operators on a Hilbert space $H$; in particular, for $H=\ell_2^d(\Bbb{C})$. We also give several approaches for constructing phase retrievable sequences. Then, we investigate the property of phase retrieval for $g$-frames and frames.
Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive
video-text data on the Internet. A plethora of work characterized by using a
two-stream Vision-Language model architecture that learns a joint
representation of video-text pairs has become a prominent approach for the VTR
task. However, these models operate under the assumption of bijective
video-text correspondences and neglect a more practical scenario where video
content usually encompasses multiple events, while texts like user queries or
webpage metadata tend to be specific and correspond to single events. This
establishes a gap between the previous training objective and real-world
applications, leading to the potential performance degradation of earlier
models during inference. In this study, we introduce the Multi-event Video-Text
Retrieval (MeVTR) task, addressing scenarios in which each video contains
multiple different events, as a niche scenario of the conventional Video-Text
Retrieval Task. We present a simple model, Me-Retriever, which incorporates key
event video representation and a new MeVTR loss for the MeVTR task.
Comprehensive experiments show that this straightforward framework outperforms
other models in the Video-to-Text and Text-to-Video tasks, effectively
establishing a robust baseline for the MeVTR task. We believe this work serves
as a strong foundation for future studies. Code is available at
https://github.com/gengyuanmax/MeVTR.
Authors' comments: accepted to ICCV2023
Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.
Jiaqi Zhai, Zhaojie Gong, Yueming Wang, Xiao Sun, Zheng Yan, Fu Li, Xing Liu
Retrieval finds a small number of relevant candidates from a large corpus for
information retrieval and recommendation applications. A key component of
retrieval is to model (user, item) similarity, which is commonly represented as
the dot product of two learned embeddings. This formulation permits efficient
inference, commonly known as Maximum Inner Product Search (MIPS). Despite its
popularity, dot products cannot capture complex user-item interactions, which
are multifaceted and likely high rank. We hence examine non-dot-product
retrieval settings on accelerators, and propose \textit{mixture of logits}
(MoL), which models (user, item) similarity as an adaptive composition of
elementary similarity functions. This new formulation is expressive, capable of
modeling high rank (user, item) interactions, and further generalizes to the
long tail. When combined with a hierarchical retrieval strategy,
\textit{h-indexer}, we are able to scale up MoL to 100M corpus on a single GPU
with latency comparable to MIPS baselines. On public datasets, our approach
leads to uplifts of up to 77.3\% in hit rate (HR). Experiments on a large
recommendation surface at Meta showed strong metric gains and reduced
popularity bias, validating the proposed approach's performance and improved
generalization.
Authors' comments: To appear in the 29th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining (KDD 2023)
Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen et al.
We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the
Information Retrieval Experiment Platform (TIREx) to promote more standardized,
reproducible, scalable, and even blinded retrieval experiments. Standardization
is achieved when a retrieval approach implements PyTerrier's interfaces and the
input and output of an experiment are compatible with ir_datasets and
ir_measures. However, none of this is a must for reproducibility and
scalability, as TIRA can run any dockerized software locally or remotely in a
cloud-native execution environment. Version control and caching ensure
efficient (re)execution. TIRA allows for blind evaluation when an experiment
runs on a remote server or cloud not under the control of the experimenter. The
test data and ground truth are then hidden from public access, and the
retrieval software has to process them in a sandbox that prevents data leaks.
We currently host an instance of TIREx with 15 corpora (1.9 billion
documents) on which 32 shared retrieval tasks are based. Using Docker images of
50 standard retrieval approaches, we automatically evaluated all approaches on
all tasks (50 $\cdot$ 32 = 1,600~runs) in less than a week on a midsize cluster
(1,620 CPU cores and 24 GPUs). This instance of TIREx is open for submissions
and will be integrated with the IR Anthology, as well as released open source.
Authors' comments: 11 pages. To be published in the proceedings of SIGIR 2023
Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
Instead of simply matching a query to pre-existing passages, generative
retrieval generates identifier strings of passages as the retrieval target. At
a cost, the identifier must be distinctive enough to represent a passage.
Current approaches use either a numeric ID or a text piece (such as a title or
substrings) as the identifier. However, these identifiers cannot cover a
passage's content well. As such, we are motivated to propose a new type of
identifier, synthetic identifiers, that are generated based on the content of a
passage and could integrate contextualized information that text pieces lack.
Furthermore, we simultaneously consider multiview identifiers, including
synthetic identifiers, titles, and substrings. These views of identifiers
complement each other and facilitate the holistic ranking of passages from
multiple perspectives. We conduct a series of experiments on three public
datasets, and the results indicate that our proposed approach performs the best
in generative retrieval, demonstrating its effectiveness and robustness.
Authors' comments: ACL 2023 Main Conference
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay et al.
Modern recommender systems perform large-scale retrieval by first embedding
queries and item candidates in the same unified space, followed by approximate
nearest neighbor search to select top candidates given a query embedding. In
this paper, we propose a novel generative retrieval approach, where the
retrieval model autoregressively decodes the identifiers of the target
candidates. To that end, we create semantically meaningful tuple of codewords
to serve as a Semantic ID for each item. Given Semantic IDs for items in a user
session, a Transformer-based sequence-to-sequence model is trained to predict
the Semantic ID of the next item that the user will interact with. To the best
of our knowledge, this is the first Semantic ID-based generative model for
recommendation tasks. We show that recommender systems trained with the
proposed paradigm significantly outperform the current SOTA models on various
datasets. In addition, we show that incorporating Semantic IDs into the
sequence-to-sequence model enhances its ability to generalize, as evidenced by
the improved retrieval performance observed for items with no prior interaction
history.
Authors' comments: To appear in The 37th Conference on Neural Information Processing
Systems (NeurIPS 2023)
Kazuma Kobayashi, Lin Gu, Ryuichiro Hataya, Takaaki Mizuno, Mototaka Miyake, Hirokazu Watanabe, Masamichi Takahashi, Yasuyuki Takamizawa et al.
The amount of medical images stored in hospitals is increasing faster than ever; however, utilizing the accumulated medical images has been limited. This is because existing content-based medical image retrieval (CBMIR) systems usually require example images to construct query vectors; nevertheless, example images cannot always be prepared. Besides, there can be images with rare characteristics that make it difficult to find similar example images, which we call isolated samples. Here, we introduce a novel sketch-based medical image retrieval (SBMIR) system that enables users to find images of interest without example images. The key idea lies in feature decomposition of medical images, whereby the entire feature of a medical image can be decomposed into and reconstructed from normal and abnormal features. By extending this idea, our SBMIR system provides an easy-to-use two-step graphical user interface: users first select a template image to specify a normal feature and then draw a semantic sketch of the disease on the template image to represent an abnormal feature. Subsequently, it integrates the two kinds of input to construct a query vector and retrieves reference images with the closest reference vectors. Using two datasets, ten healthcare professionals with various clinical backgrounds participated in the user test for evaluation. As a result, our SBMIR system enabled users to overcome previous challenges, including image retrieval based on fine-grained image characteristics, image retrieval without example images, and image retrieval for isolated samples. Our SBMIR system achieves flexible medical image retrieval on demand, thereby expanding the utility of medical image databases.
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer et al.
Recent multimodal models such as DALL-E and CM3 have achieved remarkable
progress in text-to-image and image-to-text generation. However, these models
store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the
model parameters, requiring increasingly larger models and training data to
capture more knowledge. To integrate knowledge in a more scalable and modular
way, we propose a retrieval-augmented multimodal model, which enables a base
multimodal model (generator) to refer to relevant text and images fetched by a
retriever from external memory (e.g., documents on the web). Specifically, for
the retriever, we use a pretrained CLIP, and for the generator, we train a CM3
Transformer on the LAION dataset. Our resulting model, named
Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can
retrieve and generate both text and images. We show that RA-CM3 significantly
outperforms baseline multimodal models such as DALL-E and CM3 on both image and
caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while
requiring much less compute for training (<30% of DALL-E). Moreover, we show
that RA-CM3 exhibits novel capabilities, such as faithful image generation and
multimodal in-context learning (e.g., image generation from demonstrations).
Authors' comments: Published at ICML 2023. Blog post available at
https://cs.stanford.edu/~myasu/blog/racm3/
Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, Wen-tau Yih
We study the problem of retrieval with instructions, where users of a
retrieval system explicitly describe their intent along with their queries. We
aim to develop a general-purpose task-aware retrieval system using multi-task
instruction tuning, which can follow human-written instructions to find the
best documents for a given query. We introduce the first large-scale collection
of approximately 40 retrieval datasets with instructions, BERRI, and present
TART, a multi-task retrieval system trained on BERRI with instructions. TART
shows strong capabilities to adapt to a new retrieval task via instructions and
advances the state of the art on two zero-shot retrieval benchmarks, BEIR and
LOTTE, outperforming models up to three times larger. We further introduce a
new evaluation setup, X^2-Retrieval to better reflect real-world scenarios,
where diverse domains and tasks are pooled and a system needs to find documents
aligning users' intents. In this setup, TART significantly outperforms
competitive baselines, further demonstrating the effectiveness of guiding
retrieval with instructions.
Authors' comments: Code, data and pretrained model checkpoints are available at
https://github.com/facebookresearch/tart
Kundana Mandapaka
Search algorithms are applied where data retrieval with specified
specifications is required. The motivation behind developing search algorithms
in Functional Object-Oriented Networks is that most of the time, a certain
recipe needs to be retrieved or ingredients for a certain recipe needs to be
determined. According to the introduction, there is a time when execution of an
entire recipe is not available for a robot thus prompting the need to retrieve
a certain recipe or ingredients. With a quality FOON, robots can decipher a
task goal, find the correct objects at the required states on which to operate
and output a sequence of proper manipulation motions. This paper shows several
proposed weighted FOON and task planning algorithms that allow a robot and a
human to successfully complete complicated tasks together with higher success
rates than a human doing them alone.
Authors' comments: 3 pages, 1 figure, and 2 tables; modified references
Zecheng Wang, Yik-Cheung Tam
Causal language modeling (LM) uses word history to predict the next word.
BERT, on the other hand, makes use of bi-directional word information in a
sentence to predict words at masked positions. While BERT is effective in
sequence encoding, it is non-causal by nature and is not designed for sequence
generation. In this paper, we propose a novel language model, SUffix
REtrieval-Augmented LM (SUREALM), that simulates a bi-directional contextual
effect in an autoregressive manner. SUREALM employs an embedding retriever to
search for training sentences in a data store that share similar word history
during sequence generation. In particular, the suffix portions of the retrieved
sentences mimick the "future" context. We evaluated our proposed model on the
DSTC9 spoken dialogue corpus and showed promising word perplexity reduction on
the validation and test set compared to competitive baselines.
Authors' comments: 5 pages, 1 figure. Submitted to ICASSP 2023
Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, Zijian Zhang
Explainable information retrieval is an emerging research area aiming to make
transparent and trustworthy information retrieval systems. Given the increasing
use of complex machine learning models in search systems, explainability is
essential in building and auditing responsible information retrieval models.
This survey fills a vital gap in the otherwise topically diverse literature of
explainable information retrieval. It categorizes and discusses recent
explainability methods developed for different application domains in
information retrieval, providing a common framework and unifying perspectives.
In addition, it reflects on the common concern of evaluating explanations and
highlights open challenges and opportunities.
Authors' comments: 35 pages, 10 figures. Under review
Kshitij Alwadhi, Rohan Sharma, Siddhant Sharma
A MIDI based approach for music recognition is proposed and implemented in this paper. Our Clarinet music retrieval system is designed to search piano MIDI files with high recall and speed. We design a novel melody extraction algorithm that improves recall results by more than 10%. We also implement 3 algorithms for retrieval-two self designed (RSA Note and RSA Time), and a modified version of the Mongeau Sankoff Algorithm. Algorithms to achieve tempo and scale invariance are also discussed in this paper. The paper also contains detailed experimentation and benchmarks with four different metrics. Clarinet achieves recall scores of more than 94%.
Hyunji Lee, Jaeyoung Kim, Hoyeon Chang, Hanseok Oh, Sohee Yang, Vlad Karpukhin, Yi Lu, Minjoon Seo
The generative retrieval model depends solely on the information encoded in
its model parameters without external memory, its information capacity is
limited and fixed. To overcome the limitation, we propose Nonparametric
Decoding (Np Decoding) which can be applied to existing generative retrieval
models. Np Decoding uses nonparametric contextualized vocab embeddings
(external memory) rather than vanilla vocab embeddings as decoder vocab
embeddings. By leveraging the contextualized vocab embeddings, the generative
retrieval model is able to utilize both the parametric and nonparametric space.
Evaluation over 9 datasets (8 single-hop and 1 multi-hop) in the document
retrieval task shows that applying Np Decoding to generative retrieval models
significantly improves the performance. We also show that Np Decoding is data-
and parameter-efficient, and shows high performance in the zero-shot setting.
Authors' comments: published at Findings of ACL 2023
Baoyu Jing, Si Zhang, Yada Zhu, Bin Peng, Kaiyu Guan, Andrew Margenot, Hanghang Tong
Time series data appears in a variety of applications such as smart
transportation and environmental monitoring. One of the fundamental problems
for time series analysis is time series forecasting. Despite the success of
recent deep time series forecasting methods, they require sufficient
observation of historical values to make accurate forecasting. In other words,
the ratio of the output length (or forecasting horizon) to the sum of the input
and output lengths should be low enough (e.g., 0.3). As the ratio increases
(e.g., to 0.8), the uncertainty for the forecasting accuracy increases
significantly. In this paper, we show both theoretically and empirically that
the uncertainty could be effectively reduced by retrieving relevant time series
as references. In the theoretical analysis, we first quantify the uncertainty
and show its connections to the Mean Squared Error (MSE). Then we prove that
models with references are easier to learn than models without references since
the retrieved references could reduce the uncertainty. To empirically
demonstrate the effectiveness of the retrieval based time series forecasting
models, we introduce a simple yet effective two-stage method, called ReTime
consisting of a relational retrieval and a content synthesis. We also show that
ReTime can be easily adapted to the spatial-temporal time series and time
series imputation settings. Finally, we evaluate ReTime on real-world datasets
to demonstrate its effectiveness.
Authors' comments: CIKM'22 AMLTS
Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
Multi-modal retrieval has seen tremendous progress with the development of
vision-language models. However, further improving these models require
additional labelled data which is a huge manual effort. In this paper, we
propose a framework MuMUR, that utilizes knowledge transfer from a multilingual
model to boost the performance of multi-modal (image and video) retrieval. We
first use state-of-the-art machine translation models to construct pseudo
ground-truth multilingual visual-text pairs. We then use this data to learn a
joint vision-text representation where English and non-English text queries are
represented in a common embedding space based on pretrained multilingual
models. We evaluate our proposed approach on a diverse set of retrieval
datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades
and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and
Multi30k . Experimental results demonstrate that our approach achieves
state-of-the-art results on all video retrieval datasets outperforming previous
models. Additionally, our framework MuMUR significantly beats other
multilingual video retrieval dataset. We also observe that MuMUR exhibits
strong performance on image retrieval. This demonstrates the universal ability
of MuMUR to perform retrieval across all visual inputs (image and video) and
text inputs (monolingual and multilingual).
Authors' comments: This is an extension of the previous MKTVR paper (for which you can
find a reference here :
https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous
version on arxiv). This version was published to the Information Retrieval
Journal
Zichao Wang, Weili Nie, Zhuoran Qiao, Chaowei Xiao, Richard Baraniuk, Anima Anandkumar
Generating new molecules with specified chemical and biological properties
via generative models has emerged as a promising direction for drug discovery.
However, existing methods require extensive training/fine-tuning with a large
dataset, often unavailable in real-world generation tasks. In this work, we
propose a new retrieval-based framework for controllable molecule generation.
We use a small set of exemplar molecules, i.e., those that (partially) satisfy
the design criteria, to steer the pre-trained generative model towards
synthesizing molecules that satisfy the given design criteria. We design a
retrieval mechanism that retrieves and fuses the exemplar molecules with the
input molecule, which is trained by a new self-supervised objective that
predicts the nearest neighbor of the input molecule. We also propose an
iterative refinement process to dynamically update the generated molecules and
retrieval database for better generalization. Our approach is agnostic to the
choice of generative models and requires no task-specific fine-tuning. On
various tasks ranging from simple design criteria to a challenging real-world
scenario for designing lead compounds that bind to the SARS-CoV-2 main
protease, we demonstrate our approach extrapolates well beyond the retrieval
database, and achieves better performance and wider applicability than previous
methods. Code is available at https://github.com/NVlabs/RetMol.
Authors' comments: ICLR 2023
Shijie Wang, Jianlong Chang, Zhihui Wang, Haojie Li, Wanli Ouyang, Qi Tian
Fine-grained object retrieval aims to learn discriminative representation to
retrieve visually similar objects. However, existing top-performing works
usually impose pairwise similarities on the semantic embedding spaces or design
a localization sub-network to continually fine-tune the entire model in limited
data scenarios, thus resulting in convergence to suboptimal solutions. In this
paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a
frozen pre-trained model to perform the fine-grained retrieval task from the
perspectives of sample prompting and feature adaptation. Specifically, FRPT
only needs to learn fewer parameters in the prompt and adaptation instead of
fine-tuning the entire model, thus solving the issue of convergence to
suboptimal solutions caused by fine-tuning the entire model. Technically, a
discriminative perturbation prompt (DPP) is introduced and deemed as a sample
prompting process, which amplifies and even exaggerates some discriminative
elements contributing to category prediction via a content-aware inhomogeneous
sampling operation. In this way, DPP can make the fine-grained retrieval task
aided by the perturbation prompts close to the solved task during the original
pre-training. Thereby, it preserves the generalization and discrimination of
representation extracted from input samples. Besides, a category-specific
awareness head is proposed and regarded as feature adaptation, which removes
the species discrepancies in features extracted by the pre-trained model using
category-guided instance normalization. And thus, it makes the optimized
features only include the discrepancies among subcategories. Extensive
experiments demonstrate that our FRPT with fewer learnable parameters achieves
the state-of-the-art performance on three widely-used fine-grained datasets.
Authors' comments: Accepted by AAAI 2023
Shervin Ardeshir, Nagendra Kamath, Hossein Taghavi
We explore retrieving character-focused video frames as candidates for being
video thumbnails. To evaluate each frame of the video based on the character(s)
present in it, characters (faces) are evaluated in two aspects:
Facial-expression: We train a CNN model to measure whether a face has an
acceptable facial expression for being in a video thumbnail. This model is
trained to distinguish faces extracted from artworks/thumbnails, from faces
extracted from random frames of videos. Prominence and interactions:
Character(s) in the thumbnail should be important character(s) in the video, to
prevent the algorithm from suggesting non-representative frames as candidates.
We use face clustering to identify the characters in the video, and form a
graph in which the prominence (frequency of appearance) of the character(s),
and their interactions (co-occurrence) are captured. We use this graph to infer
the relevance of the characters present in each candidate frame. Once every
face is scored based on the two criteria above, we infer frame level scores by
combining the scores for all the faces within a frame.
Authors' comments: International Conference on Machine Learning. Machine Learning for
Media Discovery (ML4MD) Workshop 2020