Kailash A. Hambarde, Hugo Proenca
In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain.
Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji et al.
With the explosive growth of web videos and emerging large-scale
vision-language pre-training models, e.g., CLIP, retrieving videos of interest
with text instructions has attracted increasing attention. A common practice is
to transfer text-video pairs to the same embedding space and craft cross-modal
interactions with certain entities in specific granularities for semantic
correspondence. Unfortunately, the intrinsic uncertainties of optimal entity
combinations in appropriate granularities for cross-modal queries are
understudied, which is especially critical for modalities with hierarchical
semantics, e.g., video, text, etc. In this paper, we propose an
Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models
each look-up as a distribution matching procedure. Concretely, we add
additional learnable tokens in the encoders to adaptively aggregate
multi-grained semantics for flexible high-level reasoning. In the refined
embedding space, we represent text-video pairs as probabilistic distributions
where prototypes are sampled for matching evaluation. Comprehensive experiments
on four benchmarks justify the superiority of our UATVR, which achieves new
state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and
DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.
Authors' comments: To appear at ICCV2023
Malavika Vasist, François Rozet, Olivier Absil, Paul Mollière, Evert Nasedkin, Gilles Louppe
Retrieving the physical parameters from spectroscopic observations of
exoplanets is key to understanding their atmospheric properties. Exoplanetary
atmospheric retrievals are usually based on approximate Bayesian inference and
rely on sampling-based approaches to compute parameter posterior distributions.
Accurate or repeated retrievals, however, can result in very long computation
times due to the sequential nature of sampling-based algorithms. We aim to
amortize exoplanetary atmospheric retrieval using neural posterior estimation
(NPE), a simulation-based inference algorithm based on variational inference
and normalizing flows. In this way, we aim (i) to strongly reduce inference
time, (ii) to scale inference to complex simulation models with many nuisance
parameters or intractable likelihood functions, and (iii) to enable the
statistical validation of the inference results. We evaluate NPE on a radiative
transfer model for exoplanet spectra petitRADTRANS, including the effects of
scattering and clouds. We train a neural autoregressive flow to quickly
estimate posteriors and compare against retrievals computed with MultiNest. NPE
produces accurate posterior approximations while reducing inference time down
to a few seconds. We demonstrate the computational faithfulness of our
posterior approximations using inference diagnostics including posterior
predictive checks and coverage, taking advantage of the quasi-instantaneous
inference time of NPE. Our analysis confirms the reliability of the approximate
posteriors produced by NPE. The accuracy and reliability of the inference
results produced by NPE establishes it as a promising approach for atmospheric
retrievals. Amortization of the posterior inference makes repeated inference on
several observations computationally inexpensive since it does not require
on-the-fly simulations, making the retrieval efficient, scalable, and testable.
Authors' comments: The paper has been submitted to AandA after a final revision
Fahimeh Arabyani-Neyshaburi, Ali Akbar Arefijamaal, Rajab Ali Kamyabi-Gol
Recovering a signal up to a unimodular constant from the magnitudes of linear measurements has been popular and well studied in recent years. However, numerous unsolved problems regarding phase retrieval still exist. Given a phase retrieval frame, may the family of phase retrieval dual frames be classified? And is such a family dense in the set of dual frames? Can we present the equivalent conditions for a family of vectors to do weak phase retrieval in complex Hilbert space case? What is the connection between phase, weak phase and norm retrieval? In this context, we aim to deal with these open problems concerning phase retrieval dual frames, weak phase retrieval frames, and specially investigate equivalent conditions for identifying these features. We provide some characterizations of alternate dual frames of a phase retrieval frame which yield phase retrieval in finite dimensional Hilbert spaces. Moreover, for some classes of frames, we show that the family of phase retrieval dual frames is open and dense in the set of dual frames. Then, we study weak phase retrieval problem. Among other things, we obtain some equivalent conditions on a family of vectors to do phase retrieval in terms of weak phase retrieval.
Nam Le Hai, Thomas Gerald, Thibault Formal, Jian-Yun Nie, Benjamin Piwowarski, Laure Soulier
Conversational search is a difficult task as it aims at retrieving documents
based not only on the current user query but also on the full conversation
history. Most of the previous methods have focused on a multi-stage ranking
approach relying on query reformulation, a critical intermediate step that
might lead to a sub-optimal retrieval. Other approaches have tried to use a
fully neural IR first-stage, but are either zero-shot or rely on full
learning-to-rank based on a dataset with pseudo-labels. In this work,
leveraging the CANARD dataset, we propose an innovative lightweight learning
technique to train a first-stage ranker based on SPLADE. By relying on SPLADE
sparse representations, we show that, when combined with a second-stage ranker
based on T5Mono, the results are competitive on the TREC CAsT 2020 and 2021
tracks.
Authors' comments: Accepted at ECIR 2023
Wedad Alharbi, Daniel Freeman, Dorsa Ghoreishi, Claire Lois, Shanea Sebastian
A frame $(x_j)_{j\in J}$ for a Hilbert space $H$ is said to do phase
retrieval if for all distinct vectors $x,y\in H$ the magnitude of the frame
coefficients $(|\langle x, x_j\rangle|)_{j\in J}$ and $(|\langle y,
x_j\rangle|)_{j\in J}$ distinguish $x$ from $y$ (up to a unimodular scalar). A
frame which does phase retrieval is said to do $C$-stable phase retrieval if
the recovery of any vector $x\in H$ from the magnitude of the frame
coefficients is $C$-Lipschitz. It is known that if a frame does stable phase
retrieval then any sufficiently small perturbation of the frame vectors will do
stable phase retrieval, though with a slightly worse stability constant. We
provide new quantitative bounds on how the stability constant for phase
retrieval is affected by a small perturbation of the frame vectors. These
bounds are significant in that they are independent of the dimension of the
Hilbert space and the number of vectors in the frame.
Authors' comments: 14 pages
Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Guodong Long, Can Xu, Daxin Jiang
Long document retrieval aims to fetch query-relevant documents from a
large-scale collection, where knowledge distillation has become de facto to
improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
However, in contrast to passages or sentences, retrieval on long documents
suffers from the scope hypothesis that a long document may cover multiple
topics. This maximizes their structure heterogeneity and poses a
granular-mismatch issue, leading to an inferior distillation efficacy. In this
work, we propose a new learning framework, fine-grained distillation (FGD), for
long-document retrievers. While preserving the conventional dense retrieval
paradigm, it first produces global-consistent representations crossing
different fine granularity and then applies multi-granular aligned distillation
merely during training. In experiments, we evaluate our framework on two
long-document retrieval benchmarks, which show state-of-the-art performance.
Authors' comments: 13 pages, 5 figures, 5 tables
Dong Li, Yelong Shen, Ruoming Jin, Yi Mao, Kuan Wang, Weizhu Chen
Pre-trained language models have achieved promising success in code retrieval tasks, where a natural language documentation query is given to find the most relevant existing code snippet. However, existing models focus only on optimizing the documentation code pairs by embedding them into latent space, without the association of external knowledge. In this paper, we propose a generation-augmented query expansion framework. Inspired by the human retrieval process - sketching an answer before searching, in this work, we utilize the powerful code generation model to benefit the code retrieval task. Specifically, we demonstrate that rather than merely retrieving the target code snippet according to the documentation query, it would be helpful to augment the documentation query with its generation counterpart - generated code snippets from the code generation model. To the best of our knowledge, this is the first attempt that leverages the code generation model to enhance the code retrieval task. We achieve new state-of-the-art results on the CodeSearchNet benchmark and surpass the baselines significantly.
Jing Lu, Keith Hall, Ji Ma, Jianmo Ni
We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a framework for training rerankers based on a hybrid of BM25 and neural retrieval models. Retrievers based on hybrid models have been shown to outperform both BM25 and neural models alone. Our approach exploits this improved performance when training a reranker, leading to a robust reranking model. The reranker, a cross-attention neural model, is shown to be robust to different first-stage retrieval systems, achieving better performance than rerankers simply trained upon the first-stage retrievers in the multi-stage systems. We present evaluations on a supervised passage retrieval task using MS MARCO and zero-shot retrieval tasks using BEIR. The empirical results show strong performance on both evaluations.
David Uthus, Jianmo Ni
Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
Riya Gupta, C. V. Jawahar
Extracting the relevant information out of a large number of documents is a
challenging and tedious task. The quality of results generated by the
traditionally available full-text search engine and text-based image retrieval
systems is not optimal. Information retrieval (IR) tasks become more
challenging with the nontraditional language scripts, as in the case of Indic
scripts. The authors have developed OCR (Optical Character Recognition) Search
Engine to make an Information Retrieval & Extraction (IRE) system that
replicates the current state-of-the-art methods using the IRE and Natural
Language Processing (NLP) techniques. Here we have presented the study of the
methods used for performing search and retrieval tasks. The details of this
system, along with the statistics of the dataset (source: National Digital
Library of India or NDLI), is also presented. Additionally, the ideas to
further explore and add value to research in IRE are also discussed.
Authors' comments: 6 pages including references, 5 figures, and 1 table. For project
page see
https://cvit.iiit.ac.in/research/projects/cvit-projects/retrieval-from-large-document-image-collections
Yookoon Park, Mahmoud Azab, Bo Xiong, Seungwhan Moon, Florian Metze, Gourab Kundu, Kirmani Ahmed
Cross-modal contrastive learning has led the recent advances in multimodal
retrieval with its simplicity and effectiveness. In this work, however, we
reveal that cross-modal contrastive learning suffers from incorrect
normalization of the sum retrieval probabilities of each text or video
instance. Specifically, we show that many test instances are either over- or
under-represented during retrieval, significantly hurting the retrieval
performance. To address this problem, we propose Normalized Contrastive
Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the
instance-wise biases that properly normalize the sum retrieval probabilities of
each instance so that every text and video instance is fairly represented
during cross-modal retrieval. Empirical study shows that NCL brings consistent
and significant gains in text-video retrieval on different model architectures,
with new state-of-the-art multimodal retrieval metrics on the ActivityNet,
MSVD, and MSR-VTT datasets without any architecture engineering.
Authors' comments: Published in EMNLP 2022
Gabriel Laverghetta
Robotic agents often perform tasks that transform sets of input objects into
output objects through functional motions. This work describes the FOON
knowledge representation model for robotic tasks. We define the structure and
key components of FOON and describe the process we followed to create our
universal FOON dataset. The paper describes various search algorithms and
heuristic functions we used to search for objects within the FOON. We performed
multiple searches on our universal FOON using these algorithms and discussed
the effectiveness of each algorithm.
Authors' comments: 4 pages, 3 figures, 3 tables
SeungHeon Doh, Minz Won, Keunwoo Choi, Juhan Nam
This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based music retrieval systems using our proposed benchmark in two main aspects: input text representation and training objectives. Our findings enable a universal text-to-music retrieval system that achieves comparable retrieval performances in both tag- and sentence-level inputs. Furthermore, the proposed multimodal representation generalizes to 9 different downstream music classification tasks. We present the code and demo online.
Ziyan Chen, Heng Wu, Jing Cheng
The current ghost imaging phase reconstruction schemes require either complex optical systems, Fourier transform steps, or iterative algorithms, which may increase the difficulty of system design, cause phase retrieval error or take too much time. To address this problem, we propose a five-step phase-shifting method in which no complex optical systems, Fourier transform steps, or iterative algorithms are needed. With five designed incoherent sources, one can obtain five different corresponding ghost imaging patterns, then the phase information of the object can be calculated from those five speckle patterns. The applicability of this theoretical proposal is demonstrated via numerical simulations with two kinds of complicated objects, and the results illustrate the phase information of the complicated object can be reconstructed successfully and quantitatively.
Daniel Reich, Felix Putze, Tanja Schultz
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.
Xunjian Yin, Xinyu Hu, Jin Jiang, Xiaojun Wan
Chinese Spelling Check (CSC) aims to detect and correct error tokens in
Chinese contexts, which has a wide range of applications. However, it is
confronted with the challenges of insufficient annotated data and the issue
that previous methods may actually not fully leverage the existing datasets. In
this paper, we introduce our plug-and-play retrieval method with error-robust
information for Chinese Spelling Check (RERIC), which can be directly applied
to existing CSC models. The datastore for retrieval is built completely based
on the training data, with elaborate designs according to the characteristics
of CSC. Specifically, we employ multimodal representations that fuse phonetic,
morphologic, and contextual information in the calculation of query and key
during retrieval to enhance robustness against potential errors. Furthermore,
in order to better judge the retrieved candidates, the n-gram surrounding the
token to be checked is regarded as the value and utilized for specific
reranking. The experiment results on the SIGHAN benchmarks demonstrate that our
proposed method achieves substantial improvements over existing work.
Authors' comments: 11 pages, 3 figures
Arusarka Bose, Zili Zhou, Guandong Xu
Increasing number of COVID-19 research literatures cause new challenges in effective literature screening and COVID-19 domain knowledge aware Information Retrieval. To tackle the challenges, we demonstrate two tasks along withsolutions, COVID-19 literature retrieval, and question answering. COVID-19 literature retrieval task screens matching COVID-19 literature documents for textual user query, and COVID-19 question answering task predicts proper text fragments from text corpus as the answer of specific COVID-19 related questions. Based on transformer neural network, we provided solutions to implement the tasks on CORD-19 dataset, we display some examples to show the effectiveness of our proposed solutions.
Tal Peer, Simon Welker, Timo Gerkmann
Diffusion probabilistic models have been recently used in a variety of tasks,
including speech enhancement and synthesis. As a generative approach, diffusion
models have been shown to be especially suitable for imputation problems, where
missing data is generated based on existing data. Phase retrieval is inherently
an imputation problem, where phase information has to be generated based on the
given magnitude. In this work we build upon previous work in the speech domain,
adapting a speech enhancement diffusion model specifically for STFT phase
retrieval. Evaluation using speech quality and intelligibility metrics shows
the diffusion approach is well-suited to the phase retrieval task, with
performance surpassing both classical and modern methods.
Authors' comments: Accepted by ICASSP 2023
Naseem Shaik
Robots can complete all human-performed tasks, but due to their current lack of knowledge, some tasks still cannot be completed by them with a high degree of success. However, with the right knowledge, these tasks can be completed by robots with a high degree of success, reducing the amount of human effort required to complete daily tasks. In this paper, the FOON, which describes the robot action success rate, is discussed. The functional object-oriented network (FOON) is a knowledge representation for symbolic task planning that takes the shape of a graph. It is to demonstrate the adaptability of FOON in developing a novel and adaptive method of solving a problem utilizing knowledge obtained from various sources, a graph retrieval methodology is shown to produce manipulation motion sequences from the FOON to accomplish a desired aim. The outcomes are illustrated using motion sequences created by the FOON to complete the desired objectives in a simulated environment.