Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, Jeff Heflin
Ranking models are the main components of information retrieval systems.
Several approaches to ranking are based on traditional machine learning
algorithms using a set of hand-crafted features. Recently, researchers have
leveraged deep learning models in information retrieval. These models are
trained end-to-end to extract features from the raw data for ranking tasks, so
that they overcome the limitations of hand-crafted features. A variety of deep
learning models have been proposed, and each model presents a set of neural
network components to extract features that are used for ranking. In this
paper, we compare the proposed models in the literature along different
dimensions in order to understand the major contributions and limitations of
each model. In our discussion of the literature, we analyze the promising
neural components, and propose future research directions. We also show the
analogy between document retrieval and other retrieval tasks where the items to
be ranked are structured documents, answers, images and videos.
Authors' comments: Published in the Information Retrieval Journal (2021)
Rong Fu, Yimin Liu, Tianyao Huang, Yonina C. Eldar
Learned iterative shrinkage thresholding algorithm (LISTA), which adopts deep
learning techniques to learn optimal algorithm parameters from labeled training
data, can be successfully applied to small-scale multidimensional harmonic
retrieval (MHR) problems. However, LISTA computationally demanding for
large-scale MHR problems because the matrix size of the learned mutual
inhibition matrix exhibits quadratic growth with the signal length. These large
matrices consume costly memory/computation resources and require a huge amount
of labeled data for training, restricting the applicability of the LISTA
method. In this paper, we show that the mutual inhibition matrix of a MHR
problem naturally has a Toeplitz structure, which means that the degrees of
freedom (DoF) of the matrix can be reduced from a quadratic order to a linear
order. By exploiting this characteristic, we propose a structured
LISTA-Toeplitz network, which imposes a Toeplitz structure restriction on the
mutual inhibition matrices and applies linear convolution instead of the
matrix-vector multiplication involved in the traditional LISTA network. Both
simulation and field test for air target detection with radar are carried out
to validate the performance of the proposed network. For small-scale MHR
problems, LISTAToeplitz exhibits close or even better recovery accuracy than
traditional LISTA, while the former significantly reduces the network
complexity and requires much less training data. For large-scale MHR problems,
where LISTA is difficult to implement due to the huge size of the mutual
inhibition matrices, our proposed LISTA-Toeplitz still enjoys desirable
recovery performance.
Authors' comments: 13 pages,13 figures, 50 references
Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.
Robert Beinert, Marzieh Hasannasab
In this paper we consider the nonlinear inverse problem of phase retrieval in the context of dynamical sampling. Where phase retrieval deals with the recovery of signals & images from phaseless measurements, dynamical sampling was introduced by Aldroubi et al in 2015 as a tool to recover diffusion fields from spatiotemporal samples. Considering finite-dimensional signals evolving in time under the action of a known matrix, our aim is to recover the signal up to global phase in a stable way from the absolute value of certain space-time measurements. First, we state necessary conditions for the dynamical system of sampling vectors to make the recovery of the unknown signal possible. The conditions deal with the spectrum of the given matrix and the initial sampling vector. Then, assuming that we have access to a specific set of further measurements related to aligned sampling vectors, we provide a feasible procedure to recover almost every signal up to global phase using polarization techniques. Moreover, we show that by adding extra conditions like full spark, the recovery of all signals is possible without exceptions.
Wei Chen, Yu Liu, Weiping Wang, Erwin Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, Michael S. Lew
In recent years a vast amount of visual content has been generated and shared
from many fields, such as social media platforms, medical imaging, and
robotics. This abundance of content creation and sharing has introduced new
challenges, particularly that of searching databases for similar
content-Content Based Image Retrieval (CBIR)-a long-established research area
in which improved efficiency and accuracy are needed for real-time retrieval.
Artificial intelligence has made progress in CBIR and has significantly
facilitated the process of instance search. In this survey we review recent
instance retrieval works that are developed based on deep learning algorithms
and techniques, with the survey organized by deep network architecture types,
deep features, feature embedding and aggregation methods, and network
fine-tuning strategies. Our survey considers a wide variety of recent methods,
whereby we identify milestone work, reveal connections among various methods
and present the commonly used benchmarks, evaluation results, common
challenges, and propose promising future directions.
Authors' comments: IEEE Transactions on Pattern Analysis and Machine Intelligence
Seunghoan Song, Masahito Hayashi
Quantum private information retrieval (QPIR) for quantum messages is the protocol in which a user retrieves one of the multiple quantum states from one or multiple servers without revealing which state is retrieved. We consider QPIR in two different settings: the blind setting, in which the servers contain one copy of the message states, and the visible setting, in which the servers contain the description of the message states. One trivial solution in both settings is downloading all states from the servers and the main goal of this paper is to find more efficient QPIR protocols. First, we prove that the trivial solution is optimal for one-server QPIR in the blind setting. In one-round protocols, the same optimality holds even in the visible setting. On the other hand, when the user and the server share entanglement, we prove that there exists an efficient one-server QPIR protocol in the blind setting. Furthermore, in the visible setting, we prove that it is possible to construct symmetric QPIR protocols in which the user obtains no information of the non-targeted messages. We construct three two-server symmetric QPIR protocols for pure states. Note that symmetric classical PIR is impossible without shared randomness unknown to the user.
Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, Diane Larlus
Cross-modal retrieval methods build a common representation space for samples
from multiple modalities, typically from the vision and the language domains.
For images and their captions, the multiplicity of the correspondences makes
the task particularly challenging. Given an image (respectively a caption),
there are multiple captions (respectively images) that equally make sense. In
this paper, we argue that deterministic functions are not sufficiently powerful
to capture such one-to-many correspondences. Instead, we propose to use
Probabilistic Cross-Modal Embedding (PCME), where samples from the different
modalities are represented as probabilistic distributions in the common
embedding space. Since common benchmarks such as COCO suffer from
non-exhaustive annotations for cross-modal matches, we propose to additionally
evaluate retrieval on the CUB dataset, a smaller yet clean database where all
possible image-caption pairs are annotated. We extensively ablate PCME and
demonstrate that it not only improves the retrieval performance over its
deterministic counterpart but also provides uncertainty estimates that render
the embeddings more interpretable. Code is available at
https://github.com/naver-ai/pcme
Authors' comments: Accepted to CVPR 2021; Code is available at
https://github.com/naver-ai/pcme
Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh
Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.
Kai Wan, Hua Sun, Mingyue Ji, Daniela Tuninetti, Giuseppe Caire
Coded Caching, proposed by Maddah-Ali and Niesen (MAN), has the potential to
reduce network traffic by pre-storing content in the users' local memories when
the network is underutilized and transmitting coded multicast messages that
simultaneously benefit many users at once during peak-hour times. This paper
considers the linear function retrieval version of the original coded caching
setting, where users are interested in retrieving a number of linear
combinations of the data points stored at the server, as opposed to a single
file. This extends the scope of the Authors' past work that only considered the
class of linear functions that operate element-wise over the files. On
observing that the existing cache-aided scalar linear function retrieval scheme
does not work in the proposed setting, this paper designs a novel coded caching
scheme that outperforms uncoded caching schemes that either use unicast
transmissions or let each user recover all files in the library.
Authors' comments: 21 pages, 4 figures, published in Entropy 2021, 23(1), 25
Naveed Naimipour, Shahin Khobahi, Mojtaba Soltanalian
Exploring the idea of phase retrieval has been intriguing researchers for decades, due to its appearance in a wide range of applications. The task of a phase retrieval algorithm is typically to recover a signal from linear phaseless measurements. In this paper, we approach the problem by proposing a hybrid model-based data-driven deep architecture, referred to as Unfolded Phase Retrieval (UPR), that exhibits significant potential in improving the performance of state-of-the art data-driven and model-based phase retrieval algorithms. The proposed method benefits from versatility and interpretability of well-established model-based algorithms, while simultaneously benefiting from the expressive power of deep neural networks. In particular, our proposed model-based deep architecture is applied to the conventional phase retrieval problem (via the incremental reshaped Wirtinger flow algorithm) and the sparse phase retrieval problem (via the sparse truncated amplitude flow algorithm), showing immense promise in both cases. Furthermore, we consider a joint design of the sensing matrix and the signal processing algorithm and utilize the deep unfolding technique in the process. Our numerical results illustrate the effectiveness of such hybrid model-based and data-driven frameworks and showcase the untapped potential of data-aided methodologies to enhance the existing phase retrieval algorithms.
Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas
Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr
Daniel Heestermans Svendsen, Pablo Morales-Álvarez, Rafael Molina, Gustau Camps-Valls
This paper introduces deep Gaussian processes (DGPs) for geophysical
parameter retrieval. Unlike the standard full GP model, the DGP accounts for
complicated (modular, hierarchical) processes, provides an efficient solution
that scales well to large datasets, and improves prediction accuracy over
standard full and sparse GP models. We give empirical evidence of performance
for estimation of surface dew point temperature from infrared sounding data.
Authors' comments: Preprint, Paper published in IGARSS 2018 - 2018 IEEE International
Geoscience and Remote Sensing Symposium, Valencia, 2018, pp. 6175-6178
Noé Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, Torsten Sattler
Visual localization, i.e., camera pose estimation in a known scene, is a core
component of technologies such as autonomous driving and augmented reality.
State-of-the-art localization approaches often rely on image retrieval
techniques for one of two tasks: (1) provide an approximate pose estimate or
(2) determine which parts of the scene are potentially visible in a given query
image. It is common practice to use state-of-the-art image retrieval algorithms
for these tasks. These algorithms are often trained for the goal of retrieving
the same landmark under a large range of viewpoint changes. However, robustness
to viewpoint changes is not necessarily desirable in the context of visual
localization. This paper focuses on understanding the role of image retrieval
for multiple visual localization tasks. We introduce a benchmark setup and
compare state-of-the-art retrieval representations on multiple datasets. We
show that retrieval performance on classical landmark retrieval/recognition
tasks correlates only for some but not all tasks to localization performance.
This indicates a need for retrieval approaches specifically designed for
localization tasks. Our benchmark and evaluation protocols are available at
https://github.com/naver/kapture-localization.
Authors' comments: International Conference on 3D Vision, 2020
Pranav Aggarwal, Ajinkya Kale
There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot cross-lingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts from each other. Finally, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for evaluating zero-shot model performance across languages. XTD10 dataset is made publicly available here: https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10
Haoyu Dong, Ze Wang, Qiang Qiu, Guillermo Sapiro
Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between visual data and text data. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval given only an image and a textual instruction on the desired modifications over the image
Arvind Srinivasan, Aprameya Bharadwaj, Aveek Saha, Subramanyam Natarajan
Large scale video retrieval is a field of study with a lot of ongoing
research. Most of the work in the field is on video retrieval through text
queries using techniques such as VSE++. However, there is little research done
on video retrieval through image queries, and the work that has been done in
this field either uses image queries from within the video dataset or iterates
through videos frame by frame. These approaches are not generalized for queries
from outside the dataset and do not scale well for large video datasets. To
overcome these issues, we propose a new approach for video retrieval through
image queries where an undirected graph is constructed from the combined set of
frames from all videos to be searched. The node features of this graph are used
in the task of video retrieval. Experimentation is done on the MSR-VTT dataset
by using query images from outside the dataset. To evaluate this novel approach
P@5, P@10 and P@20 metrics are calculated. Two different ResNet models namely,
ResNet-152 and ResNet-50 are used in this study.
Authors' comments: 6 pages, 6 figures, 7 tables
Alexei Novikov, Stephen White
We consider the \textit{phase retrieval} problem of recovering a sparse signal $\mathbf{x}$ in $\mathbb{R}^d$ from intensity-only measurements in dimension $d \geq 2$. Phase retrieval can be equivalently formulated as the problem of recovering a signal from its autocorrelation, which is in turn directly related to the combinatorial problem of recovering a set from its pairwise differences. In one spatial dimension, this problem is well studied and known as the \textit{turnpike problem}. In this work, we present MISTR (Multidimensional Intersection Sparse supporT Recovery), an algorithm which exploits this formulation to recover the support of a multidimensional signal from magnitude-only measurements. MISTR takes advantage of the structure of multiple dimensions to provably achieve the same accuracy as the best one-dimensional algorithms in dramatically less time. We prove theoretically that MISTR correctly recovers the support of signals distributed as a Gaussian point process with high probability as long as sparsity is at most $\mathcal{O}\left(n^{d\theta}\right)$ for any $\theta < 1/2$, where $n^d$ represents pixel size in a fixed image window. In the case that magnitude measurements are corrupted by noise, we provide a thresholding scheme with theoretical guarantees for sparsity at most $\mathcal{O}\left(n^{d\theta}\right)$ for $\theta < 1/4$ that obviates the need for MISTR to explicitly handle noisy autocorrelation data. Detailed and reproducible numerical experiments demonstrate the effectiveness of our algorithm, showing that in practice MISTR enjoys time complexity which is nearly linear in the size of the input.
Jiapeng Liu, Xiao Zhang, Dan Goldwasser, Xiao Wang
Cross-lingual document search is an information retrieval task in which the
queries' language differs from the documents' language. In this paper, we study
the instability of neural document search models and propose a novel end-to-end
robust framework that achieves improved performance in cross-lingual search
with different documents' languages. This framework includes a novel measure of
the relevance, smooth cosine similarity, between queries and documents, and a
novel loss function, Smooth Ordinal Search Loss, as the objective. We further
provide theoretical guarantee on the generalization error bound for the
proposed framework. We conduct experiments to compare our approach with other
document search models, and observe significant gains under commonly used
ranking metrics on the cross-lingual document retrieval task in a variety of
languages.
Authors' comments: COLING 2020
Leonid Boytsov, Eric Nyberg
Our objective is to introduce to the NLP community an existing k-NN search library NMSLIB, a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further extended by adding new distances. FlexNeuART is a modular, extendible and flexible toolkit for candidate generation in IR and QA applications, which supports mixing of classic and neural ranking signals. FlexNeuART can efficiently retrieve mixed dense and sparse representations (with weights learned from training data), which is achieved by extending NMSLIB. In that, other retrieval systems work with purely sparse representations (e.g., Lucene), purely dense representations (e.g., FAISS and Annoy), or only perform mixing at the re-ranking stage.
Jing Lu, Gustavo Hernandez Abrego, Ji Ma, Jianmo Ni, Yinfei Yang
In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages for automatic question answering. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Our retrieval-based strategies are based on the semantic similarity and the lexical overlap between questions and passages. We train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. We apply negative sampling to both stages. The approach is evaluated in two passage retrieval tasks. Even though it is not evident that there is one single sampling strategy that works best in all the tasks, it is clear that our strategies contribute to improving the contrast between the response and all the other passages. Furthermore, mixing the negatives from different strategies achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering datasets that we evaluated.