Kai Wan, Hua Sun, Mingyue Ji, Daniela Tuninetti, Giuseppe Caire
Coded Caching, proposed by Maddah-Ali and Niesen (MAN), has the potential to
reduce network traffic by pre-storing content in the users' local memories when
the network is underutilized and transmitting coded multicast messages that
simultaneously benefit many users at once during peak-hour times. This paper
considers the linear function retrieval version of the original coded caching
setting, where users are interested in retrieving a number of linear
combinations of the data points stored at the server, as opposed to a single
file. This extends the scope of the Authors' past work that only considered the
class of linear functions that operate element-wise over the files. On
observing that the existing cache-aided scalar linear function retrieval scheme
does not work in the proposed setting, this paper designs a novel coded caching
scheme that outperforms uncoded caching schemes that either use unicast
transmissions or let each user recover all files in the library.
Authors' comments: 21 pages, 4 figures, published in Entropy 2021, 23(1), 25
Naveed Naimipour, Shahin Khobahi, Mojtaba Soltanalian
Exploring the idea of phase retrieval has been intriguing researchers for decades, due to its appearance in a wide range of applications. The task of a phase retrieval algorithm is typically to recover a signal from linear phaseless measurements. In this paper, we approach the problem by proposing a hybrid model-based data-driven deep architecture, referred to as Unfolded Phase Retrieval (UPR), that exhibits significant potential in improving the performance of state-of-the art data-driven and model-based phase retrieval algorithms. The proposed method benefits from versatility and interpretability of well-established model-based algorithms, while simultaneously benefiting from the expressive power of deep neural networks. In particular, our proposed model-based deep architecture is applied to the conventional phase retrieval problem (via the incremental reshaped Wirtinger flow algorithm) and the sparse phase retrieval problem (via the sparse truncated amplitude flow algorithm), showing immense promise in both cases. Furthermore, we consider a joint design of the sensing matrix and the signal processing algorithm and utilize the deep unfolding technique in the process. Our numerical results illustrate the effectiveness of such hybrid model-based and data-driven frameworks and showcase the untapped potential of data-aided methodologies to enhance the existing phase retrieval algorithms.
Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas
Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr
Daniel Heestermans Svendsen, Pablo Morales-Álvarez, Rafael Molina, Gustau Camps-Valls
This paper introduces deep Gaussian processes (DGPs) for geophysical
parameter retrieval. Unlike the standard full GP model, the DGP accounts for
complicated (modular, hierarchical) processes, provides an efficient solution
that scales well to large datasets, and improves prediction accuracy over
standard full and sparse GP models. We give empirical evidence of performance
for estimation of surface dew point temperature from infrared sounding data.
Authors' comments: Preprint, Paper published in IGARSS 2018 - 2018 IEEE International
Geoscience and Remote Sensing Symposium, Valencia, 2018, pp. 6175-6178
Noé Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, Torsten Sattler
Visual localization, i.e., camera pose estimation in a known scene, is a core
component of technologies such as autonomous driving and augmented reality.
State-of-the-art localization approaches often rely on image retrieval
techniques for one of two tasks: (1) provide an approximate pose estimate or
(2) determine which parts of the scene are potentially visible in a given query
image. It is common practice to use state-of-the-art image retrieval algorithms
for these tasks. These algorithms are often trained for the goal of retrieving
the same landmark under a large range of viewpoint changes. However, robustness
to viewpoint changes is not necessarily desirable in the context of visual
localization. This paper focuses on understanding the role of image retrieval
for multiple visual localization tasks. We introduce a benchmark setup and
compare state-of-the-art retrieval representations on multiple datasets. We
show that retrieval performance on classical landmark retrieval/recognition
tasks correlates only for some but not all tasks to localization performance.
This indicates a need for retrieval approaches specifically designed for
localization tasks. Our benchmark and evaluation protocols are available at
https://github.com/naver/kapture-localization.
Authors' comments: International Conference on 3D Vision, 2020
Pranav Aggarwal, Ajinkya Kale
There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot cross-lingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts from each other. Finally, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for evaluating zero-shot model performance across languages. XTD10 dataset is made publicly available here: https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10
Haoyu Dong, Ze Wang, Qiang Qiu, Guillermo Sapiro
Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between visual data and text data. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval given only an image and a textual instruction on the desired modifications over the image
Arvind Srinivasan, Aprameya Bharadwaj, Aveek Saha, Subramanyam Natarajan
Large scale video retrieval is a field of study with a lot of ongoing
research. Most of the work in the field is on video retrieval through text
queries using techniques such as VSE++. However, there is little research done
on video retrieval through image queries, and the work that has been done in
this field either uses image queries from within the video dataset or iterates
through videos frame by frame. These approaches are not generalized for queries
from outside the dataset and do not scale well for large video datasets. To
overcome these issues, we propose a new approach for video retrieval through
image queries where an undirected graph is constructed from the combined set of
frames from all videos to be searched. The node features of this graph are used
in the task of video retrieval. Experimentation is done on the MSR-VTT dataset
by using query images from outside the dataset. To evaluate this novel approach
P@5, P@10 and P@20 metrics are calculated. Two different ResNet models namely,
ResNet-152 and ResNet-50 are used in this study.
Authors' comments: 6 pages, 6 figures, 7 tables
Alexei Novikov, Stephen White
We consider the \textit{phase retrieval} problem of recovering a sparse signal $\mathbf{x}$ in $\mathbb{R}^d$ from intensity-only measurements in dimension $d \geq 2$. Phase retrieval can be equivalently formulated as the problem of recovering a signal from its autocorrelation, which is in turn directly related to the combinatorial problem of recovering a set from its pairwise differences. In one spatial dimension, this problem is well studied and known as the \textit{turnpike problem}. In this work, we present MISTR (Multidimensional Intersection Sparse supporT Recovery), an algorithm which exploits this formulation to recover the support of a multidimensional signal from magnitude-only measurements. MISTR takes advantage of the structure of multiple dimensions to provably achieve the same accuracy as the best one-dimensional algorithms in dramatically less time. We prove theoretically that MISTR correctly recovers the support of signals distributed as a Gaussian point process with high probability as long as sparsity is at most $\mathcal{O}\left(n^{d\theta}\right)$ for any $\theta < 1/2$, where $n^d$ represents pixel size in a fixed image window. In the case that magnitude measurements are corrupted by noise, we provide a thresholding scheme with theoretical guarantees for sparsity at most $\mathcal{O}\left(n^{d\theta}\right)$ for $\theta < 1/4$ that obviates the need for MISTR to explicitly handle noisy autocorrelation data. Detailed and reproducible numerical experiments demonstrate the effectiveness of our algorithm, showing that in practice MISTR enjoys time complexity which is nearly linear in the size of the input.
Jiapeng Liu, Xiao Zhang, Dan Goldwasser, Xiao Wang
Cross-lingual document search is an information retrieval task in which the
queries' language differs from the documents' language. In this paper, we study
the instability of neural document search models and propose a novel end-to-end
robust framework that achieves improved performance in cross-lingual search
with different documents' languages. This framework includes a novel measure of
the relevance, smooth cosine similarity, between queries and documents, and a
novel loss function, Smooth Ordinal Search Loss, as the objective. We further
provide theoretical guarantee on the generalization error bound for the
proposed framework. We conduct experiments to compare our approach with other
document search models, and observe significant gains under commonly used
ranking metrics on the cross-lingual document retrieval task in a variety of
languages.
Authors' comments: COLING 2020
Leonid Boytsov, Eric Nyberg
Our objective is to introduce to the NLP community an existing k-NN search library NMSLIB, a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further extended by adding new distances. FlexNeuART is a modular, extendible and flexible toolkit for candidate generation in IR and QA applications, which supports mixing of classic and neural ranking signals. FlexNeuART can efficiently retrieve mixed dense and sparse representations (with weights learned from training data), which is achieved by extending NMSLIB. In that, other retrieval systems work with purely sparse representations (e.g., Lucene), purely dense representations (e.g., FAISS and Annoy), or only perform mixing at the re-ranking stage.
Jing Lu, Gustavo Hernandez Abrego, Ji Ma, Jianmo Ni, Yinfei Yang
In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages for automatic question answering. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Our retrieval-based strategies are based on the semantic similarity and the lexical overlap between questions and passages. We train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. We apply negative sampling to both stages. The approach is evaluated in two passage retrieval tasks. Even though it is not evident that there is one single sampling strategy that works best in all the tasks, it is clear that our strategies contribute to improving the contrast between the response and all the other passages. Furthermore, mixing the negatives from different strategies achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering datasets that we evaluated.
Sohee Yang, Minjoon Seo
The state of the art in open-domain question answering (QA) relies on an efficient retriever that drastically reduces the search space for the expensive reader. A rather overlooked question in the community is the relationship between the retriever and the reader, and in particular, if the whole purpose of the retriever is just a fast approximation for the reader. Our empirical evidence indicates that the answer is no, and that the reader and the retriever are complementary to each other even in terms of accuracy only. We make a careful conjecture that the architectural constraint of the retriever, which has been originally intended for enabling approximate search, seems to also make the model more robust in large-scale search. We then propose to distill the reader into the retriever so that the retriever absorbs the strength of the reader while keeping its own benefit. Experimental results show that our method can enhance the document recall rate as well as the end-to-end QA accuracy of off-the-shelf retrievers in open-domain QA tasks.
Aashish Kumar Misraa, Ajinkya Kale, Pranav Aggarwal, Ali Aminian
Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.
Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong
The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.
Xuanang Chen, Ben He, Kai Hui, Le Sun, Yingfei Sun
Despite the effectiveness of utilizing the BERT model for document ranking,
the high computational cost of such approaches limits their uses. To this end,
this paper first empirically investigates the effectiveness of two knowledge
distillation models on the document ranking task. In addition, on top of the
recently proposed TinyBERT model, two simplifications are proposed. Evaluations
on two different and widely-used benchmarks demonstrate that Simplified
TinyBERT with the proposed simplifications not only boosts TinyBERT, but also
significantly outperforms BERT-Base when providing 15$\times$ speedup.
Authors' comments: Accepted at ECIR 2021 (short paper)
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, Meng Wang
This paper attacks the challenging problem of video retrieval by text. In
such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc
queries described exclusively in the form of a natural-language sentence, with
no visual example provided. Given videos as sequences of frames and queries as
sequences of words, an effective sequence-to-sequence cross-modal matching is
crucial. To that end, the two modalities need to be first encoded into
real-valued vectors and then projected into a common space. In this paper we
achieve this by proposing a dual deep encoding network that encodes videos and
queries into powerful dense representations of their own. Our novelty is
two-fold. First, different from prior art that resorts to a specific
single-level encoder, the proposed network performs multi-level encoding that
represents the rich content of both modalities in a coarse-to-fine fashion.
Second, different from a conventional common space learning algorithm which is
either concept based or latent space based, we introduce hybrid space learning
which combines the high performance of the latent space and the good
interpretability of the concept space. Dual encoding is conceptually simple,
practically effective and end-to-end trained with hybrid space learning.
Extensive experiments on four challenging video datasets show the viability of
the new method.
Authors' comments: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence. Code and data will be available at
https://github.com/danieljf24/hybrid_space. Conference version:
arXiv:1809.06181
Sarvesh Soni, Kirk Roberts
We apply deep learning-based language models to the task of patient cohort
retrieval (CR) with the aim to assess their efficacy. The task of CR requires
the extraction of relevant documents from the electronic health records (EHRs)
on the basis of a given query. Given the recent advancements in the field of
document retrieval, we map the task of CR to a document retrieval task and
apply various deep neural models implemented for the general domain tasks. In
this paper, we propose a framework for retrieving patient cohorts using neural
language models without the need of explicit feature engineering and domain
expertise. We find that a majority of our models outperform the BM25 baseline
method on various evaluation metrics.
Authors' comments: Accepted at the AMIA Annual Symposium 2020
Xinli Yu, Mohsen Malmir, Cynthia He, Yue Liu, Rex Wu
In this paper, we propose a novel method for video moment retrieval (VMR)
that achieves state of the arts (SOTA) performance on R@1 metrics and
surpassing the SOTA on the high IoU metric (R@1, IoU=0.7).
First, we propose to use a multi-head self-attention mechanism, and further a
cross-attention scheme to capture video/query interaction and long-range query
dependencies from video context. The attention-based methods can develop
frame-to-query interaction and query-to-frame interaction at arbitrary
positions and the multi-head setting ensures the sufficient understanding of
complicated dependencies. Our model has a simple architecture, which enables
faster training and inference while maintaining .
Second, We also propose to use multiple task training objective consists of
moment segmentation task, start/end distribution prediction and start/end
location regression task. We have verified that start/end prediction are noisy
due to annotator disagreement and joint training with moment segmentation task
can provide richer information since frames inside the target clip are also
utilized as positive training examples.
Third, we propose to use an early fusion approach, which achieves better
performance at the cost of inference time. However, the inference time will not
be a problem for our model since our model has a simple architecture which
enables efficient training and inference.
Authors' comments: needs internal approval
Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, Lu Zhang
Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.