Sohee Yang, Minjoon Seo
The state of the art in open-domain question answering (QA) relies on an efficient retriever that drastically reduces the search space for the expensive reader. A rather overlooked question in the community is the relationship between the retriever and the reader, and in particular, if the whole purpose of the retriever is just a fast approximation for the reader. Our empirical evidence indicates that the answer is no, and that the reader and the retriever are complementary to each other even in terms of accuracy only. We make a careful conjecture that the architectural constraint of the retriever, which has been originally intended for enabling approximate search, seems to also make the model more robust in large-scale search. We then propose to distill the reader into the retriever so that the retriever absorbs the strength of the reader while keeping its own benefit. Experimental results show that our method can enhance the document recall rate as well as the end-to-end QA accuracy of off-the-shelf retrievers in open-domain QA tasks.
Aashish Kumar Misraa, Ajinkya Kale, Pranav Aggarwal, Ali Aminian
Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.
Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong
The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.
Xuanang Chen, Ben He, Kai Hui, Le Sun, Yingfei Sun
Despite the effectiveness of utilizing the BERT model for document ranking,
the high computational cost of such approaches limits their uses. To this end,
this paper first empirically investigates the effectiveness of two knowledge
distillation models on the document ranking task. In addition, on top of the
recently proposed TinyBERT model, two simplifications are proposed. Evaluations
on two different and widely-used benchmarks demonstrate that Simplified
TinyBERT with the proposed simplifications not only boosts TinyBERT, but also
significantly outperforms BERT-Base when providing 15$\times$ speedup.
Authors' comments: Accepted at ECIR 2021 (short paper)
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, Meng Wang
This paper attacks the challenging problem of video retrieval by text. In
such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc
queries described exclusively in the form of a natural-language sentence, with
no visual example provided. Given videos as sequences of frames and queries as
sequences of words, an effective sequence-to-sequence cross-modal matching is
crucial. To that end, the two modalities need to be first encoded into
real-valued vectors and then projected into a common space. In this paper we
achieve this by proposing a dual deep encoding network that encodes videos and
queries into powerful dense representations of their own. Our novelty is
two-fold. First, different from prior art that resorts to a specific
single-level encoder, the proposed network performs multi-level encoding that
represents the rich content of both modalities in a coarse-to-fine fashion.
Second, different from a conventional common space learning algorithm which is
either concept based or latent space based, we introduce hybrid space learning
which combines the high performance of the latent space and the good
interpretability of the concept space. Dual encoding is conceptually simple,
practically effective and end-to-end trained with hybrid space learning.
Extensive experiments on four challenging video datasets show the viability of
the new method.
Authors' comments: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence. Code and data will be available at
https://github.com/danieljf24/hybrid_space. Conference version:
arXiv:1809.06181
Sarvesh Soni, Kirk Roberts
We apply deep learning-based language models to the task of patient cohort
retrieval (CR) with the aim to assess their efficacy. The task of CR requires
the extraction of relevant documents from the electronic health records (EHRs)
on the basis of a given query. Given the recent advancements in the field of
document retrieval, we map the task of CR to a document retrieval task and
apply various deep neural models implemented for the general domain tasks. In
this paper, we propose a framework for retrieving patient cohorts using neural
language models without the need of explicit feature engineering and domain
expertise. We find that a majority of our models outperform the BM25 baseline
method on various evaluation metrics.
Authors' comments: Accepted at the AMIA Annual Symposium 2020
Xinli Yu, Mohsen Malmir, Cynthia He, Yue Liu, Rex Wu
In this paper, we propose a novel method for video moment retrieval (VMR)
that achieves state of the arts (SOTA) performance on R@1 metrics and
surpassing the SOTA on the high IoU metric (R@1, IoU=0.7).
First, we propose to use a multi-head self-attention mechanism, and further a
cross-attention scheme to capture video/query interaction and long-range query
dependencies from video context. The attention-based methods can develop
frame-to-query interaction and query-to-frame interaction at arbitrary
positions and the multi-head setting ensures the sufficient understanding of
complicated dependencies. Our model has a simple architecture, which enables
faster training and inference while maintaining .
Second, We also propose to use multiple task training objective consists of
moment segmentation task, start/end distribution prediction and start/end
location regression task. We have verified that start/end prediction are noisy
due to annotator disagreement and joint training with moment segmentation task
can provide richer information since frames inside the target clip are also
utilized as positive training examples.
Third, we propose to use an early fusion approach, which achieves better
performance at the cost of inference time. However, the inference time will not
be a problem for our model since our model has a simple architecture which
enables efficient training and inference.
Authors' comments: needs internal approval
Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, Lu Zhang
Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.
Yuxiang Lu, Zhuqing Jia, Syed A. Jafar
Double blind $T$-private information retrieval (DB-TPIR) enables two users,
each of whom specifies an index ($\theta_1, \theta_2$, resp.), to efficiently
retrieve a message $W(\theta_1,\theta_2)$ labeled by the two indices, from a
set of $N$ servers that store all messages $W(k_1,k_2),
k_1\in\{1,2,\cdots,K_1\}, k_2\in\{1,2,\cdots,K_2\}$, such that the two users'
indices are kept private from any set of up to $T_1,T_2$ colluding servers,
respectively, as well as from each other. A DB-TPIR scheme based on
cross-subspace alignment is proposed in this paper, and shown to be
capacity-achieving in the asymptotic setting of large number of messages and
bounded latency. The scheme is then extended to $M$-way blind $X$-secure
$T$-private information retrieval (MB-XS-TPIR) with multiple ($M$) indices,
each belonging to a different user, arbitrary privacy levels for each index
($T_1, T_2,\cdots, T_M$), and arbitrary level of security ($X$) of data
storage, so that the message $W(\theta_1,\theta_2,\cdots, \theta_M)$ can be
efficiently retrieved while the stored data is held secure against collusion
among up to $X$ colluding servers, the $m^{th}$ user's index is private against
collusion among up to $T_m$ servers, and each user's index $\theta_m$ is
private from all other users. The general scheme relies on a tensor-product
based extension of cross-subspace alignment and retrieves
$1-(X+T_1+\cdots+T_M)/N$ bits of desired message per bit of download.
Authors' comments: Accepted for publication in IEEE Journal on Selected Areas in
Information Theory (JSAIT)
Xueya Zhang, Tong Zhang, Xiaobin Hong, Zhen Cui, Jian Yang
Movie graphs play an important role to bridge heterogenous modalities of videos and texts in human-centric retrieval. In this work, we propose Graph Wasserstein Correlation Analysis (GWCA) to deal with the core issue therein, i.e, cross heterogeneous graph comparison. Spectral graph filtering is introduced to encode graph signals, which are then embedded as probability distributions in a Wasserstein space, called graph Wasserstein metric learning. Such a seamless integration of graph signal filtering together with metric learning results in a surprise consistency on both learning processes, in which the goal of metric learning is just to optimize signal filters or vice versa. Further, we derive the solution of the graph comparison model as a classic generalized eigenvalue decomposition problem, which has an exactly closed-form solution. Finally, GWCA together with movie/text graphs generation are unified into the framework of movie retrieval to evaluate our proposed method. Extensive experiments on MovieGrpahs dataset demonstrate the effectiveness of our GWCA as well as the entire framework.
Alexander Lebedev, Andrei Khrennikov
Recently people started to understand that applications of the mathematical formalism of quantum theory are not reduced to physics. Nowadays, this formalism is widely used outside of quantum physics, in particular, in cognition, psychology, decision making, information processing, especially information retrieval. The latter is very promising. The aim of this brief introductory review is to stimulate research in this exciting area of information science. This paper is not aimed to present a complete review on the state of art in quantum information retrieval.
Ming-Wei Li, Qing-Yuan Jiang, Wu-Jun Li
Due to its low storage cost and fast query speed, hashing has been widely
used in large-scale image retrieval tasks. Hash bucket search returns data
points within a given Hamming radius to each query, which can enable search at
a constant or sub-linear time cost. However, existing hashing methods cannot
achieve satisfactory retrieval performance for hash bucket search in complex
scenarios, since they learn only one hash code for each image. More
specifically, by using one hash code to represent one image, existing methods
might fail to put similar image pairs to the buckets with a small Hamming
distance to the query when the semantic information of images is complex. As a
result, a large number of hash buckets need to be visited for retrieving
similar images, based on the learned codes. This will deteriorate the
efficiency of hash bucket search. In this paper, we propose a novel hashing
framework, called multiple code hashing (MCH), to improve the performance of
hash bucket search. The main idea of MCH is to learn multiple hash codes for
each image, with each code representing a different region of the image.
Furthermore, we propose a deep reinforcement learning algorithm to learn the
parameters in MCH. To the best of our knowledge, this is the first work that
proposes to learn multiple hash codes for each image in image retrieval.
Experiments demonstrate that MCH can achieve a significant improvement in hash
bucket search, compared with existing methods that learn only one hash code for
each image.
Authors' comments: 12 pages, 9 figures, 3 tables
Te-Yuan Liu, Ata Mahjoubfar, Daniel Prusinski, Luis Stevens
Neuromorphic computing mimics the neural activity of the brain through emulating spiking neural networks. In numerous machine learning tasks, neuromorphic chips are expected to provide superior solutions in terms of cost and power efficiency. Here, we explore the application of Loihi, a neuromorphic computing chip developed by Intel, for the computer vision task of image retrieval. We evaluated the functionalities and the performance metrics that are critical in content-based visual search and recommender systems using deep-learning embeddings. Our results show that the neuromorphic solution is about 2.5 times more energy-efficient compared with an ARM Cortex-A72 CPU and 12.5 times more energy-efficient compared with NVIDIA T4 GPU for inference by a lightweight convolutional neural network without batching while maintaining the same level of matching accuracy. The study validates the potential of neuromorphic computing in low-power image retrieval, as a complementary paradigm to the existing von Neumann architectures.
Rakib Hyder, Zikui Cai, M. Salman Asif
Fourier phase retrieval is a classical problem that deals with the recovery
of an image from the amplitude measurements of its Fourier coefficients.
Conventional methods solve this problem via iterative (alternating)
minimization by leveraging some prior knowledge about the structure of the
unknown image. The inherent ambiguities about shift and flip in the Fourier
measurements make this problem especially difficult; and most of the existing
methods use several random restarts with different permutations. In this paper,
we assume that a known (learned) reference is added to the signal before
capturing the Fourier amplitude measurements. Our method is inspired by the
principle of adding a reference signal in holography. To recover the signal, we
implement an iterative phase retrieval method as an unrolled network. Then we
use back propagation to learn the reference that provides us the best
reconstruction for a fixed number of phase retrieval iterations. We performed a
number of simulations on a variety of datasets under different conditions and
found that our proposed method for phase retrieval via unrolled network and
learned reference provides near-perfect recovery at fixed (small) computational
cost. We compared our method with standard Fourier phase retrieval methods and
observed significant performance enhancement using the learned reference.
Authors' comments: Accepted to ECCV 2020. Code is available at
https://github.com/CSIPlab/learnPR_reference
Craig Macdonald, Nicola Tonellotto
The advent of deep machine learning platforms such as Tensorflow and Pytorch, developed in expressive high-level languages such as Python, have allowed more expressive representations of deep neural network architectures. We argue that such a powerful formalism is missing in information retrieval (IR), and propose a framework called PyTerrier that allows advanced retrieval pipelines to be expressed, and evaluated, in a declarative manner close to their conceptual design. Like the aforementioned frameworks that compile deep learning experiments into primitive GPU operations, our framework targets IR platforms as backends in order to execute and evaluate retrieval pipelines. Further, we can automatically optimise the retrieval pipelines to increase their efficiency to suite a particular IR platform backend. Our experiments, conducted on TREC Robust and ClueWeb09 test collections, demonstrate the efficiency benefits of these optimisations for retrieval pipelines involving both the Anserini and Terrier IR platforms.
Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
The task of retrieving video content relevant to natural language queries
plays a critical role in effectively handling internet-scale datasets. Most of
the existing methods for this caption-to-video retrieval problem do not fully
exploit cross-modal cues present in video. Furthermore, they aggregate
per-frame visual features with limited or no temporal information. In this
paper, we present a multi-modal transformer to jointly encode the different
modalities in video, which allows each of them to attend to the others. The
transformer architecture is also leveraged to encode and model the temporal
information. On the natural language side, we investigate the best practices to
jointly optimize the language embedding together with the multi-modal
transformer. This novel framework allows us to establish state-of-the-art
results for video retrieval on three datasets. More details are available at
http://thoth.inrialpes.fr/research/MMT.
Authors' comments: ECCV 2020 (spotlight paper)
Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
We study the image retrieval problem at the wireless edge, where an edge device captures an image, which is then used to retrieve similar images from an edge server. These can be images of the same person or a vehicle taken from other cameras at different times and locations. Our goal is to maximize the accuracy of the retrieval task under power and bandwidth constraints over the wireless link. Due to the stringent delay constraint of the underlying application, sending the whole image at a sufficient quality is not possible. We propose two alternative schemes based on digital and analog communications, respectively. In the digital approach, we first propose a deep neural network (DNN) aided retrieval-oriented image compression scheme, whose output bit sequence is transmitted over the channel using conventional channel codes. In the analog joint source and channel coding (JSCC) approach, the feature vectors are directly mapped into channel symbols. We evaluate both schemes on image based re-identification (re-ID) tasks under different channel conditions, including both static and fading channels. We show that the JSCC scheme significantly increases the end-to-end accuracy, speeds up the encoding process, and provides graceful degradation with channel conditions. The proposed architecture is evaluated through extensive simulations on different datasets and channel conditions, as well as through ablation studies.
Hsuan-Yin Lin, Siddhartha Kumar, Eirik Rosnes, Alexandre Graell i Amat, Eitan Yaakobi
Private information retrieval (PIR) protocols ensure that a user can download
a file from a database without revealing any information on the identity of the
requested file to the servers storing the database. While existing protocols
strictly impose that no information is leaked on the file's identity, this work
initiates the study of the tradeoffs that can be achieved by relaxing the
perfect privacy requirement. We refer to such protocols as weakly-private
information retrieval (WPIR) protocols. In particular, for the case of multiple
noncolluding replicated servers, we study how the download rate, the upload
cost, and the access complexity can be improved when relaxing the full privacy
constraint. To quantify the information leakage on the requested file's
identity we consider mutual information (MI), worst-case information leakage,
and maximal leakage (MaxL). We present two WPIR schemes, denoted by Scheme A
and Scheme B, based on two recent PIR protocols and show that the download rate
of the former can be optimized by solving a convex optimization problem. We
also show that Scheme A achieves an improved download rate compared to the
recently proposed scheme by Samy et al. under the so-called $\epsilon$-privacy
metric. Additionally, a family of schemes based on partitioning is presented.
Moreover, we provide an information-theoretic converse bound for the maximum
possible download rate for the MI and MaxL privacy metrics under a practical
restriction on the alphabet size of queries and answers. For two servers and
two files, the bound is tight under the MaxL metric, which settles the WPIR
capacity in this particular case. Finally, we compare the performance of the
proposed schemes and their gap to the converse bound.
Authors' comments: To appear in IEEE Transactions on Information Theory. arXiv admin
note: text overlap with arXiv:1901.06730
Xiao Wang, Craig Macdonald, Iadh Ounis
Query reformulations have long been a key mechanism to alleviate the
vocabulary-mismatch problem in information retrieval, for example by expanding
the queries with related query terms or by generating paraphrases of the
queries. In this work, we propose a deep reinforced query reformulation (DRQR)
model to automatically generate new reformulations of the query. To encourage
the model to generate queries which can achieve high performance when
performing the retrieval task, we incorporate query performance prediction into
our reward function. In addition, to evaluate the quality of the reformulated
query in the context of information retrieval, we first train our DRQR model,
then apply the retrieval ranking model on the obtained reformulated query.
Experiments are conducted on the TREC 2020 Deep Learning track MSMARCO document
ranking dataset. Our results show that our proposed model outperforms several
query reformulation model baselines when performing retrieval task. In
addition, improvements are also observed when combining with various retrieval
models, such as query expansion and BERT.
Authors' comments: 10 pages, 4 figures
Golsa Tahmasebzadeh, Sherzod Hakimov, Eric Müller-Budack, Ralph Ewerth
Content-based information retrieval is based on the information contained in
documents rather than using metadata such as keywords. Most information
retrieval methods are either based on text or image. In this paper, we
investigate the usefulness of multimodal features for cross-lingual news search
in various domains: politics, health, environment, sport, and finance. To this
end, we consider five feature types for image and text and compare the
performance of the retrieval system using different combinations. Experimental
results show that retrieval results can be improved when considering both
visual and textual information. In addition, it is observed that among textual
features entity overlap outperforms word embeddings, while geolocation
embeddings achieve better performance among visual features in the retrieval
task.
Authors' comments: CLEOPATRA Workshop co-located with ESWC 2020