Luyu Gao, Jamie Callan
Pre-trained Transformer language models (LM) have become go-to text
representation encoders. Prior research fine-tunes deep LMs to encode text
sequences such as sentences and passages into single dense vector
representations for efficient text comparison and retrieval. However, dense
encoders require a lot of data and sophisticated techniques to effectively
train and suffer in low data situations. This paper finds a key reason is that
standard LMs' internal attention structure is not ready-to-use for dense
encoders, which needs to aggregate text information into the dense
representation. We propose to pre-train towards dense encoder with a novel
Transformer architecture, Condenser, where LM prediction CONditions on DENSE
Representation. Our experiments show Condenser improves over standard LM by
large margins on various text retrieval and similarity tasks.
Authors' comments: EMNLP 2021
Revanth Gangi Reddy, Vikas Yadav, Md Arafat Sultan, Martin Franz, Vittorio Castelli, Heng Ji, Avirup Sil
Recent work has shown that commonly available machine reading comprehension (MRC) datasets can be used to train high-performance neural information retrieval (IR) systems. However, the evaluation of neural IR has so far been limited to standard supervised learning settings, where they have outperformed traditional term matching baselines. We conduct in-domain and out-of-domain evaluations of neural IR, and seek to improve its robustness across different scenarios, including zero-shot settings. We show that synthetic training examples generated using a sequence-to-sequence generator can be effective towards this goal: in our experiments, pre-training with synthetic examples improves retrieval performance in both in-domain and out-of-domain evaluation on five different test sets.
Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie
Product quantization (PQ) is a widely used technique for ad-hoc retrieval.
Recent studies propose supervised PQ, where the embedding and quantization
models can be jointly trained with supervised learning. However, there is a
lack of appropriate formulation of the joint training objective; thus, the
improvements over previous non-supervised baselines are limited in reality. In
this work, we propose the Matching-oriented Product Quantization (MoPQ), where
a novel objective Multinoulli Contrastive Loss (MCL) is formulated. With the
minimization of MCL, we are able to maximize the matching probability of query
and ground-truth key, which contributes to the optimal retrieval accuracy.
Given that the exact computation of MCL is intractable due to the demand of
vast contrastive samples, we further propose the Differentiable Cross-device
Sampling (DCS), which significantly augments the contrastive samples for
precise approximation of MCL. We conduct extensive experimental studies on four
real-world datasets, whose results verify the effectiveness of MoPQ. The code
is available at https://github.com/microsoft/MoPQ.
Authors' comments: Accepted by EMNLP2021
Xinyi Dai, Jianghao Lin, Weinan Zhang, Shuai Li, Weiwen Liu, Ruiming Tang, Xiuqiang He, Jianye Hao et al.
Modern information retrieval systems, including web search, ads placement,
and recommender systems, typically rely on learning from user feedback. Click
models, which study how users interact with a ranked list of items, provide a
useful understanding of user feedback for learning ranking models. Constructing
"right" dependencies is the key of any successful click model. However,
probabilistic graphical models (PGMs) have to rely on manually assigned
dependencies, and oversimplify user behaviors. Existing neural network based
methods promote PGMs by enhancing the expressive ability and allowing flexible
dependencies, but still suffer from exposure bias and inferior estimation. In
this paper, we propose a novel framework, Adversarial Imitation Click Model
(AICM), based on imitation learning. Firstly, we explicitly learn the reward
function that recovers users' intrinsic utility and underlying intentions.
Secondly, we model user interactions with a ranked list as a dynamic system
instead of one-step click prediction, alleviating the exposure bias problem.
Finally, we minimize the JS divergence through adversarial training and learn a
stable distribution of click sequences, which makes AICM generalize well across
different distributions of ranked lists. A theoretical analysis has indicated
that AICM reduces the exposure bias from $O(T^2)$ to $O(T)$. Our studies on a
public web search dataset show that AICM not only outperforms state-of-the-art
models in traditional click metrics but also achieves superior performance in
addressing the exposure bias and recovering the underlying patterns of click
sequences.
Authors' comments: Accepted to WWW 2021
Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, Hal Daumé
Complex question answering often requires finding a reasoning chain that
consists of multiple evidence pieces. Current approaches incorporate the
strengths of structured knowledge and unstructured text, assuming text corpora
is semi-structured. Building on dense retrieval methods, we propose a new
multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain
through beam search in dense representations. When evaluated on multi-hop
question answering, BeamDR is competitive to state-of-the-art systems, without
using any semi-structured information. Through query composition in dense
space, BeamDR captures the implicit relationships between evidence in the
reasoning chain. The code is available at https://github.com/
henryzhao5852/BeamDR.
Authors' comments: NAACL 2021
Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, Qi Li
Few-shot learning arises in important practical scenarios, such as when a
natural language understanding system needs to learn new semantic labels for an
emerging, resource-scarce domain. In this paper, we explore retrieval-based
methods for intent classification and slot filling tasks in few-shot settings.
Retrieval-based methods make predictions based on labeled examples in the
retrieval index that are similar to the input, and thus can adapt to new
domains simply by changing the index without having to retrain the model.
However, it is non-trivial to apply such methods on tasks with a complex label
space like slot filling. To this end, we propose a span-level retrieval method
that learns similar contextualized representations for spans with the same
label via a novel batch-softmax objective. At inference time, we use the labels
of the retrieved spans to construct the final structure with the highest
aggregated score. Our method outperforms previous systems in various few-shot
settings on the CLINC and SNIPS benchmarks.
Authors' comments: To appear at NAACL 2021
Wei Chen, Yu Liu, Erwin M. Bakker, Michael S. Lew
Accurately matching visual and textual data in cross-modal retrieval has been
widely studied in the multimedia community. To address these challenges posited
by the heterogeneity gap and the semantic gap, we propose integrating Shannon
information theory and adversarial learning. In terms of the heterogeneity gap,
we integrate modality classification and information entropy maximization
adversarially. For this purpose, a modality classifier (as a discriminator) is
built to distinguish the text and image modalities according to their different
statistical properties. This discriminator uses its output probabilities to
compute Shannon information entropy, which measures the uncertainty of the
modality classification it performs. Moreover, feature encoders (as a
generator) project uni-modal features into a commonly shared space and attempt
to fool the discriminator by maximizing its output information entropy. Thus,
maximizing information entropy gradually reduces the distribution discrepancy
of cross-modal features, thereby achieving a domain confusion state where the
discriminator cannot classify two modalities confidently. To reduce the
semantic gap, Kullback-Leibler (KL) divergence and bi-directional triplet loss
are used to associate the intra- and inter-modality similarity between features
in the shared space. Furthermore, a regularization term based on KL-divergence
with temperature scaling is used to calibrate the biased label classifier
caused by the data imbalance issue. Extensive experiments with four deep models
on four benchmarks are conducted to demonstrate the effectiveness of the
proposed approach.
Authors' comments: Accepted by Pattern Recognition
Hao Wang, Xiang Bai, Mingkun Yang, Shenggao Zhu, Jing Wang, Wenyu Liu
Scene text retrieval aims to localize and search all text instances from an
image gallery, which are the same or similar to a given query text. Such a task
is usually realized by matching a query text to the recognized words, outputted
by an end-to-end scene text spotter. In this paper, we address this problem by
directly learning a cross-modal similarity between a query text and each text
instance from natural images. Specifically, we establish an end-to-end
trainable network, jointly optimizing the procedures of scene text detection
and cross-modal similarity learning. In this way, scene text retrieval can be
simply performed by ranking the detected text instances with the learned
similarity. Experiments on three benchmark datasets demonstrate our method
consistently outperforms the state-of-the-art scene text spotting/retrieval
approaches. In particular, the proposed framework of joint detection and
similarity learning achieves significantly better performance than separated
methods. Code is available at: https://github.com/lanfeng4659/STR-TDSL.
Authors' comments: Accepted to CVPR 2021. Code is available at:
https://github.com/lanfeng4659/STR-TDSL
Yueqi Cao, Athanasios Vlontzos, Luca Schmidtke, Bernhard Kainz, Anthea Monod
Appropriately representing elements in a database so that queries may be
accurately matched is a central task in information retrieval; recently, this
has been achieved by embedding the graphical structure of the database into a
manifold in a hierarchy-preserving manner using a variety of metrics.
Persistent homology is a tool commonly used in topological data analysis that
is able to rigorously characterize a database in terms of both its hierarchy
and connectivity structure. Computing persistent homology on a variety of
embedded datasets reveals that some commonly used embeddings fail to preserve
the connectivity. We show that those embeddings which successfully retain the
database topology coincide in persistent homology by introducing two
dilation-invariant comparative measures to capture this effect: in particular,
they address the issue of metric distortion on manifolds. We provide an
algorithm for their computation that exhibits greatly reduced time complexity
over existing methods. We use these measures to perform the first instance of
topology-based information retrieval and demonstrate its increased performance
over the standard bottleneck distance for persistent homology. We showcase our
approach on databases of different data varieties including text, videos, and
medical images.
Authors' comments: 29 pages, 10 figures, 4 tables
Zongyu Li, Kenneth Lange, Jeffrey A. Fessler
This paper discusses phase retrieval algorithms for maximum likelihood (ML)
estimation from measurements following independent Poisson distributions in
very low-count regimes, e.g., 0.25 photon per pixel. To maximize the
log-likelihood of the Poisson ML model, we propose a modified Wirtinger flow
(WF) algorithm using a step size based on the observed Fisher information. This
approach eliminates all parameter tuning except the number of iterations. We
also propose a novel curvature for majorize-minimize (MM) algorithms with a
quadratic majorizer. We show theoretically that our proposed curvature is
sharper than the curvature derived from the supremum of the second derivative
of the Poisson ML cost function. We compare the proposed algorithms (WF, MM)
with existing optimization methods, including WF using other step-size schemes,
quasi-Newton methods such as LBFGS and alternating direction method of
multipliers (ADMM) algorithms, under a variety of experimental settings.
Simulation experiments with a random Gaussian matrix, a canonical DFT matrix, a
masked DFT matrix and an empirical transmission matrix demonstrate the
following. 1) As expected, algorithms based on the Poisson ML model
consistently produce higher quality reconstructions than algorithms derived
from Gaussian noise ML models when applied to low-count data. 2) For
unregularized cases, our proposed WF algorithm with Fisher information for step
size converges faster than other WF methods, e.g., WF with empirical step size,
backtracking line search, and optimal step size for the Gaussian noise model;
it also converges faster than the LBFGS quasi-Newton method. 3) In regularized
cases, our proposed WF algorithm converges faster than WF with backtracking
line search, LBFGS, MM and ADMM.
Authors' comments: 14 pages
Alex M. Wilhelm, David D. Schmidt, Daniel E. Adams, Charles G. Durfee
We present a phase retrieval algorithm for dispersion scan (d-scan), inspired
by ptychography, which is capable of characterizing multiple
mutually-incoherent ultrafast pulses (or modes) in a pulse train simultaneously
from a single d-scan trace. In addition, a form of Newton's method is employed
as a solution to the square root problem commonly encountered in second
harmonic pulse measurement techniques. Simulated and experimental phase
retrievals of both single-mode and multi-mode d-scan traces are shown to
demonstrate the accuracy and robustness of the algorithm.
Authors' comments: 17 pages, 8 figures
Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai
3D reconstruction of large scenes is a challenging problem due to the
high-complexity nature of the solution space, in particular for generative
neural networks. In contrast to traditional generative learned models which
encode the full generative process into a neural network and can struggle with
maintaining local details at the scene level, we introduce a new method that
directly leverages scene geometry from the training database. First, we learn
to synthesize an initial estimate for a 3D scene, constructed by retrieving a
top-k set of volumetric chunks from the scene database. These candidates are
then refined to a final scene generation with an attention-based refinement
that can effectively select the most consistent set of geometry from the
candidates and combine them together to create an output scene, facilitating
transfer of coherent structures and local detail from train scene geometry. We
demonstrate our neural scene reconstruction with a database for the tasks of 3D
super resolution and surface reconstruction from sparse point clouds, showing
that our approach enables generation of more coherent, accurate 3D scenes,
improving on average by over 8% in IoU over state-of-the-art scene
reconstruction.
Authors' comments: Project Page: https://nihalsid.github.io/retrieval-fuse/
Aneeshan Sain, Ayan Kumar Bhunia, Yongxin Yang, Tao Xiang, Yi-Zhe Song
Sketch-based image retrieval (SBIR) is a cross-modal matching problem which
is typically solved by learning a joint embedding space where the semantic
content shared between photo and sketch modalities are preserved. However, a
fundamental challenge in SBIR has been largely ignored so far, that is,
sketches are drawn by humans and considerable style variations exist amongst
different users. An effective SBIR model needs to explicitly account for this
style diversity, crucially, to generalise to unseen user styles. To this end, a
novel style-agnostic SBIR model is proposed. Different from existing models, a
cross-modal variational autoencoder (VAE) is employed to explicitly disentangle
each sketch into a semantic content part shared with the corresponding photo,
and a style part unique to the sketcher. Importantly, to make our model
dynamically adaptable to any unseen user styles, we propose to meta-train our
cross-modal VAE by adding two style-adaptive components: a set of feature
transformation layers to its encoder and a regulariser to the disentangled
semantic content latent code. With this meta-learning framework, our model can
not only disentangle the cross-modal shared semantic content for SBIR, but can
adapt the disentanglement to any unseen user style as well, making the SBIR
model truly style-agnostic. Extensive experiments show that our style-agnostic
model yields state-of-the-art performance for both category-level and
instance-level SBIR.
Authors' comments: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021
Xinping Liu, Zehong Cao
Electroencephalography (EEG) signals recordings when people reading natural languages are commonly used as a cognitive method to interpret human language understanding in neuroscience and psycholinguistics. Previous studies have demonstrated that the human fixation and activation in word reading associated with some brain regions, but it is not clear when and how to measure the brain dynamics across time and frequency domains. In this study, we propose the first analysis of event-related brain potentials (ERPs), and event-related spectral perturbations (ERSPs) on benchmark datasets which consist of sentence-level simultaneous EEG and related eye-tracking recorded from human natural reading experiment tasks. Our results showed peaks evoked at around 162 ms after the stimulus (starting to read each sentence) in the occipital area, indicating the brain retriving lexical and semantic visual information processing approaching 200 ms from the sentence onset. Furthermore, the occipital ERP around 200ms presents negative power and positive power in short and long reaction times. In addition, the occipital ERSP around 200ms demonstrated increased high gamma and decreased low beta and low gamma power, relative to the baseline. Our results implied that most of the semantic-perception responses occurred around the 200ms in alpha, beta and gamma bands of EEG signals. Our findings also provide potential impacts on promoting cognitive natural language processing models evaluation from EEG dynamics.
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, Zhongyuan Wang
Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.
Jianlin Su, Jiarun Cao, Weijie Liu, Yangyiwen Ou
Pre-training models such as BERT have achieved great success in many natural
language processing tasks. However, how to obtain better sentence
representation through these pre-training models is still worthy to exploit.
Previous work has shown that the anisotropy problem is an critical bottleneck
for BERT-based sentence representation which hinders the model to fully utilize
the underlying semantic features. Therefore, some attempts of boosting the
isotropy of sentence distribution, such as flow-based model, have been applied
to sentence representations and achieved some improvement. In this paper, we
find that the whitening operation in traditional machine learning can similarly
enhance the isotropy of sentence representations and achieve competitive
results. Furthermore, the whitening technique is also capable of reducing the
dimensionality of the sentence representation. Our experimental results show
that it can not only achieve promising performance but also significantly
reduce the storage cost and accelerate the model retrieval speed.
Authors' comments: The source code of this paper is available at
https://github.com/bojone/BERT-whitening
Marco Faltelli, Giacomo Belocchi, Francesco Quaglia, Salvatore Pontarelli, Giuseppe Bianchi
The increasing performance requirements of modern applications place a significant burden on software-based packet processing. Most of today's software input/output accelerations achieve high performance at the expense of reserving CPU resources dedicated to continuously poll the Network Interface Card. This is specifically the case with DPDK (Data Plane Development Kit), probably the most widely used framework for software-based packet processing today. The approach presented in this paper, descriptively called Metronome, has the dual goals of providing CPU utilization proportional to the load, and allowing flexible sharing of CPU resources between I/O tasks and applications. Metronome replaces DPDK's continuous polling with an intermittent sleep&wake mode, and revolves around a new multi-threaded operation, which improves service continuity. Since the proposed operation trades CPU usage with buffering delay, we propose an analytical model devised to dynamically adapt the sleep&wake parameters to the actual traffic load, meanwhile providing a target average latency. Our experimental results show a significant reduction of the CPU cycles, improvements in power usage, and robustness to CPU sharing even when challenged with CPU-intensive applications.
Florian Boudin, Ygor Gallina
Neural keyphrase generation models have recently attracted much interest due
to their ability to output absent keyphrases, that is, keyphrases that do not
appear in the source text. In this paper, we discuss the usefulness of absent
keyphrases from an Information Retrieval (IR) perspective, and show that the
commonly drawn distinction between present and absent keyphrases is not made
explicit enough. We introduce a finer-grained categorization scheme that sheds
more light on the impact of absent keyphrases on scientific document retrieval.
Under this scheme, we find that only a fraction (around 20%) of the words that
make up keyphrases actually serves as document expansion, but that this small
fraction of words is behind much of the gains observed in retrieval
effectiveness. We also discuss how the proposed scheme can offer a new angle to
evaluate the output of neural keyphrase generation models.
Authors' comments: Accepted at NAACL 2021
Jonathan Herzig, Thomas Müller, Syrine Krichene, Julian Martin Eisenschlos
Recent advances in open-domain QA have led to strong models based on dense
retrieval, but only focused on retrieving textual passages. In this work, we
tackle open-domain QA over tables for the first time, and show that retrieval
can be improved by a retriever designed to handle tabular context. We present
an effective pre-training procedure for our retriever and improve retrieval
quality with mined hard negatives. As relevant datasets are missing, we extract
a subset of Natural Questions (Kwiatkowski et al., 2019) into a Table QA
dataset. We find that our retriever improves retrieval results from 72.0 to
81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a
BERT based retriever.
Authors' comments: NAACL 2021 camera ready
Onifade Olufade, Arise Abiola, Ogboo Chisom
Getting relevant information from search engines has been the heart of research works in information retrieval. Query expansion is a retrieval technique that has been studied and proved to yield positive results in relevance. Users are required to express their queries as a shortlist of words, sentences, or questions. With this short format, a huge amount of information is lost in the process of translating the information need from the actual query size since the user cannot convey all his thoughts in a few words. This mostly leads to poor query representation which contributes to undesired retrieval effectiveness. This loss of information has made the study of query expansion technique a strong area of study. This research work focuses on two methods of retrieval for both tweet-length queries and sentence-length queries. Two algorithms have been proposed and the implementation is expected to produce a better relevance retrieval model than most state-the-art relevance models.