Yohan Jo, Haneul Yoo, JinYeong Bak, Alice Oh, Chris Reed, Eduard Hovy
Finding counterevidence to statements is key to many tasks, including
counterargument generation. We build a system that, given a statement,
retrieves counterevidence from diverse sources on the Web. At the core of this
system is a natural language inference (NLI) model that determines whether a
candidate sentence is valid counterevidence or not. Most NLI models to date,
however, lack proper reasoning abilities necessary to find counterevidence that
involves complex inference. Thus, we present a knowledge-enhanced NLI model
that aims to handle causality- and example-based inference by incorporating
knowledge graphs. Our NLI model outperforms baselines for NLI tasks, especially
for instances that require the targeted inference. In addition, this NLI model
further improves the counterevidence retrieval system, notably finding complex
counterevidence better.
Authors' comments: To appear in Findings of EMNLP 2021
Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, Danqi Chen
Open-domain question answering has exploded in popularity recently due to the
success of dense retrieval models, which have surpassed sparse models using
only a few supervised training examples. However, in this paper, we demonstrate
current dense models are not yet the holy grail of retrieval. We first
construct EntityQuestions, a set of simple, entity-rich questions based on
facts from Wikidata (e.g., "Where was Arve Furset born?"), and observe that
dense retrievers drastically underperform sparse methods. We investigate this
issue and uncover that dense retrievers can only generalize to common entities
unless the question pattern is explicitly observed during training. We discuss
two simple solutions towards addressing this critical problem. First, we
demonstrate that data augmentation is unable to fix the generalization problem.
Second, we argue a more robust passage encoder helps facilitate better question
adaptation using specialized question encoders. We hope our work can shed light
on the challenges in creating a robust, universal dense retriever that works
well across different input distributions.
Authors' comments: EMNLP 2021. The code and data is publicly available at
https://github.com/princeton-nlp/EntityQuestions
Ying Wang, Tingzhen Liu, Zepeng Bu, Yuhui Huang, Lizhong Gao, Qiao Wang
In large-scale image retrieval, many indexing methods have been proposed to
narrow down the searching scope of retrieval. The features extracted from
images usually are of high dimensions or unfixed sizes due to the existence of
key points. Most of existing index structures suffer from the dimension curse,
the unfixed feature size and/or the loss of semantic similarity. In this paper
a new classification-based indexing structure, called Semantic Indexing
Structure (SIS), is proposed, in which we utilize the semantic categories
rather than clustering centers to create database partitions, such that the
proposed index SIS can be combined with feature extractors without the
restriction of dimensions. Besides, it is observed that the size of each
semantic partition is positively correlated with the semantic distribution of
database. Along this way, we found that when the partition number is normalized
to five, the proposed algorithm performed very well in all the tests. Compared
with state-of-the-art models, SIS achieves outstanding performance.
Authors' comments: 12 pages, 6 figures
Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, Jason Baldridge
Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang
Software developers write a lot of source code and documentation during
software development. Intrinsically, developers often recall parts of source
code or code summaries that they had written in the past while implementing
software or documenting them. To mimic developers' code or summary generation
behavior, we propose a retrieval augmented framework, REDCODER, that retrieves
relevant code or summaries from a retrieval database and provides them as a
supplement to code generation or summarization models. REDCODER has a couple of
uniqueness. First, it extends the state-of-the-art dense retrieval technique to
search for relevant code or summaries. Second, it can work with retrieval
databases that include unimodal (only code or natural language description) or
bimodal instances (code-description pairs). We conduct experiments and
extensive analysis on two benchmark datasets of code generation and
summarization in Java and Python, and the promising results endorse the
effectiveness of our proposed retrieval augmented framework.
Authors' comments: accepted in EMNLP-Findings 2021
Antoine Louis, Gerasimos Spanakis
Statutory article retrieval is the task of automatically retrieving law
articles relevant to a legal question. While recent advances in natural
language processing have sparked considerable interest in many legal tasks,
statutory article retrieval remains primarily untouched due to the scarcity of
large-scale and high-quality annotated datasets. To address this bottleneck, we
introduce the Belgian Statutory Article Retrieval Dataset (BSARD), which
consists of 1,100+ French native legal questions labeled by experienced jurists
with relevant articles from a corpus of 22,600+ Belgian law articles. Using
BSARD, we benchmark several state-of-the-art retrieval approaches, including
lexical and dense architectures, both in zero-shot and supervised setups. We
find that fine-tuned dense retrieval models significantly outperform other
systems. Our best performing baseline achieves 74.8% R@100, which is promising
for the feasibility of the task and indicates there is still room for
improvement. By the specificity of the domain and addressed task, BSARD
presents a unique challenge problem for future research on legal information
retrieval. Our dataset and source code are publicly available.
Authors' comments: ACL 2022. Code and dataset are available at
https://github.com/maastrichtlawtech/bsard
Nicola Tonellotto, Craig Macdonald
Recent advances in dense retrieval techniques have offered the promise of being able not just to re-rank documents using contextualised language models such as BERT, but also to use such models to identify documents from the collection in the first place. However, when using dense retrieval approaches that use multiple embedded representations for each query, a large number of documents can be retrieved for each query, hindering the efficiency of the method. Hence, this work is the first to consider efficiency improvements in the context of a dense retrieval approach (namely ColBERT), by pruning query term embeddings that are estimated not to be useful for retrieving relevant documents. Our proposed query embeddings pruning reduces the cost of the dense retrieval operation, as well as reducing the number of documents that are retrieved and hence require to be fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, when reducing the number of query embeddings used from 32 to 3 based on the collection frequency of the corresponding tokens, query embedding pruning results in no statistically significant differences in effectiveness, while reducing the number of documents retrieved by 70%. In terms of mean response time for the end-to-end to end system, this results in a 2.65x speedup.
Yuhao Zhou, Huanhuan Fan, Shuang Gao, Yuchen Yang, Xudong Zhang, Jijunnan Li, Yandong Guo
Accurate visual re-localization is very critical to many artificial
intelligence applications, such as augmented reality, virtual reality, robotics
and autonomous driving. To accomplish this task, we propose an integrated
visual re-localization method called RLOCS by combining image retrieval,
semantic consistency and geometry verification to achieve accurate estimations.
The localization pipeline is designed as a coarse-to-fine paradigm. In the
retrieval part, we cascade the architecture of ResNet101-GeM-ArcFace and employ
DBSCAN followed by spatial verification to obtain a better initial coarse pose.
We design a module called observation constraints, which combines geometry
information and semantic consistency for filtering outliers. Comprehensive
experiments are conducted on open datasets, including retrieval on R-Oxford5k
and R-Paris6k, semantic segmentation on Cityscapes, localization on Aachen
Day-Night and InLoc. By creatively modifying separate modules in the total
pipeline, our method achieves many performance improvements on the challenging
localization benchmarks.
Authors' comments: Accepted by the 2021 International Conference on Robotics and
Automation (ICRA2021)
Qinghong Lin, Xiaojun Chen, Qin Zhang, Shangxuan Tian, Yudong Chen
Hashing technology has been widely used in image retrieval due to its
computational and storage efficiency. Recently, deep unsupervised hashing
methods have attracted increasing attention due to the high cost of human
annotations in the real world and the superiority of deep learning technology.
However, most deep unsupervised hashing methods usually pre-compute a
similarity matrix to model the pairwise relationship in the pre-trained feature
space. Then this similarity matrix would be used to guide hash learning, in
which most of the data pairs are treated equivalently. The above process is
confronted with the following defects: 1) The pre-computed similarity matrix is
inalterable and disconnected from the hash learning process, which cannot
explore the underlying semantic information. 2) The informative data pairs may
be buried by the large number of less-informative data pairs. To solve the
aforementioned problems, we propose a Deep Self-Adaptive Hashing (DSAH) model
to adaptively capture the semantic information with two special designs:
Adaptive Neighbor Discovery (AND) and Pairwise Information Content (PIC).
Firstly, we adopt the AND to initially construct a neighborhood-based
similarity matrix, and then refine this initial similarity matrix with a novel
update strategy to further investigate the semantic structure behind the
learned representation. Secondly, we measure the priorities of data pairs with
PIC and assign adaptive weights to them, which is relies on the assumption that
more dissimilar data pairs contain more discriminative information for hash
learning. Extensive experiments on several datasets demonstrate that the above
two technologies facilitate the deep hashing model to achieve superior
performance.
Authors' comments: 10 pages, 11 figures, 4 tables
Paul Grundmann, Sebastian Arnold, Alexander Löser
Retrieving answer passages from long documents is a complex task requiring semantic understanding of both discourse and document context. We approach this challenge specifically in a clinical scenario, where doctors retrieve cohorts of patients based on diagnoses and other latent medical aspects. We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. In addition, we contribute a novel retrieval dataset based on clinical notes to simulate this scenario on a large corpus of clinical notes. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. From our extensive evaluation on MIMIC-III and three other healthcare datasets, we report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages. This makes the model powerful especially in zero-shot scenarios where only limited training data is available.
Jie Lei, Tamara L. Berg, Mohit Bansal
We introduce mTVR, a large-scale multilingual video moment retrieval dataset,
containing 218K English and Chinese queries from 21.8K TV show video clips. The
dataset is collected by extending the popular TVR dataset (in English) with
paired Chinese queries and subtitles. Compared to existing moment retrieval
datasets, mTVR is multilingual, larger, and comes with diverse annotations. We
further propose mXML, a multilingual moment retrieval model that learns and
operates on data from both languages, via encoder parameter sharing and
language neighborhood constraints. We demonstrate the effectiveness of mXML on
the newly collected MTVR dataset, where mXML outperforms strong monolingual
baselines while using fewer parameters. In addition, we also provide detailed
dataset analyses and model ablations. Data and code are publicly available at
https://github.com/jayleicn/mTVRetrieval
Authors' comments: ACL 2021 (9 pages, 4 figures)
Gia Dvali
We present certain universal bounds on the capacity of quantum information
storage and on the time scale of its retrieval for a generic quantum field
theoretic system. The capacity, quantified by the microstate entropy, is
bounded from above by the surface area of the object measured in units of a
Goldstone decay constant. The Goldstone bosons are universally present due to
the spontaneous breaking of Poincare and internal symmetries by the
information-storing object. Applied to a black hole, the bound reproduces the
Bekenstein-Hawking entropy. However, the relation goes beyond gravity. The
minimal time-scale required for retrieving the quantum information from a
system is equal to its volume measured in units of the same Goldstone scale.
For a black hole this reproduces the Page time as well as the quantum
break-time. The same expression for the information retrieval time is shared by
non-gravitational saturated states in gauge theories, including QCD. The
saturated objects exhibit some universal signatures such as the emission of
ultra-soft radiation. Similar bounds apply to non-relativistic many-body
systems.
Authors' comments: 5 pages
Yinqiong Cai, Yixing Fan, Jiafeng Guo, Ruqing Zhang, Yanyan Lan, Xueqi Cheng
Similar question retrieval is a core task in community-based question
answering (CQA) services. To balance the effectiveness and efficiency, the
question retrieval system is typically implemented as multi-stage rankers: The
first-stage ranker aims to recall potentially relevant questions from a large
repository, and the latter stages attempt to re-rank the retrieved results.
Most existing works on question retrieval mainly focused on the re-ranking
stages, leaving the first-stage ranker to some traditional term-based methods.
However, term-based methods often suffer from the vocabulary mismatch problem,
especially on short texts, which may block the re-rankers from relevant
questions at the very beginning. An alternative is to employ embedding-based
methods for the first-stage ranker, which compress texts into dense vectors to
enhance the semantic matching. However, these methods often lose the
discriminative power as term-based methods, thus introduce noise during
retrieval and hurt the recall performance. In this work, we aim to tackle the
dilemma of the first-stage ranker, and propose a discriminative semantic
ranker, namely DenseTrans, for high-recall retrieval. Specifically, DenseTrans
is a densely connected Transformer, which learns semantic embeddings for texts
based on Transformer layers. Meanwhile, DenseTrans promotes low-level features
through dense connections to keep the discriminative power of the learned
representations. DenseTrans is inspired by DenseNet in computer vision (CV),
but poses a new way to use the dense connectivity which is totally different
from its original design purpose. Experimental results over two question
retrieval benchmark datasets show that our model can obtain significant gain on
recall against strong term-based methods as well as state-of-the-art
embedding-based methods.
Authors' comments: ICTIR'21
Young Kyun Jang, Nam Ik Cho
Face image retrieval, which searches for images of the same identity from the
query input face image, is drawing more attention as the size of the image
database increases rapidly. In order to conduct fast and accurate retrieval, a
compact hash code-based methods have been proposed, and recently, deep face
image hashing methods with supervised classification training have shown
outstanding performance. However, classification-based scheme has a
disadvantage in that it cannot reveal complex similarities between face images
into the hash code learning. In this paper, we attempt to improve the face
image retrieval quality by proposing a Similarity Guided Hashing (SGH) method,
which gently considers self and pairwise-similarity simultaneously. SGH employs
various data augmentations designed to explore elaborate similarities between
face images, solving both intra and inter identity-wise difficulties. Extensive
experimental results on the protocols with existing benchmarks and an
additionally proposed large scale higher resolution face image dataset
demonstrate that our SGH delivers state-of-the-art retrieval performance.
Authors' comments: 10 pages, 9 figures
Amit Kumar Nath, Andy Wang
The number of photographs taken worldwide is growing rapidly and steadily. While a small subset of these images is annotated and shared by users through social media platforms, due to the sheer number of images in personal photo repositories (shared or not shared), finding specific images remains challenging. This survey explores existing image retrieval techniques as well as photo-organizer applications to highlight their relative strengths in addressing this challenge.
Hong-Gyu Yoon, Pilwon Kim
Spike-timing-dependent plasticity(STDP) is a biological process in which the
precise order and timing of neuronal spikes affect the degree of synaptic
modification. While there have been numerous research focusing on the role of
STDP in neural coding, the functional implications of STDP at the macroscopic
level in the brain have not been fully explored yet. In this work, we propose a
neurodynamical model based on STDP that renders storage and retrieval of a
group of associative memories. We showed that the function of STDP at the
macroscopic level is to form a "memory plane" in the neural state space which
dynamically encodes high dimensional data. We derived the analytic relation
between the input, the memory plane, and the induced macroscopic neural
oscillations around the memory plane. Such plane produces a limit cycle in
reaction to a similar memory cue, which can be used for retrieval of the
original input.
Authors' comments: 7 pages of main article, 12 pages of appendices
Alison Wong, Benjamin Pope, Louis Desdoigts, Peter Tuthill, Barnaby Norris, Chris Betters
The principal limitation in many areas of astronomy, especially for directly imaging exoplanets, arises from instability in the point spread function (PSF) delivered by the telescope and instrument. To understand the transfer function, it is often necessary to infer a set of optical aberrations given only the intensity distribution on the sensor - the problem of phase retrieval. This can be important for post-processing of existing data, or for the design of optical phase masks to engineer PSFs optimized to achieve high contrast, angular resolution, or astrometric stability. By exploiting newly efficient and flexible technology for automatic differentiation, which in recent years has undergone rapid development driven by machine learning, we can perform both phase retrieval and design in a way that is systematic, user-friendly, fast, and effective. By using modern gradient descent techniques, this converges efficiently and is easily extended to incorporate constraints and regularization. We illustrate the wide-ranging potential for this approach using our new package, Morphine. Challenging applications performed with this code include precise phase retrieval for both discrete and continuous phase distributions, even where information has been censored such as heavily-saturated sensor data. We also show that the same algorithms can optimize continuous or binary phase masks that are competitive with existing best solutions for two example problems: an Apodizing Phase Plate (APP) coronagraph for exoplanet direct imaging, and a diffractive pupil for narrow-angle astrometry. The Morphine source code and examples are available open-source, with a similar interface to the popular physical optics package Poppy.
Florian Boudin, Ygor Gallina, Akiko Aizawa
Sequence-to-sequence models have lead to significant progress in keyphrase
generation, but it remains unknown whether they are reliable enough to be
beneficial for document retrieval. This study provides empirical evidence that
such models can significantly improve retrieval performance, and introduces a
new extrinsic evaluation framework that allows for a better understanding of
the limitations of keyphrase generation models. Using this framework, we point
out and discuss the difficulties encountered with supplementing documents with
-- not present in text -- keyphrases, and generalizing models across domains.
Our code is available at https://github.com/boudinfl/ir-using-kg
Authors' comments: Accepted at ACL 2020
Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, Qianli Ma
Nowadays, the product search service of e-commerce platforms has become a
vital shopping channel in people's life. The retrieval phase of products
determines the search system's quality and gradually attracts researchers'
attention. Retrieving the most relevant products from a large-scale corpus
while preserving personalized user characteristics remains an open question.
Recent approaches in this domain have mainly focused on embedding-based
retrieval (EBR) systems. However, after a long period of practice on Taobao, we
find that the performance of the EBR system is dramatically degraded due to
its: (1) low relevance with a given query and (2) discrepancy between the
training and inference phases. Therefore, we propose a novel and practical
embedding-based product retrieval model, named Multi-Grained Deep Semantic
Product Retrieval (MGDSPR). Specifically, we first identify the inconsistency
between the training and inference stages, and then use the softmax
cross-entropy loss as the training objective, which achieves better performance
and faster convergence. Two efficient methods are further proposed to improve
retrieval relevance, including smoothing noisy training data and generating
relevance-improving hard negative samples without requiring extra knowledge and
training procedures. We evaluate MGDSPR on Taobao Product Search with
significant metrics gains observed in offline experiments and online A/B tests.
MGDSPR has been successfully deployed to the existing multi-channel retrieval
system in Taobao Search. We also introduce the online deployment scheme and
share practical lessons of our retrieval system to contribute to the community.
Authors' comments: 9 pages, accepted by KDD2021
Ruoyuan Gao, Yingqiang Ge, Chirag Shah
With the emerging needs of creating fairness-aware solutions for search and
recommendation systems, a daunting challenge exists of evaluating such
solutions. While many of the traditional information retrieval (IR) metrics can
capture the relevance, diversity, and novelty for the utility with respect to
users, they are not suitable for inferring whether the presented results are
fair from the perspective of responsible information exposure. On the other
hand, existing fairness metrics do not account for user utility or do not
measure it adequately. To address this problem, we propose a new metric called
FAIR. By unifying standard IR metrics and fairness measures into an integrated
metric, this metric offers a new perspective for evaluating fairness-aware
ranking results. Based on this metric, we developed an effective ranking
algorithm that jointly optimized user utility and fairness. The experimental
results showed that our FAIR metric could highlight results with good user
utility and fair information exposure. We showed how FAIR related to a set of
existing utility and fairness metrics and demonstrated the effectiveness of our
FAIR-based algorithm. We believe our work opens up a new direction of pursuing
a metric for evaluating and implementing the FAIR systems.
Authors' comments: Published in The Journal of the Association for Information Science
and Technology