Qinghong Lin, Xiaojun Chen, Qin Zhang, Shangxuan Tian, Yudong Chen
Hashing technology has been widely used in image retrieval due to its
computational and storage efficiency. Recently, deep unsupervised hashing
methods have attracted increasing attention due to the high cost of human
annotations in the real world and the superiority of deep learning technology.
However, most deep unsupervised hashing methods usually pre-compute a
similarity matrix to model the pairwise relationship in the pre-trained feature
space. Then this similarity matrix would be used to guide hash learning, in
which most of the data pairs are treated equivalently. The above process is
confronted with the following defects: 1) The pre-computed similarity matrix is
inalterable and disconnected from the hash learning process, which cannot
explore the underlying semantic information. 2) The informative data pairs may
be buried by the large number of less-informative data pairs. To solve the
aforementioned problems, we propose a Deep Self-Adaptive Hashing (DSAH) model
to adaptively capture the semantic information with two special designs:
Adaptive Neighbor Discovery (AND) and Pairwise Information Content (PIC).
Firstly, we adopt the AND to initially construct a neighborhood-based
similarity matrix, and then refine this initial similarity matrix with a novel
update strategy to further investigate the semantic structure behind the
learned representation. Secondly, we measure the priorities of data pairs with
PIC and assign adaptive weights to them, which is relies on the assumption that
more dissimilar data pairs contain more discriminative information for hash
learning. Extensive experiments on several datasets demonstrate that the above
two technologies facilitate the deep hashing model to achieve superior
performance.
Authors' comments: 10 pages, 11 figures, 4 tables
Paul Grundmann, Sebastian Arnold, Alexander Löser
Retrieving answer passages from long documents is a complex task requiring semantic understanding of both discourse and document context. We approach this challenge specifically in a clinical scenario, where doctors retrieve cohorts of patients based on diagnoses and other latent medical aspects. We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. In addition, we contribute a novel retrieval dataset based on clinical notes to simulate this scenario on a large corpus of clinical notes. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. From our extensive evaluation on MIMIC-III and three other healthcare datasets, we report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages. This makes the model powerful especially in zero-shot scenarios where only limited training data is available.
Jie Lei, Tamara L. Berg, Mohit Bansal
We introduce mTVR, a large-scale multilingual video moment retrieval dataset,
containing 218K English and Chinese queries from 21.8K TV show video clips. The
dataset is collected by extending the popular TVR dataset (in English) with
paired Chinese queries and subtitles. Compared to existing moment retrieval
datasets, mTVR is multilingual, larger, and comes with diverse annotations. We
further propose mXML, a multilingual moment retrieval model that learns and
operates on data from both languages, via encoder parameter sharing and
language neighborhood constraints. We demonstrate the effectiveness of mXML on
the newly collected MTVR dataset, where mXML outperforms strong monolingual
baselines while using fewer parameters. In addition, we also provide detailed
dataset analyses and model ablations. Data and code are publicly available at
https://github.com/jayleicn/mTVRetrieval
Authors' comments: ACL 2021 (9 pages, 4 figures)
Gia Dvali
We present certain universal bounds on the capacity of quantum information
storage and on the time scale of its retrieval for a generic quantum field
theoretic system. The capacity, quantified by the microstate entropy, is
bounded from above by the surface area of the object measured in units of a
Goldstone decay constant. The Goldstone bosons are universally present due to
the spontaneous breaking of Poincare and internal symmetries by the
information-storing object. Applied to a black hole, the bound reproduces the
Bekenstein-Hawking entropy. However, the relation goes beyond gravity. The
minimal time-scale required for retrieving the quantum information from a
system is equal to its volume measured in units of the same Goldstone scale.
For a black hole this reproduces the Page time as well as the quantum
break-time. The same expression for the information retrieval time is shared by
non-gravitational saturated states in gauge theories, including QCD. The
saturated objects exhibit some universal signatures such as the emission of
ultra-soft radiation. Similar bounds apply to non-relativistic many-body
systems.
Authors' comments: 5 pages
Yinqiong Cai, Yixing Fan, Jiafeng Guo, Ruqing Zhang, Yanyan Lan, Xueqi Cheng
Similar question retrieval is a core task in community-based question
answering (CQA) services. To balance the effectiveness and efficiency, the
question retrieval system is typically implemented as multi-stage rankers: The
first-stage ranker aims to recall potentially relevant questions from a large
repository, and the latter stages attempt to re-rank the retrieved results.
Most existing works on question retrieval mainly focused on the re-ranking
stages, leaving the first-stage ranker to some traditional term-based methods.
However, term-based methods often suffer from the vocabulary mismatch problem,
especially on short texts, which may block the re-rankers from relevant
questions at the very beginning. An alternative is to employ embedding-based
methods for the first-stage ranker, which compress texts into dense vectors to
enhance the semantic matching. However, these methods often lose the
discriminative power as term-based methods, thus introduce noise during
retrieval and hurt the recall performance. In this work, we aim to tackle the
dilemma of the first-stage ranker, and propose a discriminative semantic
ranker, namely DenseTrans, for high-recall retrieval. Specifically, DenseTrans
is a densely connected Transformer, which learns semantic embeddings for texts
based on Transformer layers. Meanwhile, DenseTrans promotes low-level features
through dense connections to keep the discriminative power of the learned
representations. DenseTrans is inspired by DenseNet in computer vision (CV),
but poses a new way to use the dense connectivity which is totally different
from its original design purpose. Experimental results over two question
retrieval benchmark datasets show that our model can obtain significant gain on
recall against strong term-based methods as well as state-of-the-art
embedding-based methods.
Authors' comments: ICTIR'21
Young Kyun Jang, Nam Ik Cho
Face image retrieval, which searches for images of the same identity from the
query input face image, is drawing more attention as the size of the image
database increases rapidly. In order to conduct fast and accurate retrieval, a
compact hash code-based methods have been proposed, and recently, deep face
image hashing methods with supervised classification training have shown
outstanding performance. However, classification-based scheme has a
disadvantage in that it cannot reveal complex similarities between face images
into the hash code learning. In this paper, we attempt to improve the face
image retrieval quality by proposing a Similarity Guided Hashing (SGH) method,
which gently considers self and pairwise-similarity simultaneously. SGH employs
various data augmentations designed to explore elaborate similarities between
face images, solving both intra and inter identity-wise difficulties. Extensive
experimental results on the protocols with existing benchmarks and an
additionally proposed large scale higher resolution face image dataset
demonstrate that our SGH delivers state-of-the-art retrieval performance.
Authors' comments: 10 pages, 9 figures
Amit Kumar Nath, Andy Wang
The number of photographs taken worldwide is growing rapidly and steadily. While a small subset of these images is annotated and shared by users through social media platforms, due to the sheer number of images in personal photo repositories (shared or not shared), finding specific images remains challenging. This survey explores existing image retrieval techniques as well as photo-organizer applications to highlight their relative strengths in addressing this challenge.
Hong-Gyu Yoon, Pilwon Kim
Spike-timing-dependent plasticity(STDP) is a biological process in which the
precise order and timing of neuronal spikes affect the degree of synaptic
modification. While there have been numerous research focusing on the role of
STDP in neural coding, the functional implications of STDP at the macroscopic
level in the brain have not been fully explored yet. In this work, we propose a
neurodynamical model based on STDP that renders storage and retrieval of a
group of associative memories. We showed that the function of STDP at the
macroscopic level is to form a "memory plane" in the neural state space which
dynamically encodes high dimensional data. We derived the analytic relation
between the input, the memory plane, and the induced macroscopic neural
oscillations around the memory plane. Such plane produces a limit cycle in
reaction to a similar memory cue, which can be used for retrieval of the
original input.
Authors' comments: 7 pages of main article, 12 pages of appendices
Alison Wong, Benjamin Pope, Louis Desdoigts, Peter Tuthill, Barnaby Norris, Chris Betters
The principal limitation in many areas of astronomy, especially for directly imaging exoplanets, arises from instability in the point spread function (PSF) delivered by the telescope and instrument. To understand the transfer function, it is often necessary to infer a set of optical aberrations given only the intensity distribution on the sensor - the problem of phase retrieval. This can be important for post-processing of existing data, or for the design of optical phase masks to engineer PSFs optimized to achieve high contrast, angular resolution, or astrometric stability. By exploiting newly efficient and flexible technology for automatic differentiation, which in recent years has undergone rapid development driven by machine learning, we can perform both phase retrieval and design in a way that is systematic, user-friendly, fast, and effective. By using modern gradient descent techniques, this converges efficiently and is easily extended to incorporate constraints and regularization. We illustrate the wide-ranging potential for this approach using our new package, Morphine. Challenging applications performed with this code include precise phase retrieval for both discrete and continuous phase distributions, even where information has been censored such as heavily-saturated sensor data. We also show that the same algorithms can optimize continuous or binary phase masks that are competitive with existing best solutions for two example problems: an Apodizing Phase Plate (APP) coronagraph for exoplanet direct imaging, and a diffractive pupil for narrow-angle astrometry. The Morphine source code and examples are available open-source, with a similar interface to the popular physical optics package Poppy.
Florian Boudin, Ygor Gallina, Akiko Aizawa
Sequence-to-sequence models have lead to significant progress in keyphrase
generation, but it remains unknown whether they are reliable enough to be
beneficial for document retrieval. This study provides empirical evidence that
such models can significantly improve retrieval performance, and introduces a
new extrinsic evaluation framework that allows for a better understanding of
the limitations of keyphrase generation models. Using this framework, we point
out and discuss the difficulties encountered with supplementing documents with
-- not present in text -- keyphrases, and generalizing models across domains.
Our code is available at https://github.com/boudinfl/ir-using-kg
Authors' comments: Accepted at ACL 2020
Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, Qianli Ma
Nowadays, the product search service of e-commerce platforms has become a
vital shopping channel in people's life. The retrieval phase of products
determines the search system's quality and gradually attracts researchers'
attention. Retrieving the most relevant products from a large-scale corpus
while preserving personalized user characteristics remains an open question.
Recent approaches in this domain have mainly focused on embedding-based
retrieval (EBR) systems. However, after a long period of practice on Taobao, we
find that the performance of the EBR system is dramatically degraded due to
its: (1) low relevance with a given query and (2) discrepancy between the
training and inference phases. Therefore, we propose a novel and practical
embedding-based product retrieval model, named Multi-Grained Deep Semantic
Product Retrieval (MGDSPR). Specifically, we first identify the inconsistency
between the training and inference stages, and then use the softmax
cross-entropy loss as the training objective, which achieves better performance
and faster convergence. Two efficient methods are further proposed to improve
retrieval relevance, including smoothing noisy training data and generating
relevance-improving hard negative samples without requiring extra knowledge and
training procedures. We evaluate MGDSPR on Taobao Product Search with
significant metrics gains observed in offline experiments and online A/B tests.
MGDSPR has been successfully deployed to the existing multi-channel retrieval
system in Taobao Search. We also introduce the online deployment scheme and
share practical lessons of our retrieval system to contribute to the community.
Authors' comments: 9 pages, accepted by KDD2021
Ruoyuan Gao, Yingqiang Ge, Chirag Shah
With the emerging needs of creating fairness-aware solutions for search and
recommendation systems, a daunting challenge exists of evaluating such
solutions. While many of the traditional information retrieval (IR) metrics can
capture the relevance, diversity, and novelty for the utility with respect to
users, they are not suitable for inferring whether the presented results are
fair from the perspective of responsible information exposure. On the other
hand, existing fairness metrics do not account for user utility or do not
measure it adequately. To address this problem, we propose a new metric called
FAIR. By unifying standard IR metrics and fairness measures into an integrated
metric, this metric offers a new perspective for evaluating fairness-aware
ranking results. Based on this metric, we developed an effective ranking
algorithm that jointly optimized user utility and fairness. The experimental
results showed that our FAIR metric could highlight results with good user
utility and fair information exposure. We showed how FAIR related to a set of
existing utility and fairness metrics and demonstrated the effectiveness of our
FAIR-based algorithm. We believe our work opens up a new direction of pursuing
a metric for evaluating and implementing the FAIR systems.
Authors' comments: Published in The Journal of the Association for Information Science
and Technology
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, Tat-Seng Chua
We tackle the task of video moment retrieval (VMR), which aims to localize a
specific moment in a video according to a textual query. Existing methods
primarily model the matching relationship between query and moment by complex
cross-modal interactions. Despite their effectiveness, current models mostly
exploit dataset biases while ignoring the video content, thus leading to poor
generalizability. We argue that the issue is caused by the hidden confounder in
VMR, {i.e., temporal location of moments}, that spuriously correlates the model
input and prediction. How to design robust matching models against the temporal
location biases is crucial but, as far as we know, has not been studied yet for
VMR.
To fill the research gap, we propose a causality-inspired VMR framework that
builds structural causal model to capture the true effect of query and video
content on the prediction. Specifically, we develop a Deconfounded Cross-modal
Matching (DCM) method to remove the confounding effects of moment location. It
first disentangles moment representation to infer the core feature of visual
content, and then applies causal intervention on the disentangled multimodal
input based on backdoor adjustment, which forces the model to fairly
incorporate each possible location of the target into consideration. Extensive
experiments clearly show that our approach can achieve significant improvement
over the state-of-the-art methods in terms of both accuracy and generalization
(Codes:
\color{blue}{\url{https://github.com/Xun-Yang/Causal_Video_Moment_Retrieval}}
Authors' comments: This work has been accepted by SIGIR 2021
Han Wang, Yang Liu, Chenguang Zhu, Linjun Shou, Ming Gong, Yichong Xu, Michael Zeng
Commonsense generation is a challenging task of generating a plausible
sentence describing an everyday scenario using provided concepts. Its
requirement of reasoning over commonsense knowledge and compositional
generalization ability even puzzles strong pre-trained language generation
models. We propose a novel framework using retrieval methods to enhance both
the pre-training and fine-tuning for commonsense generation. We retrieve
prototype sentence candidates by concept matching and use them as auxiliary
input. For fine-tuning, we further boost its performance with a trainable
sentence retriever. We demonstrate experimentally on the large-scale CommonGen
benchmark that our approach achieves new state-of-the-art results.
Authors' comments: Findings of ACL-IJCNLP 2021
Josef Knapp, Alexander Paulus, Jonas Kornprobst, Uwe Siart, Thomas F. Eibert
Phase retrieval problems in antenna measurements arise when a reference phase
cannot be provided to all measurement locations. Phase retrieval algorithms
require sufficiently many independent measurement samples of the radiated
fields to be successful. Larger amounts of independent data may improve the
reconstruction of the phase information from magnitude-only measurements. We
show how the knowledge of relative phases among the spectral components of a
modulated signal at the individual measurement locations may be employed to
reconstruct the relative phases between different measurement locations at all
frequencies. Projection matrices map the estimated phases onto the space of
fields possibly generated by equivalent antenna under test (AUT) sources at all
frequencies. In this way, the phase of the reconstructed solution is not only
restricted by the measurement samples at one frequency, but by the samples at
allfrequencies simultaneously. The proposed method can increase the amount of
independent phase information even if all probes are located in the far field
of the AUT.
Authors' comments: 14 pages, 29 figures, 1 table, published in IEEE Transactions on
Antennas and Propagation
Conghui Hu, Yongxin Yang, Yunpeng Li, Timothy M. Hospedales, Yi-Zhe Song
The practical value of existing supervised sketch-based image retrieval (SBIR) algorithms is largely limited by the requirement for intensive data collection and labeling. In this paper, we present the first attempt at unsupervised SBIR to remove the labeling cost (both category annotations and sketch-photo pairings) that is conventionally needed for training. Existing single-domain unsupervised representation learning methods perform poorly in this application, due to the unique cross-domain (sketch and photo) nature of the problem. We therefore introduce a novel framework that simultaneously performs sketch-photo domain alignment and semantic-aware representation learning. Technically this is underpinned by introducing joint distribution optimal transport (JDOT) to align data from different domains, which we extend with trainable cluster prototypes and feature memory banks to further improve scalability and efficacy. Extensive experiments show that our framework achieves excellent performance in the new unsupervised setting, and performs comparably or better than state-of-the-art in the zero-shot setting.
Shah Riya Chiragkumar
Music Information Retrieval (MIR) is a collaborative scientific study that
help to build innovative information research themes, novel frameworks, and
developing connected delivery mechanisms in addition to making the world's
massive collection of music open for everyone. Modern rock music proved to be
difficult to estimate tempo and chord recognition did not work. All of the
findings indicate that modern rock and metal music can be analysed, despite its
complexity, but that further research is needed in this area to make it useful.
Using a neural network has been one of the simplest ways of dealing with it.
The pitch class profile vector is used in the neural network method. Because
the vector only contains 12 elements of semi-tone values, it is enough for
chord recognition. Of course, there are other ways of achieving this work, most
of them depend on pitch class profiling to transform the chord into a type that
can be recognised, but the recognition process is time-consuming centred on
extremely complicated and memory-intensive methods.
Authors' comments: work in progress
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh
Given a collection of untrimmed and unsegmented videos, video corpus moment
retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video)
that semantically corresponds to a given text query. As video and text are from
two distinct feature spaces, there are two general approaches to address VCMR:
(i) to separately encode each modality representations, then align the two
modality representations for query processing, and (ii) to adopt fine-grained
cross-modal interaction to learn multi-modal representations for query
processing. While the second approach often leads to better retrieval accuracy,
the first approach is far more efficient. In this paper, we propose a Retrieval
and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We
adopt the first approach and introduce two contrastive learning objectives to
refine video encoder and text encoder to learn video and text representations
separately but with better alignment for VCMR. The video contrastive learning
(VideoCL) is to maximize mutual information between query and candidate video
at video-level. The frame contrastive learning (FrameCL) aims to highlight the
moment region corresponds to the query at frame-level, within a video.
Experimental results show that, although ReLoCLNet encodes text and video
separately for efficiency, its retrieval accuracy is comparable with baselines
adopting cross-modal interaction learning.
Authors' comments: 11 pages, 7 figures and 6 tables. Accepted by SIGIR 2021
Sanya B. Taneja, Richard D. Boyce, William T. Reynolds, Denis Newman-Griffis
Introducing biomedical informatics (BMI) students to natural language
processing (NLP) requires balancing technical depth with practical know-how to
address application-focused needs. We developed a set of three activities
introducing introductory BMI students to information retrieval with NLP,
covering document representation strategies and language models from TF-IDF to
BERT. These activities provide students with hands-on experience targeted
towards common use cases, and introduce fundamental components of NLP workflows
for a wide variety of applications.
Authors' comments: To appear in the Proceedings of the Fifth Workshop on Teaching NLP @
NAACL
Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie
We consider the task of retrieving audio using free-form natural language
queries. To study this problem, which has received limited attention in the
existing literature, we introduce challenging new benchmarks for text-based
audio retrieval using text annotations sourced from the Audiocaps and Clotho
datasets. We then employ these benchmarks to establish baselines for
cross-modal audio retrieval, where we demonstrate the benefits of pre-training
on diverse audio tasks. We hope that our benchmarks will inspire further
research into cross-modal text-based audio retrieval with free-form text
queries.
Authors' comments: Accepted at INTERSPEECH 2021