Sowmya Kamath S, Karthik K
Medical Image Retrieval is a challenging field in Visual information
retrieval, due to the multi-dimensional and multi-modal context of the
underlying content. Traditional models often fail to take the intrinsic
characteristics of data into consideration, and have thus achieved limited
accuracy when applied to medical images. The Bag of Visual Words (BoVW) is a
technique that can be used to effectively represent intrinsic image features in
vector space, so that applications like image classification and similar-image
search can be optimized. In this paper, we present a MedIR approach based on
the BoVW model for content-based medical image retrieval. As medical images as
multi-dimensional, they exhibit underlying cluster and manifold information
which enhances semantic relevance and allows for label uniformity. Hence, the
BoVW features extracted for each image are used to train a supervised machine
learning classifier based on positive and negative training images, for
extending content based image retrieval. During experimental validation, the
proposed model performed very well, achieving a Mean Average Precision of
88.89% during top-3 image retrieval experiments.
Authors' comments: In the proceedings of the 7th International Engineering Symposium
(IES 2018), Kumamoto University, Kumamoto, Japan, Mar 7-9, 2018
Samarth Rawal
Information Retrieval (IR) is the task of obtaining pieces of data (such as
documents) that are relevant to a particular query or need from a large
repository of information. IR is a valuable component of several downstream
Natural Language Processing (NLP) tasks. Practically, IR is at the heart of
many widely-used technologies like search engines. While probabilistic ranking
functions like the Okapi BM25 function have been utilized in IR systems since
the 1970's, modern neural approaches pose certain advantages compared to their
classical counterparts. In particular, the release of BERT (Bidirectional
Encoder Representations from Transformers) has had a significant impact in the
NLP community by demonstrating how the use of a Masked Language Model trained
on a large corpus of data can improve a variety of downstream NLP tasks,
including sentence classification and passage re-ranking. IR Systems are also
important in the biomedical and clinical domains. Given the increasing amount
of scientific literature across biomedical domain, the ability find answers to
specific clinical queries from a repository of millions of articles is a matter
of practical value to medical professionals. Moreover, there are
domain-specific challenges present, including handling clinical jargon and
evaluating the similarity or relatedness of various medical symptoms when
determining the relevance between a query and a sentence. This work presents
contributions to several aspects of the Biomedical Semantic Information
Retrieval domain. First, it introduces Multi-Perspective Sentence Relevance, a
novel methodology of utilizing BERT-based models for contextual IR. The system
is evaluated using the BioASQ Biomedical IR Challenge. Finally, practical
contributions in the form of a live IR system for medics and a proposed
challenge on the Living Systematic Review clinical task are provided.
Authors' comments: Masters thesis, Arizona State Univ (May, 2020)
Christopher Thomas, Adriana Kovashka
The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Further, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces, which does not necessarily align with visual coherency. Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines.
Yang Feng, Yubao Liu, Jiebo Luo
Medical Image Retrieval (MIR) helps doctors quickly find similar patients'
data, which can considerably aid the diagnosis process. MIR is becoming
increasingly helpful due to the wide use of digital imaging modalities and the
growth of the medical image repositories. However, the popularity of various
digital imaging modalities in hospitals also poses several challenges to MIR.
Usually, one image retrieval model is only trained to handle images from one
modality or one source. When there are needs to retrieve medical images from
several sources or domains, multiple retrieval models need to be maintained,
which is cost ineffective. In this paper, we study an important but unexplored
task: how to train one MIR model that is applicable to medical images from
multiple domains? Simply fusing the training data from multiple domains cannot
solve this problem because some domains become over-fit sooner when trained
together using existing methods. Therefore, we propose to distill the knowledge
in multiple specialist MIR models into a single multi-domain MIR model via
universal embedding to solve this problem. Using skin disease, x-ray, and
retina image datasets, we validate that our proposed universal model can
effectively accomplish multi-domain MIR.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2003.03701
Simon Jordan, Mathias Seuret, Pavel Král, Ladislav Lenc, Jiří Martínek, Barbara Wiermann, Tobias Schwinger, Andreas Maier et al.
Automatic writer identification is a common problem in document analysis. State-of-the-art methods typically focus on the feature extraction step with traditional or deep-learning-based techniques. In retrieval problems, re-ranking is a commonly used technique to improve the results. Re-ranking refines an initial ranking result by using the knowledge contained in the ranked result, e. g., by exploiting nearest neighbor relations. To the best of our knowledge, re-ranking has not been used for writer identification/retrieval. A possible reason might be that publicly available benchmark datasets contain only few samples per writer which makes a re-ranking less promising. We show that a re-ranking step based on k-reciprocal nearest neighbor relationships is advantageous for writer identification, even if only a few samples per writer are available. We use these reciprocal relationships in two ways: encode them into new vectors, as originally proposed, or integrate them in terms of query-expansion. We show that both techniques outperform the baseline results in terms of mAP on three writer identification datasets.
Mark Hamilton, Stephanie Fu, Mindren Lu, Johnny Bui, Darius Bopp, Zhenbang Chen, Felix Tran, Margaret Wang et al.
We introduce MosAIc, an interactive web app that allows users to find pairs of semantically related artworks that span different cultures, media, and millennia. To create this application, we introduce Conditional Image Retrieval (CIR) which combines visual similarity search with user supplied filters or "conditions". This technique allows one to find pairs of similar images that span distinct subsets of the image corpus. We provide a generic way to adapt existing image retrieval data-structures to this new domain and provide theoretical bounds on our approach's efficiency. To quantify the performance of CIR systems, we introduce new datasets for evaluating CIR methods and show that CIR performs non-parametric style transfer. Finally, we demonstrate that our CIR data-structures can identify "blind spots" in Generative Adversarial Networks (GAN) where they fail to properly model the true data distribution.
Sagar Uprety, Dimitris Gkoumas, Dawei Song
Since 2004, researchers have been using the mathematical framework of Quantum
Theory (QT) in Information Retrieval (IR). QT offers a generalized probability
and logic framework. Such a framework has been shown capable of unifying the
representation, ranking and user cognitive aspects of IR, and helpful in
developing more dynamic, adaptive and context-aware IR systems. Although
Quantum-inspired IR is still a growing area, a wide array of work in different
aspects of IR has been done and produced promising results. This paper presents
a survey of the research done in this area, aiming to show the landscape of the
field and draw a road-map of future directions.
Authors' comments: Accepted for publication at ACM Computing Surveys on May 20, 2020
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, Tat-Seng Chua
The rapid growth of user-generated videos on the Internet has intensified the
need for text-based video retrieval systems. Traditional methods mainly favor
the concept-based paradigm on retrieval with simple queries, which are usually
ineffective for complex queries that carry far more complex semantics.
Recently, embedding-based paradigm has emerged as a popular approach. It aims
to map the queries and videos into a shared embedding space where
semantically-similar texts and videos are much closer to each other. Despite
its simplicity, it forgoes the exploitation of the syntactic structure of text
queries, making it suboptimal to model the complex queries.
To facilitate video retrieval with complex queries, we propose a
Tree-augmented Cross-modal Encoding method by jointly learning the linguistic
structure of queries and the temporal representation of videos. Specifically,
given a complex user query, we first recursively compose a latent semantic tree
to structurally describe the text query. We then design a tree-augmented query
encoder to derive structure-aware query representation and a temporal attentive
video encoder to model the temporal characteristics of videos. Finally, both
the query and videos are mapped into a joint embedding space for matching and
ranking. In this approach, we have a better understanding and modeling of the
complex queries, thereby achieving a better video retrieval performance.
Extensive experiments on large scale video retrieval benchmark datasets
demonstrate the effectiveness of our approach.
Authors' comments: Accepted For 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 2020)
Hai Wang, David McAllester
Here we experiment with the use of information retrieval as an augmentation
for pre-trained language models. The text corpus used in information retrieval
can be viewed as form of episodic memory which grows over time. By augmenting
GPT 2.0 with information retrieval we achieve a zero shot 15% relative
reduction in perplexity on Gigaword corpus without any re-training. We also
validate our IR augmentation on an event co-reference task.
Authors' comments: ACL 2020 NUSE Workshop
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk
Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma
Although exact term match between queries and documents is the dominant
method to perform first-stage retrieval, we propose a different approach,
called RepBERT, to represent documents and queries with fixed-length
contextualized embeddings. The inner products of query and document embeddings
are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT
achieves state-of-the-art results among all initial retrieval techniques. And
its efficiency is comparable to bag-of-words methods.
Authors' comments: For corresponding code and data, see
https://github.com/jingtaozhan/RepBERT-Index
Thao Nguyen, Nakul Gopalan, Roma Patel, Matt Corsaro, Ellie Pavlick, Stefanie Tellex
Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object's type such as "scissors" and/or visual attributes such as "red," thus limiting the robot to only known object classes. We develop a model to retrieve objects based on descriptions of their usage. The model takes in a language command containing a verb, for example "Hand me something to cut," and RGB images of candidate objects and selects the object that best satisfies the task specified by the verb. Our model directly predicts an object's appearance from the object's use specified by a verb phrase. We do not need to explicitly specify an object's class label. Our approach allows us to predict high level concepts like an object's utility based on the language query. Based on contextual information present in the language commands, our model can generalize to unseen object classes and unknown nouns in the commands. Our model correctly selects objects out of sets of five candidates to fulfill natural language commands, and achieves an average accuracy of 62.3% on a held-out test set of unseen ImageNet object classes and 53.0% on unseen object classes and unknown nouns. Our model also achieves an average accuracy of 54.7% on unseen YCB object classes, which have a different image distribution from ImageNet objects. We demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language descriptions of their usage. We also present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes.
Michiel Min, Chris W. Ormel, Katy Chubb, Christiane Helling, Yui Kawashima
Aims: ARCiS, a novel code for the analysis of exoplanet transmission and
emission spectra is presented. The aim of the modelling framework is to provide
a tool able to link observations to physical models of exoplanet atmospheres.
Methods: The modelling philosophy chosen in this paper is to use physical and
chemical models to constrain certain parameters while keeping free the parts
where our physical understanding is still more limited. This approach, in
between full physical modelling and full parameterisation, allows us to use the
processes we understand well and parameterise those less understood. A Bayesian
retrieval framework is implemented and applied to the transit spectra of a set
of 10 hot Jupiters. The code contains chemistry and cloud formation and has the
option for self consistent temperature structure computations. Results: The
code presented is fast and flexible enough to be used for retrieval and for
target list simulations for e.g. JWST or the ESA Ariel missions. We present
results for the retrieval of elemental abundance ratios using the physical
retrieval framework and compare this to results obtained using a parameterised
retrieval setup. Conclusions: We conclude that for most of the targets
considered the current dataset is not constraining enough to reliably pin down
the elemental abundance ratios. We find no significant correlations between
different physical parameters. We confirm that planets in our sample with a
strong slope in the optical transmission spectrum are the planets where we find
cloud formation to be most active. Finally, we conclude that with ARCiS we have
a computationally efficient tool to analyse exoplanet observations in the
context of physical and chemical models.
Authors' comments: Accepted for publication in A&A
Cheng Cheng, Ingrid Daubechies, Nadav Dym, Jianfeng Lu
This paper is concerned with stable phase retrieval for a family of phase retrieval models we name "locally stable and conditionally connected" (LSCC) measurement schemes. For every signal $f$, we associate a corresponding weighted graph $G_f$, defined by the LSCC measurement scheme, and show that the phase retrievability of the signal $f$ is determined by the connectivity of $G_f$. We then characterize the phase retrieval stability of the signal $f$ by two measures that are commonly used in graph theory to quantify graph connectivity: the Cheeger constant of $G_f$ for real valued signals, and the algebraic connectivity of $G_f$ for complex valued signals. We use our results to study the stability of two phase retrieval models that can be cast as LSCC measurement schemes, and focus on understanding for which signals the "curse of dimensionality" can be avoided. The first model we discuss is a finite-dimensional model for locally supported measurements such as the windowed Fourier transform. For signals "without large holes", we show the stability constant exhibits only a mild polynomial growth in the dimension, in stark contrast with the exponential growth which uniform stability constants tend to suffer from; more precisely, in $R^d$ the constant grows proportionally to $d^{1/2}$, while in $C^d$ it grows proportionally to $d$. We also show the growth of the constant in the complex case cannot be reduced, suggesting that complex phase retrieval is substantially more difficult than real phase retrieval. The second model we consider is an infinite-dimensional phase retrieval problem in a principal shift invariant space. We show that despite the infinite dimensionality of this model, signals with monotone exponential decay will have a finite stability constant. In contrast, the stability bound provided by our results will be infinite if the signal's decay is polynomial.
Muhammad Umer Anwaar, Egor Labintcev, Martin Kleinsteuber
In this paper, we investigate the problem of retrieving images from a
database based on a multi-modal (image-text) query. Specifically, the query
text prompts some modification in the query image and the task is to retrieve
images with the desired modifications. For instance, a user of an E-Commerce
platform is interested in buying a dress, which should look similar to her
friend's dress, but the dress should be of white color with a ribbon sash. In
this case, we would like the algorithm to retrieve some dresses with desired
modifications in the query dress. We propose an autoencoder based model,
ComposeAE, to learn the composition of image and text query for retrieving
images. We adopt a deep metric learning approach and learn a metric that pushes
composition of source image and text query closer to the target images. We also
propose a rotational symmetry constraint on the optimization problem. Our
approach is able to outperform the state-of-the-art method TIRG \cite{TIRG} on
three benchmark datasets, namely: MIT-States, Fashion200k and Fashion IQ. In
order to ensure fair comparison, we introduce strong baselines by enhancing
TIRG method. To ensure reproducibility of the results, we publish our code
here: \url{https://github.com/ecom-research/ComposeAE}.
Authors' comments: Published at IEEE WACV 2021
Y. Katherina Feng, Michael R. Line, Jonathan J. Fortney
Spectroscopic phase curves provide unique access to the three-dimensional
properties of transiting exoplanet atmospheres. However, a modeling framework
must be developed to deliver accurate inferences of atmospheric properties for
these complex data sets. Here, we develop an approach to retrieve temperature
structures and molecular abundances from phase curve spectra at any orbital
phase. In the context of a representative hot Jupiter with a large day-night
temperature contrast, we examine the biases in typical one-dimensional (1D)
retrievals as a function of orbital phase/geometry, compared to two-dimensional
(2D) models that appropriately capture the disk-integrated phase geometry. We
guide our intuition by applying our new framework on a simulated HST+Spitzer
phase curve data set in which the "truth" is known, followed by an application
to the spectroscopic phase curve of the canonical hot Jupiter, WASP-43b. We
also demonstrate the retrieval framework on simulated JWST phase curve
observations. We apply our new geometric framework to a joint-fit of all
spectroscopic phases, assuming longitudinal molecular abundance homogeneity,
resulting in an a factor of 2 improvement in abundances precision when compared
to individual phase constraints. With a 1D retrieval model on simulated
HST+Spitzer data, we find strongly biased molecular abundances for CH$_4$ and
CO$_2$ at most orbital phases. With 2D, the day and night profiles retrieved
from WASP-43b remain consistent throughout the orbit. JWST retrievals show that
a 2D model is strongly favored at all orbital phases. Based on our new 2D
retrieval implementation, we provide recommendations on when 1D models are
appropriate and when more complex phase geometries involving multiple TP
profiles are required to obtain an unbiased view of tidally locked planetary
atmospheres.
Authors' comments: Accepted at AJ. 23 pages, 30 figures - see figure 2 for quick
takeaway
Rima Alaifari, Matthias Wellershoff
We consider the problem of phase retrieval from magnitudes of short-time
Fourier transform (STFT) measurements. It is well-known that signals are
uniquely determined (up to global phase) by their STFT magnitude when the
underlying window has an ambiguity function that is nowhere vanishing. It is
less clear, however, what can be said in terms of unique phase-retrievability
when the ambiguity function of the underlying window vanishes on some of the
time-frequency plane. In this short note, we demonstrate that by considering
signals in Paley-Wiener spaces, it is possible to prove new uniqueness results
for STFT phase retrieval. Among those, we establish a first uniqueness theorem
for STFT phase retrieval from magnitude-only samples in a real-valued setting.
Authors' comments: 13 pages, 5 figures
Kun Liu, Huadong Ma, Chuang Gan
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Existing methods independently extract the features of videos and sentences and purely utilize the sentence embedding in the multi-modal fusion stage, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval. In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to modulate the visual feature extractor's feature maps by a linguistic embedding. Then we adopt a multi-modal fusion module in the second fusion stage. Finally, to get a precise localizer, the sentence information is utilized to guide the process of predicting temporal positions. Specifically, the late guidance module is developed to linearly transform the output of localization networks via the channel attention mechanism. The experimental results on two popular datasets demonstrate the superior performance of our proposed method on moment retrieval (improving by 5.8\% in terms of Rank1@IoU0.5 on Charades-STA and 5.2\% on TACoS). The source code for the complete system will be publicly available.
Chau Tran, Yuqing Tang, Xian Li, Jiatao Gu
Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.
Zerun Feng, Zhimin Zeng, Caili Guo, Zheng Li
Video retrieval is a challenging research topic bridging the vision and
language areas and has attracted broad attention in recent years. Previous
works have been devoted to representing videos by directly encoding from
frame-level features. In fact, videos consist of various and abundant semantic
relations to which existing methods pay less attention. To address this issue,
we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit
reasoning between frame regions. Specifically, we consider frame regions as
vertices and construct a fully-connected semantic correlation graph. Then, we
perform reasoning by novel random walk rule-based graph convolutional networks
to generate region features involved with semantic relations. With the benefit
of reasoning, semantic interactions between regions are considered, while the
impact of redundancy is suppressed. Finally, the region features are aggregated
to form frame-level features for further encoding to measure video-text
similarity. Extensive experiments on two public benchmark datasets validate the
effectiveness of our method by achieving state-of-the-art performance due to
the powerful semantic reasoning.
Authors' comments: Accepted by IJCAI 2020. SOLE copyright holder is IJCAI (International
Joint Conferences on Artificial Intelligence), all rights reserved.
http://static.ijcai.org/2020-accepted_papers.html