Shrikant Temburwar, Bulla Rajesh, Mohammed Javed
Content-based image retrieval (CBIR) systems on pixel domain use low-level
features, such as colour, texture and shape, to retrieve images. In this
context, two types of image representations i.e. local and global image
features have been studied in the literature. Extracting these features from
pixel images and comparing them with images from the database is very
time-consuming. Therefore, in recent years, there has been some effort to
accomplish image analysis directly in the compressed domain with lesser
computations. Furthermore, most of the images in our daily transactions are
stored in the JPEG compressed format. Therefore, it would be ideal if we could
retrieve features directly from the partially decoded or compressed data and
use them for retrieval. Here, we propose a unified model for image retrieval
which takes DCT coefficients as input and efficiently extracts global and local
features directly in the JPEG compressed domain for accurate image retrieval.
The experimental findings indicate that our proposed model performed similarly
to the current DELG model which takes RGB features as an input with reference
to mean average precision while having a faster training and retrieval speed.
Authors' comments: Accepted in MISP2021
Ming Zhang, Xuefei Zhe, Hong Yan
Existing deep quantization methods provided an efficient solution for
large-scale image retrieval. However, the significant intra-class variations
like pose, illumination, and expressions in face images, still pose a challenge
for face image retrieval. In light of this, face image retrieval requires
sufficiently powerful learning metrics, which are absent in current deep
quantization works. Moreover, to tackle the growing unseen identities in the
query stage, face image retrieval drives more demands regarding model
generalization and system scalability than general image retrieval tasks. This
paper integrates product quantization with orthonormal constraints into an
end-to-end deep learning framework to effectively retrieve face images.
Specifically, a novel scheme that uses predefined orthonormal vectors as
codewords is proposed to enhance the quantization informativeness and reduce
codewords' redundancy. A tailored loss function maximizes discriminability
among identities in each quantization subspace for both the quantized and
original features. An entropy-based regularization term is imposed to reduce
the quantization error. Experiments are conducted on four commonly-used face
datasets under both seen and unseen identities retrieval settings. Our method
outperforms all the compared deep hashing/quantization state-of-the-arts under
both settings. Results validate the effectiveness of the proposed orthonormal
codewords in improving models' standard retrieval performance and
generalization ability. Combing with further experiments on two general image
datasets, it demonstrates the broad superiority of our method for scalable
image retrieval.
Authors' comments: Published in Pattern Recognition, supplementary material can be found
in Github project page
Nikos Voskarides, Edgar Meij, Sabrina Sauer, Maarten de Rijke
Writers such as journalists often use automatic tools to find relevant
content to include in their narratives. In this paper, we focus on supporting
writers in the news domain to develop event-centric narratives. Given an
incomplete narrative that specifies a main event and a context, we aim to
retrieve news articles that discuss relevant events that would enable the
continuation of the narrative. We formally define this task and propose a
retrieval dataset construction procedure that relies on existing news articles
to simulate incomplete narratives and relevant articles. Experiments on two
datasets derived from this procedure show that state-of-the-art lexical and
semantic rankers are not sufficient for this task. We show that combining those
with a ranker that ranks articles by reverse chronological order outperforms
those rankers alone. We also perform an in-depth quantitative and qualitative
analysis of the results that sheds light on the characteristics of this task.
Authors' comments: ICTIR 2021
Margaret Lawson, William Gropp, Jay Lofstead
Despite the critical role that range queries play in analysis and
visualization for HPC applications, there has been no comprehensive analysis of
indices that are designed to accelerate range queries and the extent to which
they are viable in an HPC setting. In this state of the practice paper we
present the first such evaluation, examining 20 open-source C and C++ libraries
that support range queries. Contributions of this paper include answering the
following questions: which of the implementations are viable in an HPC setting,
how do these libraries compare in terms of build time, query time, memory
usage, and scalability, what are other trade-offs between these
implementations, is there a single overall best solution, and when does a brute
force solution offer the best performance? We also share key insights learned
during this process that can assist both HPC application scientists and spatial
index developers.
Authors' comments: Added references
Jegug Ih, Eliza M. -R. Kempton
Retrieval of exoplanetary atmospheric properties from their transmission
spectra commonly assumes that the errors in the data are Gaussian and
independent. However, non-Gaussian noise can occur due to instrumental or
stellar systematics and merging discrete datasets. We investigate the effect of
correlated noise and constrain the potential biases incurred in the retrieved
posteriors. We simulate multiple noise instances of synthetic data and perform
retrievals to obtain statistics of goodness-of-retrieval for varying noise
models. We find that correlated noise allows for overfitting the spectrum,
thereby yielding better goodness-of-fit on average but degrading the overall
accuracy of retrievals. In particular, correlated noise can manifest as an
apparent non-Rayleigh slope in the optical range, leading to an incorrect
estimate of cloud/haze parameters. We also find that higher precision causes
correlated results to be further off from the input values in terms of
estimated errors. As such, we emphasize that caution must be taken in analyzing
retrieved posteriors and that estimated parameter uncertainties are best
understood as lower limits. Finally, we show that while correlated noise cannot
be be reliably distinguished with HST observations, inferring its presence and
strength may be possible with JWST observations.
Authors' comments: 25 pages, 26 figures, Submitted to AJ
Zhipeng Wang, Hao Wang, Jiexi Yan, Aming Wu, Cheng Deng
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal
retrieval task, where abstract sketches are used as queries to retrieve natural
images under zero-shot scenario. Most existing methods regard ZS-SBIR as a
traditional classification problem and employ a cross-entropy or triplet-based
loss to achieve retrieval, which neglect the problems of the domain gap between
sketches and natural images and the large intra-class diversity in sketches.
Toward this end, we propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR.
Specifically, a cross-modal contrastive method is proposed to learn generalized
representations to smooth the domain gap by mining relations with additional
augmented samples. Furthermore, a category-specific memory bank with sketch
features is explored to reduce intra-class diversity in the sketch domain.
Extensive experiments demonstrate that our approach notably outperforms the
state-of-the-art methods in both Sketchy and TU-Berlin datasets. Our source
code is publicly available at https://github.com/haowang1992/DSN.
Authors' comments: Accepted to IJCAI 2021
Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models,
have shown the usefulness of expanding and reweighting the users' initial
queries using information occurring in an initial set of retrieved documents,
known as the pseudo-relevant set. Recently, dense retrieval -- through the use
of neural contextual language models such as BERT for analysing the documents'
and queries' contents and computing their relevance scores -- has shown a
promising performance on several information retrieval tasks still relying on
the traditional inverted index for identifying documents relevant to a query.
Two different dense retrieval families have emerged: the use of single embedded
representations for each passage and query (e.g. using BERT's [CLS] token), or
via multiple representations (e.g. using an embedding for each token of the
query and document). In this work, we conduct the first study into the
potential for multiple representation dense retrieval to be enhanced using
pseudo-relevance feedback. In particular, based on the pseudo-relevant set of
documents identified using a first-pass dense retrieval, we extract
representative feedback embeddings (using KMeans clustering) -- while ensuring
that these embeddings discriminate among passages (based on IDF) -- which are
then added to the query representation. These additional feedback embeddings
are shown to both enhance the effectiveness of a reranking as well as an
additional dense retrieval operation. Indeed, experiments on the MSMARCO
passage ranking dataset show that MAP can be improved by upto 26% on the TREC
2019 query set and 10% on the TREC 2020 query set by the application of our
proposed ColBERT-PRF method on a ColBERT dense retrieval approach.
Authors' comments: 10 pages
Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
Tobias Uelwer, Tobias Hoffmann, Stefan Harmeling
Fourier phase retrieval is the problem of reconstructing a signal given only
the magnitude of its Fourier transformation. Optimization-based approaches,
like the well-established Gerchberg-Saxton or the hybrid input output
algorithm, struggle at reconstructing images from magnitudes that are not
oversampled. This motivates the application of learned methods, which allow
reconstruction from non-oversampled magnitude measurements after a learning
phase. In this paper, we want to push the limits of these learned methods by
means of a deep neural network cascade that reconstructs the image successively
on different resolutions from its non-oversampled Fourier magnitude. We
evaluate our method on four different datasets (MNIST, EMNIST, Fashion-MNIST,
and KMNIST) and demonstrate that it yields improved performance over other
non-iterative methods and optimization-based methods.
Authors' comments: Accepted at the 30th International Conference on Artificial Neural
Networks (ICANN 2021)
Yeon Seonwoo, Sang-Woo Lee, Ji-Hoon Kim, Jung-Woo Ha, Alice Oh
In multi-hop QA, answering complex questions entails iterative document
retrieval for finding the missing entity of the question. The main steps of
this process are sub-question detection, document retrieval for the
sub-question, and generation of a new query for the final document retrieval.
However, building a dataset that contains complex questions with sub-questions
and their corresponding documents requires costly human annotation. To address
the issue, we propose a new method for weakly supervised multi-hop retriever
pre-training without human efforts. Our method includes 1) a pre-training task
for generating vector representations of complex questions, 2) a scalable data
generation method that produces the nested structure of question and
sub-question as weak supervision for pre-training, and 3) a pre-training model
structure based on dense encoders. We conduct experiments to compare the
performance of our pre-trained retriever with several state-of-the-art models
on end-to-end multi-hop QA as well as document retrieval. The experimental
results show that our pre-trained retriever is effective and also robust on
limited data and computational resources.
Authors' comments: ACL-Findings 2021
Thomas Friedrich, Chu-Ping Yu, Johan Verbeek, Timothy Pennycook, Sandra Van Aert
We present a computational imaging mode for large scale electron microscopy
data, which retrieves a complex wave from noisy/sparse intensity recordings
using a deep learning approach and subsequently reconstructs an image of the
specimen from the Convolutional Neural Network (CNN) predicted exit waves. We
demonstrate that an appropriate forward model in combination with open data
frameworks can be used to generate large synthetic datasets for training. In
combination with augmenting the data with Poisson noise corresponding to
varying dose-values, we effectively eliminate overfitting issues. The U-NET
based architecture of the CNN is adapted to the task at hand and performs well
while maintaining a relatively small size and fast performance. The validity of
the approach is confirmed by comparing the reconstruction to well-established
methods using simulated, as well as real electron microscopy data. The proposed
method is shown to be effective particularly in the low dose range, evident by
strong suppression of noise, good spatial resolution, and sensitivity to
different atom types, enabling the simultaneous visualisation of light and
heavy elements and making different atomic species distinguishable. Since the
method acts on a very local scale and is comparatively fast it bears the
potential to be used for near-real-time reconstruction during data acquisition.
Authors' comments: Accepted conference paper of IEEE ICIP 2021
Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, Rishabh Iyer
Semi-supervised learning (SSL) algorithms have had great success in recent
years in limited labeled data regimes. However, the current state-of-the-art
SSL algorithms are computationally expensive and entail significant compute
time and energy requirements. This can prove to be a huge limitation for many
smaller companies and academic groups. Our main insight is that training on a
subset of unlabeled data instead of entire unlabeled data enables the current
SSL algorithms to converge faster, significantly reducing computational costs.
In this work, we propose RETRIEVE, a coreset selection framework for efficient
and robust semi-supervised learning. RETRIEVE selects the coreset by solving a
mixed discrete-continuous bi-level optimization problem such that the selected
coreset minimizes the labeled set loss. We use a one-step gradient
approximation and show that the discrete optimization problem is approximately
submodular, enabling simple greedy algorithms to obtain the coreset. We
empirically demonstrate on several real-world datasets that existing SSL
algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve
a) faster training times, b) better performance when unlabeled data consists of
Out-of-Distribution (OOD) data and imbalance. More specifically, we show that
with minimal accuracy degradation, RETRIEVE achieves a speedup of around
$3\times$ in the traditional SSL setting and achieves a speedup of $5\times$
compared to state-of-the-art (SOTA) robust SSL algorithms in the case of
imbalance and OOD data. RETRIEVE is available as a part of the CORDS toolkit:
https://github.com/decile-team/cords.
Authors' comments: To appear in NeurIPS21
J. A. Pollock, K. S. Morgan, L. C. P. Croton, M. K. Croughan, G. Ruben, N. Yagi, H. Sekiguchi, M. J. Kitchen
The ill-posed problem of phase retrieval in optics, using one or more intensity measurements, has a multitude of applications using electromagnetic or matter waves. Many phase retrieval algorithms are computed on pixel arrays using discrete Fourier transforms due to their high computational efficiency. However, the mathematics underpinning these algorithms is typically formulated using continuous mathematics, which can result in a loss in spatial resolution in the reconstructed images. Herein we investigate how phase retrieval algorithms for propagation-based phase-contrast X-ray imaging can be rederived using discrete mathematics and result in more precise retrieval for single- and multi-material objects and for spectral image decomposition. We validate this theory through experimental measurements of spatial resolution using computed tomography (CT) reconstructions of plastic phantoms and biological tissue, using detectors with a range of imaging system point spread functions (PSFs). We demonstrate that if the PSF substantially suppresses high spatial frequencies, the potential improvement from utilising the discrete derivation is limited. However, with detectors characterised by a single pixel PSF (e.g. direct, photon-counting X-ray detectors), a significant improvement in spatial resolution can be obtained, demonstrated here at up to 17%.
Abin Jose, Daniel Filbert, Christian Rohlfing, Jens-Rainer Ohm
In this paper, we propose an approach for learning binary hash codes for
image retrieval. Canonical Correlation Analysis (CCA) is used to design two
loss functions for training a neural network such that the correlation between
the two views to CCA is maximized. The first loss, maximizes the correlation
between the hash centers and learned hash codes. The second loss maximizes the
correlation between the class labels and classification scores. A novel
weighted mean and thresholding based hash center update scheme is proposed for
adapting the hash centers in each epoch. The training loss reaches the
theoretical lower bound of the proposed loss functions, showing that the
correlation coefficients are maximized during training and substantiating the
formation of an efficient feature space for image retrieval. The measured mean
average precision shows that the proposed approach outperforms other
state-of-the-art approaches in both single-labeled and multi-labeled image
datasets.
Authors' comments: Submitted to ICCV 2021
Dmitry S. Filonov, Egor I. Kretov, Sergei A. Kurdjumov, Viacheslav A. Ivanov, Pavel Ginzburg
Material susceptibilities govern interactions between electromagnetic waves and matter and are of a crucial importance for basic understanding of natural phenomena and for tailoring practical applications. Here we present a new calibration-free method for relative complex permittivity retrieval, which allows using accessible and cheap apparatus and simplifies the measurement process. The method combines advantages of resonant and non-resonant techniques, allowing to extract parameters of liquids and solids in a broad frequency range, where material's loss tangent is less than 0.5. The essence of the method is based on exciting magnetic dipole resonance in a spherical sample with variable dimensions. Size-dependent resonant frequencies and quality factors of magnetic dipolar modes are mapped on real and imaginary parts of permittivity by employing Mie theory. Samples are comprised of liquid solutions, enclosed in stretchable covers, which allows changing the dimensions continuously. This approach allows tuning magnetic dipolar resonance over a wide frequency range, effectively making resonance retrieval method broadband. The technique can be extended to powder and solid materials, depending on their physical parameters, such as granularity and processability.
Kristen Moore, Shenjun Zhong, Zhen He, Torsten Rudolf, Nils Fisher, Brandon Victor, Neha Jindal
In this paper we present the results of our experiments in training and deploying a self-supervised retrieval-based chatbot trained with contrastive learning for assisting customer support agents. In contrast to most existing research papers in this area where the focus is on solving just one component of a deployable chatbot, we present an end-to-end set of solutions to take the reader from an unlabelled chatlogs to a deployed chatbot. This set of solutions includes creating a self-supervised dataset and a weakly labelled dataset from chatlogs, as well as a systematic approach to selecting a fixed list of canned responses. We present a hierarchical-based RNN architecture for the response selection model, chosen for its ability to cache intermediate utterance embeddings, which helped to meet deployment inference speed requirements. We compare the performance of this architecture across 3 different learning objectives: self-supervised contrastive learning, binary classification, and multi-class classification. We find that using a self-supervised contrastive learning model outperforms training the binary and multi-class classification models on a weakly labelled dataset. Our results validate that the self-supervised contrastive learning approach can be effectively used for a real-world chatbot scenario.
Kshitij Tayal, Raunak Manekar, Zhong Zhuang, David Yang, Vipin Kumar, Felix Hofmann, Ju Sun
Several deep learning methods for phase retrieval exist, but most of them fail on realistic data without precise support information. We propose a novel method based on single-instance deep generative prior that works well on complex-valued crystal data.
Sanjeev Kumar
Phase can be reliably estimated from a single diffracted intensity image, if a faithful prior information about the object is available. Examples include amplitude bounds, object support, sparsity in the spatial or a transform domain, deep image prior and the prior learnt from the labelled datasets by a deep neural network. Deep learning facilitates state of art reconstruction quality but requires a large labelled dataset (ground truth-measurement pair acquired in the same experimental conditions) for training. To alleviate this data requirement problem, this letter proposes a zero-shot learning method. The letter demonstrates that the object-prior learnt by a deep neural network while being trained for a denoising task can also be utilized for the phase retrieval, if the diffraction physics is effectively enforced on the network output. The letter additionally demonstrates that the incorporation of total variation in the proposed zero-shot framework facilitates the reconstruction of similar quality in lesser time (e.g. ~8.5 fold, for a test reported in this letter).
Yifei Yuan, Wai Lam
We study the task of conversational fashion image retrieval via multiturn
natural language feedback. Most previous studies are based on single-turn
settings. Existing models on multiturn conversational fashion image retrieval
have limitations, such as employing traditional models, and leading to
ineffective performance. We propose a novel framework that can effectively
handle conversational fashion image retrieval with multiturn natural language
feedback texts. One characteristic of the framework is that it searches for
candidate images based on exploitation of the encoded reference image and
feedback text information together with the conversation history. Furthermore,
the image fashion attribute information is leveraged via a mutual attention
strategy. Since there is no existing fashion dataset suitable for the multiturn
setting of our task, we derive a large-scale multiturn fashion dataset via
additional manual annotation efforts on an existing single-turn dataset. The
experiments show that our proposed model significantly outperforms existing
state-of-the-art methods.
Authors' comments: Accepted by SIGIR 2021
Yunhao Li, Yunyi Yang, Xiaojun Quan, Jianxing Yu
Dialogue policy learning, a subtask that determines the content of system
response generation and then the degree of task completion, is essential for
task-oriented dialogue systems. However, the unbalanced distribution of system
actions in dialogue datasets often causes difficulty in learning to generate
desired actions and responses. In this paper, we propose a
retrieve-and-memorize framework to enhance the learning of system actions.
Specially, we first design a neural context-aware retrieval module to retrieve
multiple candidate system actions from the training set given a dialogue
context. Then, we propose a memory-augmented multi-decoder network to generate
the system actions conditioned on the candidate actions, which allows the
network to adaptively select key information in the candidate actions and
ignore noises. We conduct experiments on the large-scale multi-domain
task-oriented dialogue dataset MultiWOZ 2.0 and MultiWOZ 2.1. Experimental
results show that our method achieves competitive performance among several
state-of-the-art models in the context-to-response generation task.
Authors' comments: Acceptdd to ACL2021 Findings