Zakaria Laskar, Juho Kannala
Recent advances in deep learning has lead to rapid developments in the field
of image retrieval. However, the best performing architectures incur
significant computational cost. Recent approaches tackle this issue using
knowledge distillation to transfer knowledge from a deeper and heavier
architecture to a much smaller network. In this paper we address knowledge
distillation for metric learning problems. Unlike previous approaches, our
proposed method jointly addresses the following constraints i) limited queries
to teacher model, ii) black box teacher model with access to the final output
representation, and iii) small fraction of original training data without any
ground-truth labels. In addition, the distillation method does not require the
student and teacher to have same dimensionality. Addressing these constraints
reduces computation requirements, dependency on large-scale training datasets
and addresses practical scenarios of limited or partial access to private data
such as teacher models or the corresponding training data/labels. The key idea
is to augment the original training set with additional samples by performing
linear interpolation in the final output representation space. Distillation is
then performed in the joint space of original and augmented teacher-student
sample representations. Results demonstrate that our approach can match
baseline models trained with full supervision. In low training sample settings,
our approach outperforms the fully supervised approach on two challenging image
retrieval datasets, ROxford5k and RParis6k \cite{Roxf} with the least possible
teacher supervision.
Authors' comments: 10 pages, 2 figures. Edited figure 7
Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that learns to rank triplets of images, tags and coordinates by plausibility, and two training strategies to balance the location influence in the final ranking. LocSens learns to fuse textual and location information of multimodal queries to retrieve related images at different levels of location granularity, and successfully utilizes location information to improve image tagging.
Emory Hufbauer, Hana Khamfroush
Twitter, like many social media and data brokering companies, makes their data available through a search API (application programming interface). In addition to filtering results by date and location, researchers can search for tweets with specific content with a boolean text query, using {\it AND}, {\it OR}, and {\it NOT} operators to select the combinations of phrases which must, or must not, appear in matching tweets. This boolean text search system is not at all unique to Twitter and is found in many different contexts, including academic, legal, and medical databases, however it is stretched to its limits in Twitter's use case because of the relative volume and brevity of tweets. In addition, the semi-automated use of such systems was well studied under the topic of Information Retrieval during the 1980s and 1990s, however the study of such systems has greatly declined since that time. As such, we propose updated methods for automatically selecting and refining complex boolean search queries that can isolate relevant results with greater specificity and completeness. Furthermore, we present preliminary results of using an optimized query to collect a sample of traffic-incident-related tweets, along with the results of manually classifying and analyzing them.
Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano et al.
Search in social networks such as Facebook poses different challenges than in
classical web search: besides the query text, it is important to take into
account the searcher's context to provide relevant results. Their social graph
is an integral part of this context and is a unique aspect of Facebook search.
While embedding-based retrieval (EBR) has been applied in eb search engines for
years, Facebook search was still mainly based on a Boolean matching model. In
this paper, we discuss the techniques for applying EBR to a Facebook Search
system. We introduce the unified embedding framework developed to model
semantic embeddings for personalized search, and the system to serve
embedding-based retrieval in a typical search system based on an inverted
index. We discuss various tricks and experiences on end-to-end optimization of
the whole system, including ANN parameter tuning and full-stack optimization.
Finally, we present our progress on two selected advanced topics about
modeling. We evaluated EBR on verticals for Facebook Search with significant
metrics gains observed in online A/B experiments. We believe this paper will
provide useful insights and experiences to help people on developing
embedding-based retrieval systems in search engines.
Authors' comments: 9 pages, 3 figures, 3 tables, to be published in KDD '20
Seyedehsara Nayer, Namrata Vaswani
This work studies the Low Rank Phase Retrieval (LRPR) problem: recover an $n
\times q$ rank-$r$ matrix $X^*$ from $y_k = |A_k^\top x^*_k|$, $k=1, 2,..., q$,
when each $y_k$ is an m-length vector containing independent phaseless linear
projections of $x^*_k$. The different matrices $A_k$ are i.i.d. and each
contains i.i.d. standard Gaussian entries. We obtain an improved guarantee for
AltMinLowRaP, which is an Alternating Minimization solution to LRPR that was
introduced and studied in our recent work. As long as the right singular
vectors of $X^*$ satisfy the incoherence assumption, we can show that the
AltMinLowRaP estimate converges geometrically to $X^*$ if the total number of
measurements $mq \gtrsim nr^2 (r + \log(1/\epsilon))$. In addition, we also
need $m \gtrsim max(r, \log q, \log n)$ because of the specific asymmetric
nature of our problem. Compared to our recent work, we improve the sample
complexity of the AltMin iterations by a factor of $r^2$, and that of the
initialization by a factor of $r$. We also extend our result to the noisy case;
we prove stability to corruption by small additive noise.
Authors' comments: Revised for IEEE Trans. Info. Th
Jie Yang, J. Pedro F. Nunes, Kathryn Ledbetter, Elisa Biasin, Martin Centurion, Zhijiang Chen, Amy A. Cordones, Christ Crissman et al.
Electron scattering on liquid samples has been enabled recently by the development of ultrathin liquid sheet technologies. The data treatment of liquid-phase electron scattering has been mostly reliant on methodologies developed for gas electron diffraction, in which theoretical inputs and empirical fittings are often needed to account for the atomic form factor and remove the inelastic scattering background. The accuracy and impact of these theoretical and empirical inputs has not been benchmarked for liquid-phase electron scattering data. In this work, we present an alternative data treatment method that requires neither theoretical inputs nor empirical fittings. The merits of this new method are illustrated through the retrieval of real-space molecular structure from experimental electron scattering patterns of liquid water, carbon tetrachloride, chloroform, and dichloromethane.
Fan Wu, Patrick Rebeschini
We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we prove that a single step of HWF is able to recover the support from $k(x^*_{max})^{-2}$ (modulo logarithmic term) samples, where $x^*_{max}$ is the largest component of the signal in magnitude. This support recovery procedure can be used to initialize existing reconstruction methods and yields algorithms with total runtime proportional to the cost of reading the data and improved sample complexity, which is linear in $k$ when the signal contains at least one large component. We numerically investigate the performance of HWF at convergence and show that, while not requiring any explicit form of regularization nor knowledge of $k$, HWF adapts to the signal sparsity and reconstructs sparse signals with fewer measurements than existing gradient based methods.
Ashlee Milton, Maria Soledad Pera
Evaluation of information retrieval systems (IRS) is a prominent topic among
information retrieval researchers--mainly directed at a general population.
Children require unique IRS and by extension different ways to evaluate these
systems, but as a large population that use IRS have largely been ignored on
the evaluation front. In this position paper, we explore many perspectives that
must be considered when evaluating IRS; we specially discuss problems faced by
researchers who work with children IRS, including lack of evaluation
frameworks, limitations of data, and lack of user judgment understanding.
Authors' comments: Accepted at the 4th International and Interdisciplinary Perspectives
on Children & Recommender and Information Retrieval Systems (KidRec '20),
co-located with the 19th ACM International Conference on Interaction Design
and Children (IDC '20), https://kidrec.github.io/
Seito Kasai, Yuchi Ishikawa, Masaki Hayashi, Yoshimitsu Aoki, Kensho Hara, Hirokatsu Kataoka
In this paper, we present a framework that jointly retrieves and
spatiotemporally highlights actions in videos by enhancing current deep
cross-modal retrieval methods. Our work takes on the novel task of action
highlighting, which visualizes where and when actions occur in an untrimmed
video setting. Action highlighting is a fine-grained task, compared to
conventional action recognition tasks which focus on classification or
window-based localization. Leveraging weak supervision from annotated captions,
our framework acquires spatiotemporal relevance maps and generates local
embeddings which relate to the nouns and verbs in captions. Through
experiments, we show that our model generates various maps conditioned on
different actions, in which conventional visual reasoning methods only go as
far as to show a single deterministic saliency map. Also, our model improves
retrieval recall over our baseline without alignment by 2-3% on the MSR-VTT
dataset.
Authors' comments: Accepted to ICIP 2020
Wenhao Yu, Lingfei Wu, Qingkai Zeng, Shu Tao, Yu Deng, Meng Jiang
Answer retrieval is to find the most aligned answer from a large set of
candidates given a question. Learning vector representations of
questions/answers is the key factor. Question-answer alignment and
question/answer semantics are two important signals for learning the
representations. Existing methods learned semantic representations with dual
encoders or dual variational auto-encoders. The semantic information was
learned from language models or question-to-question (answer-to-answer)
generative processes. However, the alignment and semantics were too separate to
capture the aligned semantics between question and answer. In this work, we
propose to cross variational auto-encoders by generating questions with aligned
answers and generating answers with aligned questions. Experiments show that
our method outperforms the state-of-the-art answer retrieval method on SQuAD.
Authors' comments: Accepted to ACL 2020
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, Jamie Callan
This paper presents CLEAR, a retrieval model that seeks to complement
classical lexical exact-match models such as BM25 with semantic matching
signals from a neural embedding matching model. CLEAR explicitly trains the
neural embedding to encode language structures and semantics that lexical
retrieval fails to capture with a novel residual-based embedding learning
method. Empirical evaluations demonstrate the advantages of CLEAR over
state-of-the-art retrieval models, and that it can substantially improve the
end-to-end accuracy and efficiency of reranking pipelines.
Authors' comments: ECIR 2021
Zhuolin Jiang, Amro El-Jaroudi, William Hartmann, Damianos Karakos, Lingjun Zhao
Multiple neural language models have been developed recently, e.g., BERT and XLNet, and achieved impressive results in various NLP tasks including sentence classification, question answering and document ranking. In this paper, we explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents in the task of cross-lingual information retrieval. A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision, using home-made CLIR training data derived from parallel corpora. Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example. The second challenge concerns uniformity: one ideally wants good performance on each subpopulation. While several solutions have been proposed to address the first challenge, the second challenge has received relatively less attention. In this paper, we propose doubly-stochastic mining (S2M ), a stochastic optimization technique that addresses both challenges. In each iteration of S2M, we compute a per-example loss based on a subset of hardest labels, and then compute the minibatch loss based on the hardest examples. We show theoretically and empirically that by focusing on the hardest examples, S2M ensures that all data subpopulations are modelled well.
Jiawang Bai, Bin Chen, Yiming Li, Dongxian Wu, Weiwei Guo, Shu-tao Xia, En-hui Yang
The deep hashing based retrieval method is widely adopted in large-scale
image and video retrieval. However, there is little investigation on its
security. In this paper, we propose a novel method, dubbed deep hashing
targeted attack (DHTA), to study the targeted attack on such retrieval.
Specifically, we first formulate the targeted attack as a point-to-set
optimization, which minimizes the average distance between the hash code of an
adversarial example and those of a set of objects with the target label. Then
we design a novel component-voting scheme to obtain an anchor code as the
representative of the set of hash codes of objects with the target label, whose
optimality guarantee is also theoretically derived. To balance the performance
and perceptibility, we propose to minimize the Hamming distance between the
hash code of the adversarial example and the anchor code under the
$\ell^\infty$ restriction on the perturbation. Extensive experiments verify
that DHTA is effective in attacking both deep hashing based image retrieval and
video retrieval.
Authors' comments: Accepted by ECCV 2020 as Oral
Stefan Steinerberger
Phase retrieval is concerned with recovering a function $f$ from the absolute value of its Fourier transform $|\widehat{f}|$. We study the stability properties of this problem in Lebesgue spaces. Our main results shows that $$ \| f-g\|_{L^2(\mathbb{R}^n)} \leq 2\cdot \| |\widehat{f}| - |\widehat{g}| \|_{L^2(\mathbb{R}^n)} + h_f\left( \|f-g\|^{}_{L^p(\mathbb{R}^n)}\right) + J(\widehat{f}, \widehat{g}),$$ where $1 \leq p < 2$, $h_f$ is an explicit nonlinear function depending on the smoothness of $f$ and $J$ is an explicit term capturing the invariance under translations. A noteworthy aspect is that the stability is phrased in terms of $L^p$ for $1 \leq p < 2$ while, usually, $L^p$ cannot be used to control $L^2$, the stability estimate has the flavor of an inverse H\"older inequality. It seems conceivable that the estimate is optimal up to constants.
Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, Leonidas Guibas
We introduce a new problem of retrieving 3D models that are deformable to a
given query shape and present a novel deep deformation-aware embedding to solve
this retrieval task. 3D model retrieval is a fundamental operation for
recovering a clean and complete 3D model from a noisy and partial 3D scan.
However, given a finite collection of 3D shapes, even the closest model to a
query may not be satisfactory. This motivates us to apply 3D model deformation
techniques to adapt the retrieved model so as to better fit the query. Yet,
certain restrictions are enforced in most 3D deformation techniques to preserve
important features of the original model that prevent a perfect fitting of the
deformed model to the query. This gap between the deformed model and the query
induces asymmetric relationships among the models, which cannot be handled by
typical metric learning techniques. Thus, to retrieve the best models for
fitting, we propose a novel deep embedding approach that learns the asymmetric
relationships by leveraging location-dependent egocentric distance fields. We
also propose two strategies for training the embedding network. We demonstrate
that both of these approaches outperform other baselines in our experiments
with both synthetic and real data. Our project page can be found at
https://deformscan2cad.github.io/.
Authors' comments: Accepted for publication at ECCV 2020. Project page under
https://deformscan2cad.github.io
Joanna K. Barstow, Kevin Heng
Spectral retrieval has long been a powerful tool for interpreting planetary
remote sensing observations. Flexible, parameterised, agnostic models are
coupled with inversion algorithms in order to infer atmospheric properties
directly from observations, with minimal reliance on physical assumptions. This
approach, originally developed for application to Earth satellite data and
subsequently observations of other Solar System planets, has been recently and
successfully applied to transit, eclipse and phase curve spectra of transiting
exoplanets. In this review, we present the current state-of-the-art in terms of
our ability to accurately retrieve information about atmospheric chemistry,
temperature, clouds and spatial variability; we discuss the limitations of
this, both in the available data and modelling strategies used; and we
recommend approaches for future improvement.
Authors' comments: 30 pages, 6 figures. Accepted by Space Science Reviews
Niloofar Tavakolian, Azadeh Nazemi, Donal Fitzpatrick
Information is frequently retrieved from valid personal ID cards by the
authorised organisation to address different purposes. The successful
information retrieval (IR) depends on the accuracy and timing process. A
process which necessitates a long time to respond is frustrating for both sides
in the exchange of data. This paper aims to propose a series of
state-of-the-art methods for the journey of an Identification card (ID) from
the scanning or capture phase to the point before Optical character recognition
(OCR). The key factors for this proposal are the accuracy and speed of the
process during the journey. The experimental results of this research prove
that utilising the methods based on deep learning, such as Efficient and
Accurate Scene Text (EAST) detector and Deep Neural Network (DNN) for face
detection, instead of traditional methods increase the efficiency considerably.
Authors' comments: 6pages,10 figures,conference
Yujie Zhong, Relja Arandjelović, Andrew Zisserman
The objective of this work is to learn a compact embedding of a set of
descriptors that is suitable for efficient retrieval and ranking, whilst
maintaining discriminability of the individual descriptors. We focus on a
specific example of this general problem -- that of retrieving images
containing multiple faces from a large scale dataset of images. Here the set
consists of the face descriptors in each image, and given a query for multiple
identities, the goal is then to retrieve, in order, images which contain all
the identities, all but one, \etc
To this end, we make the following contributions: first, we propose a CNN
architecture -- {\em SetNet} -- to achieve the objective: it learns face
descriptors and their aggregation over a set to produce a compact fixed length
descriptor designed for set retrieval, and the score of an image is a count of
the number of identities that match the query; second, we show that this
compact descriptor has minimal loss of discriminability up to two faces per
image, and degrades slowly after that -- far exceeding a number of baselines;
third, we explore the speed vs.\ retrieval quality trade-off for set retrieval
using this compact descriptor; and, finally, we collect and annotate a large
dataset of images containing various number of celebrities, which we use for
evaluation and is publicly released.
Authors' comments: 20 pages
Yaotian Wang, Xiaohang Sun, Jason W. Fleischer
Recovering a signal from its Fourier intensity underlies many important applications, including lensless imaging and imaging through scattering media. Conventional algorithms for retrieving the phase suffer when noise is present but display global convergence when given clean data. Neural networks have been used to improve algorithm robustness, but efforts to date are sensitive to initial conditions and give inconsistent performance. Here, we combine iterative methods from phase retrieval with image statistics from deep denoisers, via regularization-by-denoising. The resulting methods inherit the advantages of each approach and outperform other noise-robust phase retrieval algorithms. Our work paves the way for hybrid imaging methods that integrate machine-learned constraints in conventional algorithms.