Ahmed Taha, Yi-Ting Chen, Teruhisa Misu, Abhinav Shrivastava, Larry Davis
We introduce an unsupervised formulation to estimate heteroscedastic uncertainty in retrieval systems. We propose an extension to triplet loss that models data uncertainty for each input. Besides improving performance, our formulation models local noise in the embedding space. It quantifies input uncertainty and thus enhances interpretability of the system. This helps identify noisy observations in query and search databases. Evaluation on both image and video retrieval applications highlight the utility of our approach. We highlight our efficiency in modeling local noise using two real-world datasets: Clothing1M and Honda Driving datasets. Qualitative results illustrate our ability in identifying confusing scenarios in various domains. Uncertainty learning also enables data cleaning by detecting noisy training labels.
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
Cross-modal retrieval methods have been significantly improved in last years
with the use of deep neural networks and large-scale annotated datasets such as
ImageNet and Places. However, collecting and annotating such datasets requires
a tremendous amount of human effort and, besides, their annotations are usually
limited to discrete sets of popular visual classes that may not be
representative of the richer semantics found on large-scale cross-modal
retrieval datasets. In this paper, we present a self-supervised cross-modal
retrieval framework that leverages as training data the correlations between
images and text on the entire set of Wikipedia articles. Our method consists in
training a CNN to predict: (1) the semantic context of the article in which an
image is more probable to appear as an illustration (global context), and (2)
the semantic context of its caption (local context). Our experiments
demonstrate that the proposed method is not only capable of learning
discriminative visual representations for solving vision tasks like image
classification and object detection, but that the learned representations are
better for cross-modal retrieval when compared to supervised pre-training of
the network on the ImageNet dataset.
Authors' comments: arXiv admin note: text overlap with arXiv:1807.02110
Ming Zhang, Xuefei Zhe, Le Ou-Yang, Shifeng Chen, Hong Yan
Deep hashing models have been proposed as an efficient method for large-scale similarity search. However, most existing deep hashing methods only utilize fine-level labels for training while ignoring the natural semantic hierarchy structure. This paper presents an effective method that preserves the classwise similarity of full-level semantic hierarchy for large-scale image retrieval. Experiments on two benchmark datasets show that our method helps improve the fine-level retrieval performance. Moreover, with the help of the semantic hierarchy, it can produce significantly better binary codes for hierarchical retrieval, which indicates its potential of providing more user-desired retrieval results.
Ahmed Taha, Yi-Ting Chen, Xitong Yang, Teruhisa Misu, Larry Davis
We cast visual retrieval as a regression problem by posing triplet loss as a regression loss. This enables epistemic uncertainty estimation using dropout as a Bayesian approximation framework in retrieval. Accordingly, Monte Carlo (MC) sampling is leveraged to boost retrieval performance. Our approach is evaluated on two applications: person re-identification and autonomous car driving. Comparable state-of-the-art results are achieved on multiple datasets for the former application. We leverage the Honda driving dataset (HDD) for autonomous car driving application. It provides multiple modalities and similarity notions for ego-motion action understanding. Hence, we present a multi-modal conditional retrieval network. It disentangles embeddings into separate representations to encode different similarities. This form of joint learning eliminates the need to train multiple independent networks without any performance degradation. Quantitative evaluation highlights our approach competence, achieving 6% improvement in a highly uncertain environment.
Kai Bai, Zhen Peng, Hong-Gang Luo, Jun-Hong An
Quantum metrology employs quantum effects to attain a measurement precision
surpassing the limit achievable in classical physics. However, it was
previously found that the precision returns the shot-noise limit (SNL) from the
ideal Zeno limit (ZL) due to the photon loss in quantum metrology based on
Mech-Zehnder interferometer. Here, we find that not only the SNL can be beaten,
but also the ZL can be asymptotically recovered in long-encoding-time condition
when the photon dissipation is exactly studied in its inherent non-Markovian
manner. Our analysis reveals that it is due to the formation of a bound state
of the photonic system and its dissipative noise. Highlighting the microscopic
mechanism of the dissipative noise on the quantum optical metrology, our result
supplies a guideline to realize the ultrasensitive measurement in practice by
forming the bound state in the setting of reservoir engineering.
Authors' comments: To appear in Phys. Rev. Lett. 6 pages and 3 figures in the main text.
3 pages and 1 figure in the supplemental material
Icaro Cavalcante Dourado, Daniel Carlos Guimarães Pedronette, Ricardo da Silva Torres
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations. We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters. A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.
Zhihao Cao, Shaomin Mu, Yongyu Xu, Mengping Dong
An image retrieval method based on convolution neural network and dimension
reduction is proposed in this paper. Convolution neural network is used to
extract high-level features of images, and to solve the problem that the
extracted feature dimensions are too high and have strong correlation,
multilinear principal component analysis is used to reduce the dimension of
features. The features after dimension reduction are binary hash coded for fast
image retrieval. Experiments show that the method proposed in this paper has
better retrieval effect than the retrieval method based on principal component
analysis on the e-commerce image datasets.
Authors' comments: 2018 International Conference on Security, Pattern Analysis, and
Cybernetics(SPAC 2018)
Rima Alaifari, Matthias Wellershoff
Phase retrieval refers to the problem of recovering some signal (which is
often modelled as an element of a Hilbert space) from phaseless measurements.
It has been shown that in the deterministic setting phase retrieval from frame
coefficients is always unstable in infinite-dimensional Hilbert spaces [7] and
possibly severely ill-conditioned in finite-dimensional Hilbert spaces [7].
Recently, it has also been shown that phase retrieval from measurements
induced by the Gabor transform with Gaussian window function is stable under a
more relaxed semi-global phase recovery regime based on atoll functions [1].
In finite dimensions, we present first evidence that this semi-global
reconstruction regime allows one to do phase retrieval from measurements of
bandlimited signals induced by the discrete Gabor transform in such a way that
the corresponding stability constant only scales like a low order polynomial in
the space dimension. To this end, we utilise reconstruction formulae which have
become common tools in recent years [6,12,18,20].
Authors' comments: 24 pages, 4 figures; Some small corrections of typos and minor
mathematical errors. Added Examples 3.8 and 3.9
Rishab Sharma, Anirudha Vishvakarma
In this paper, we propose a deep convolutional neural network for learning
the embeddings of images in order to capture the notion of visual similarity.
We present a deep siamese architecture that when trained on positive and
negative pairs of images learn an embedding that accurately approximates the
ranking of images in order of visual similarity notion. We also implement a
novel loss calculation method using an angular loss metrics based on the
problems requirement. The final embedding of the image is combined
representation of the lower and top-level embeddings. We used fractional
distance matrix to calculate the distance between the learned embeddings in
n-dimensional space. In the end, we compare our architecture with other
existing deep architecture and go on to demonstrate the superiority of our
solution in terms of image retrieval by testing the architecture on four
datasets. We also show how our suggested network is better than the other
traditional deep CNNs used for capturing fine-grained image similarities by
learning an optimum embedding.
Authors' comments: 9 pages, 5 figures
Umberto Martínez-Peñas
We consider information-theoretical private information retrieval (PIR) from a coded database with colluding servers. We target, for the first time, locally repairable storage codes (LRCs). We consider any number of local groups $ g $, locality $ r $, local distance $ \delta $ and dimension $ k $. Our main contribution is a PIR scheme for maximally recoverable (MR) LRCs based on linearized Reed--Solomon codes, which achieve the smallest field sizes among MR-LRCs for many parameter regimes. In our scheme, nodes are identified with codeword symbols and servers are identified with local groups of nodes. Only locally non-redundant information is downloaded from each server, that is, only $ r $ nodes (out of $ r+\delta-1 $) are downloaded per server. The PIR scheme achieves the (download) rate $ R = (N - k - rt + 1)/N $, where $ N = gr $ is the length of the MDS code obtained after removing the local parities, and for any $ t $ colluding servers such that $ k + rt \leq N $. For an unbounded number of stored files, the obtained rate is strictly larger than those of known PIR schemes that work for any MDS code. Finally, the obtained PIR scheme can also be adapted when communication between the user and each server is performed via linear network coding, achieving the same rate as previous PIR schemes for this scenario but with polynomial finite field sizes, instead of exponential. Our rates are equal to those of PIR schemes for Reed--Solomon codes, but Reed--Solomon codes are incompatible with the MR-LRC property or linear network coding, thus our PIR scheme is less restrictive in its applications.
Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
Self-Supervised learning from multimodal image and text data allows deep
neural networks to learn powerful features with no need of human annotated
data. Web and Social Media platforms provide a virtually unlimited amount of
this multimodal data. In this work we propose to exploit this free available
data to learn a multimodal image and text embedding, aiming to leverage the
semantic knowledge learnt in the text domain and transfer it to a visual model
for semantic image retrieval. We demonstrate that the proposed pipeline can
learn from images with associated textwithout supervision and analyze the
semantic structure of the learnt joint image and text embedding space. We
perform a thorough analysis and performance comparison of five different state
of the art text embeddings in three different benchmarks. We show that the
embeddings learnt with Web and Social Media data have competitive performances
over supervised methods in the text based image retrieval task, and we clearly
outperform state of the art in the MIRFlickr dataset when training in the
target data. Further, we demonstrate how semantic multimodal image retrieval
can be performed using the learnt embeddings, going beyond classical
instance-level retrieval problems. Finally, we present a new dataset,
InstaCities1M, composed by Instagram images and their associated texts that can
be used for fair comparison of image-text embeddings.
Authors' comments: Submitted to Multi-Modal Scene Understanding. arXiv admin note:
substantial text overlap with arXiv:1808.06368
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval.
Keping Bi, Qingyao Ai, W. Bruce Croft
As more and more search traffic comes from mobile phones, intelligent assistants, and smart-home devices, new challenges (e.g., limited presentation space) and opportunities come up in information retrieval. Previously, an effective technique, relevance feedback (RF), has rarely been used in real search scenarios due to the overhead of collecting users' relevance judgments. However, since users tend to interact more with the search results shown on the new interfaces, it becomes feasible to obtain users' assessments on a few results during each interaction. This makes iterative relevance feedback (IRF) techniques look promising today. IRF has not been studied systematically in the new search scenarios and its effectiveness is mostly unknown. In this paper, we re-visit IRF and extend it with RF models proposed in recent years. We conduct extensive experiments to analyze and compare IRF with the standard top-k RF framework on document and passage retrieval. Experimental results show that IRF is at least as effective as the standard top-k RF framework for documents and much more effective for passages. This indicates that IRF for passage retrieval has huge potential.
Tomasz Szołdra, Krzysztof Sacha, Arkadiusz Kosior
Ultracold atoms in optical lattices form a clean quantum simulator platform
which can be utilized to examine topological phenomena and test exotic
topological materials. Here we propose an experimental scheme to measure the
Chern numbers of two-dimensional multiband topological insulators with bosonic
atoms. We show how to extract the topological invariants out of a sequence of
time-of-flight images by applying a phase retrieval algorithm to matter waves.
We illustrate advantages of using bosonic atoms as well as efficiency and
robustness of the method with two prominent examples: the Harper-Hofstadter
model with an arbitrary commensurate magnetic flux and the Haldane model on a
brick-wall lattice.
Authors' comments: Version accepted for publication in Phys. Rev. A (11 pages, 8
figures)
Ahmad S. Tarawneh, Ahmad B. A. Hassanat, Ceyhun Celik, Dmitry Chetverikov, M. Sohel Rahman, Chaman Verma
Facial image retrieval is a challenging task since faces have many similar features (areas), which makes it difficult for the retrieval systems to distinguish faces of different people. With the advent of deep learning, deep networks are often applied to extract powerful features that are used in many areas of computer vision. This paper investigates the application of different deep learning models for face image retrieval, namely, Alexlayer6, Alexlayer7, VGG16layer6, VGG16layer7, VGG19layer6, and VGG19layer7, with two types of dictionary learning techniques, namely $K$-means and $K$-SVD. We also investigate some coefficient learning techniques such as the Homotopy, Lasso, Elastic Net and SSF and their effect on the face retrieval system. The comparative results of the experiments conducted on three standard face image datasets show that the best performers for face image retrieval are Alexlayer7 with $K$-means and SSF, Alexlayer6 with $K$-SVD and SSF, and Alexlayer6 with $K$-means and SSF. The APR and ARR of these methods were further compared to some of the state of the art methods based on local descriptors. The experimental results show that deep learning outperforms most of those methods and therefore can be recommended for use in practice of face image retrieval
Alex Brandsen, Anne Dirkson, Wessel Kraaij, Wout Lamers, Suzan Verberne, Hugo de Vos, Gineke Wiggers
This volume contains the papers presented at DIR 2018: 17th Dutch-Belgian Information Retrieval Workshop (DIR) held on November 23, 2018 in Leiden. DIR aims to serve as an international platform (with a special focus on the Netherlands and Belgium) for exchange and discussions on research & applications in the field of information retrieval and related fields. The committee accepted 4 short papers presenting novel work, 3 demo proposals, and 8 compressed contributions (summaries of papers recently published in international journals and conferences). Each submission was reviewed by at least 3 programme committee members.
Nandana Sengupta, Nati Srebro, James Evans
In the last decade, the use of simple rating and comparison surveys has proliferated on social and digital media platforms to fuel recommendations. These simple surveys and their extrapolation with machine learning algorithms shed light on user preferences over large and growing pools of items, such as movies, songs and ads. Social scientists have a long history of measuring perceptions, preferences and opinions, often over smaller, discrete item sets with exhaustive rating or ranking surveys. This paper introduces simple surveys for social science application. We ran experiments to compare the predictive accuracy of both individual and aggregate comparative assessments using four types of simple surveys: pairwise comparisons and ratings on 2, 5 and continuous point scales in three distinct contexts: perceived Safety of Google Streetview Images, Likeability of Artwork, and Hilarity of Animal GIFs. Across contexts, we find that continuous scale ratings best predict individual assessments but consume the most time and cognitive effort. Binary choice surveys are quick and perform best to predict aggregate assessments, useful for collective decision tasks, but poorly predict personalized preferences, for which they are currently used by Netflix to recommend movies. Pairwise comparisons, by contrast, perform well to predict personal assessments, but poorly predict aggregate assessments despite being widely used to crowdsource ideas and collective preferences. We demonstrate how findings from these surveys can be visualized in a low-dimensional space that reveals distinct respondent interpretations of questions asked in each context. We conclude by reflecting on differences between sparse, incomplete simple surveys and their traditional survey counterparts in terms of efficiency, information elicited and settings in which knowing less about more may be critical for social science.
George-Sebastian Pirtoaca, Traian Rebedea, Stefan Ruseti
Question answering is one of the most important and difficult applications at
the border of information retrieval and natural language processing, especially
when we talk about complex science questions which require some form of
inference to determine the correct answer. In this paper, we present a two-step
method that combines information retrieval techniques optimized for question
answering with deep learning models for natural language inference in order to
tackle the multi-choice question answering in the science domain. For each
question-answer pair, we use standard retrieval-based models to find relevant
candidate contexts and decompose the main problem into two different
sub-problems. First, assign correctness scores for each candidate answer based
on the context using retrieval models from Lucene. Second, we use deep learning
architectures to compute if a candidate answer can be inferred from some
well-chosen context consisting of sentences retrieved from the knowledge base.
In the end, all these solvers are combined using a simple neural network to
predict the correct answer. This proposed two-step model outperforms the best
retrieval-based solver by over 3% in absolute accuracy.
Authors' comments: 8 pages, 2 figures, 8 tables, accepted at IJCNN 2019
Netanel Raviv, Itzhak Tamo, Eitan Yaakobi
In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called $t$-private if the identity of the file remains concealed even if $t$ of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter $t$, and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a $2$-replication scheme which guarantees perfect privacy from acyclic sets in the graph, and guarantees partial-privacy in the presence of cycles. Furthermore, by providing an upper bound, it is shown that the PIR rate of this scheme is at most a factor of two from its optimal value for an important family of graphs. Lastly, we extend our results to larger replication factors and to graph-based coding, which is a similar technique with smaller storage overhead and larger PIR rate.
Devraj Mandal, Pramod Rao, Soma Biswas
Due to abundance of data from multiple modalities, cross-modal retrieval
tasks with image-text, audio-image, etc. are gaining increasing importance. Of
the different approaches proposed, supervised methods usually give significant
improvement over their unsupervised counterparts at the additional cost of
labeling or annotation of the training data. Semi-supervised methods are
recently becoming popular as they provide an elegant framework to balance the
conflicting requirement of labeling cost and accuracy. In this work, we propose
a novel deep semi-supervised framework which can seamlessly handle both labeled
as well as unlabeled data. The network has two important components: (a) the
label prediction component predicts the labels for the unlabeled portion of the
data and then (b) a common modality-invariant representation is learned for
cross-modal retrieval. The two parts of the network are trained sequentially
one after the other. Extensive experiments on three standard benchmark
datasets, Wiki, Pascal VOC and NUS-WIDE demonstrate that the proposed framework
outperforms the state-of-the-art for both supervised and semi-supervised
settings.
Authors' comments: Updated Version of the Paper has been accepted in IEEE Transactions
on Multimedia {https://ieeexplore.ieee.org/document/8907496/}