Osman Tursun, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, Clinton Fookes, Sandra Mau
The demand for large-scale trademark retrieval (TR) systems has significantly
increased to combat the rise in international trademark infringement.
Unfortunately, the ranking accuracy of current approaches using either
hand-crafted or pre-trained deep convolution neural network (DCNN) features is
inadequate for large-scale deployments. We show in this paper that the ranking
accuracy of TR systems can be significantly improved by incorporating hard and
soft attention mechanisms, which direct attention to critical information such
as figurative elements and reduce attention given to distracting and
uninformative elements such as text and background. Our proposed approach
achieves state-of-the-art results on a challenging large-scale trademark
dataset.
Authors' comments: Fix typos related to authors' information
Mohsen Yavartanoo, Eu Young Kim, Kyoung Mu Lee
We propose an efficient Stereographic Projection Neural Network (SPNet) for learning representations of 3D objects. We first transform a 3D input volume into a 2D planar image using stereographic projection. We then present a shallow 2D convolutional neural network (CNN) to estimate the object category followed by view ensemble, which combines the responses from multiple views of the object to further enhance the predictions. Specifically, the proposed approach consists of four stages: (1) Stereographic projection of a 3D object, (2) view-specific feature learning, (3) view selection and (4) view ensemble. The proposed approach performs comparably to the state-of-the-art methods while having substantially lower GPU memory as well as network parameters. Despite its lightness, the experiments on 3D object classification and shape retrievals demonstrate the high performance of the proposed method.
Kaihui Liu, Jiayi Wang, Zhengli Xing, Linxiao Yang, Jun Fang
In this paper, we consider the problem of low-rank phase retrieval whose objective is to estimate a complex low-rank matrix from magnitude-only measurements. We propose a hierarchical prior model for low-rank phase retrieval, in which a Gaussian-Wishart hierarchical prior is placed on the underlying low-rank matrix to promote the low-rankness of the matrix. Based on the proposed hierarchical model, a variational expectation-maximization (EM) algorithm is developed. The proposed method is less sensitive to the choice of the initialization point and works well with random initialization. Simulation results are provided to illustrate the effectiveness of the proposed algorithm.
Ayan Kumar Bhunia, Ankan Kumar Bhunia, Shuvozit Ghose, Abhirup Das, Partha Pratim Roy, Umapada Pal
Logo detection in real-world scene images is an important problem with
applications in advertisement and marketing. Existing general-purpose object
detection methods require large training data with annotations for every logo
class. These methods do not satisfy the incremental demand of logo classes
necessary for practical deployment since it is practically impossible to have
such annotated data for new unseen logo. In this work, we develop an
easy-to-implement query-based logo detection and localization system by
employing a one-shot learning technique. Given an image of a query logo, our
model searches for it within a given target image and predicts the possible
location of the logo by estimating a binary segmentation mask. The proposed
model consists of a conditional branch and a segmentation branch. The former
gives a conditional latent representation of the given query logo which is
combined with feature maps of the segmentation branch at multiple scales in
order to find the matching position of the query logo in a target image, should
it be present. Feature matching between the latent query representation and
multi-scale feature maps of segmentation branch using simple concatenation
operation followed by 1x1 convolution layer makes our model scale-invariant.
Despite its simplicity, our query-based logo retrieval framework achieved
superior performance in FlickrLogos-32 and TopLogos-10 dataset over different
existing baselines.
Authors' comments: Accepted in Pattern Recognition, Elsevier(2019)
Ayan Kumar Bhunia, Perla Sai Raj Kishore, Pranay Mukherjee, Abhirup Das, Partha Pratim Roy
With the large-scale explosion of images and videos over the internet,
efficient hashing methods have been developed to facilitate memory and time
efficient retrieval of similar images. However, none of the existing works uses
hashing to address texture image retrieval mostly because of the lack of
sufficiently large texture image databases. Our work addresses this problem by
developing a novel deep learning architecture that generates binary hash codes
for input texture images. For this, we first pre-train a Texture Synthesis
Network (TSN) which takes a texture patch as input and outputs an enlarged view
of the texture by injecting newer texture content. Thus it signifies that the
TSN encodes the learnt texture specific information in its intermediate layers.
In the next stage, a second network gathers the multi-scale feature
representations from the TSN's intermediate layers using channel-wise
attention, combines them in a progressive manner to a dense continuous
representation which is finally converted into a binary hash code with the help
of individual and pairwise label information. The new enlarged texture patches
also help in data augmentation to alleviate the problem of insufficient texture
data and are used to train the second stage of the network. Experiments on
three public texture image retrieval datasets indicate the superiority of our
texture synthesis guided hashing approach over current state-of-the-art
methods.
Authors' comments: IEEE Winter Conference on Applications of Computer Vision (WACV),
2019 Video Presentation: https://www.youtube.com/watch?v=tXaXTGhzaJo
Zhiwen Tang, Grace Hui Yang
Most neural Information Retrieval (Neu-IR) models derive query-to-document ranking scores based on term-level matching. Inspired by TileBars, a classical term distribution visualization method, in this paper, we propose a novel Neu-IR model that handles query-to-document matching at the subtopic and higher levels. Our system first splits the documents into topical segments, "visualizes" the matchings between the query and the segments, and then feeds an interaction matrix into a Neu-IR model, DeepTileBars, to obtain the final ranking scores. DeepTileBars models the relevance signals occurring at different granularities in a document's topic hierarchy. It better captures the discourse structure of a document and thus the matching patterns. Although its design and implementation are light-weight, DeepTileBars outperforms other state-of-the-art Neu-IR models on benchmark datasets including the Text REtrieval Conference (TREC) 2010-2012 Web Tracks and LETOR 4.0.
Adel Rahimi, Mohammad Bahrani
In this paper, we propose a new method for query expansion, which uses
FarsNet (Persian WordNet) to find similar tokens related to the query and
expand the semantic meaning of the query. For this purpose, we use synonymy
relations in FarsNet and extract the related synonyms to query words. This
algorithm is used to enhance information retrieval systems and improve search
results. The overall evaluation of this system in comparison to the baseline
method (without using query expansion) shows an improvement of about 9 percent
in Mean Average Precision (MAP).
Authors' comments: 4 pages
Yinzheng Gu, Chuanpeng Li, Jinbin Xie
It has been shown that image descriptors extracted by convolutional neural
networks (CNNs) achieve remarkable results for retrieval problems. In this
paper, we apply attention mechanism to CNN, which aims at enhancing more
relevant features that correspond to important keypoints in the input image.
The generated attention-aware features are then aggregated by the previous
state-of-the-art generalized mean (GeM) pooling followed by normalization to
produce a compact global descriptor, which can be efficiently compared to other
image descriptors by the dot product. An extensive comparison of our proposed
approach with state-of-the-art methods is performed on the new challenging
ROxford5k and RParis6k retrieval benchmarks. Results indicate significant
improvement over previous work. In particular, our attention-aware GeM (AGeM)
descriptor outperforms state-of-the-art method on ROxford5k under the `Hard'
evaluation protocal.
Authors' comments: Shortened version for submission
Jing Yu, Chenghao Yang, Zengchang Qin, Zhuoqian Yang, Yue Hu, Weifeng Zhang
Feature modeling of different modalities is a basic problem in current
research of cross-modal information retrieval. Existing models typically
project texts and images into one embedding space, in which semantically
similar information will have a shorter distance. Semantic modeling of textural
relationships is notoriously difficult. In this paper, we propose an approach
to model texts using a featured graph by integrating multi-view textual
relationships including semantic relations, statistical co-occurrence, and
prior relations in the knowledge base. A dual-path neural network is adopted to
learn multi-modal representations of information and cross-modal similarity
measure jointly. We use a Graph Convolutional Network (GCN) for generating
relation-aware text representations, and use a Convolutional Neural Network
(CNN) with non-linearities for image representations. The cross-modal
similarity measure is learned by distance metric learning. Experimental results
show that, by leveraging the rich relational semantics in texts, our model can
outperform the state-of-the-art models by 3.4% and 6.3% on accuracy on two
benchmark datasets.
Authors' comments: To appear in KSEM 2019
Tolgahan Cakaloglu, Christian Szegedy, Xiaowei Xu
Text embedding representing natural language documents in a semantic vector
space can be used for document retrieval using nearest neighbor lookup. In
order to study the feasibility of neural models specialized for retrieval in a
semantically meaningful way, we suggest the use of the Stanford Question
Answering Dataset (SQuAD) in an open-domain question answering context, where
the first task is to find paragraphs useful for answering a given question.
First, we compare the quality of various text-embedding methods on the
performance of retrieval and give an extensive empirical comparison on the
performance of various non-augmented base embedding with, and without IDF
weighting. Our main results are that by training deep residual neural models,
specifically for retrieval purposes, can yield significant gains when it is
used to augment existing embeddings. We also establish that deeper models are
superior to this task. The best base baseline embeddings augmented by our
learned neural approach improves the top-1 paragraph recall of the system by
14%.
Authors' comments: 12 pages, 7 figures
M. Benjelloun, E. W. Dadi, E. M. Daoudi
In this paper, we present an improvement of our proposed technique for 3D shape retrieval in classified databases [2] which is based on representatives of classes. Instead of systematically matching the object-query with all 3D models of the database, our idea presented in [2] consist, for a classified database, to represent each class by one representative that is used to orient the retrieval process to the right class (the class excepted to contain 3D models similar to the query). In order to increase the chance to fall in the right class, our idea in this work is to represent each class by more than one representative. In this case, instead of using only one representative to decide which is the right class we use a set of representatives this will contribute certainly to improving the relevance of retrieval results. The obtained experimental results show that the relevance is significantly improved.
Joshin P. Krishnan, José M. Bioucas-Dias, Vladimir Katkovnik
This paper proposes a novel algorithm for image phase retrieval, i.e., for recovering complex-valued images from the amplitudes of noisy linear combinations (often the Fourier transform) of the sought complex images. The algorithm is developed using the alternating projection framework and is aimed to obtain high performance for heavily noisy (Poissonian or Gaussian) observations. The estimation of the target images is reformulated as a sparse regression, often termed sparse coding, in the complex domain. This is accomplished by learning a complex domain dictionary from the data it represents via matrix factorization with sparsity constraints on the code (i.e., the regression coefficients). Our algorithm, termed dictionary learning phase retrieval (DLPR), jointly learns the referred to dictionary and reconstructs the unknown target image. The effectiveness of DLPR is illustrated through experiments conducted on complex images, simulated and real, where it shows noticeable advantages over the state-of-the-art competitors.
Mustafa Hajij, Paul Rosen
The Reeb graph of a scalar function defined on a domain gives a topologically
meaningful summary of that domain. Reeb graphs have been shown in the past
decade to be of great importance in geometric processing, image processing,
computer graphics, and computational topology. The demand for analyzing large
data sets has increased in the last decade. Hence the parallelization of
topological computations needs to be more fully considered. We propose a
parallel augmented Reeb graph algorithm on triangulated meshes with and without
a boundary. That is, in addition to our parallel algorithm for computing a Reeb
graph, we describe a method for extracting the original manifold data from the
Reeb graph structure. We demonstrate the running time of our algorithm on
standard datasets. As an application, we show how our algorithm can be utilized
in mesh segmentation algorithms.
Authors' comments: 30 pages, 25 figures
Samy Wu Fung, Zichao Di
Ptychography is a popular imaging technique that combines diffractive imaging
with scanning microscopy. The technique consists of a coherent beam that is
scanned across an object in a series of overlapping positions, leading to
reliable and improved reconstructions. Ptychographic microscopes allow for
large fields to be imaged at high resolution at the cost of additional
computational expense. In this work, we propose a multigrid-based optimization
framework to reduce the computational burdens of large-scale ptychographic
phase retrieval. Our proposed method exploits the inherent hierarchical
structures in ptychography through tailored restriction and prolongation
operators for the object and data domains. Our numerical results show that our
proposed scheme accelerates the convergence of its underlying solver and
outperforms the Ptychographic Iterative Engine (PIE), a workhorse in the optics
community.
Authors' comments: 21 pages, 7 figures
Gengchen Mai, Krzysztof Janowicz, Cheng He, Sumang Liu, Ni Lao
Many services that perform information retrieval for Points of Interest (POI) utilize a Lucene-based setup with spatial filtering. While this type of system is easy to implement it does not make use of semantics but relies on direct word matches between a query and reviews leading to a loss in both precision and recall. To study the challenging task of semantically enriching POIs from unstructured data in order to support open-domain search and question answering (QA), we introduce a new dataset POIReviewQA. It consists of 20k questions (e.g."is this restaurant dog friendly?") for 1022 Yelp business types. For each question we sampled 10 reviews, and annotated each sentence in the reviews whether it answers the question and what the corresponding answer is. To test a system's ability to understand the text we adopt an information retrieval evaluation by ranking all the review sentences for a question based on the likelihood that they answer this question. We build a Lucene-based baseline model, which achieves 77.0% AUC and 48.8% MAP. A sentence embedding-based model achieves 79.2% AUC and 41.8% MAP, indicating that the dataset presents a challenging problem for future research by the GIR community. The result technology can help exploit the thematic content of web documents and social media for characterisation of locations.
Qijun Zhu, Dandan Li, Dik Lun Lee
As the web expands in data volume and in geographical distribution, centralized search methods become inefficient, leading to increasing interest in cooperative information retrieval, e.g., federated text retrieval (FTR). Different from existing centralized information retrieval (IR) methods, in which search is done on a logically centralized document collection, FTR is composed of a number of peers, each of which is a complete search engine by itself. To process a query, FTR requires firstly the identification of promising peers that host the relevant documents and secondly the retrieval of the most relevant documents from the selected peers. Most of the existing methods only apply traditional IR techniques that treat each text collection as a single large document and utilize term matching to rank the collections. In this paper, we formalize the problem and identify the properties of FTR, and analyze the feasibility of extending LSI with clustering to adapt to FTR, based on which a novel approach called Cluster-based Distributed Latent Semantic Indexing (C-DLSI) is proposed. C-DLSI distinguishes the topics of a peer with clustering, captures the local LSI spaces within the clusters, and consider the relations among these LSI spaces, thus providing more precise characterization of the peer. Accordingly, novel descriptors of the peers and a compatible local text retrieval are proposed. The experimental results show that C-DLSI outperforms existing methods.
Björn Barz, Joachim Denzler
Deep neural networks trained for classification have been found to learn
powerful image representations, which are also often used for other tasks such
as comparing images w.r.t. their visual similarity. However, visual similarity
does not imply semantic similarity. In order to learn semantically
discriminative features, we propose to map images onto class embeddings whose
pair-wise dot products correspond to a measure of semantic similarity between
classes. Such an embedding does not only improve image retrieval results, but
could also facilitate integrating semantics for other tasks, e.g., novelty
detection or few-shot learning. We introduce a deterministic algorithm for
computing the class centroids directly based on prior world-knowledge encoded
in a hierarchy of classes such as WordNet. Experiments on CIFAR-100, NABirds,
and ImageNet show that our learned semantic image embeddings improve the
semantic consistency of image retrieval results by a large margin.
Authors' comments: Accepted at WACV 2019. Source code:
https://github.com/cvjena/semantic-embeddings
Asha Anoosheh, Torsten Sattler, Radu Timofte, Marc Pollefeys, Luc Van Gool
Visual localization is a key step in many robotics pipelines, allowing the
robot to (approximately) determine its position and orientation in the world.
An efficient and scalable approach to visual localization is to use image
retrieval techniques. These approaches identify the image most similar to a
query photo in a database of geo-tagged images and approximate the query's pose
via the pose of the retrieved database image. However, image retrieval across
drastically different illumination conditions, e.g. day and night, is still a
problem with unsatisfactory results, even in this age of powerful neural
models. This is due to a lack of a suitably diverse dataset with true
correspondences to perform end-to-end learning. A recent class of neural models
allows for realistic translation of images among visual domains with relatively
little training data and, most importantly, without ground-truth pairings. In
this paper, we explore the task of accurately localizing images captured from
two traversals of the same area in both day and night. We propose ToDayGAN - a
modified image-translation model to alter nighttime driving images to a more
useful daytime representation. We then compare the daytime and translated night
images to obtain a pose estimate for the night image using the known 6-DOF
position of the closest day image. Our approach improves localization
performance by over 250% compared the current state-of-the-art, in the context
of standard metrics in multiple categories.
Authors' comments: Published in ICRA 2019
Hiren Galiyawala, Kenil Shah, Vandit Gajjar, Mehul S. Raval
A person is commonly described by attributes like height, build, cloth color,
cloth type, and gender. Such attributes are known as soft biometrics. They
bridge the semantic gap between human description and person retrieval in
surveillance video. The paper proposes a deep learning-based linear filtering
approach for person retrieval using height, cloth color, and gender. The
proposed approach uses Mask R-CNN for pixel-wise person segmentation. It
removes background clutter and provides precise boundary around the person.
Color and gender models are fine-tuned using AlexNet and the algorithm is
tested on SoftBioSearch dataset. It achieves good accuracy for person retrieval
using the semantic query in challenging conditions.
Authors' comments: 6 Pages, 6 Figures, Accepted to Semantic Person Retrieval in
Surveillance Using Soft Biometrics challenge in Conjunction with AVSS-2018
Georgios-Ioannis Brokos, Polyvios Liosis, Ryan McDonald, Dimitris Pappas, Ion Androutsopoulos
We present AUEB's submissions to the BioASQ 6 document and snippet retrieval
tasks (parts of Task 6b, Phase A). Our models use novel extensions to deep
learning architectures that operate solely over the text of the query and
candidate document/snippets. Our systems scored at the top or near the top for
all batches of the challenge, highlighting the effectiveness of deep learning
for these tasks.
Authors' comments: In Proceedings of the workshop BioASQ: Large-scale Biomedical
Semantic Indexing and Question Answering, at the Conference on Empirical
Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 2018.
arXiv admin note: text overlap with arXiv:1809.01682