Xin Zhang, Ning Jia, Ioannis Ivrissimtzis
We present a deep neural network based method for the retrieval of watermarks from images of 3D printed objects. To deal with the variability of all possible 3D printing and image acquisition settings we train the network with synthetic data. The main simulator parameters such as texture, illumination and camera position are dynamically randomized in non-realistic ways, forcing the neural network to learn the intrinsic features of the 3D printed watermarks. At the end of the pipeline, the watermark, in the form of a two-dimensional bit array, is retrieved through a series of simple image processing and statistical operations applied on the confidence map generated by the neural network. The results demonstrate that the inclusion of synthetic DR data in the training set increases the generalization power of the network, which performs better on images from previously unseen 3D printed objects. We conclude that in our application domain of information retrieval from 3D printed objects, where access to the exact CAD files of the printed objects can be assumed, one can use inexpensive synthetic data to enhance neural network training, reducing the need for the labour intensive process of creating large amounts of hand labelled real data or the need to generate photorealistic synthetic data.
Kai Zhu, Wei Zhai, Zheng-Jun Zha, Yang Cao
In this paper, we tackle one-shot texture retrieval: given an example of a
new reference texture, detect and segment all the pixels of the same texture
category within an arbitrary image. To address this problem, we present an
OS-TR network to encode both reference and query image, leading to achieve
texture segmentation towards the reference category. Unlike the existing
texture encoding methods that integrate CNN with orderless pooling, we propose
a directionality-aware module to capture the texture variations at each
direction, resulting in spatially invariant representation. To segment new
categories given only few examples, we incorporate a self-gating mechanism into
relation network to exploit global context information for adjusting
per-channel modulation weights of local relation features. Extensive
experiments on benchmark texture datasets and real scenarios demonstrate the
above-par segmentation performance and robust generalization across domains of
our proposed method.
Authors' comments: ijcai2019-lastest
Xinxun Xu, Hao Wang, Leida Li, Cheng Deng
Zero-shot sketch-based image retrieval (ZS-SBIR) is a specific cross-modal
retrieval task for retrieving natural images with free-hand sketches under
zero-shot scenario. Previous works mostly focus on modeling the correspondence
between images and sketches or synthesizing image features with sketch
features. However, both of them ignore the large intra-class variance of
sketches, thus resulting in unsatisfactory retrieval performance. In this
paper, we propose a novel end-to-end semantic adversarial approach for ZS-SBIR.
Specifically, we devise a semantic adversarial module to maximize the
consistency between learned semantic features and category-level word vectors.
Moreover, to preserve the discriminability of synthesized features within each
training category, a triplet loss is employed for the generative module.
Additionally, the proposed model is trained in an end-to-end strategy to
exploit better semantic features suitable for ZS-SBIR. Extensive experiments
conducted on two large-scale popular datasets demonstrate that our proposed
approach remarkably outperforms state-of-the-art approaches by more than 12\%
on Sketchy dataset and about 3\% on TU-Berlin dataset in the retrieval.
Authors' comments: There is a big problem with the paper and I hope it can be retracted
Tao Yao, Xiangwei Kong, Lianshan Yan, Wenjing Tang, Qi Tian
Supervised cross-modal hashing has gained increasing research interest on large-scale retrieval task owning to its satisfactory performance and efficiency. However, it still has some challenging issues to be further studied: 1) most of them fail to well preserve the semantic correlations in hash codes because of the large heterogenous gap; 2) most of them relax the discrete constraint on hash codes, leading to large quantization error and consequent low performance; 3) most of them suffer from relatively high memory cost and computational complexity during training procedure, which makes them unscalable. In this paper, to address above issues, we propose a supervised cross-modal hashing method based on matrix factorization dubbed Efficient Discrete Supervised Hashing (EDSH). Specifically, collective matrix factorization on heterogenous features and semantic embedding with class labels are seamlessly integrated to learn hash codes. Therefore, the feature based similarities and semantic correlations can be both preserved in hash codes, which makes the learned hash codes more discriminative. Then an efficient discrete optimal algorithm is proposed to handle the scalable issue. Instead of learning hash codes bit-by-bit, hash codes matrix can be obtained directly which is more efficient. Extensive experimental results on three public real-world datasets demonstrate that EDSH produces a superior performance in both accuracy and scalability over some existing cross-modal hashing methods.
Paul Sheridan, Mikael Onsjö, Janna Hastings
Literary theme identification and interpretation is a focal point of literary
studies scholarship. Classical forms of literary scholarship, such as close
reading, have flourished with scarcely any need for commonly defined literary
themes. However, the rise in popularity of collaborative and algorithmic
analyses of literary themes in works of fiction, together with a requirement
for computational searching and indexing facilities for large corpora, creates
the need for a collection of shared literary themes to ensure common
terminology and definitions. To address this need, we here introduce a first
draft of the Literary Theme Ontology. Inspired by a traditional framing from
literary theory, the ontology comprises literary themes drawn from the authors
own analyses, reference books, and online sources. The ontology is available at
https://github.com/theme-ontology/lto under a Creative Commons Attribution 4.0
International license (CC BY 4.0).
Authors' comments: 12 pages, 2 figures, 1 tables, minor revisions
Ali Ahmed, Alireza Aghasi, Paul Hand
We consider the task of recovering two real or complex $m$-vectors from
phaseless Fourier measurements of their circular convolution. Our method is a
novel convex relaxation that is based on a lifted matrix recovery formulation
that allows a nontrivial convex relaxation of the bilinear measurements from
convolution. We prove that if the two signals belong to known random subspaces
of dimensions $k$ and $n$, then they can be recovered up to the inherent
scaling ambiguity with $m \gg (k+n) \log^2 m$ phaseless measurements. Our
method provides the first theoretical recovery guarantee for this problem by a
computationally efficient algorithm and does not require a solution estimate to
be computed for initialization. Our proof is based on Rademacher complexity
estimates. Additionally, we provide an alternating direction method of
multipliers (ADMM) implementation and provide numerical experiments that verify
the theory.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:1806.08091
Noa Garcia, Benjamin Renoust, Yuta Nakashima
In computer vision, visual arts are often studied from a purely aesthetics perspective, mostly by analysing the visual appearance of an artistic reproduction to infer its style, its author, or its representative features. In this work, however, we explore art from both a visual and a language perspective. Our aim is to bridge the gap between the visual appearance of an artwork and its underlying meaning, by jointly analysing its aesthetics and its semantics. We introduce the use of multi-modal techniques in the field of automatic art analysis by 1) collecting a multi-modal dataset with fine-art paintings and comments, and 2) exploring robust visual and textual representations in artistic images.
Wing Hong Wong, Yifei Lou, Stefano Marchesini, Tieyong Zeng
Recovering a signal from its Fourier magnitude is referred to as phase
retrieval, which occurs in different fields of engineering and applied physics.
This paper gives a new characterization of the phase retrieval problem.
Particularly useful is the analysis revealing that the common gradient-based
regularization does not restrict the set of solutions to a smaller set.
Specifically focusing on binary signals, we show that a box relaxation is
equivalent to the binary constraint for Fourier-types of phase retrieval. We
further prove that binary signals can be recovered uniquely up to trivial
ambiguities under certain conditions. Finally, we use the characterization
theorem to develop an efficient denoising algorithm.
Authors' comments: 25 pages, 11 figures
Meng Huang, Zhiqiang Xu
Suppose that $\mathbf{y}=\lvert A\mathbf{x_0}\rvert+\eta$ where $\mathbf{x_0}
\in \mathbb{R}^d$ is the target signal and $\eta\in \mathbb{R}^m$ is a noise
vector. The aim of phase retrieval is to estimate $\mathbf{x_0}$ from
$\mathbf{y}$. A popular model for estimating $\mathbf{x_0} $ is the nonlinear
least square $ \widehat{\mathbf{x}}:={\rm argmin}_{\mathbf{x}} \| \lvert A
\mathbf{x}\rvert-\mathbf{y}\|_2$. One already develops many efficient
algorithms for solving the model, such as the seminal error reduction
algorithm. In this paper, we present the estimation performance of the model
with proving that $\|\widehat{\mathbf{x}}-\mathbf{x_0} \|\lesssim
{\|\eta\|_2}/{\sqrt{m}}$ under the assumption of $A$ being a Gaussian random
matrix. We also prove the reconstruction error ${\|\eta\|_2}/{\sqrt{m}}$ is
sharp. For the case where $\mathbf{x_0}$ is sparse, we study the estimation
performance of both the nonlinear Lasso of phase retrieval and its
unconstrained version. Our results are non-asymptotic, and we do not assume any
distribution on the noise $\eta$. To the best of our knowledge, our results
represent the first theoretical guarantee for the nonlinear least square and
for the nonlinear Lasso of phase retrieval.
Authors' comments: 22 pages
Subarna Tripathi, Sharath Nittur Sridhar, Sairam Sundaresan, Hanlin Tang
Structured representations such as scene graphs serve as an efficient and
compact representation that can be used for downstream rendering or retrieval
tasks. However, existing efforts to generate realistic images from scene graphs
perform poorly on scene composition for cluttered or complex scenes. We propose
two contributions to improve the scene composition. First, we enhance the scene
graph representation with heuristic-based relations, which add minimal storage
overhead. Second, we use extreme points representation to supervise the
learning of the scene composition network. These methods achieve significantly
higher performance over existing work (69.0% vs 51.2% in relation score
metric). We additionally demonstrate how scene graphs can be used to retrieve
pose-constrained image patches that are semantically similar to the source
query. Improving structured scene graph representations for rendering or
retrieval is an important step towards realistic image generation.
Authors' comments: To appear in CVPRW 2019 (CEFRL)
Vinay Kumar Verma, Aakansha Mishra, Ashish Mishra, Piyush Rai
We present a probabilistic model for Sketch-Based Image Retrieval (SBIR)
where, at retrieval time, we are given sketches from novel classes, that were
not present at training time. Existing SBIR methods, most of which rely on
learning class-wise correspondences between sketches and images, typically work
well only for previously seen sketch classes, and result in poor retrieval
performance on novel classes. To address this, we propose a generative model
that learns to generate images, conditioned on a given novel class sketch. This
enables us to reduce the SBIR problem to a standard image-to-image search
problem. Our model is based on an inverse auto-regressive flow based
variational autoencoder, with a feedback mechanism to ensure robust image
generation. We evaluate our model on two very challenging datasets, Sketchy,
and TU Berlin, with novel train-test split. The proposed approach significantly
outperforms various baselines on both the datasets.
Authors' comments: Accepted at CVPR-Workshop 2019
Ushasi Chaudhuri, Partha Bhowmick, Jayanta Mukhopadhyay
Searching for similar logos in the registered logo database is a very important and tedious task at the trademark office. Speed and accuracy are two aspects that one must attend to while developing a system for retrieval of logos. In this paper, we propose a rough-set based method to quantify the structural information in a logo image that can be used to efficiently index an image. A logo is split into a number of polygons, and for each polygon, we compute the tight upper and lower approximations based on the principles of a rough set. This representation is used for forming feature vectors for retrieval of logos. Experimentation on a standard data set shows the usefulness of the proposed technique. It is computationally efficient and also provides retrieval results at high accuracy.
Luis Welbanks, Nikku Madhusudhan
Accurate estimations of atmospheric properties of exoplanets from
transmission spectra require understanding of degeneracies between model
parameters and observations that can resolve them. We conduct a systematic
investigation of such degeneracies using a combination of detailed atmospheric
retrievals and a range of model assumptions, focusing on H$_2$-rich
atmospheres. As a case study, we consider the well-studied hot Jupiter HD
209458 b. We perform extensive retrievals with models ranging from simple
isothermal and isobaric atmospheres to those with full pressure-temperature
profiles, inhomogeneous cloud/haze coverage, multiple molecular species, and
data in the optical-infrared wavelengths. Our study reveals four key insights.
First, we find that a combination of models with minimal assumptions and
broadband transmission spectra with current facilities allow precise estimates
of chemical abundances. In particular, high-precision optical and infrared
spectra along with models including variable cloud coverage and prominent
opacity sources, Na and K being important in optical, provide joint constraints
on cloud/haze properties and chemical abundances. Second, we show that the
degeneracy between planetary radius and its reference pressure is well
characterised and has little effect on abundance estimates, contrary to
previous claims using semi-analytic models. Third, collision induced absorption
due to H$_2$-H$_2$ and H$_2$-He interactions plays a critical role in correctly
estimating atmospheric abundances. Finally, our results highlight the
inadequacy of simplified semi-analytic models with isobaric assumptions for
reliable retrievals of transmission spectra. Transmission spectra obtained with
current facilities such as HST and VLT can provide strong constraints on
atmospheric abundances of exoplanets.
Authors' comments: Accepted for publication in ApJ
Kenta Hama, Takashi Matsubara, Kuniaki Uehara, Jianfei Cai
With the wide development of black-box machine learning algorithms, particularly deep neural network (DNN), the practical demand for the reliability assessment is rapidly rising. On the basis of the concept that `Bayesian deep learning knows what it does not know,' the uncertainty of DNN outputs has been investigated as a reliability measure for the classification and regression tasks. However, in the image-caption retrieval task, well-known samples are not always easy-to-retrieve samples. This study investigates two aspects of image-caption embedding-and-retrieval systems. On one hand, we quantify feature uncertainty by considering image-caption embedding as a regression task, and use it for model averaging, which can improve the retrieval performance. On the other hand, we further quantify posterior uncertainty by considering the retrieval as a classification task, and use it as a reliability measure, which can greatly improve the retrieval performance by rejecting uncertain queries. The consistent performance of two uncertainty measures is observed with different datasets (MS COCO and Flickr30k), different deep learning architectures (dropout and batch normalization), and different similarity functions.
Anirban Santara, Jayeeta Datta, Sourav Sarkar, Ankur Garg, Kirti Padia, Pabitra Mitra
Hyperspectral images of land-cover captured by airborne or satellite-mounted
sensors provide a rich source of information about the chemical composition of
the materials present in a given place. This makes hyperspectral imaging an
important tool for earth sciences, land-cover studies, and military and
strategic applications. However, the scarcity of labeled training examples and
spatial variability of spectral signature are two of the biggest challenges
faced by hyperspectral image classification. In order to address these issues,
we aim to develop a framework for material-agnostic information retrieval in
hyperspectral images based on Positive-Unlabelled (PU) classification. Given a
hyperspectral scene, the user labels some positive samples of a material he/she
is looking for and our goal is to retrieve all the remaining instances of the
query material in the scene. Additionally, we require the system to work
equally well for any material in any scene without the user having to disclose
the identity of the query material. This material-agnostic nature of the
framework provides it with superior generalization abilities. We explore two
alternative approaches to solve the hyperspectral image classification problem
within this framework. The first approach is an adaptation of non-negative risk
estimation based PU learning for hyperspectral data. The second approach is
based on one-versus-all positive-negative classification where the negative
class is approximately sampled using a novel spectral-spatial retrieval model.
We propose two annotator models - uniform and blob - that represent the
labelling patterns of a human annotator. We compare the performances of the
proposed algorithms for each annotator model on three benchmark hyperspectral
image datasets - Indian Pines, Pavia University and Salinas.
Authors' comments: 9 pages, under review at ACMMM-2019
Kojima Yusuke, Tanaka Kanji, Yang Naiming, Hirota Yuji
We present a novel scalable framework for image change detection (ICD) from
an on-board 3D imagery system. We argue that existing ICD systems are
constrained by the time required to align a given query image with individual
reference image coordinates. We utilize an invariant coordinate system (ICS) to
replace the time-consuming image alignment with an offline pre-processing
procedure. Our key contribution is an extension of the traditional image
comparison-based ICD tasks to setups of the image retrieval (IR) task. We
replace each component of the 3D ICD system, i.e., (1) image modeling, (2)
image alignment, and (3) image differencing, with significantly efficient
variants from the bag-of-words (BoW) IR paradigm. Further, we train a deep 3D
feature extractor in an unsupervised manner using an unsupervised Siamese
network and automatically collected training data. We conducted experiments on
a challenging cross-season ICD task using a publicly available dataset and
thereby validate the efficacy of the proposed approach.
Authors' comments: 5 pages, 1 figure, technical report
Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, Yi-Zhe Song
In this paper, we investigate the problem of zero-shot sketch-based image
retrieval (ZS-SBIR), where human sketches are used as queries to conduct
retrieval of photos from unseen categories. We importantly advance prior arts
by proposing a novel ZS-SBIR scenario that represents a firm step forward in
its practical application. The new setting uniquely recognizes two important
yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap
between amateur sketch and photo, and (ii) the necessity for moving towards
large-scale retrieval. We first contribute to the community a novel ZS-SBIR
dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000
photos spanning across 110 categories. Highly abstract amateur human sketches
are purposefully sourced to maximize the domain gap, instead of ones included
in existing datasets that can often be semi-photorealistic. We then formulate a
ZS-SBIR framework to jointly model sketches and photos into a common embedding
space. A novel strategy to mine the mutual information among domains is
specifically engineered to alleviate the domain gap. External semantic
knowledge is further embedded to aid semantic transfer. We show that, rather
surprisingly, retrieval performance significantly outperforms that of
state-of-the-art on existing datasets that can already be achieved using a
reduced version of our model. We further demonstrate the superior performance
of our full model by comparing with a number of alternatives on the newly
proposed dataset. The new dataset, plus all training and testing code of our
model, will be publicly released to facilitate future research
Authors' comments: Oral paper in CVPR 2019
Yadan Luo, Ziwei Wang, Zi Huang, Yang Yang, Huimin Lu
With the increasing number of online stores, there is a pressing need for intelligent search systems to understand the item photos snapped by customers and search against large-scale product databases to find their desired items. However, it is challenging for conventional retrieval systems to match up the item photos captured by customers and the ones officially released by stores, especially for garment images. To bridge the customer- and store- provided garment photos, existing studies have been widely exploiting the clothing attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to learn a common embedding space for garment representations. Unfortunately they omit the sequential correlation of attributes and consume large quantity of human labors to label the landmarks. In this paper, we propose a deep multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain embedding and sequential attribute learning are modeled simultaneously. Sequential attribute learning not only provides the semantic guidance for embedding, but also generates rich attention on discriminative local details (\textit{e.g.,} black buttons) of clothing items without requiring extra landmark labels. This leads to promising performance and 306$\times$ boost on efficiency when compared with the state-of-the-art models, which is demonstrated through rigorous experiments on two public fashion datasets.
Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
There have been a few recent methods proposed in text to video moment
retrieval using natural language queries, but requiring full supervision during
training. However, acquiring a large number of training videos with temporal
boundary annotations for each text description is extremely time-consuming and
often not scalable. In order to cope with this issue, in this work, we
introduce the problem of learning from weak labels for the task of text to
video moment retrieval. The weak nature of the supervision is because, during
training, we only have access to the video-text pairs rather than the temporal
extent of the video to which different text descriptions relate. We propose a
joint visual-semantic embedding based framework that learns the notion of
relevant segments from video using only video-level sentence descriptions.
Specifically, our main idea is to utilize latent alignment between video frames
and sentence descriptions using Text-Guided Attention (TGA). TGA is then used
during the test phase to retrieve relevant moments. Experiments on two
benchmark datasets demonstrate that our method achieves comparable performance
to state-of-the-art fully supervised approaches.
Authors' comments: Revised Table 1 in Page 6, A small bug related to rounding resulted
in a slightly improved score in the previous version. Our conclusion remains
the same after the update
Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, Dacheng Tao
Given the benefits of its low storage requirements and high retrieval efficiency, hashing has recently received increasing attention. In particular,cross-modal hashing has been widely and successfully used in multimedia similarity search applications. However, almost all existing methods employing cross-modal hashing cannot obtain powerful hash codes due to their ignoring the relative similarity between heterogeneous data that contains richer semantic information, leading to unsatisfactory retrieval performance. In this paper, we propose a triplet-based deep hashing (TDH) network for cross-modal retrieval. First, we utilize the triplet labels, which describes the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross-modal instances. We then establish a loss function from the inter-modal view and the intra-modal view to boost the discriminative abilities of the hash codes. Finally, graph regularization is introduced into our proposed TDH method to preserve the original semantic similarity between hash codes in Hamming space. Experimental results show that our proposed method outperforms several state-of-the-art approaches on two popular cross-modal datasets.