Gaurav Sharma, Bernt Schiele
We propose a novel algorithm for the task of supervised discriminative
distance learning by nonlinearly embedding vectors into a low dimensional
Euclidean space. We work in the challenging setting where supervision is with
constraints on similar and dissimilar pairs while training. The proposed method
is derived by an approximate kernelization of a linear Mahalanobis-like
distance metric learning algorithm and can also be seen as a kernel neural
network. The number of model parameters and test time evaluation complexity of
the proposed method are O(dD) where D is the dimensionality of the input
features and d is the dimension of the projection space - this is in contrast
to the usual kernelization methods as, unlike them, the complexity does not
scale linearly with the number of training examples. We propose a stochastic
gradient based learning algorithm which makes the method scalable (w.r.t. the
number of training examples), while being nonlinear. We train the method with
up to half a million training pairs of 4096 dimensional CNN features. We give
empirical comparisons with relevant baselines on seven challenging datasets for
the task of low dimensional semantic category based image retrieval.
Authors' comments: ICCV 2015 preprint
Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, Larry S. Davis
Complex event retrieval is a challenging research problem, especially when no training videos are available. An alternative to collecting training videos is to train a large semantic concept bank a priori. Given a text description of an event, event retrieval is performed by selecting concepts linguistically related to the event description and fusing the concept responses on unseen videos. However, defining an exhaustive concept lexicon and pre-training it requires vast computational resources. Therefore, recent approaches automate concept discovery and training by leveraging large amounts of weakly annotated web data. Compact visually salient concepts are automatically obtained by the use of concept pairs or, more generally, n-grams. However, not all visually salient n-grams are necessarily useful for an event query--some combinations of concepts may be visually compact but irrelevant--and this drastically affects performance. We propose an event retrieval algorithm that constructs pairs of automatically discovered concepts and then prunes those concepts that are unlikely to be helpful for retrieval. Pruning depends both on the query and on the specific video instance being evaluated. Our approach also addresses calibration and domain adaptation issues that arise when applying concept detectors to unseen videos. We demonstrate large improvements over other vision based systems on the TRECVID MED 13 dataset.
Enfu Liu, Kanji Tanaka
Map retrieval, the problem of similarity search over a large collection of 2D
pointset maps previously built by mobile robots, is crucial for autonomous
navigation in indoor and outdoor environments. Bag-of-words (BoW) methods
constitute a popular approach to map retrieval; however, these methods have
extremely limited descriptive ability because they ignore the spatial layout
information of the local features. The main contribution of this paper is an
extension of the bag-of-words map retrieval method to enable the use of spatial
information from local features. Our strategy is to explicitly model a unique
viewpoint of an input local map; the pose of the local feature is defined with
respect to this unique viewpoint, and can be viewed as an additional invariant
feature for discriminative map retrieval. Specifically, we wish to determine a
unique viewpoint that is invariant to moving objects, clutter, occlusions, and
actual viewpoints. Hence, we perform scene parsing to analyze the scene
structure, and consider the "center" of the scene structure to be the unique
viewpoint. Our scene parsing is based on a Manhattan world grammar that imposes
a quasi-Manhattan world constraint to enable the robust detection of a scene
structure that is invariant to clutter and moving objects. Experimental results
using the publicly available radish dataset validate the efficacy of the
proposed approach.
Authors' comments: Technical Report, 8 pages, 9 figures
Mahyar Najibi, Mohammad Rastegari, Larry S. Davis
The growing amount of data available in modern-day datasets makes the need to efficiently search and retrieve information. To make large-scale search feasible, Distance Estimation and Subset Indexing are the main approaches. Although binary coding has been popular for implementing both techniques, n-ary coding (known as Product Quantization) is also very effective for Distance Estimation. However, their relative performance has not been studied for Subset Indexing. We investigate whether binary or n-ary coding works better under different retrieval strategies. This leads to the design of a new n-ary coding method, "Linear Subspace Quantization (LSQ)" which, unlike other n-ary encoders, can be used as a similarity-preserving embedding. Experiments on image retrieval show that when Distance Estimation is used, n-ary LSQ outperforms other methods. However, when Subset Indexing is applied, interestingly, binary codings are more effective and binary LSQ achieves the best accuracy.
H. Ouahi, K. Afdel, M. Machkour
In the medical field, images are increasingly used to facilitate diagnosis of diseases. These images are stored in multimedia databases accompanied by doctor s prescriptions and other information related to patients.Search for medical images has become for clinical applications an essential tool to bring effective aid in diagnosis. Content Based Image Retrieval (CBIR) is one of the possible solutions to effectively manage these databases. Our contribution is to define a relevant descriptor to retrieve images based on multiresolution analysis of texture using Local Binary Pattern LBP. This descriptor once calculated and information s relating to the patient; will be placed in the image using the technique of reversible watermarking. Thereby, the image, descriptor of its contents, the BFILE locator and patientrelated information become a single entity, so even the administrator cannot have access to the patient private data.
Arsalan Mousavian, Jana Kosecka
Several recent approaches showed how the representations learned by Convolutional Neural Networks can be repurposed for novel tasks. Most commonly it has been shown that the activation features of the last fully connected layers (fc7 or fc6) of the network, followed by a linear classifier outperform the state-of-the-art on several recognition challenge datasets. Instead of recognition, this paper focuses on the image retrieval problem and proposes a examines alternative pooling strategies derived for CNN features. The presented scheme uses the features maps from an earlier layer 5 of the CNN architecture, which has been shown to preserve coarse spatial information and is semantically meaningful. We examine several pooling strategies and demonstrate superior performance on the image retrieval task (INRIA Holidays) at the fraction of the computational cost, while using a relatively small memory requirements. In addition to retrieval, we see similar efficiency gains on the SUN397 scene categorization dataset, demonstrating wide applicability of this simple strategy. We also introduce and evaluate a novel GeoPlaces5K dataset from different geographical locations in the world for image retrieval that stresses more dramatic changes in appearance and viewpoint.
Kishore Jaganathan, Yonina C. Eldar, Babak Hassibi
The problem of recovering a signal from its Fourier magnitude is of paramount
importance in various fields of engineering and applied physics. Due to the
absence of Fourier phase information, some form of additional information is
required in order to be able to uniquely, efficiently and robustly identify the
underlying signal. Inspired by practical methods in optical imaging, we
consider the problem of signal reconstruction from the Short-Time Fourier
Transform (STFT) magnitude. We first develop conditions under which the STFT
magnitude is an almost surely unique signal representation. We then consider a
semidefinite relaxation-based algorithm (STliFT) and provide recovery
guarantees. Numerical simulations complement our theoretical analysis and
provide directions for future work.
Authors' comments: IEEE Journal of Selected Topics in Signal Processing 2016
Michael Kech
We explicitly give a frame of cardinality $5n-6$ such that every signal in
$\mathbb{C}^n$ can be recovered up to a phase from its associated intensity
measurements via the PhaseLift approach. Furthermore, we give explicit linear
measurements with $4r(n-r)+n-2r$ outcomes that enable the recovery of every
positive semidefinite $n\times n$ matrix of rank at most $r$.
Authors' comments: Applied and Computational Harmonic Analysis, 2016
Yanshan Wang, In-Chan Choi, Hongfang Liu
A generalized ensemble model (gEnM) for document ranking is proposed in this paper. The gEnM linearly combines basis document retrieval models and tries to retrieve relevant documents at high positions. In order to obtain the optimal linear combination of multiple document retrieval models or rankers, an optimization program is formulated by directly maximizing the mean average precision. Both supervised and unsupervised learning algorithms are presented to solve this program. For the supervised scheme, two approaches are considered based on the data setting, namely batch and online setting. In the batch setting, we propose a revised Newton's algorithm, gEnM.BAT, by approximating the derivative and Hessian matrix. In the online setting, we advocate a stochastic gradient descent (SGD) based algorithm---gEnM.ON. As for the unsupervised scheme, an unsupervised ensemble model (UnsEnM) by iteratively co-learning from each constituent ranker is presented. Experimental study on benchmark data sets verifies the effectiveness of the proposed algorithms. Therefore, with appropriate algorithms, the gEnM is a viable option in diverse practical information retrieval applications.
Sohail Bahmani, Justin Romberg
We propose a robust and efficient approach to the problem of compressive
phase retrieval in which the goal is to reconstruct a sparse vector from the
magnitude of a number of its linear measurements. The proposed framework relies
on constrained sensing vectors and a two-stage reconstruction method that
consists of two standard convex programs that are solved sequentially.
In recent years, various methods are proposed for compressive phase
retrieval, but they have suboptimal sample complexity or lack robustness
guarantees. The main obstacle has been that there is no straightforward convex
relaxations for the type of structure in the target. Given a set of
underdetermined measurements, there is a standard framework for recovering a
sparse matrix, and a standard framework for recovering a low-rank matrix.
However, a general, efficient method for recovering a jointly sparse and
low-rank matrix has remained elusive.
Deviating from the models with generic measurements, in this paper we show
that if the sensing vectors are chosen at random from an incoherent subspace,
then the low-rank and sparse structures of the target signal can be effectively
decoupled. We show that a recovery algorithm that consists of a low-rank
recovery stage followed by a sparse recovery stage will produce an accurate
estimate of the target when the number of measurements is
$\mathsf{O}(k\,\log\frac{d}{k})$, where $k$ and $d$ denote the sparsity level
and the dimension of the input signal. We also evaluate the algorithm through
numerical simulation.
Authors' comments: Accepted for the 29th Annual Conference on Neural Information
Processing Systems (NIPS), 2015
Souleymen Sahnoun, El-Hadi Djermoune, David Brie, Pierre Comon
In this paper, a sparse-based method for the estimation of the parameters of
multidimensional ($R$-D) modal (harmonic or damped) complex signals in noise is
presented. The problem is formulated as $R$ simultaneous sparse approximations
of multiple 1-D signals. To get a method able to handle large size signals
while maintaining a sufficient resolution, a multigrid dictionary refinement
technique is associated with the simultaneous sparse approximation problem. The
refinement procedure is proved to converge in the single $R$-D mode case. Then,
for the general multiple modes $R$-D case, the signal tensor model is
decomposed in order to handle each mode separately in an iterative scheme. The
proposed method does not require an association step since the estimated modes
are automatically "paired". We also derive the Cram\'er-Rao lower bounds of the
parameters of modal $R$-D signals. The expressions are given in compact form in
the single $R$-D mode case. Finally, numerical simulations are conducted to
demonstrate the effectiveness of the proposed method.
Authors' comments: 14 pages
Nadine Dulisch, Andreas Oskar Kempf, Philipp Schaer
In recent years, the importance of research data and the need to archive and
to share it in the scientific community have increased enormously. This
introduces a whole new set of challenges for digital libraries. In the social
sciences typical research data sets consist of surveys and questionnaires. In
this paper we focus on the use case of social science survey question reuse and
on mechanisms to support users in the query formulation for data sets. We
describe and evaluate thesaurus- and co-occurrence-based approaches for query
expansion to improve retrieval quality in digital libraries and research data
archives. The challenge here is to translate the information need and the
underlying sociological phenomena into proper queries. As we can show retrieval
quality can be improved by adding related terms to the queries. In a direct
comparison automatically expanded queries using extracted co-occurring terms
can provide better results than queries manually reformulated by a domain
expert and better results than a keyword-based BM25 baseline.
Authors' comments: to appear in Proceedings of 19th International Conference on Theory
and Practice of Digital Libraries 2015 (TPDL 2015)
Matthijs Douze, Jérôme Revaud, Jakob Verbeek, Hervé Jégou, Cordelia Schmid
We address the problem of specific video event retrieval. Given a query video of a specific event, e.g., a concert of Madonna, the goal is to retrieve other videos of the same event that temporally overlap with the query. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to efficiently compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. The descriptors can be compressed in the frequency domain with a product quantizer adapted to complex numbers. In this case, video retrieval is performed without decompressing the descriptors. We also consider the temporal alignment of a set of videos. We exploit the matching confidence and an estimate of the temporal offset computed for all pairs of videos by our retrieval approach. Our robust algorithm aligns the videos on a global timeline by maximizing the set of temporally consistent matches. The global temporal alignment enables synchronous playback of the videos of a given scene.
Thanh The Van, Thanh Manh Le
This chapter approaches the image retrieval system on the base of the colors
of image. It creates fuzzy signature to describe the color of image on color
space HSV and builds fuzzy Hamming distance (FHD) to evaluate the similarity
between the images. In order to reduce the storage space and speed up the
search of similar images, it aims to create S-tree to store fuzzy signature
relies on FHD and builds image retrieval algorithm on S-tree. Then, it provides
the content-based image retrieval (CBIR) and an image retrieval method on FHD
and S-tree. Last but not least, based on this theory, it also presents an
application and experimental assessment of the process of querying similar
image on the database system over 10,000 images.
Authors' comments: 9 pages, 4 figures
Thanh The Van, Thanh Manh Le
In this paper, we introduce an optimum approach for querying similar images
on large digital-image databases. Our work is based on RBIR (region-based image
retrieval) method which uses multiple regions as the key to retrieval images.
This method significantly improves the accuracy of queries. However, this also
increases the cost of computing. To reduce this expensive computational cost,
we implement binary signature encoder which maps an image to its identification
in binary. In order to fasten the lookup, binary signatures of images are
classified by the help of S-kGraph. Finally, our work is evaluated on COREL's
images.
Authors' comments: 17 pages, 9 figures
Hamid R. Tizhoosh
This paper proposes to generate and to use barcodes to annotate medical
images and/or their regions of interest such as organs, tumors and tissue
types. A multitude of efficient feature-based image retrieval methods already
exist that can assign a query image to a certain image class. Visual
annotations may help to increase the retrieval accuracy if combined with
existing feature-based classification paradigms. Whereas with annotations we
usually mean textual descriptions, in this paper barcode annotations are
proposed. In particular, Radon barcodes (RBC) are introduced. As well, local
binary patterns (LBP) and local Radon binary patterns (LRBP) are implemented as
barcodes. The IRMA x-ray dataset with 12,677 training images and 1,733 test
images is used to verify how barcodes could facilitate image retrieval.
Authors' comments: To be published in proceedings of The IEEE International Conference
on Image Processing (ICIP 2015), September 27-30, 2015, Quebec City, Canada
Patryk Najgebauer, Janusz Rygal, Tomasz Nowak, Jakub Romanowski, Leszek Rutkowski, Sviatoslav Voloshynovskiy, Rafal Scherer
This paper describes a method for searching for common sets of descriptors
between collections of images. The presented method operates on local interest
keypoints, which are generated using the SURF algorithm. The use of a
dictionary of descriptors allowed achieving good performance of the
content-based image retrieval. The method can be used to initially determine a
set of similar pairs of keypoints between images. For this purpose, we use a
certain level of tolerance between values of descriptors, as values of feature
descriptors are almost never equal but similar between different images. After
that, the method compares the structure of rotation and location of interest
points in one image with the point structure in other images. Thus, we were
able to find similar areas in images and determine the level of similarity
between them, even when images contain different scenes.
Authors' comments: Accepted for the 14th International Conference on Artificial
Intelligence and Soft Computing, ICAISC, June 14-18, 2015, Zakopane, Poland,
http://www.icaisc.eu/
Joe Yue-Hei Ng, Fan Yang, Larry S. Davis
Deep convolutional neural networks have been successfully applied to image
classification tasks. When these same networks have been applied to image
retrieval, the assumption has been made that the last layers would give the
best performance, as they do in classification. We show that for instance-level
image retrieval, lower layers often perform better than the last layers in
convolutional neural networks. We present an approach for extracting
convolutional features from different layers of the networks, and adopt VLAD
encoding to encode features into a single vector for each image. We investigate
the effect of different layers and scales of input images on the performance of
convolutional features using the recent deep networks OxfordNet and GoogLeNet.
Experiments demonstrate that intermediate layers or higher layers with finer
scales produce better results for image retrieval, compared to the last layer.
When using compressed 128-D VLAD descriptors, our method obtains
state-of-the-art results and outperforms other VLAD and CNN based approaches on
two out of three test datasets. Our work provides guidance for transferring
deep networks trained on image classification to image retrieval tasks.
Authors' comments: CVPR DeepVision Workshop 2015
Longkun Guo, Hong Shen, Wenxing Zhu
In a mobile network, wireless data broadcast over $m$ channels (frequencies) is a powerful means for distributed dissemination of data to clients who access the channels through multi-antennae equipped on their mobile devices. The $\delta$-antennae largest weight data retrieval ($\delta$ALWDR) problem is to compute a schedule for downloading a subset of data items that has a maximum total weight using $\delta$ antennae in a given time interval. In this paper, we propose a ratio $1-\frac{1}{e}-\epsilon$ approximation algorithm for the $\delta$-antennae largest weight data retrieval ($\delta$ALWDR) problem that has the same ratio as the known result but a significantly improved time complexity of $O(2^{\frac{1}{\epsilon}}\frac{1}{\epsilon}m^{7}T^{3.5}L)$ from $O(\epsilon^{3.5}m^{\frac{3.5}{\epsilon}}T^{3.5}L)$ when $\delta=1$ \cite{lu2014data}. To our knowledge, our algorithm represents the first ratio $1-\frac{1}{e}-\epsilon$ approximation solution to $\delta$ALWDR for the general case of arbitrary $\delta$. To achieve this, we first give a ratio $1-\frac{1}{e}$ algorithm for the $\gamma$-separated $\delta$ALWDR ($\delta$A$\gamma$LWDR) with runtime $O(m^{7}T^{3.5}L)$, under the assumption that every data item appears at most once in each segment of $\delta$A$\gamma$LWDR, for any input of maximum length $L$ on $m$ channels in $T$ time slots. Then, we show that we can retain the same ratio for $\delta$A$\gamma$LWDR without this assumption at the cost of increased time complexity to $O(2^{\gamma}m^{7}T^{3.5}L)$. This result immediately yields an approximation solution of same ratio and time complexity for $\delta$ALWDR, presenting a significant improvement of the known time complexity of ratio $1-\frac{1}{e}-\epsilon$ approximation to the problem.
Fang Wang, Le Kang, Yi Li
Retrieving 3D models from 2D human sketches has received considerable
attention in the areas of graphics, image retrieval, and computer vision.
Almost always in state of the art approaches a large amount of "best views" are
computed for 3D models, with the hope that the query sketch matches one of
these 2D projections of 3D models using predefined features.
We argue that this two stage approach (view selection -- matching) is
pragmatic but also problematic because the "best views" are subjective and
ambiguous, which makes the matching inputs obscure. This imprecise nature of
matching further makes it challenging to choose features manually. Instead of
relying on the elusive concept of "best views" and the hand-crafted features,
we propose to define our views using a minimalism approach and learn features
for both sketches and views. Specifically, we drastically reduce the number of
views to only two predefined directions for the whole dataset. Then, we learn
two Siamese Convolutional Neural Networks (CNNs), one for the views and one for
the sketches. The loss function is defined on the within-domain as well as the
cross-domain similarities. Our experiments on three benchmark datasets
demonstrate that our method is significantly better than state of the art
approaches, and outperforms them in all conventional metrics.
Authors' comments: CVPR 2015