Kaustubh D. Dhole
Task oriented Dialogue Systems generally employ intent detection systems in order to map user queries to a set of pre-defined intents. However, user queries appearing in natural language can be easily ambiguous and hence such a direct mapping might not be straightforward harming intent detection and eventually the overall performance of a dialogue system. Moreover, acquiring domain-specific clarification questions is costly. In order to disambiguate queries which are ambiguous between two intents, we propose a novel method of generating discriminative questions using a simple rule based system which can take advantage of any question generation system without requiring annotated data of clarification questions. Our approach aims at discrimination between two intents but can be easily extended to clarification over multiple intents. Seeking clarification from the user to classify user intents not only helps understand the user intent effectively, but also reduces the roboticity of the conversation and makes the interaction considerably natural.
Philipp Grohs, Lukas Liehr
We establish novel uniqueness results for the Gabor phase retrieval problem:
if $\mathcal{G} : L^2(\mathbb{R}) \to L^2(\mathbb{R}^2)$ denotes the Gabor
transform then every $f \in L^4[-\tfrac{c}{2},\tfrac{c}{2}]$ is determined up
to a global phase by the values $|\mathcal{G}f(x,\omega)|$ where $(x,\omega)$
are points on the lattice $b^{-1}\mathbb{Z} \times (2c)^{-1}\mathbb{Z}$ and
$b>0$ is an arbitrary positive constant. This for the first time shows that
compactly-supported, complex-valued functions can be uniquely reconstructed
from lattice samples of their spectrogram. Moreover, by making use of recent
developments related to sampling in shift-invariant spaces by Gr\"ochenig,
Romero and St\"ockler, we prove analogous uniqueness results for functions in
shift-invariant spaces with Gaussian generator. Generalizations to nonuniform
sampling are also presented. Finally, we compare our results to the situation
where the considered signals are assumed to be real-valued.
Authors' comments: 26 pages, 1 figure, Section 3.2 and Section 3.4 added
Andrey Kurenkov, Joseph Taglic, Rohun Kulkarni, Marcus Dominguez-Kuhne, Animesh Garg, Roberto Martín-Martín, Silvio Savarese
When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.
Raul Gomez, Yahui Liu, Marco De Nadai, Dimosthenis Karatzas, Bruno Lepri, Nicu Sebe
Image to image translation aims to learn a mapping that transforms an image
from one visual domain to another. Recent works assume that images descriptors
can be disentangled into a domain-invariant content representation and a
domain-specific style representation. Thus, translation models seek to preserve
the content of source images while changing the style to a target visual
domain. However, synthesizing new images is extremely challenging especially in
multi-domain translations, as the network has to compose content and style to
generate reliable and diverse images in multiple domains. In this paper we
propose the use of an image retrieval system to assist the image-to-image
translation task. First, we train an image-to-image translation model to map
images to multiple domains. Then, we train an image retrieval model using real
and generated images to find images similar to a query one in content but in a
different domain. Finally, we exploit the image retrieval system to fine-tune
the image-to-image translation model and generate higher quality images. Our
experiments show the effectiveness of the proposed solution and highlight the
contribution of the retrieval network, which can benefit from additional
unlabeled data and help image-to-image translation models in the presence of
scarce data.
Authors' comments: Submitted to ACM MM '20, October 12-16, 2020, Seattle, WA, USA
Kuan Fang, Long Zhao, Zhan Shen, RuiXing Wang, RiKang Zhour, LiWen Fan
Search engine has become a fundamental component in various web and mobile
applications. Retrieving relevant documents from the massive datasets is
challenging for a search engine system, especially when faced with verbose or
tail queries. In this paper, we explore a vector space search framework for
document retrieval. Specifically, we trained a deep semantic matching model so
that each query and document can be encoded as a low dimensional embedding. Our
model was trained based on BERT architecture. We deployed a fast
k-nearest-neighbor index service for online serving. Both offline and online
metrics demonstrate that our method improved retrieval performance and search
quality considerably, particularly for tail
Authors' comments: 9 pages
Nisarg Raval, Manisha Verma
Adversarial examples, generated by applying small perturbations to input features, are widely used to fool classifiers and measure their robustness to noisy inputs. However, little work has been done to evaluate the robustness of ranking models through adversarial examples. In this work, we present a systematic approach of leveraging adversarial examples to measure the robustness of popular ranking models. We explore a simple method to generate adversarial examples that forces a ranker to incorrectly rank the documents. Using this approach, we analyze the robustness of various ranking models and the quality of perturbations generated by the adversarial attacker across two datasets. Our findings suggest that with very few token changes (1-3), the attacker can yield semantically similar perturbed documents that can fool different rankers into changing a document's score, lowering its rank by several positions.
Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue
The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames of a video as individual images or short clips, making the modeling of long-range semantic dependencies difficult. In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism. To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multiple video retrieval tasks, such as CC_WEB_VIDEO, FIVR-200K, and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.
Samer B. Nashed
As mobile robot capabilities improve and deployment times increase, tools to
analyze the growing volume of data are becoming necessary. Current
state-of-the-art logging, playback, and exploration systems are insufficient
for practitioners seeking to discover systemic points of failure in robotic
systems. This paper presents a suite of algorithms for similarity-based queries
of robotic perception data and implements a system for storing 2D LiDAR data
from many deployments cheaply and evaluating top-k queries for complete or
partial scans efficiently. We generate compressed representations of laser
scans via a convolutional variational autoencoder and store them in a database,
where a light-weight dense network for distance function approximation is run
at query time. Our query evaluator leverages the local continuity of the
embedding space to generate evaluation orders that, in expectation, dominate
full linear scans of the database. The accuracy, robustness, scalability, and
efficiency of our system is tested on real-world data gathered from dozens of
deployments and synthetic data generated by corrupting real data. We find our
system accurately and efficiently identifies similar scans across a number of
episodes where the robot encountered the same location, or similar indoor
structures or objects.
Authors' comments: 6 pages
Anjali A. A. Piette, Nikku Madhusudhan
Isolated brown dwarfs provide remarkable laboratories for understanding
atmospheric physics in the low-irradiation regime, and can be observed more
precisely than exoplanets. As such, they provide a glimpse into the future of
high-SNR observations of exoplanets. In this work, we investigate several new
considerations that are important for atmospheric retrievals of high-quality
thermal emission spectra of sub-stellar objects. We pursue this using an
adaptation of the HyDRA atmospheric retrieval code. We propose a parametric
pressure-temperature (P-T) profile for brown dwarfs consisting of multiple
atmospheric layers, parameterised by the temperature change across each layer.
This model allows the steep temperature gradient of brown dwarf atmospheres to
be accurately retrieved while avoiding commonly-encountered numerical
artefacts. The P-T model is especially flexible in the photosphere, which can
reach a few tens of bar for T-dwarfs. We demonstrate an approach to include
model uncertainties in the retrieval, focusing on uncertainties introduced by
finite spectral and vertical resolution in the atmospheric model used for
retrieval (~8\% in the present case). We validate our retrieval framework by
applying it to a simulated data set and then apply it to the HST/WFC3 spectrum
of the T-dwarf 2MASS J2339+1352. We retrieve sub-solar abundances of H2O and
CH4 in the object at ~0.1 dex precision. Additionally, we constrain the
temperature structure to within ~100 K in the photosphere. Our results
demonstrate the promise of high-SNR spectra to provide high-precision abundance
estimates of sub-stellar objects.
Authors' comments: Accepted for publication in MNRAS, 21 pages, 15 figures
Pratik Patil, Mikael Kuusela, Jonathan Hobbs
The steadily increasing amount of atmospheric carbon dioxide (CO$_2$) is
affecting the global climate system and threatening the long-term
sustainability of Earth's ecosystem. In order to better understand the sources
and sinks of CO$_2$, NASA operates the Orbiting Carbon Observatory-2 & 3
satellites to monitor CO$_2$ from space. These satellites make passive radiance
measurements of the sunlight reflected off the Earth's surface in different
spectral bands, which are then inverted in an ill-posed inverse problem to
obtain estimates of the atmospheric CO$_2$ concentration. In this work, we
propose a new CO$_2$ retrieval method that uses known physical constraints on
the state variables and direct inversion of the target functional of interest
to construct well-calibrated frequentist confidence intervals based on convex
programming. We compare the method with the current operational retrieval
procedure, which uses prior knowledge in the form of probability distributions
to regularize the problem. We demonstrate that the proposed intervals
consistently achieve the desired frequentist coverage, while the operational
uncertainties are poorly calibrated in a frequentist sense both at individual
locations and over a spatial region in a realistic simulation experiment. We
also study the influence of specific nuisance state variables on the length of
the proposed intervals and identify certain key variables that can greatly
reduce the final uncertainty given additional deterministic or probabilistic
constraints, and develop a principled framework to incorporate such information
into our method.
Authors' comments: 33 pages, 6 figures
Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Vincent Leroy, Jérôme Revaud, Philippe Rerole, Noé Pion et al.
Visual localization tackles the challenge of estimating the camera pose from images by using correspondence analysis between query images and a map. This task is computation and data intensive which poses challenges on thorough evaluation of methods on various datasets. However, in order to further advance in the field, we claim that robust visual localization algorithms should be evaluated on multiple datasets covering a broad domain variety. To facilitate this, we introduce kapture, a new, flexible, unified data format and toolbox for visual localization and structure-from-motion (SFM). It enables easy usage of different datasets as well as efficient and reusable data processing. To demonstrate this, we present a versatile pipeline for visual localization that facilitates the use of different local and global features, 3D data (e.g. depth maps), non-vision sensor data (e.g. IMU, GPS, WiFi), and various processing algorithms. Using multiple configurations of the pipeline, we show the great versatility of kapture in our experiments. Furthermore, we evaluate our methods on eight public datasets where they rank top on all and first on many of them. To foster future research, we release code, models, and all datasets used in this paper in the kapture format open source under a permissive BSD license. github.com/naver/kapture, github.com/naver/kapture-localization
Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
Object recognition has seen significant progress in the image domain, with
focus primarily on 2D perception. We propose to leverage existing large-scale
datasets of 3D models to understand the underlying 3D structure of objects seen
in an image by constructing a CAD-based representation of the objects and their
poses. We present Mask2CAD, which jointly detects objects in real-world images
and for each detected object, optimizes for the most similar CAD model and its
pose. We construct a joint embedding space between the detected regions of an
image corresponding to an object and 3D CAD models, enabling retrieval of CAD
models for an input RGB image. This produces a clean, lightweight
representation of the objects in an image; this CAD-based representation
ensures a valid, efficient shape representation for applications such as
content creation or interactive scenarios, and makes a step towards
understanding the transformation of real-world imagery to a synthetic domain.
Experiments on real-world images from Pix3D demonstrate the advantage of our
approach in comparison to state of the art. To facilitate future research, we
additionally propose a new image-to-3D baseline on ScanNet which features
larger shape diversity, real-world occlusions, and challenging image views.
Authors' comments: ECCV 2020 (Spotlight)
Qi Gu, Zhihua Xia, Xingming Sun
Content-Based Image Retrieval (CBIR) techniques have been widely researched
and in service with the help of cloud computing like Google Images. However,
the images always contain rich sensitive information. In this case, the privacy
protection become a big problem as the cloud always can't be fully trusted.
Many privacy-preserving image retrieval schemes have been proposed, in which
the image owner can upload the encrypted images to the cloud, and the owner
himself or the authorized user can execute the secure retrieval with the help
of cloud. Nevertheless, few existing researches notice the multi-source scene
which is more practical. In this paper, we analyze the difficulties in
Multi-Source Privacy-Preserving Image Retrieval (MSPPIR). Then we use the image
in JPEG-format as the example, to propose a scheme called JES-MSIR, namely a
novel JPEG image Encryption Scheme which is made for Multi-Source content-based
Image Retrieval. JES-MSIR can support the requirements of MSPPIR, including the
constant-rounds secure retrieval from multiple sources and the union of
multiple sources for better retrieval services. Experiment results and security
analysis on the proposed scheme show its efficiency, security and accuracy.
Authors' comments: this version adds notations and repair some mistakes
Jinho Choi
In machine-type communication (MTC), a group of devices or sensors may need
to send their data packets with certain access delay limits for delay-sensitive
applications or real-time Internet-of-Things (IoT) applications. In this case,
2-step random access approaches would be preferable to 4-step random access
approaches that are employed for most MTC standards in cellular systems. While
2-step approaches are efficient in terms of access delay, their access delay is
still dependent on retransmission strategies. Thus, for a low access delay,
fast retrial that allows immediate re-transmissions can be employed as a
re-transmission strategy. In this paper, we study 2-step random access
approaches with fast retrial as a buffered multichannel ALOHA with fast
retrial, and derive an analytical way to obtain the quality-of-service (QoS)
exponent for the distribution of queue length so that key parameters can be
decided to meet QoS requirements in terms of access delay. Simulation results
confirm that the derived analytical approach can provide a good approximation
of QoS exponent.
Authors' comments: 9 pages, IEEE IoTJ (accepted)
Gaspard Anthoine, Jean-Guillaume Dumas, Michael Hanling, Mélanie de Jonghe, Aude Maignan, Clément Pernet, Daniel Roche
Proofs of Retrievability (PoRs) are protocols which allow a client to store data remotely and to efficiently ensure, via audits, that the entirety of that data is still intact. A dynamic PoR system also supports efficient retrieval and update of any small portion of the data. We propose new, simple protocols for dynamic PoR that are designed for practical efficiency, trading decreased persistent storage for increased server computation, and show in fact that this tradeoff is inherent via a lower bound proof of time-space for any PoR scheme. Notably, ours is the first dynamic PoR which does not require any special encoding of the data stored on the server, meaning it can be trivially composed with any database service or with existing techniques for encryption or redundancy. Our implementation and deployment on Google Cloud Platform demonstrates our solution is scalable: for example, auditing a 1TB file takes just less than 5 minutes and costs less than $0.08 USD. We also present several further enhancements, reducing the amount of client storage, or the communication bandwidth, or allowing public verifiability, wherein any untrusted third party may conduct an audit.
Michael Segundo Ortiz
In this survey I discuss ophthalmic neurophysiology and the experimental considerations that must be made to reduce possible noise in an eye-tracking data stream. I also review the history, experiments, technological benefits and limitations of eye-tracking within the information retrieval field. The concepts of aware and adaptive user interfaces are also explored that humbly make an attempt to synthesize work from the fields of industrial engineering and psychophysiology with information retrieval.
Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Optimising a ranking-based metric, such as Average Precision (AP), is
notoriously challenging due to the fact that it is non-differentiable, and
hence cannot be optimised directly using gradient-descent methods. To this end,
we introduce an objective that optimises instead a smoothed approximation of
AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that
allows for end-to-end training of deep networks with a simple and elegant
implementation. We also present an analysis for why directly optimising the
ranking based metric of AP offers benefits over other deep metric learning
losses. We apply Smooth-AP to standard retrieval benchmarks: Stanford Online
products and VehicleID, and also evaluate on larger-scale datasets: INaturalist
for fine-grained category retrieval, and VGGFace2 and IJB-C for face retrieval.
In all cases, we improve the performance over the state-of-the-art, especially
for larger-scale datasets, thus demonstrating the effectiveness and scalability
of Smooth-AP to real-world scenarios.
Authors' comments: Accepted at ECCV 2020
Sadaqat ur Rehman, Muhammad Waqas, Shanshan Tu, Anis Koubaa, Obaid ur Rehman, Jawad Ahmad, Muhammad Hanif, Zhu Han
With the advancement in technology and the expansion of broadcasting,
cross-media retrieval has gained much attention. It plays a significant role in
big data applications and consists in searching and finding data from different
types of media. In this paper, we provide a novel taxonomy according to the
challenges faced by multi-modal deep learning approaches in solving cross-media
retrieval, namely: representation, alignment, and translation. These challenges
are evaluated on deep learning (DL) based methods, which are categorized into
four main groups: 1) unsupervised methods, 2) supervised methods, 3) pairwise
based methods, and 4) rank based methods. Then, we present some well-known
cross-media datasets used for retrieval, considering the importance of these
datasets in the context in of deep learning based cross-media retrieval
approaches. Moreover, we also present an extensive review of the
state-of-the-art problems and its corresponding solutions for encouraging deep
learning in cross-media retrieval. The fundamental objective of this work is to
exploit Deep Neural Networks (DNNs) for bridging the "media gap", and provide
researchers and developers with a better understanding of the underlying
problems and the potential solutions of deep learning assisted cross-media
retrieval. To the best of our knowledge, this is the first comprehensive survey
to address cross-media retrieval under deep learning methods.
Authors' comments: arXiv admin note: text overlap with arXiv:1804.09539 by other authors
Bhaskar Mitra, Sebastian Hofstatter, Hamed Zamani, Nick Craswell
The Transformer-Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark---and can be considered to be an efficient (but slightly less effective) alternative to BERT-based ranking models. In this work, we extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption. Furthermore, to reduce the memory complexity of the Transformer layers with respect to the input sequence length, we propose a new Conformer layer. We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents. Finally, we demonstrate that incorporating explicit term matching signal into the model can be particularly useful in the full retrieval setting. We present preliminary results from our work in this paper.
Xiangyang Mou, Mo Yu, Bingsheng Yao, Chenghao Yang, Xiaoxiao Guo, Saloni Potdar, Hui Su
A lot of progress has been made to improve question answering (QA) in recent
years, but the special problem of QA over narrative book stories has not been
explored in-depth. We formulate BookQA as an open-domain QA task given its
similar dependency on evidence retrieval. We further investigate how
state-of-the-art open-domain QA approaches can help BookQA. Besides achieving
state-of-the-art on the NarrativeQA benchmark, our study also reveals the
difficulty of evidence retrieval in books with a wealth of experiments and
analysis - which necessitates future effort on novel solutions for evidence
retrieval in BookQA.
Authors' comments: ACL 2020 NUSE Workshop, 6 pages