Feng Hu
Indoor localization has many applications, such as commercial Location Based Services (LBS), robotic navigation, and assistive navigation for the blind. This paper formulates the indoor localization problem into a multimedia retrieving problem by modeling visual landmarks with a panoramic image feature, and calculating a user's location via GPU- accelerated parallel retrieving algorithm. To solve the scene similarity problem, we apply a multi-images based retrieval strategy and a 2D aggregation method to estimate the final retrieval location. Experiments on a campus building real data demonstrate real-time responses (14fps) and robust localization.
Cheng Zhang
The number of static human poses is limited, it is hard to retrieve the exact
videos using one single pose as the clue. However, with a pose sequence or a
dynamic gesture as the keyword, retrieving specific videos becomes more
feasible. We propose a novel method for querying videos containing a designated
sequence of human poses, whereas previous works only designate a single static
pose. The proposed method takes continuous 3d human poses from keyword gesture
video and video candidates, then converts each pose in individual frames into
bone direction descriptors, which describe the direction of each natural
connection in articulated pose. A temporal pyramid sliding window is then
applied to find matches between designated gesture and video candidates, which
ensures that same gestures with different duration can be matched.
Authors' comments: The problem proposed in this article should be classified as "gesture
retrieval" or "gesture detection", and there are already better algorithms to
deal with the proposed problem, for example Dynamic Time Warping (DTW) based
methods. The solution in this work gives little contribution to the field, so
I decided to withdraw it
Xiaopeng Zhang
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.
Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, Yang Liu
Source code summarization aims to generate natural language summaries from structured code snippets for better understanding code functionalities. However, automatic code summarization is challenging due to the complexity of the source code and the language gap between the source code and natural language summaries. Most previous approaches either rely on retrieval-based (which can take advantage of similar examples seen from the retrieval database, but have low generalization performance) or generation-based methods (which have better generalization performance, but cannot take advantage of similar examples). This paper proposes a novel retrieval-augmented mechanism to combine the benefits of both worlds. Furthermore, to mitigate the limitation of Graph Neural Networks (GNNs) on capturing global graph structure information of source code, we propose a novel attention-based dynamic graph to complement the static graph representation of the source code, and design a hybrid message passing GNN for capturing both the local and global structural information. To evaluate the proposed approach, we release a new challenging benchmark, crawled from diversified large-scale open-source C projects (total 95k+ unique functions in the dataset). Our method achieves the state-of-the-art performance, improving existing methods by 1.42, 2.44 and 1.29 in terms of BLEU-4, ROUGE-L and METEOR.
Antoine Maillard, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
We consider the phase retrieval problem of reconstructing a $n$-dimensional
real or complex signal $\mathbf{X}^{\star}$ from $m$ (possibly noisy)
observations $Y_\mu = | \sum_{i=1}^n \Phi_{\mu i} X^{\star}_i/\sqrt{n}|$, for a
large class of correlated real and complex random sensing matrices
$\mathbf{\Phi}$, in a high-dimensional setting where $m,n\to\infty$ while
$\alpha = m/n=\Theta(1)$. First, we derive sharp asymptotics for the lowest
possible estimation error achievable statistically and we unveil the existence
of sharp phase transitions for the weak- and full-recovery thresholds as a
function of the singular values of the matrix $\mathbf{\Phi}$. This is achieved
by providing a rigorous proof of a result first obtained by the replica method
from statistical mechanics. In particular, the information-theoretic transition
to perfect recovery for full-rank matrices appears at $\alpha=1$ (real case)
and $\alpha=2$ (complex case). Secondly, we analyze the performance of the
best-known polynomial time algorithm for this problem -- approximate
message-passing -- establishing the existence of a statistical-to-algorithmic
gap depending, again, on the spectral properties of $\mathbf{\Phi}$. Our work
provides an extensive classification of the statistical and algorithmic
thresholds in high-dimensional phase retrieval for a broad class of random
matrices.
Authors' comments: 12 pages (main text and references), 26 pages of supplementary
material. v2 matches the final version accepted at NeurIPS 2021
Zikui Cai, Rakib Hyder, M. Salman Asif
Signal recovery from nonlinear measurements involves solving an iterative optimization problem. In this paper, we present a framework to optimize the sensing parameters to improve the quality of the signal recovered by the given iterative method. In particular, we learn illumination patterns to recover signals from coded diffraction patterns using a fixed-cost alternating minimization-based phase retrieval method. Coded diffraction phase retrieval is a physically realistic system in which the signal is first modulated by a sequence of codes before the sensor records its Fourier amplitude. We represent the phase retrieval method as an unrolled network with a fixed number of layers and minimize the recovery error by optimizing over the measurement parameters. Since the number of iterations/layers are fixed, the recovery incurs a fixed cost. We present extensive simulation results on a variety of datasets under different conditions and a comparison with existing methods. Our results demonstrate that the proposed method provides near-perfect reconstruction using patterns learned with a small number of training images. Our proposed method provides significant improvements over existing methods both in terms of accuracy and speed.
Limin Chen, Zhiwen Tang, Grace Hui Yang
Interactive Information Retrieval (IIR) and Reinforcement Learning (RL) share
many commonalities, including an agent who learns while interacts, a long-term
and complex goal, and an algorithm that explores and adapts. To successfully
apply RL methods to IIR, one challenge is to obtain sufficient relevance labels
to train the RL agents, which are infamously known as sample inefficient.
However, in a text corpus annotated for a given query, it is not the relevant
documents but the irrelevant documents that predominate. This would cause very
unbalanced training experiences for the agent and prevent it from learning any
policy that is effective. Our paper addresses this issue by using domain
randomization to synthesize more relevant documents for the training. Our
experimental results on the Text REtrieval Conference (TREC) Dynamic Domain
(DD) 2017 Track show that the proposed method is able to boost an RL agent's
learning effectiveness by 22\% in dealing with unseen situations.
Authors' comments: Accepted by SIGIR 2020
Ron Estrin, Yifan Sun, Halyun Jeong, Michael Friedlander
We consider the problem of finding a low rank symmetric matrix satisfying a system of linear equations, as appears in phase retrieval. In particular, we solve the gauge dual formulation, but use a fast approximation of the spectral computations to achieve a noisy solution estimate. This estimate is then used as the initialization of an alternating gradient descent scheme over a nonconvex rank-1 matrix factorization formulation. Numerical results on small problems show consistent recovery, with very low computational cost.
Jun Yu, Guochen Xie, Mengyan Li, Xinlong Hao
Retrieval of family members in the wild aims at finding family members of the
given subject in the dataset, which is useful in finding the lost children and
analyzing the kinship. However, due to the diversity in age, gender, pose and
illumination of the collected data, this task is always challenging. To solve
this problem, we propose our solution with deep Siamese neural network. Our
solution can be divided into two parts: similarity computation and ranking. In
training procedure, the Siamese network firstly takes two candidate images as
input and produces two feature vectors. And then, the similarity between the
two vectors is computed with several fully connected layers. While in inference
procedure, we try another similarity computing method by dropping the followed
several fully connected layers and directly computing the cosine similarity of
the two feature vectors. After similarity computation, we use the ranking
algorithm to merge the similarity scores with the same identity and output the
ordered list according to their similarities. To gain further improvement, we
try different combinations of backbones, training methods and similarity
computing methods. Finally, we submit the best combination as our solution and
our team(ustc-nelslip) obtains favorable result in the track3 of the RFIW2020
challenge with the first runner-up, which verifies the effectiveness of our
method. Our code is available at: https://github.com/gniknoil/FG2020-kinship
Authors' comments: 5 pages, 3 figures
Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, Yong Yu
Click-through rate (CTR) prediction plays a key role in modern online
personalization services. In practice, it is necessary to capture user's
drifting interests by modeling sequential user behaviors to build an accurate
CTR prediction model. However, as the users accumulate more and more behavioral
data on the platforms, it becomes non-trivial for the sequential models to make
use of the whole behavior history of each user. First, directly feeding the
long behavior sequence will make online inference time and system load
infeasible. Second, there is much noise in such long histories to fail the
sequential model learning. The current industrial solutions mainly truncate the
sequences and just feed recent behaviors to the prediction model, which leads
to a problem that sequential patterns such as periodicity or long-term
dependency are not embedded in the recent several behaviors but in far back
history. To tackle these issues, in this paper we consider it from the data
perspective instead of just designing more sophisticated yet complicated models
and propose User Behavior Retrieval for CTR prediction (UBR4CTR) framework. In
UBR4CTR, the most relevant and appropriate user behaviors will be firstly
retrieved from the entire user history sequence using a learnable search
method. These retrieved behaviors are then fed into a deep model to make the
final prediction instead of simply using the most recent ones. It is highly
feasible to deploy UBR4CTR into industrial model pipeline with low cost.
Experiments on three real-world large-scale datasets demonstrate the
superiority and efficacy of our proposed framework and models.
Authors' comments: SIGIR 2020 industry track
Yang-Ho Ji, HeeJae Jun, Insik Kim, Jongtack Kim, Youngjoon Kim, Byungsoo Ko, Hyong-Keun Kook, Jingeun Lee et al.
In this paper, we propose an effective pipeline for clothes retrieval system
which has sturdiness on large-scale real-world fashion data. Our proposed
method consists of three components: detection, retrieval, and post-processing.
We firstly conduct a detection task for precise retrieval on target clothes,
then retrieve the corresponding items with the metric learning-based model. To
improve the retrieval robustness against noise and misleading bounding boxes,
we apply post-processing methods such as weighted boxes fusion and feature
concatenation. With the proposed methodology, we achieved 2nd place in the
DeepFashion2 Clothes Retrieval 2020 challenge.
Authors' comments: 2nd place solution on DeepFashion2 clothes retrieval challenge in
CVPR2020 workshop (CVFAD)
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis et al.
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when fine-tuned on
downstream NLP tasks. However, their ability to access and precisely manipulate
knowledge is still limited, and hence on knowledge-intensive tasks, their
performance lags behind task-specific architectures. Additionally, providing
provenance for their decisions and updating their world knowledge remain open
research problems. Pre-trained models with a differentiable access mechanism to
explicit non-parametric memory can overcome this issue, but have so far been
only investigated for extractive downstream tasks. We explore a general-purpose
fine-tuning recipe for retrieval-augmented generation (RAG) -- models which
combine pre-trained parametric and non-parametric memory for language
generation. We introduce RAG models where the parametric memory is a
pre-trained seq2seq model and the non-parametric memory is a dense vector index
of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG
formulations, one which conditions on the same retrieved passages across the
whole generated sequence, the other can use different passages per token. We
fine-tune and evaluate our models on a wide range of knowledge-intensive NLP
tasks and set the state-of-the-art on three open domain QA tasks, outperforming
parametric seq2seq models and task-specific retrieve-and-extract architectures.
For language generation tasks, we find that RAG models generate more specific,
diverse and factual language than a state-of-the-art parametric-only seq2seq
baseline.
Authors' comments: Accepted at NeurIPS 2020
J. Emmanuel Johnson, Valero Laparra, Gustau Camps-Valls
Gaussian processes (GPs) are a class of Kernel methods that have shown to be very useful in geoscience and remote sensing applications for parameter retrieval, model inversion, and emulation. They are widely used because they are simple, flexible, and provide accurate estimates. GPs are based on a Bayesian statistical framework which provides a posterior probability function for each estimation. Therefore, besides the usual prediction (given in this case by the mean function), GPs come equipped with the possibility to obtain a predictive variance (i.e., error bars, confidence intervals) for each prediction. Unfortunately, the GP formulation usually assumes that there is no noise in the inputs, only in the observations. However, this is often not the case in earth observation problems where an accurate assessment of the measuring instrument error is typically available, and where there is huge interest in characterizing the error propagation through the processing pipeline. In this letter, we demonstrate how one can account for input noise estimates using a GP model formulation which propagates the error terms using the derivative of the predictive mean function. We analyze the resulting predictive variance term and show how they more accurately represent the model error in a temperature prediction problem from infrared sounding data.
José Luis Romero
We show that a real-valued function $f$ in the shift-invariant space
generated by a totally positive function of Gaussian type is uniquely
determined, up to a sign, by its absolute values $\{|f(\lambda)|: \lambda \in
\Lambda \}$ on any set $\Lambda \subseteq \mathbb{R}$ with lower Beurling
density $D^{-}(\Lambda)>2$.
Authors' comments: 7 pages
Brigit Schroeder, Subarna Tripathi
A structured query can capture the complexity of object interactions (e.g.
'woman rides motorcycle') unlike single objects (e.g. 'woman' or 'motorcycle').
Retrieval using structured queries therefore is much more useful than single
object retrieval, but a much more challenging problem. In this paper we present
a method which uses scene graph embeddings as the basis for an approach to
image retrieval. We examine how visual relationships, derived from scene
graphs, can be used as structured queries. The visual relationships are
directed subgraphs of the scene graph with a subject and object as nodes
connected by a predicate relationship. Notably, we are able to achieve high
recall even on low to medium frequency objects found in the long-tailed
COCO-Stuff dataset, and find that adding a visual relationship-inspired loss
boosts our recall by 10% in the best case.
Authors' comments: Accepted to Diagram Image Retrieval and Analysis (DIRA) Workshop at
CVPR 2020
Sebastian Hofstätter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, Allan Hanbury
Neural networks, particularly Transformer-based architectures, have achieved
significant performance improvements on several retrieval benchmarks. When the
items being retrieved are documents, the time and memory cost of employing
Transformers over a full sequence of document terms can be prohibitive. A
popular strategy involves considering only the first n terms of the document.
This can, however, result in a biased system that under retrieves longer
documents. In this work, we propose a local self-attention which considers a
moving window over the document terms and for each term attends only to other
terms in the same window. This local attention incurs a fraction of the compute
and memory cost of attention over the whole document. The windowed approach
also leads to more compact packing of padded documents in minibatches resulting
in additional savings. We also employ a learned saturation function and a
two-staged pooling strategy to identify relevant regions of the document. The
Transformer-Kernel pooling model with these changes can efficiently elicit
relevance information from documents with thousands of tokens. We benchmark our
proposed modifications on the document ranking task from the TREC 2019 Deep
Learning track and observe significant improvements in retrieval quality as
well as increased retrieval of longer documents at moderate increase in compute
and memory costs.
Authors' comments: Accepted at SIGIR 2020 (short paper)
Javed Qadrud-Din, Ashraf Bah Rabiou, Ryan Walker, Ravi Soni, Martin Gajek, Gabriel Pack, Akhil Rangaraj
Most approaches for similar text retrieval and ranking with long natural
language queries rely at some level on queries and responses having words in
common with each other. Recent applications of transformer-based neural
language models to text retrieval and ranking problems have been very
promising, but still involve a two-step process in which result candidates are
first obtained through bag-of-words-based approaches, and then reranked by a
neural transformer. In this paper, we introduce novel approaches for
effectively applying neural transformer models to similar text retrieval and
ranking without an initial bag-of-words-based step. By eliminating the
bag-of-words-based step, our approach is able to accurately retrieve and rank
results even when they have no non-stopwords in common with the query. We
accomplish this by using bidirectional encoder representations from
transformers (BERT) to create vectorized representations of sentence-length
texts, along with a vector nearest neighbor search index. We demonstrate both
supervised and unsupervised means of using BERT to accomplish this task.
Authors' comments: 5 pages, 2 figures
Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff et al.
TREC-COVID is a community evaluation designed to build a test collection that
captures the information needs of biomedical researchers using the scientific
literature during a pandemic. One of the key characteristics of pandemic search
is the accelerated rate of change: the topics of interest evolve as the
pandemic progresses and the scientific literature in the area explodes. The
COVID-19 pandemic provides an opportunity to capture this progression as it
happens. TREC-COVID, in creating a test collection around COVID-19 literature,
is building infrastructure to support new research and technologies in pandemic
search.
Authors' comments: 10 pages, 5 figures. TREC-COVID web site:
http://ir.nist.gov/covidSubmit/ Will also appear in June 2020 issue of ACM
SIGIR Forum
Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
Our objective in this work is long range understanding of the narrative
structure of movies. Instead of considering the entire movie, we propose to
learn from the `key scenes' of the movie, providing a condensed look at the
full storyline. To this end, we make the following three contributions: (i) We
create the Condensed Movies Dataset (CMD) consisting of the key scenes from
over 3K movies: each key scene is accompanied by a high level semantic
description of the scene, character face-tracks, and metadata about the movie.
The dataset is scalable, obtained automatically from YouTube, and is freely
available for anybody to download and use. It is also an order of magnitude
larger than existing movie datasets in the number of movies; (ii) We provide a
deep network baseline for text-to-video retrieval on our dataset, combining
character, speech and visual cues into a single video embedding; and finally
(iii) We demonstrate how the addition of context from other video clips
improves retrieval performance.
Authors' comments: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) -
Oral presentation
Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, Noah Constant
Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al.,2019).This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks drawn from publicly available QA datasets. We provide the first systematic retrieval based evaluation over these datasets using two supervised neural models, based on fine-tuning BERT andUSE-QA models respectively, as well as a surprisingly strong information retrieval baseline,BM25. Five of these tasks contain both train-ing and test data, while three contain test data only. Performance on the five tasks with train-ing data shows that while a general model covering all domains is achievable, the best performance is often obtained by training exclusively on in-domain data.