Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu et al.
Code retrieval is to find the code snippet from a large corpus of source code
repositories that highly matches the query of natural language description.
Recent work mainly uses natural language processing techniques to process both
query texts (i.e., human natural language) and code snippets (i.e., machine
programming language), however neglecting the deep structured features of query
texts and source codes, both of which contain rich semantic information. In
this paper, we propose an end-to-end deep graph matching and searching (DGMS)
model based on graph neural networks for the task of semantic code retrieval.
To this end, we first represent both natural language query texts and
programming language code snippets with the unified graph-structured data, and
then use the proposed graph matching and searching model to retrieve the best
matching code snippet. In particular, DGMS not only captures more structural
information for individual query texts or code snippets but also learns the
fine-grained similarity between them by cross-attention based semantic matching
operations. We evaluate the proposed DGMS model on two public code retrieval
datasets with two representative programming languages (i.e., Java and Python).
Experiment results demonstrate that DGMS significantly outperforms
state-of-the-art baseline models by a large margin on both datasets. Moreover,
our extensive ablation studies systematically investigate and illustrate the
impact of each part of DGMS.
Authors' comments: Accepted by ACM Transactions on Knowledge Discovery from Data (ACM
TKDD)
Akari Asai, Eunsol Choi
Recent pretrained language models "solved" many reading comprehension
benchmarks, where questions are written with access to the evidence document.
However, datasets containing information-seeking queries where evidence
documents are provided after the queries are written independently remain
challenging. We analyze why answering information-seeking queries is more
challenging and where their prevalent unanswerabilities arise, on Natural
Questions and TyDi QA. Our controlled experiments suggest two headrooms --
paragraph selection and answerability prediction, i.e. whether the paired
evidence document contains the answer to the query or not. When provided with a
gold paragraph and knowing when to abstain from answering, existing models
easily outperform a human annotator. However, predicting answerability itself
remains challenging. We manually annotate 800 unanswerable examples across six
languages on what makes them challenging to answer. With this new data, we
conduct per-category answerability prediction, revealing issues in the current
dataset collection as well as task formulation. Together, our study points to
avenues for future research in information-seeking question answering, both for
dataset creation and model development.
Authors' comments: Published as a conference paper at ACL 2021 (long). Our code and
annotated data are publicly available at
https://github.com/AkariAsai/unanswerable_qa
Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, Hannaneh Hajishirzi
Multilingual question answering tasks typically assume answers exist in the
same language as the question. Yet in practice, many languages face both
information scarcity -- where languages have few reference articles -- and
information asymmetry -- where questions reference concepts from other
cultures. This work extends open-retrieval question answering to a
cross-lingual setting enabling questions from one language to be answered via
answer content from another language. We construct a large-scale dataset built
on questions from TyDi QA lacking same-language answers. Our task formulation,
called Cross-lingual Open Retrieval Question Answering (XOR QA), includes 40k
information-seeking questions from across 7 diverse non-English languages.
Based on this dataset, we introduce three new tasks that involve cross-lingual
document retrieval using multi-lingual and English resources. We establish
baselines with state-of-the-art machine translation systems and cross-lingual
pretrained models. Experimental results suggest that XOR QA is a challenging
task that will facilitate the development of novel techniques for multilingual
question answering. Our data and code are available at
https://nlp.cs.washington.edu/xorqa.
Authors' comments: Published as a conference paper at NAACL-HLT 2021 (long)
Mathias Seuret, Anguelos Nicolaou, Dominique Stutzmann, Andreas Maier, Vincent Christlein
This competition succeeds upon a line of competitions for writer and style
analysis of historical document images. In particular, we investigate the
performance of large-scale retrieval of historical document fragments in terms
of style and writer identification. The analysis of historic fragments is a
difficult challenge commonly solved by trained humanists. In comparison to
previous competitions, we make the results more meaningful by addressing the
issue of sample granularity and moving from writer to page fragment retrieval.
The two approaches, style and author identification, provide information on
what kind of information each method makes better use of and indirectly
contribute to the interpretability of the participating method. Therefore, we
created a large dataset consisting of more than 120 000 fragments. Although the
most teams submitted methods based on convolutional neural networks, the
winning entry achieves an mAP below 40%.
Authors' comments: ICFHR 2020
Markus J. Hofmann, Lara Müller, Andre Rölke, Ralph Radach, Chris Biemann
The corpus, from which a predictive language model is trained, can be
considered the experience of a semantic system. We recorded everyday reading of
two participants for two months on a tablet, generating individual corpus
samples of 300/500K tokens. Then we trained word2vec models from individual
corpora and a 70 million-sentence newspaper corpus to obtain individual and
norm-based long-term memory structure. To test whether individual corpora can
make better predictions for a cognitive task of long-term memory retrieval, we
generated stimulus materials consisting of 134 sentences with uncorrelated
individual and norm-based word probabilities. For the subsequent eye tracking
study 1-2 months later, our regression analyses revealed that individual, but
not norm-corpus-based word probabilities can account for first-fixation
duration and first-pass gaze duration. Word length additionally affected gaze
duration and total viewing duration. The results suggest that corpora
representative for an individual's longterm memory structure can better explain
reading performance than a norm corpus, and that recently acquired information
is lexically accessed rapidly.
Authors' comments: Proceedings of the 6th workshop on Cognitive Aspects of the Lexicon
(CogALex-VI), Barcelona, Spain, December 12, 2020; accepted manuscript; 11
pages, 2 figures, 4 Tables
Fan Wu, Patrick Rebeschini
We analyze continuous-time mirror descent applied to sparse phase retrieval, which is the problem of recovering sparse signals from a set of magnitude-only measurements. We apply mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements. We provide a convergence analysis of the algorithm in this non-convex setting and prove that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector $\mathbf{x}^\star\in\mathbb{R}^n$ with minimum (in modulus) non-zero entry on the order of $\| \mathbf{x}^\star \|_2/\sqrt{k}$ from $k^2$ Gaussian measurements, modulo logarithmic terms. This yields a simple algorithm which, unlike most existing approaches to sparse phase retrieval, adapts to the sparsity level, without including thresholding steps or adding regularization terms. Our results also provide a principled theoretical understanding for Hadamard Wirtinger flow [58], as Euclidean gradient descent applied to the empirical risk problem with Hadamard parametrization can be recovered as a first-order approximation to mirror descent in discrete time.
Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee
Remote sensing image retrieval (RSIR) is the process of ranking database
images depending on the degree of similarity compared to the query image. As
the complexity of RSIR increases due to the diversity in shooting range, angle,
and location of remote sensors, there is an increasing demand for methods to
address these issues and improve retrieval performance. In this work, we
introduce a novel method for retrieving aerial images by merging group
convolution with attention mechanism and metric learning, resulting in
robustness to rotational variations. For refinement and emphasis on important
features, we applied channel attention in each group convolution stage. By
utilizing the characteristics of group convolution and channel-wise attention,
it is possible to acknowledge the equality among rotated but identically
located images. The training procedure has two main steps: (i) training the
network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning
the network with triplet-loss for retrieval with Google Earth South Korea and
NWPU-RESISC45 datasets. Results show that the proposed method performance
exceeds other state-of-the-art retrieval methods in both rotated and original
environments. Furthermore, we utilize class activation maps (CAM) to visualize
the distinct difference of main features between our method and baseline,
resulting in better adaptability in rotated environments.
Authors' comments: 8 pages, 5 figures, Accepted in ICPR 2020
Jie Zhao, Huan Sun
Code retrieval is a key task aiming to match natural and programming
languages. In this work, we propose adversarial learning for code retrieval,
that is regularized by question-description relevance. First, we adapt a simple
adversarial learning technique to generate difficult code snippets given the
input question, which can help the learning of code retrieval that faces
bi-modal and data-scarce challenges. Second, we propose to leverage
question-description relevance to regularize adversarial learning, such that a
generated code snippet should contribute more to the code retrieval training
loss, only if its paired natural language description is predicted to be less
relevant to the user given question. Experiments on large-scale code retrieval
datasets of two programming languages show that our adversarial learning method
is able to improve the performance of state-of-the-art models. Moreover, using
an additional duplicate question prediction model to regularize adversarial
learning further improves the performance, and this is more effective than
using the duplicated questions in strong multi-task learning baselines
Authors' comments: Accepted to Findings of EMNLP 2020. 11 pages, 2 figures
Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris
In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust to the retrieval of audio duplicates generated with speed transformations.
Peter G. Casazza, Janet C. Tremain
Edidin [3] proved a fundamental result in phase retrieval: Theorem: A family of orthogonal projections $\{P_i\}_{i=1}^m$ does phase retrieval in $\mathbb{R}^n$ if and only if for every $0\not= x\in \mathbb{R}^n$, the family $\{P_ix\}_{i=1}^m$ spans $\mathbb{R}^n$. The proof of this result relies on Algebraic Geometry and so is inaccessible to many people in the field. We will give an elementary proof of this result without Algebraic Geometry. We will also solve the complex version of this result by showing that the "if" part fails and the "only if" part holds in $\mathbb{C}^n$. Finally, we will show that these techniques can be used to verify two classifications of norm retrieval.
Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, Graham Neubig
Language models (LMs) have proven surprisingly successful at capturing
factual knowledge by completing cloze-style fill-in-the-blank questions such as
"Punta Cana is located in _." However, while knowledge is both written and
queried in many languages, studies on LMs' factual representation ability have
almost invariably been performed on English. To assess factual knowledge
retrieval in LMs in different languages, we create a multilingual benchmark of
cloze-style probes for 23 typologically diverse languages. To properly handle
language variations, we expand probing methods from single- to multi-word
entities, and develop several decoding algorithms to generate multi-token
predictions. Extensive experimental results provide insights about how well (or
poorly) current state-of-the-art LMs perform at this task in languages with
more or fewer available resources. We further propose a code-switching-based
method to improve the ability of multilingual LMs to access knowledge, and
verify its effectiveness on several benchmark languages. Benchmark data and
code have been released at https://x-factr.github.io.
Authors' comments: EMNLP 2020
Bolin Wei, Yongmin Li, Ge Li, Xin Xia, Zhi Jin
Code comment generation which aims to automatically generate natural language
descriptions for source code, is a crucial task in the field of automatic
software development. Traditional comment generation methods use
manually-crafted templates or information retrieval (IR) techniques to generate
summaries for source code. In recent years, neural network-based methods which
leveraged acclaimed encoder-decoder deep learning framework to learn comment
generation patterns from a large-scale parallel code corpus, have achieved
impressive results. However, these emerging methods only take code-related
information as input. Software reuse is common in the process of software
development, meaning that comments of similar code snippets are helpful for
comment generation. Inspired by the IR-based and template-based approaches, in
this paper, we propose a neural comment generation approach where we use the
existing comments of similar code snippets as exemplars to guide comment
generation. Specifically, given a piece of code, we first use an IR technique
to retrieve a similar code snippet and treat its comment as an exemplar. Then
we design a novel seq2seq neural network that takes the given code, its AST,
its similar code, and its exemplar as input, and leverages the information from
the exemplar to assist in the target comment generation based on the semantic
similarity between the source code and the similar code. We evaluate our
approach on a large-scale Java corpus, which contains about 2M samples, and
experimental results demonstrate that our model outperforms the
state-of-the-art methods by a substantial margin.
Authors' comments: to be published in the 35th IEEE/ACM International Conference on
Automated Software Engineering (ASE 2020) (ASE'20)
Xiao Kang, Xingbo Liu, Xiushan Nie, Yilong Yin
With the development of medical imaging technology and machine learning, computer-assisted diagnosis which can provide impressive reference to pathologists, attracts extensive research interests. The exponential growth of medical images and uninterpretability of traditional classification models have hindered the applications of computer-assisted diagnosis. To address these issues, we propose a novel method for Learning Binary Semantic Embedding (LBSE). Based on the efficient and effective embedding, classification and retrieval are performed to provide interpretable computer-assisted diagnosis for histology images. Furthermore, double supervision, bit uncorrelation and balance constraint, asymmetric strategy and discrete optimization are seamlessly integrated in the proposed method for learning binary embedding. Experiments conducted on three benchmark datasets validate the superiority of LBSE under various scenarios.
Pengfei Fang, Pan Ji, Jieming Zhou, Lars Petersson, Mehrtash Harandi
Full attention, which generates an attention value per element of the input
feature maps, has been successfully demonstrated to be beneficial in visual
tasks. In this work, we propose a fully attentional network, termed {\it
channel recurrent attention network}, for the task of video pedestrian
retrieval. The main attention unit, \textit{channel recurrent attention},
identifies attention maps at the frame level by jointly leveraging spatial and
channel patterns via a recurrent neural network. This channel recurrent
attention is designed to build a global receptive field by recurrently
receiving and learning the spatial vectors. Then, a \textit{set aggregation}
cell is employed to generate a compact video representation. Empirical
experimental results demonstrate the superior performance of the proposed deep
network, outperforming current state-of-the-art results across standard video
person retrieval benchmarks, and a thorough ablation study shows the
effectiveness of the proposed units.
Authors' comments: To appear in ACCV 2020
Rima Alaifari, Matthias Wellershoff
We consider the recovery of square-integrable signals from discrete,
equidistant samples of their Gabor transform magnitude and show that, in
general, signals can not be recovered from such samples. In particular, we show
that for any lattice, one can construct functions in $L^2(\mathbb{R})$ which do
not agree up to global phase but whose Gabor transform magnitudes sampled on
the lattice agree. These functions have good concentration in both time and
frequency and can be constructed to be real-valued for rectangular lattices.
Authors' comments: 6 pages, 1 figure; fixed a minor typo
Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang et al.
Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.
Ameet Deshpande, Mitesh M. Khapra
Recent advances in Generative Adversarial Networks (GANs) have resulted in its widespread applications to multiple domains. A recent model, IRGAN, applies this framework to Information Retrieval (IR) and has gained significant attention over the last few years. In this focused work, we critically analyze multiple components of IRGAN, while providing experimental and theoretical evidence of some of its shortcomings. Specifically, we identify issues with the constant baseline term in the policy gradients optimization and show that the generator harms IRGAN's performance. Motivated by our findings, we propose two models influenced by self-contrastive estimation and co-training which outperform IRGAN on two out of the three tasks considered.
Theban Stanley, Nihar Vanjara, Yanxin Pan, Ekaterina Pirogova, Swagata Chakraborty, Abon Chaudhuri
We present a similar image retrieval (SIR) platform that is used to quickly
discover visually similar products in a catalog of millions. Given the size,
diversity, and dynamism of our catalog, product search poses many challenges.
It can be addressed by building supervised models to tagging product images
with labels representing themes and later retrieving them by labels. This
approach suffices for common and perennial themes like "white shirt" or
"lifestyle image of TV". It does not work for new themes such as
"e-cigarettes", hard-to-define ones such as "image with a promotional badge",
or the ones with short relevance span such as "Halloween costumes". SIR is
ideal for such cases because it allows us to search by an example, not a
pre-defined theme. We describe the steps - embedding computation, encoding, and
indexing - that power the approximate nearest neighbor search back-end. We also
highlight two applications of SIR. The first one is related to the detection of
products with various types of potentially objectionable themes. This
application is run with a sense of urgency, hence the typical time frame to
train and bootstrap a model is not permitted. Also, these themes are often
short-lived based on current trends, hence spending resources to build a
lasting model is not justified. The second application is a variant item
detection system where SIR helps discover visual variants that are hard to find
through text search. We analyze the performance of SIR in the context of these
applications.
Authors' comments: Accepted in 13th International Conference on Similarity Search and
Applications, SISAP 2020
Kumiko Hori, Steven M. Tobias, Robert J. Teed
Dynamic mode decomposition (DMD) is utilised to identify the intrinsic
signals arising from planetary interiors. Focusing on an axisymmetric
quasi-geostrophic magnetohydrodynamic (MHD) wave -called torsional Alfv\'{e}n
waves (TW) - we examine the utility of DMD in two types of MHD direct numerical
simulations: Boussinesq magnetoconvection and anelastic convection-driven
dynamos in rapidly rotating spherical shells, which model the dynamics in
Earth's core and in Jupiter, respectively. We demonstrate that DMD is capable
of distinguishing internal modes and boundary/interface-related modes from the
timeseries of the internal velocity. Those internal modes may be realised as
free TW, in terms of eigenvalues and eigenfunctions of their normal mode
solutions. Meanwhile it turns out that, in order to account for the details,
the global TW eigenvalue problems in spherical shells need to be further
addressed.
Authors' comments: 7 pages, 11 figures, Proceedings of the Japan Society of Fluid
Mechanics Annual Meeting 2020, Virtual, 18-20 September 2020
Wenhan Xiong, Xiang Lorraine Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Wen-tau Yih et al.
We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.