Harsh Kohli
Large Scale Question-Answering systems today are widely used in downstream
applications such as chatbots and conversational dialogue agents. Typically,
such systems consist of an Answer Passage retrieval layer coupled with Machine
Comprehension models trained on natural language query-passage pairs. Recent
studies have explored Question Answering over structured data sources such as
web-tables and relational databases. However, architectures such as Seq2SQL
assume the correct table a priori which is input to the model along with the
free text question. Our proposed method, analogues to a passage retrieval model
in traditional Question-Answering systems, describes an architecture to discern
the correct table pertaining to a given query from amongst a large pool of
candidate tables.
Authors' comments: 11 pages, 3 figures, 2 tables
Xi Shen, Alexei A. Efros, Armand Joulin, Mathieu Aubry
The goal of this work is to efficiently identify visually similar patterns in
images, e.g. identifying an artwork detail copied between an engraving and an
oil painting, or recognizing parts of a night-time photograph visible in its
daytime counterpart. Lack of training data is a key challenge for this
co-segmentation task. We present a simple yet surprisingly effective approach
to overcome this difficulty: we generate synthetic training pairs by selecting
segments in an image and copy-pasting them into another image. We then learn to
predict the repeated region masks. We find that it is crucial to predict the
correspondences as an auxiliary task and to use Poisson blending and style
transfer on the training pairs to generalize on real data. We analyse results
with two deep architectures relevant to our joint image analysis task: a
transformer-based architecture and Sparse Nc-Net, a recent network designed to
predict coarse correspondences using 4D convolutions. We show our approach
provides clear improvements for artwork details retrieval on the Brueghel
dataset and achieves competitive performance on two place recognition
benchmarks, Tokyo247 and Pitts30K. We also demonstrate the potential of our
approach for unsupervised image collection analysis by introducing a spectral
graph clustering approach to object discovery and demonstrating it on the
object discovery dataset of \cite{rubinstein2013unsupervised} and the Brueghel
dataset. Our code and data are available at
http://imagine.enpc.fr/~shenx/SegSwap/.
Authors' comments: add results of unsupervised saliency detection
Ye Liu, Kazuma Hashimoto, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Philip S. Yu
Dense neural text retrieval has achieved promising results on open-domain
Question Answering (QA), where latent representations of questions and passages
are exploited for maximum inner product search in the retrieval process.
However, current dense retrievers require splitting documents into short
passages that usually contain local, partial, and sometimes biased context, and
highly depend on the splitting process. As a consequence, it may yield
inaccurate and misleading hidden representations, thus deteriorating the final
retrieval result. In this work, we propose Dense Hierarchical Retrieval (DHR),
a hierarchical framework that can generate accurate dense representations of
passages by utilizing both macroscopic semantics in the document and
microscopic semantics specific to each passage. Specifically, a document-level
retriever first identifies relevant documents, among which relevant passages
are then retrieved by a passage-level retriever. The ranking of the retrieved
passages will be further calibrated by examining the document-level relevance.
In addition, hierarchical title structure and two negative sampling strategies
(i.e., In-Doc and In-Sec negatives) are investigated. We apply DHR to
large-scale open-domain QA datasets. DHR significantly outperforms the original
dense passage retriever and helps an end-to-end QA system outperform the strong
baselines on multiple open-domain QA benchmarks.
Authors' comments: EMNLP 2021 Findings
Tobias Uelwer, Nick Rucks, Stefan Harmeling
Reconstructing images from their Fourier magnitude measurements is a problem
that often arises in different research areas. This process is also referred to
as phase retrieval. In this work, we consider a modified version of the phase
retrieval problem, which allows for a reference image to be added onto the
image before the Fourier magnitudes are measured. We analyze an unrolled
Gerchberg-Saxton (GS) algorithm that can be used to learn a good reference
image from a dataset. Furthermore, we take a closer look at the learned
reference images and propose a simple and efficient heuristic to construct
reference images that, in some cases, yields reconstructions of comparable
quality as approaches that learn references. Our code is available at
https://github.com/tuelwer/reference-learning.
Authors' comments: Accepted at the NeurIPS 2021 Workshop on Deep Learning and Inverse
Problems
Lisai Zhang, Hongfa Wu, Qingcai Chen, Yimeng Deng, Zhonghua Li, Dejiang Kong, Zhao Cao, Joanna Siebert et al.
Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model learns cross-modal knowledge from early-interaction pre-training and is then decomposed into an individual encoder. The decomposition requires only small target datasets for supervision and achieves both $1000+$ times acceleration and less than $0.6$\% average recall drop. VLDeformer also outperforms state-of-the-art visual-semantic embedding methods on COCO and Flickr30k.
Bohong Wu, Zhuosheng Zhang, Jinyuan Wang, Hai Zhao
Training dense passage representations via contrastive learning has been
shown effective for Open-Domain Passage Retrieval (ODPR). Existing studies
focus on further optimizing by improving negative sampling strategy or extra
pretraining. However, these studies keep unknown in capturing passage with
internal representation conflicts from improper modeling granularity. This work
thus presents a refined model on the basis of a smaller granularity, contextual
sentences, to alleviate the concerned conflicts. In detail, we introduce an
in-passage negative sampling strategy to encourage a diverse generation of
sentence representations within the same passage. Experiments on three
benchmark datasets verify the efficacy of our method, especially on datasets
where conflicts are severe. Extensive experiments further present good
transferability of our method across datasets.
Authors' comments: Accepted by ACL 2022 Main Conference, Long Paper
Zijian Gao, Huanyu Liu, Jingyu Liu
The current state-of-the-art methods for video corpus moment retrieval (VCMR) often use similarity-based feature alignment approach for the sake of convenience and speed. However, late fusion methods like cosine similarity alignment are unable to make full use of the information from both query texts and videos. In this paper, we combine feature alignment with feature fusion to promote the performance on VCMR.
Ji Xin, Chenyan Xiong, Ashwin Srinivasan, Ankita Sharma, Damien Jose, Paul N. Bennett
Dense retrieval (DR) methods conduct text retrieval by first encoding texts in the embedding space and then matching them by nearest neighbor search. This requires strong locality properties from the representation space, i.e, the close allocations of each small group of relevant texts, which are hard to generalize to domains without sufficient training data. In this paper, we aim to improve the generalization ability of DR models from source training domains with rich supervision signals to target domains without any relevant labels, in the zero-shot setting. To achieve that, we propose Momentum adversarial Domain Invariant Representation learning (MoDIR), which introduces a momentum method in the DR training process to train a domain classifier distinguishing source versus target, and then adversarially updates the DR encoder to learn domain invariant representations. Our experiments show that MoDIR robustly outperforms its baselines on 10+ ranking datasets from the BEIR benchmark in the zero-shot setup, with more than 10% relative gains on datasets with enough sensitivity for DR models' evaluation. Source code of this paper will be released.
Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, Hal Daumé III
Open-domain question answering answers a question based on evidence retrieved
from a large corpus. State-of-the-art neural approaches require intermediate
evidence annotations for training. However, such intermediate annotations are
expensive, and methods that rely on them cannot transfer to the more common
setting, where only question-answer pairs are available. This paper
investigates whether models can learn to find evidence from a large corpus,
with only distant supervision from answer labels for model training, thereby
generating no additional annotation cost. We introduce a novel approach
(DistDR) that iteratively improves over a weak retriever by alternately finding
evidence from the up-to-date model and encouraging the model to learn the most
likely evidence. Without using any evidence labels, DistDR is on par with
fully-supervised state-of-the-art methods on both multi-hop and single-hop QA
benchmarks. Our analysis confirms that DistDR finds more accurate evidence over
iterations, which leads to model improvements.
Authors' comments: EMNLP 2021
Qishen Ha, Bo Liu, Hongwei Zhang
We present our solutions to the Google Landmark Challenges 2021, for both the retrieval and the recognition tracks. Both solutions are ensembles of transformers and ConvNet models based on Sub-center ArcFace with dynamic margins. Since the two tracks share the same training data, we used the same pipeline and training approach, but with different model selections for the ensemble and different post-processing. The key improvement over last year is newer state-of-the-art vision architectures, especially transformers which significantly outperform ConvNets for the retrieval task. We finished third and fourth places for the retrieval and recognition tracks respectively.
Zhang Yuqi, Xu Xianzhe, Chen Weihua, Wang Yaohua, Zhang Fangyi, Wang Fan, Li Hao
This paper presents the 2nd place solution to the Google Landmark Retrieval
2021 Competition on Kaggle. The solution is based on a baseline with training
tricks from person re-identification, a continent-aware sampling strategy is
presented to select training images according to their country tags and a
Landmark-Country aware reranking is proposed for the retrieval task. With these
contributions, we achieve 0.52995 mAP@100 on private leaderboard. Code
available at
https://github.com/WesleyZhang1991/Google_Landmark_Retrieval_2021_2nd_Place_Solution
Authors' comments: Kaggle Competition, ICCV workshop
Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
Food is significant to human daily life. In this paper, we are interested in
learning structural representations for lengthy recipes, that can benefit the
recipe generation and food cross-modal retrieval tasks. Different from the
common vision-language data, here the food images contain mixed ingredients and
target recipes are lengthy paragraphs, where we do not have annotations on
structure information. To address the above limitations, we propose a novel
method to unsupervisedly learn the sentence-level tree structures for the
cooking recipes. Our approach brings together several novel ideas in a
systematic framework: (1) exploiting an unsupervised learning approach to
obtain the sentence-level tree structure labels before training; (2) generating
trees of target recipes from images with the supervision of tree structure
labels learned from (1); and (3) integrating the learned tree structures into
the recipe generation and food cross-modal retrieval procedure. Our proposed
model can produce good-quality sentence-level tree structures and coherent
recipes. We achieve the state-of-the-art recipe generation and food cross-modal
retrieval performance on the benchmark Recipe1M dataset.
Authors' comments: Accepted at IEEE Transactions on Pattern Analysis and Machine
Intelligence. arXiv admin note: substantial text overlap with
arXiv:2009.00944
Elias Ramzi, Nicolas Thome, Clément Rambour, Nicolas Audebert, Xavier Bitot
In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank function, which provides an upper bound of the AP loss and ensures robust training. Secondly, we design a simple yet effective loss function to reduce the decomposability gap between the AP in the whole training set and its averaged batch approximation, for which we provide theoretical guarantees. Extensive experiments conducted on three image retrieval datasets show that ROADMAP outperforms several recent AP approximation methods and highlight the importance of our two contributions. Finally, using ROADMAP for training deep models yields very good performances, outperforming state-of-the-art results on the three datasets.
Jeeho Ahn, ChangHwan Kim, Changjoo Nam
We consider the problem of retrieving a target object from a confined space
by two robotic manipulators where overhand grasps are not allowed. If other
movable obstacles occlude the target, more than one object should be relocated
to clear the path to reach the target object. With two robots, the relocation
could be done efficiently by simultaneously performing relocation tasks.
However, the precedence constraint between the tasks (e.g, some objects at the
front should be removed to manipulate the objects in the back) makes the
simultaneous task execution difficult.
We propose a coordination method that determines which robot relocates which
object so as to perform tasks simultaneously. Given a set of objects to be
relocated, the objective is to maximize the number of turn-takings of the
robots in performing relocation tasks. Thus, one robot can pick an object in
the clutter while the other robot places an object in hand to the outside of
the clutter. However, the object to be relocated may not be accessible to all
robots, so taking turns could not always be achieved. Our method is based on
the optimal uniform-cost search so the number of turn-takings is proven to be
maximized. We also propose a greedy variant whose computation time is shorter.
From experiments, we show that our method reduces the completion time of the
mission by at least 22.9% (at most 27.3%) compared to the methods with no
consideration of turn-taking.
Authors' comments: Submitted to ICRA'22
Radu Balan, Chris B. Dock
The classical phase retrieval problem arises in contexts ranging from speech
recognition to x-ray crystallography and quantum state tomography. The
generalization to matrix frames is natural in the sense that it corresponds to
quantum tomography of impure states. We provide computable global stability
bounds for the quasi-linear analysis map $\beta$ and a path forward for
understanding related problems in terms of the differential geometry of key
spaces. In particular, we manifest a Whitney stratification of the positive
semidefinite matrices of low rank which allows us to ``stratify'' the
computation of the global stability bound. We show that for the impure state
case no such global stability bounds can be obtained for the non-linear
analysis map $\alpha$ with respect to certain natural distance metrics.
Finally, our computation of the global lower Lipschitz constant for the $\beta$
analysis map provides novel conditions for a frame to be generalized phase
retrievable.
Authors' comments: Proofs are in the appendix, main results in the body
Luke Murray, Divya Gopinath, Monica Agrawal, Steven Horng, David Sontag, David R. Karger
Clinical documentation can be transformed by Electronic Health Records, yet
the documentation process is still a tedious, time-consuming, and error-prone
process. Clinicians are faced with multi-faceted requirements and fragmented
interfaces for information exploration and documentation. These challenges are
only exacerbated in the Emergency Department -- clinicians often see 35
patients in one shift, during which they have to synthesize an often previously
unknown patient's medical records in order to reach a tailored diagnosis and
treatment plan. To better support this information synthesis, clinical
documentation tools must enable rapid contextual access to the patient's
medical record. MedKnowts is an integrated note-taking editor and information
retrieval system which unifies the documentation and search process and
provides concise synthesized concept-oriented slices of the patient's medical
record. MedKnowts automatically captures structured data while still allowing
users the flexibility of natural language. MedKnowts leverages this structure
to enable easier parsing of long notes, auto-populated text, and proactive
information retrieval, easing the documentation burden.
Authors' comments: 15 Pages, 8 figures, UIST 21, October 10-13
Christopher Sciavolino
In open-domain question answering, a model receives a text question as input and searches for the correct answer using a large evidence corpus. The retrieval step is especially difficult as typical evidence corpora have \textit{millions} of documents, each of which may or may not have the correct answer to the question. Very recently, dense models have replaced sparse methods as the de facto retrieval method. Rather than focusing on lexical overlap to determine similarity, dense methods build an encoding function that captures semantic similarity by learning from a small collection of question-answer or question-context pairs. In this paper, we investigate dense retrieval models in the context of open-domain question answering across different input distributions. To do this, first we introduce an entity-rich question answering dataset constructed from Wikidata facts and demonstrate dense models are unable to generalize to unseen input question distributions. Second, we perform analyses aimed at better understanding the source of the problem and propose new training techniques to improve out-of-domain performance on a wide variety of datasets. We encourage the field to further investigate the creation of a single, universal dense retrieval model that generalizes well across all input distributions.
Malihe Alikhani, Fangda Han, Hareesh Ravi, Mubbasir Kapadia, Vladimir Pavlovic, Matthew Stone
Common image-text joint understanding techniques presume that images and the
associated text can universally be characterized by a single implicit model.
However, co-occurring images and text can be related in qualitatively different
ways, and explicitly modeling it could improve the performance of current joint
understanding models. In this paper, we train a Cross-Modal Coherence Modelfor
text-to-image retrieval task. Our analysis shows that models trained with
image--text coherence relations can retrieve images originally paired with
target text more often than coherence-agnostic models. We also show via human
evaluation that images retrieved by the proposed coherence-aware model are
preferred over a coherence-agnostic baseline by a huge margin. Our findings
provide insights into the ways that different modalities communicate and the
role of coherence relations in capturing commonsense inferences in text and
imagery.
Authors' comments: This paper is published in AAAI-2022
Seonho Park, Maciej Rysz, Kathleen M. Dipple, Panos M. Pardalos
Deep learning-based image retrieval has been emphasized in computer vision. Representation embedding extracted by deep neural networks (DNNs) not only aims at containing semantic information of the image, but also can manage large-scale image retrieval tasks. In this work, we propose a deep learning-based image retrieval approach using homography transformation augmented contrastive learning to perform large-scale synthetic aperture radar (SAR) image search tasks. Moreover, we propose a training method for the DNNs induced by contrastive learning that does not require any labeling procedure. This may enable tractability of large-scale datasets with relative ease. Finally, we verify the performance of the proposed method by conducting experiments on the polarimetric SAR image datasets.
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant
In neural Information Retrieval (IR), ongoing research is directed towards
improving the first retriever in ranking pipelines. Learning dense embeddings
to conduct retrieval using efficient approximate nearest neighbors methods has
proven to work well. Meanwhile, there has been a growing interest in learning
\emph{sparse} representations for documents and queries, that could inherit
from the desirable properties of bag-of-words models such as the exact matching
of terms and the efficiency of inverted indexes. Introduced recently, the
SPLADE model provides highly sparse representations and competitive results
with respect to state-of-the-art dense and sparse approaches. In this paper, we
build on SPLADE and propose several significant improvements in terms of
effectiveness and/or efficiency. More specifically, we modify the pooling
mechanism, benchmark a model solely based on document expansion, and introduce
models trained with distillation. We also report results on the BEIR benchmark.
Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10
on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
Authors' comments: 5 pages. arXiv admin note: substantial text overlap with
arXiv:2107.05720