Hengxin Fun, Sunil Gandhi, Sujith Ravi
Recently, there have been significant advances in neural methods for tackling knowledge-intensive tasks such as open domain question answering (QA). These advances are fueled by combining large pre-trained language models with learnable retrieval of documents. Majority of these models use separate encoders for learning query representation, passage representation for the retriever and an additional encoder for the downstream task. Using separate encoders for each stage/task occupies a lot of memory and makes it difficult to scale to a large number of tasks. In this paper, we propose a novel Retrieval Optimized Multi-task (ROM) framework for jointly training self-supervised tasks, knowledge retrieval, and extractive question answering. Our ROM approach presents a unified and generalizable framework that enables scaling efficiently to multiple tasks, varying levels of supervision, and optimization choices such as different learning schedules without changing the model architecture. It also provides the flexibility of changing the encoders without changing the architecture of the system. Using our framework, we achieve comparable or better performance than recent methods on QA, while drastically reducing the number of parameters.
Xiangteng He, Yulin Pan, Mingqian Tang, Yiliang Lv
Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed SVRTN method, which achieves the best performance of video retrieval on accuracy and efficiency.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston
Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We study various types of architectures with multiple components - retrievers, rankers, and encoder-decoders - with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots.
Kai Wang, Luis Herranz, Joost van de Weijer
Multimodal representations and continual learning are two areas closely
related to human intelligence. The former considers the learning of shared
representation spaces where information from different modalities can be
compared and integrated (we focus on cross-modal retrieval between language and
visual representations). The latter studies how to prevent forgetting a
previously learned task when learning a new one. While humans excel in these
two aspects, deep neural networks are still quite limited. In this paper, we
propose a combination of both problems into a continual cross-modal retrieval
setting, where we study how the catastrophic interference caused by new tasks
impacts the embedding spaces and their cross-modal alignment required for
effective retrieval. We propose a general framework that decouples the
training, indexing and querying stages. We also identify and study different
factors that may lead to forgetting, and propose tools to alleviate it. We
found that the indexing stage pays an important role and that simply avoiding
reindexing the database with updated embedding networks can lead to significant
gains. We evaluated our methods in two image-text retrieval datasets, obtaining
significant gains with respect to the fine tuning baseline.
Authors' comments: 2nd CLVISION workshop in CVPR 2021
Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin
Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations. One recent work that has garnered much attention is the dense passage retriever (DPR) technique proposed by Karpukhin et al. (2020) for end-to-end open-domain question answering. We present a replication study of this work, starting with model checkpoints provided by the authors, but otherwise from an independent implementation in our group's Pyserini IR toolkit and PyGaggle neural text ranking library. Although our experimental results largely verify the claims of the original paper, we arrived at two important additional findings that contribute to a better understanding of DPR: First, it appears that the original authors under-report the effectiveness of the BM25 baseline and hence also dense--sparse hybrid retrieval results. Second, by incorporating evidence from the retriever and an improved answer span scoring technique, we are able to improve end-to-end question answering effectiveness using exactly the same models as in the original work.
Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal
Interest in physical therapy and individual exercises such as yoga/dance has
increased alongside the well-being trend. However, such exercises are hard to
follow without expert guidance (which is impossible to scale for personalized
feedback to every trainee remotely). Thus, automated pose correction systems
are required more than ever, and we introduce a new captioning dataset named
FixMyPose to address this need. We collect descriptions of correcting a
"current" pose to look like a "target" pose (in both English and Hindi). The
collected descriptions have interesting linguistic properties such as
egocentric relations to environment objects, analogous references, etc.,
requiring an understanding of spatial relations and commonsense knowledge about
postures. Further, to avoid ML biases, we maintain a balance across characters
with diverse demographics, who perform a variety of movements in several
interior environments (e.g., homes, offices). From our dataset, we introduce
the pose-correctional-captioning task and its reverse target-pose-retrieval
task. During the correctional-captioning task, models must generate
descriptions of how to move from the current to target pose image, whereas in
the retrieval task, models should select the correct target pose given the
initial pose and correctional description. We present strong cross-attention
baseline models (uni/multimodal, RL, multilingual) and also show that our
baselines are competitive with other models when evaluated on other
image-difference datasets. We also propose new task-specific metrics
(object-match, body-part-match, direction-match) and conduct human evaluation
for more reliable evaluation, and we demonstrate a large human-model
performance gap suggesting room for promising future work. To verify the
sim-to-real transfer of our FixMyPose dataset, we collect a set of real images
and show promising performance on these images.
Authors' comments: AAAI 2021 (18 pages, 16 figures; webpage:
https://fixmypose-unc.github.io/)
Fuwen Tan, Jiangbo Yuan, Vicente Ordonez
Instance-level image retrieval is the task of searching in a large database
for images that match an object in a query image. To address this task, systems
usually rely on a retrieval step that uses global image descriptors, and a
subsequent step that performs domain-specific refinements or reranking by
leveraging operations such as geometric verification based on local features.
In this work, we propose Reranking Transformers (RRTs) as a general model to
incorporate both local and global features to rerank the matching images in a
supervised fashion and thus replace the relatively expensive process of
geometric verification. RRTs are lightweight and can be easily parallelized so
that reranking a set of top matching results can be performed in a single
forward-pass. We perform extensive experiments on the Revisited Oxford and
Paris datasets, and the Google Landmarks v2 dataset, showing that RRTs
outperform previous reranking approaches while using much fewer local
descriptors. Moreover, we demonstrate that, unlike existing approaches, RRTs
can be optimized jointly with the feature extractor, which can lead to feature
representations tailored to downstream tasks and further accuracy improvements.
The code and trained models are publicly available at
https://github.com/uvavision/RerankingTransformer.
Authors' comments: ICCV 2021, Table-3 corrected
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.
Michael Wray, Hazel Doughty, Dima Damen
Current video retrieval efforts all found their evaluation on an
instance-based assumption, that only a single caption is relevant to a query
video and vice versa. We demonstrate that this assumption results in
performance comparisons often not indicative of models' retrieval capabilities.
We propose a move to semantic similarity video retrieval, where (i) multiple
videos/captions can be deemed equally relevant, and their relative ranking does
not affect a method's reported performance and (ii) retrieved videos/captions
are ranked by their similarity to a query. We propose several proxies to
estimate semantic similarities in large-scale retrieval datasets, without
additional annotations. Our analysis is performed on three commonly used video
retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).
Authors' comments: Accepted in CVPR 2021. Project Page: https://mwray.github.io/SSVR/
Chen Qu, Liu Yang, Cen Chen, W. Bruce Croft, Kalpesh Krishna, Mohit Iyyer
Recent studies on Question Answering (QA) and Conversational QA (ConvQA)
emphasize the role of retrieval: a system first retrieves evidence from a large
collection and then extracts answers. This open-retrieval ConvQA setting
typically assumes that each question is answerable by a single span of text
within a particular passage (a span answer). The supervision signal is thus
derived from whether or not the system can recover an exact match of this
ground-truth answer span from the retrieved passages. This method is referred
to as span-match weak supervision. However, information-seeking conversations
are challenging for this span-match method since long answers, especially
freeform answers, are not necessarily strict spans of any passage. Therefore,
we introduce a learned weak supervision approach that can identify a
paraphrased span of the known answer in a passage. Our experiments on QuAC and
CoQA datasets show that the span-match weak supervisor can only handle
conversations with span answers, and has less satisfactory results for freeform
answers generated by people. Our method is more flexible as it can handle both
span answers and freeform answers. Moreover, our method can be more powerful
when combined with the span-match method which shows it is complementary to the
span-match method. We also conduct in-depth analyses to show more insights on
open-retrieval ConvQA under a weak supervision setting.
Authors' comments: Accepted to ECIR'21
William H. B. Smith, Michael Milford, Klaus D. McDonald-Maier, Shoaib Ehsan
Visual navigation localizes a query place image against a reference database
of place images, also known as a `visual map'. Localization accuracy
requirements for specific areas of the visual map, `scene classes', vary
according to the context of the environment and task. State-of-the-art visual
mapping is unable to reflect these requirements by explicitly targetting scene
classes for inclusion in the map. Four different scene classes, including
pedestrian crossings and stations, are identified in each of the Nordland and
St. Lucia datasets. Instead of re-training separate scene classifiers which
struggle with these overlapping scene classes we make our first contribution:
defining the problem of `scene retrieval'. Scene retrieval extends image
retrieval to classification of scenes defined at test time by associating a
single query image to reference images of scene classes. Our second
contribution is a triplet-trained convolutional neural network (CNN) to address
this problem which increases scene classification accuracy by up to 7% against
state-of-the-art networks pre-trained for scene recognition. The second
contribution is an algorithm `DMC' that combines our scene classification with
distance and memorability for visual mapping. Our analysis shows that DMC
includes 64% more images of our chosen scene classes in a visual map than just
using distance interval mapping. State-of-the-art visual place descriptors
AMOS-Net, Hybrid-Net and NetVLAD are finally used to show that DMC improves
scene class localization accuracy by a mean of 3% and localization accuracy of
the remaining map images by a mean of 10% across both datasets.
Authors' comments: 8 page paper on visual place recogniton and scene classification
Rita Parada Ramos, Patrícia Pereira, Helena Moniz, Joao Paulo Carvalho, Bruno Martins
Deep neural networks have achieved state-of-the-art results in various vision
and/or language tasks. Despite the use of large training datasets, most models
are trained by iterating over single input-output pairs, discarding the
remaining examples for the current prediction. In this work, we actively
exploit the training data, using the information from nearest training examples
to aid the prediction both during training and testing. Specifically, our
approach uses the target of the most similar training example to initialize the
memory state of an LSTM model, or to guide attention mechanisms. We apply this
approach to image captioning and sentiment analysis, respectively through image
and text retrieval. Results confirm the effectiveness of the proposed approach
for the two tasks, on the widely used Flickr8 and IMDB datasets. Our code is
publicly available at http://github.com/RitaRamo/retrieval-augmentation-nn.
Authors' comments: Accepted at IJCNN 2021
Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, Jeff Heflin
Ranking models are the main components of information retrieval systems.
Several approaches to ranking are based on traditional machine learning
algorithms using a set of hand-crafted features. Recently, researchers have
leveraged deep learning models in information retrieval. These models are
trained end-to-end to extract features from the raw data for ranking tasks, so
that they overcome the limitations of hand-crafted features. A variety of deep
learning models have been proposed, and each model presents a set of neural
network components to extract features that are used for ranking. In this
paper, we compare the proposed models in the literature along different
dimensions in order to understand the major contributions and limitations of
each model. In our discussion of the literature, we analyze the promising
neural components, and propose future research directions. We also show the
analogy between document retrieval and other retrieval tasks where the items to
be ranked are structured documents, answers, images and videos.
Authors' comments: Published in the Information Retrieval Journal (2021)
Rong Fu, Yimin Liu, Tianyao Huang, Yonina C. Eldar
Learned iterative shrinkage thresholding algorithm (LISTA), which adopts deep
learning techniques to learn optimal algorithm parameters from labeled training
data, can be successfully applied to small-scale multidimensional harmonic
retrieval (MHR) problems. However, LISTA computationally demanding for
large-scale MHR problems because the matrix size of the learned mutual
inhibition matrix exhibits quadratic growth with the signal length. These large
matrices consume costly memory/computation resources and require a huge amount
of labeled data for training, restricting the applicability of the LISTA
method. In this paper, we show that the mutual inhibition matrix of a MHR
problem naturally has a Toeplitz structure, which means that the degrees of
freedom (DoF) of the matrix can be reduced from a quadratic order to a linear
order. By exploiting this characteristic, we propose a structured
LISTA-Toeplitz network, which imposes a Toeplitz structure restriction on the
mutual inhibition matrices and applies linear convolution instead of the
matrix-vector multiplication involved in the traditional LISTA network. Both
simulation and field test for air target detection with radar are carried out
to validate the performance of the proposed network. For small-scale MHR
problems, LISTAToeplitz exhibits close or even better recovery accuracy than
traditional LISTA, while the former significantly reduces the network
complexity and requires much less training data. For large-scale MHR problems,
where LISTA is difficult to implement due to the huge size of the mutual
inhibition matrices, our proposed LISTA-Toeplitz still enjoys desirable
recovery performance.
Authors' comments: 13 pages,13 figures, 50 references
Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.
Robert Beinert, Marzieh Hasannasab
In this paper we consider the nonlinear inverse problem of phase retrieval in the context of dynamical sampling. Where phase retrieval deals with the recovery of signals & images from phaseless measurements, dynamical sampling was introduced by Aldroubi et al in 2015 as a tool to recover diffusion fields from spatiotemporal samples. Considering finite-dimensional signals evolving in time under the action of a known matrix, our aim is to recover the signal up to global phase in a stable way from the absolute value of certain space-time measurements. First, we state necessary conditions for the dynamical system of sampling vectors to make the recovery of the unknown signal possible. The conditions deal with the spectrum of the given matrix and the initial sampling vector. Then, assuming that we have access to a specific set of further measurements related to aligned sampling vectors, we provide a feasible procedure to recover almost every signal up to global phase using polarization techniques. Moreover, we show that by adding extra conditions like full spark, the recovery of all signals is possible without exceptions.
Wei Chen, Yu Liu, Weiping Wang, Erwin Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, Michael S. Lew
In recent years a vast amount of visual content has been generated and shared
from many fields, such as social media platforms, medical imaging, and
robotics. This abundance of content creation and sharing has introduced new
challenges, particularly that of searching databases for similar
content-Content Based Image Retrieval (CBIR)-a long-established research area
in which improved efficiency and accuracy are needed for real-time retrieval.
Artificial intelligence has made progress in CBIR and has significantly
facilitated the process of instance search. In this survey we review recent
instance retrieval works that are developed based on deep learning algorithms
and techniques, with the survey organized by deep network architecture types,
deep features, feature embedding and aggregation methods, and network
fine-tuning strategies. Our survey considers a wide variety of recent methods,
whereby we identify milestone work, reveal connections among various methods
and present the commonly used benchmarks, evaluation results, common
challenges, and propose promising future directions.
Authors' comments: IEEE Transactions on Pattern Analysis and Machine Intelligence
Seunghoan Song, Masahito Hayashi
Quantum private information retrieval (QPIR) for quantum messages is the protocol in which a user retrieves one of the multiple quantum states from one or multiple servers without revealing which state is retrieved. We consider QPIR in two different settings: the blind setting, in which the servers contain one copy of the message states, and the visible setting, in which the servers contain the description of the message states. One trivial solution in both settings is downloading all states from the servers and the main goal of this paper is to find more efficient QPIR protocols. First, we prove that the trivial solution is optimal for one-server QPIR in the blind setting. In one-round protocols, the same optimality holds even in the visible setting. On the other hand, when the user and the server share entanglement, we prove that there exists an efficient one-server QPIR protocol in the blind setting. Furthermore, in the visible setting, we prove that it is possible to construct symmetric QPIR protocols in which the user obtains no information of the non-targeted messages. We construct three two-server symmetric QPIR protocols for pure states. Note that symmetric classical PIR is impossible without shared randomness unknown to the user.
Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, Diane Larlus
Cross-modal retrieval methods build a common representation space for samples
from multiple modalities, typically from the vision and the language domains.
For images and their captions, the multiplicity of the correspondences makes
the task particularly challenging. Given an image (respectively a caption),
there are multiple captions (respectively images) that equally make sense. In
this paper, we argue that deterministic functions are not sufficiently powerful
to capture such one-to-many correspondences. Instead, we propose to use
Probabilistic Cross-Modal Embedding (PCME), where samples from the different
modalities are represented as probabilistic distributions in the common
embedding space. Since common benchmarks such as COCO suffer from
non-exhaustive annotations for cross-modal matches, we propose to additionally
evaluate retrieval on the CUB dataset, a smaller yet clean database where all
possible image-caption pairs are annotated. We extensively ablate PCME and
demonstrate that it not only improves the retrieval performance over its
deterministic counterpart but also provides uncertainty estimates that render
the embeddings more interpretable. Code is available at
https://github.com/naver-ai/pcme
Authors' comments: Accepted to CVPR 2021; Code is available at
https://github.com/naver-ai/pcme
Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh
Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.