Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, Amir Globerson
Dense retrievers for open-domain question answering (ODQA) have been shown to
achieve impressive performance by training on large datasets of
question-passage pairs. In this work we ask whether this dependence on labeled
data can be reduced via unsupervised pretraining that is geared towards ODQA.
We show this is in fact possible, via a novel pretraining scheme designed for
retrieval. Our "recurring span retrieval" approach uses recurring spans across
passages in a document to create pseudo examples for contrastive learning. Our
pretraining scheme directly controls for term overlap across pseudo queries and
relevant passages, thus allowing to model both lexical and semantic relations
between them. The resulting model, named Spider, performs surprisingly well
without any labeled training examples on a wide range of ODQA datasets.
Specifically, it significantly outperforms all other pretrained baselines in a
zero-shot setting, and is competitive with BM25, a strong sparse baseline.
Moreover, a hybrid retriever over Spider and BM25 improves over both, and is
often competitive with DPR models, which are trained on tens of thousands of
examples. Last, notable gains are observed when using Spider as an
initialization for supervised training.
Authors' comments: NAACL 2022
Hui Wu, Min Wang, Wengang Zhou, Yang Hu, Houqiang Li
In image retrieval, deep local features learned in a data-driven manner have
been demonstrated effective to improve retrieval performance. To realize
efficient retrieval on large image database, some approaches quantize deep
local features with a large codebook and match images with aggregated match
kernel. However, the complexity of these approaches is non-trivial with large
memory footprint, which limits their capability to jointly perform feature
learning and aggregation. To generate compact global representations while
maintaining regional matching capability, we propose a unified framework to
jointly learn local feature representation and aggregation. In our framework,
we first extract deep local features using CNNs. Then, we design a tokenizer
module to aggregate them into a few visual tokens, each corresponding to a
specific visual pattern. This helps to remove background noise, and capture
more discriminative regions in the image. Next, a refinement block is
introduced to enhance the visual tokens with self-attention and
cross-attention. Finally, different visual tokens are concatenated to generate
a compact global representation. The whole framework is trained end-to-end with
image-level labels. Extensive experiments are conducted to evaluate our
approach, which outperforms the state-of-the-art methods on the Revisited
Oxford and Paris datasets.
Authors' comments: Our code is available at https://github.com/MCC-WH/Token
Zelu Deng, Yujie Zhong, Sheng Guo, Weilin Huang
This work aims at improving instance retrieval with self-supervision. We find
that fine-tuning using the recently developed self-supervised (SSL) learning
methods, such as SimCLR and MoCo, fails to improve the performance of instance
retrieval. In this work, we identify that the learnt representations for
instance retrieval should be invariant to large variations in viewpoint and
background etc., whereas self-augmented positives applied by the current SSL
methods can not provide strong enough signals for learning robust
instance-level representations. To overcome this problem, we propose InsCLR, a
new SSL method that builds on the \textit{instance-level} contrast, to learn
the intra-class invariance by dynamically mining meaningful pseudo positive
samples from both mini-batches and a memory bank during training. Extensive
experiments demonstrate that InsCLR achieves similar or even better performance
than the state-of-the-art SSL methods on instance retrieval. Code is available
at https://github.com/zeludeng/insclr.
Authors' comments: Accepted by AAAI 2022
Zhenting Luan, Zhenyu Ming, Yuchi Wu, Wei Han, Xiang Chen, Bo Bai, Liping Zhang
Harmonic retrieval (HR) has a wide range of applications in the scenes where signals are modelled as a summation of sinusoids. Past works have developed a number of approaches to recover the original signals. Most of them rely on classical singular value decomposition, which are vulnerable to unexpected outliers. In this paper, we present new decomposition algorithms of third-order complex-valued tensors with $L_1$-principle component analysis ($L_1$-PCA) of complex data and apply them to a novel random access HR model in presence of outliers. We also develop a novel subcarrier recovery method for the proposed model. Simulations are designed to compare our proposed method with some existing tensor-based algorithms for HR. The results demonstrate the outlier-insensitivity of the proposed method.
Shalev Lifshitz, Abtin Riasatian, H. R. Tizhoosh
Recent advances in digital pathology have led to the need for Histopathology Image Retrieval (HIR) systems that search through databases of biopsy images to find similar cases to a given query image. These HIR systems allow pathologists to effortlessly and efficiently access thousands of previously diagnosed cases in order to exploit the knowledge in the corresponding pathology reports. Since HIR systems may have to deal with millions of gigapixel images, the extraction of compact and expressive image features must be available to allow for efficient and accurate retrieval. In this paper, we propose the application of Gram barcodes as image features for HIR systems. Unlike most feature generation schemes, Gram barcodes are based on high-order statistics that describe tissue texture by summarizing the correlations between different feature maps in layers of convolutional neural networks. We run HIR experiments on three public datasets using a pre-trained VGG19 network for Gram barcode generation and showcase highly competitive results.
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma
Dense Retrieval (DR) reaches state-of-the-art results in first-stage retrieval, but little is known about the mechanisms that contribute to its success. Therefore, in this work, we conduct an interpretation study of recently proposed DR models. Specifically, we first discretize the embeddings output by the document and query encoders. Based on the discrete representations, we analyze the attribution of input tokens. Both qualitative and quantitative experiments are carried out on public test collections. Results suggest that DR models pay attention to different aspects of input and extract various high-level topic representations. Therefore, we can regard the representations learned by DR models as a mixture of high-level topics.
Yixing Fan, Xiaohui Xie, Yinqiong Cai, Jia Chen, Xinyu Ma, Xiangsheng Li, Ruqing Zhang, Jiafeng Guo
The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to the user's information need. In recent years, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Recently, a large number of works, which are dedicated to the application of PTMs in IR, have been introduced to promote the retrieval performance. Considering the rapid progress of this direction, this survey aims to provide a systematic review of pre-training methods in IR. To be specific, we present an overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and highlight several promising directions, with the hope of inspiring and facilitating more works on these topics for future research.
Kamilla Nazirkhanova, Joachim Neu, David Tse
The ability to verifiably retrieve transaction or state data stored off-chain is crucial to blockchain scaling techniques such as rollups or sharding. We formalize the problem and design a storage- and communication-efficient protocol using linear erasure-correcting codes and homomorphic vector commitments. Motivated by application requirements for rollups, our solution Semi-AVID-PR departs from earlier Verifiable Information Dispersal schemes in that we do not require comprehensive termination properties. Compared to Data Availability Oracles, under no circumstance do we fall back to returning empty blocks. Distributing a file of 22 MB among 256 storage nodes, up to 85 of which may be adversarial, requires in total ~70 MB of communication and storage, and ~41 seconds of single-thread runtime (<3 seconds on 16 threads) on an AMD Opteron 6378 processor when using the BLS12-381 curve. Our solution requires no modification to on-chain contracts of Validium rollups such as StarkWare's StarkEx. Additionally, it provides privacy of the dispersed data against honest-but-curious storage nodes. Finally, we discuss an application of our Semi-AVID-PR scheme to data availability verification schemes based on random sampling.
Şeyma Bodur, Edgar Martínez-Moro, Diego Ruano
A Private Information Retrieval (PIR) scheme allows users to retrieve data from a database without disclosing to the server information about the identity of the data retrieved. A coded storage in a distributed storage system with colluding servers is considered in this work, namely the approach in [$t$-private information retrieval schemes using transitive codes, IEEE Trans. Inform. Theory, vol. 65, no. 4, pp. 2107-2118, 2019] which considers a storage and retrieval code with a transitive group and provides binary PIR schemes with the highest possible rate. Reed-Muller codes were considered in [$t$-private information retrieval schemes using transitive codes, IEEE Trans. Inform. Theory, vol. 65, no. 4, pp. 2107-2118, 2019]. In this work, we consider cyclic codes and we show that binary PIR schemes using cyclic codes provide a larger constellation of PIR parameters and they may outperform the ones coming from Reed-Muller codes in some cases.
Alvet Miranda, Shah Jahan Miah
Objective: Our study objective is to design a feasible technology solution for health organizations to remove barriers to evidence-based clinical information retrieval, and improve Evidence-Based Practice. Methods: Literature from 2010 to 2020 was reviewed to define problems in evidence-based clinical information retrieval with recommendations from literature used to define solution objectives. Design Science Research is used to complete three projects in a research stream using cloud services such as Web-Scale Discovery, Content Management System, Federated Access, Global Knowledgebase, and Document Delivery. Design thinking, systems thinking, and user-oriented theory of information need are adopted to construct a design theory. Results: The research stream produced three novel and innovative artefacts: a contextual model, a unified architecture, and a context-aware unified architecture which we evaluate as part of academic reviews, scholarly publications, and conference proceedings in various research stream stages. A fourth artefact or design theory is presented to generalize results as mature knowledge.
Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Pre-training on large scale unlabelled datasets has shown impressive
performance improvements in the fields of computer vision and natural language
processing. Given the advent of large-scale instructional video datasets, a
common strategy for pre-training video encoders is to use the accompanying
speech as weak supervision. However, as speech is used to supervise the
pre-training, it is never seen by the video encoder, which does not learn to
process that modality. We address this drawback of current pre-training
methods, which fail to exploit the rich cues in spoken language. Our proposal
is to pre-train a video encoder using all the available video modalities as
supervision, namely, appearance, sound, and transcribed speech. We mask an
entire modality in the input and predict it using the other two modalities.
This encourages each modality to collaborate with the others, and our video
encoder learns to process appearance and audio as well as speech. We show the
superior performance of our "modality masking" pre-training approach for video
retrieval on the How2R, YouCook2 and Condensed Movies datasets.
Authors' comments: Accepted at WACV 2022
Sarthak Mittal, Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, Guillaume Lajoie
Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.
Panupong Pasupat, Yuan Zhang, Kelvin Guu
In practical applications of semantic parsing, we often want to rapidly
change the behavior of the parser, such as enabling it to handle queries in a
new domain, or changing its predictions on certain targeted queries. While we
can introduce new training examples exhibiting the target behavior, a mechanism
for enacting such behavior changes without expensive model re-training would be
preferable. To this end, we propose ControllAble Semantic Parser via Exemplar
Retrieval (CASPER). Given an input query, the parser retrieves related
exemplars from a retrieval index, augments them to the query, and then applies
a generative seq2seq model to produce an output parse. The exemplars act as a
control mechanism over the generic generative model: by manipulating the
retrieval index or how the augmented query is constructed, we can manipulate
the behavior of the parser. On the MTOP dataset, in addition to achieving
state-of-the-art on the standard setup, we show that CASPER can parse queries
in a new domain, adapt the prediction toward the specified patterns, or adapt
to new semantic schemas without having to further re-train the model.
Authors' comments: EMNLP 2021
Bhargavi Paranjape, Matthew Lamm, Ian Tenney
Deep NLP models have been shown to learn spurious correlations, leaving them
brittle to input perturbations. Recent work has shown that counterfactual or
contrastive data -- i.e. minimally perturbed inputs -- can reveal these
weaknesses, and that data augmentation using counterfactuals can help
ameliorate them. Proposed techniques for generating counterfactuals rely on
human annotations, perturbations based on simple heuristics, and meaning
representation frameworks. We focus on the task of creating counterfactuals for
question answering, which presents unique challenges related to world
knowledge, semantic diversity, and answerability. To address these challenges,
we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual
evaluation and training data with minimal human supervision. Using an
open-domain QA framework and question generation model trained on original task
data, we create counterfactuals that are fluent, semantically diverse, and
automatically labeled. Data augmentation with RGF counterfactuals improves
performance on out-of-domain and challenging evaluation sets over and above
existing methods, in both the reading comprehension and open-domain QA
settings. Moreover, we find that RGF data leads to significant improvements in
a model's robustness to local perturbations.
Authors' comments: ACL 2022 Camera-ready version
Tian Lan, Deng Cai, Yan Wang, Yixuan Su, Heyan Huang, Xian-Ling Mao
Recent progress in deep learning has continuously improved the accuracy of
dialogue response selection. In particular, sophisticated neural network
architectures are leveraged to capture the rich interactions between dialogue
context and response candidates. While remarkably effective, these models also
bring in a steep increase in computational cost. Consequently, such models can
only be used as a re-rank module in practice. In this study, we present a
solution to directly select proper responses from a large corpus or even a
nonparallel corpus that only consists of unpaired sentences, using a dense
retrieval model. To push the limits of dense retrieval, we design an
interaction layer upon the dense retrieval models and apply a set of
tailor-designed learning strategies. Our model shows superiority over strong
baselines on the conventional re-rank evaluation setting, which is remarkable
given its efficiency. To verify the effectiveness of our approach in realistic
scenarios, we also conduct full-rank evaluation, where the target is to select
proper responses from a full candidate pool that may contain millions of
candidates and evaluate them fairly through human annotations. Our proposed
model notably outperforms pipeline baselines that integrate fast recall and
expressive re-rank modules. Human evaluation results show that enlarging the
candidate pool with nonparallel corpora improves response quality further.
Authors' comments: 11 pages, 4 figures, 6 tables
Sulaiman Adesegun Kukoyi, O. F. W Onifade, Kamorudeen A. Amuda
Voice information retrieval is a technique that provides Information Retrieval System with the capacity to transcribe spoken queries and use the text output for information search. CIS is a field of research that involves studying the situation, motivations, and methods for people working in a collaborative group for information seeking projects, as well as building a system for supporting such activities. Humans find it easier to communicate and express ideas via speech. Existing voice search like Google and other mainstream voice search does not support collaborative search. The spoken speeches passed through the ASR for feature extraction using MFCC and HMM, Viterbi algorithm precisely for pattern matching. The result of the ASR is then passed as input into CIS System, results is then filtered to have an aggregate result. The result from the simulation shows that our model was able to achieve 81.25% transcription accuracy.
Shiv Ram Dubey, Satish Kumar Singh, Wei-Ta Chu
Deep learning has shown a tremendous growth in hashing techniques for image
retrieval. Recently, Transformer has emerged as a new architecture by utilizing
self-attention without convolution. Transformer is also extended to Vision
Transformer (ViT) for the visual recognition with a promising performance on
ImageNet. In this paper, we propose a Vision Transformer based Hashing (VTS)
for image retrieval. We utilize the pre-trained ViT on ImageNet as the backbone
network and add the hashing head. The proposed VTS model is fine tuned for
hashing under six different image retrieval frameworks, including Deep
Supervised Hashing (DSH), HashNet, GreedyHash, Improved Deep Hashing Network
(IDHN), Deep Polarized Network (DPN) and Central Similarity Quantization (CSQ)
with their objective functions. We perform the extensive experiments on
CIFAR10, ImageNet, NUS-Wide, and COCO datasets. The proposed VTS based image
retrieval outperforms the recent state-of-the-art hashing techniques with a
great margin. We also find the proposed VTS model as the backbone network is
better than the existing networks, such as AlexNet and ResNet. The code is
released at \url{https://github.com/shivram1987/VisionTransformerHashing}.
Authors' comments: Accepted in IEEE International Conference on Multimedia and Expo
(ICME), 2022
Vivek Gupta, Akshat Shrivastava, Adithya Sagar, Armen Aghajanyan, Denis Savenkov
While large pre-trained language models accumulate a lot of knowledge in
their parameters, it has been demonstrated that augmenting it with
non-parametric retrieval-based memory has a number of benefits from accuracy
improvements to data efficiency for knowledge-focused tasks, such as question
answering. In this paper, we are applying retrieval-based modeling ideas to the
problem of multi-domain task-oriented semantic parsing for conversational
assistants. Our approach, RetroNLU, extends a sequence-to-sequence model
architecture with a retrieval component, used to fetch existing similar
examples and provide them as an additional input to the model. In particular,
we analyze two settings, where we augment an input with (a) retrieved nearest
neighbor utterances (utterance-nn), and (b) ground-truth semantic parses of
nearest neighbor utterances (semparse-nn). Our technique outperforms the
baseline method by 1.5% absolute macro-F1, especially at the low resource
setting, matching the baseline model accuracy with only 40% of the data.
Furthermore, we analyze the nearest neighbor retrieval component's quality,
model sensitivity and break down the performance for semantic parses of
different utterance complexity.
Authors' comments: 12 pages, 9 figures, 5 Tables
Giuseppe Ortolano, Pauline Boucher, Ivano Ruo Berchera, Silvania F. Pereira, Marco Genovese
Quantum correlation, such as entanglement and squeezing have shown to improve phase estimation in interferometric setups on one side, and non-interferometric imaging scheme of amplitude object on the other. In the last case, quantum correlation among a pair of beams leads to a sub-shot-noise readout of the image intensity pattern, where weak details, otherwise hidden in the noise, can be appreciated. In this paper we propose a technique which exploits entanglement to enhance quantitative phase retrieval of an object in a non-interferometric setting, i.e only measuring the propagated intensity pattern after interaction with the object. The method exploits existing technology, it operates in wide field mode, so does not require time consuming raster scanning and can operate with small spatial coherence of the incident field. This protocol can find application in optical microscopy and X-ray imaging, reducing the photon dose necessary to achieve a fixed signal-to-noise ratio.
Ievgeniia Kuzminykh, Dan Shevchuk, Stavros Shiaeles, Bogdan Ghita
Modern streaming services are increasingly labeling videos based on their
visual or audio content. This typically augments the use of technologies such
as AI and ML by allowing to use natural speech for searching by keywords and
video descriptions. Prior research has successfully provided a number of
solutions for speech to text, in the case of a human speech, but this article
aims to investigate possible solutions to retrieve sound events based on a
natural language query, and estimate how effective and accurate they are. In
this study, we specifically focus on the YamNet, AlexNet, and ResNet-50
pre-trained models to automatically classify audio samples using their
respective melspectrograms into a number of predefined classes. The predefined
classes can represent sounds associated with actions within a video fragment.
Two tests are conducted to evaluate the performance of the models on two
separate problems: audio classification and intervals retrieval based on a
natural language query. Results show that the benchmarked models are comparable
in terms of performance, with YamNet slightly outperforming the other two
models. YamNet was able to classify single fixed-size audio samples with 92.7%
accuracy and 68.75% precision while its average accuracy on intervals retrieval
was 71.62% and precision was 41.95%. The investigated method may be embedded
into an automated event marking architecture for streaming services.
Authors' comments: 20th International Conference on Next Generation Teletraffic and
Wired/Wireless Advanced Networks and Systems, NEW2AN 2020 and 13th Conference
on the Internet of Things and Smart Spaces, ruSMART 2020