Weizhe Lin, Bill Byrne
Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA
task that requires retrieval of external knowledge to answer questions about
images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve
documents from external knowledge bases, such as Wikipedia, but with DPR
trained separately from answer generation, introducing a potential limit on the
overall system performance. Instead, we propose a joint training scheme which
includes differentiable DPR integrated with answer generation so that the
system can be trained in an end-to-end fashion. Our experiments show that our
scheme outperforms recent OK-VQA systems with strong DPR for retrieval. We also
introduce new diagnostic metrics to analyze how retrieval and generation
interact. The strong retrieval ability of our model significantly reduces the
number of retrieved documents needed in training, yielding significant benefits
in answer quality and computation required for training.
Authors' comments: Accepted to appear at the main conference of EMNLP 2022
Nolan Peard, Kartik Ayyer, Henry N. Chapman
Second-order intensity correlations from incoherent emitters can reveal the Fourier transform modulus of their spatial distribution, but retrieving the phase to enable completely general Fourier inversion to real space remains challenging. Phase retrieval via the third-order intensity correlations has relied on special emitter configurations which simplified an unaddressed sign problem in the computation. Without a complete treatment of this sign problem, the general case of retrieving the Fourier phase from a truly arbitrary configuration of emitters is not possible. In this paper, a general method for ab initio phase retrieval via the intensity triple correlations is described. Simulations demonstrate accurate phase retrieval for clusters of incoherent emitters which could be applied to imaging stars or fluorescent atoms and molecules. With this work, it is now finally tractable to perform Fourier inversion directly and reconstruct images of arbitrary arrays of independent emitters via far-field intensity correlations alone.
Jon Almazn, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis
Strong image search models can be learned for a specific domain, ie. set of
labels, provided that some labeled images of that domain are available. A
practical visual search model, however, should be versatile enough to solve
multiple retrieval tasks simultaneously, even if those cover very different
specialized domains. Additionally, it should be able to benefit from even
unlabeled images from these various retrieval tasks. This is the more practical
scenario that we consider in this paper. We address it with the proposed
Grappa, an approach that starts from a strong pretrained model, and adapts it
to tackle multiple retrieval tasks concurrently, using only unlabeled images
from the different task domains. We extend the pretrained model with multiple
independently trained sets of adaptors that use pseudo-label sets of different
sizes, effectively mimicking different pseudo-granularities. We reconcile all
adaptor sets into a single unified model suited for all retrieval tasks by
learning fusion layers that we guide by propagating pseudo-granularity
attentions across neighbors in the feature space. Results on a benchmark
composed of six heterogeneous retrieval tasks show that the unsupervised Grappa
model improves the zero-shot performance of a state-of-the-art self-supervised
learning model, and in some places reaches or improves over a task label-aware
oracle that selects the most fitting pseudo-granularity per task.
Authors' comments: ECCV 2022
Anirudha Rayasam, Siddhartha R Thota, Avinash N Bukkittu, Sowmya Kamath
Efficiently discovering relevant Web services with respect to a specific user query has become a growing challenge owing to the incredible growth in the field of web technologies. In previous works, different clustering models have been used to address these issues. But, most of the traditional clustering techniques are computationally intensive and fail to address all the problems involved. Also, the current standards fail to incorporate the semantic relatedness of Web services during clustering and retrieval resulting in decreased performance. In this paper, we propose a framework for web services retrieval that uses a bottom-up, decentralized and self organising approach to cluster available services. It also provides online, dynamic computation of clusters thus overcoming the drawbacks of traditional clustering methods. We also use the semantic similarity between Web services for the clustering process to enhance the precision and lower the recall.
Rita Ramos, Bruno Martins, Desmond Elliott, Yova Kementchedjhieva
Recent advances in image captioning have focused on scaling the data and
model size, substantially increasing the cost of pre-training and finetuning.
As an alternative to large models, we present SmallCap, which generates a
caption conditioned on an input image and related captions retrieved from a
datastore. Our model is lightweight and fast to train, as the only learned
parameters are in newly introduced cross-attention layers between a pre-trained
CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without
additional finetuning and can exploit large-scale data in a training-free
fashion since the contents of the datastore can be readily replaced. Our
experiments show that SmallCap, trained only on COCO, has competitive
performance on this benchmark, and also transfers to other domains without
retraining, solely through retrieval from target-domain data. Further
improvement is achieved through the training-free exploitation of diverse
human-labeled and web data, which proves to be effective for a range of
domains, including the nocaps benchmark, designed to test generalization to
unseen visual concepts.
Authors' comments: Accepted to CVPR 2023
Michelle Chen Huebscher, Christian Buck, Massimiliano Ciaramita, Sascha Rothe
Learning to search is the task of building artificial agents that learn to autonomously use a search box to find information. So far, it has been shown that current language models can learn symbolic query reformulation policies, in combination with traditional term-based retrieval, but fall short of outperforming neural retrievers. We extend the previous learning to search setup to a hybrid environment, which accepts discrete query refinement operations, after a first-pass retrieval step via a dual encoder. Experiments on the BEIR task show that search agents, trained via behavioral cloning, outperform the underlying search system based on a combined dual encoder retriever and cross encoder reranker. Furthermore, we find that simple heuristic Hybrid Retrieval Environments (HRE) can improve baseline performance by several nDCG points. The search agent based on HRE (HARE) matches state-of-the-art performance, balanced in both zero-shot and in-domain evaluations, via interpretable actions, and at twice the speed.
Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos
This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.
Tao Wu, Tie Luo, Donald Wunsch
Instance-level Image Retrieval (IIR), or simply Instance Retrieval, deals
with the problem of finding all the images within an dataset that contain a
query instance (e.g. an object). This paper makes the first attempt that
tackles this problem using instance-discrimination based contrastive learning
(CL). While CL has shown impressive performance for many computer vision tasks,
the similar success has never been found in the field of IIR. In this work, we
approach this problem by exploring the capability of deriving discriminative
representations from pre-trained and fine-tuned CL models. To begin with, we
investigate the efficacy of transfer learning in IIR, by comparing
off-the-shelf features learned by a pre-trained deep neural network (DNN)
classifier with features learned by a CL model. The findings inspired us to
propose a new training strategy that optimizes CL towards learning IIR-oriented
features, by using an Average Precision (AP) loss together with a fine-tuning
method to learn contrastive feature representations that are tailored to IIR.
Our empirical evaluation demonstrates significant performance enhancement over
the off-the-shelf features learned from a pre-trained DNN classifier on the
challenging Oxford and Paris datasets.
Authors' comments: IEEE Symposium Series On Computational Intelligence (SSCI), December
2022. Accepted
Sebastian Hofstätter, Jiecao Chen, Karthik Raman, Hamed Zamani
Retrieval-augmented generation models offer many benefits over standalone language models: besides a textual answer to a given query they provide provenance items retrieved from an updateable knowledge base. However, they are also more complex systems and need to handle long inputs. In this work, we introduce FiD-Light to strongly increase the efficiency of the state-of-the-art retrieval-augmented FiD model, while maintaining the same level of effectiveness. Our FiD-Light model constrains the information flow from the encoder (which encodes passages separately) to the decoder (using concatenated encoded representations). Furthermore, we adapt FiD-Light with re-ranking capabilities through textual source pointers, to improve the top-ranked provenance precision. Our experiments on a diverse set of seven knowledge intensive tasks (KILT) show FiD-Light consistently improves the Pareto frontier between query latency and effectiveness. FiD-Light with source pointing sets substantial new state-of-the-art results on six KILT tasks for combined text generation and provenance retrieval evaluation, while maintaining reasonable efficiency.
Nhat-Minh Pham, Ha-Thanh Nguyen, Trong-Hop Do
This study deals with the problem of information retrieval (IR) for
Vietnamese legal texts. Despite being well researched in many languages,
information retrieval has still not received much attention from the Vietnamese
research community. This is especially true for the case of legal documents,
which are hard to process. This study proposes a new approach for information
retrieval for Vietnamese legal documents using sentence-transformer. Besides,
various experiments are conducted to make comparisons between different
transformer models, ranking scores, syllable-level, and word-level training.
The experiment results show that the proposed model outperforms models used in
current research on information retrieval for Vietnamese documents.
Authors' comments: Presented at PKAW 2022 (arXiv:2211.03888) Report-no: PKAW/2022/01
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, Jenq-Neng Hwang, Zhongtian Du
There are two popular loss functions used for vision-language retrieval,
i.e., triplet loss and contrastive learning loss, both of them essentially
minimize the difference between the similarities of negative pairs and positive
pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN),
which is widely used in existing retrieval models to improve the discriminative
ability, is easy to fall into local minima in training. On the other hand,
Vision-Language Contrastive learning loss (VLC), which is widely used in the
vision-language pre-training, has been shown to achieve significant performance
gains on vision-language retrieval, but the performance of fine-tuning with VLC
on small datasets is not satisfactory. This paper proposes a unified loss of
pair similarity optimization for vision-language retrieval, providing a
powerful tool for understanding existing loss functions. Our unified loss
includes the hard sample mining strategy of VLC and introduces the margin used
by the triplet loss for better similarity separation. It is shown that both
Triplet-HN and VLC are special forms of our unified loss. Compared with the
Triplet-HN, our unified loss has a fast convergence speed. Compared with the
VLC, our unified loss is more discriminative and can provide better
generalization in downstream fine-tuning tasks. Experiments on image-text and
video-text retrieval benchmarks show that our unified loss can significantly
improve the performance of the state-of-the-art retrieval models.
Authors' comments: 16 pages, 5 figures
Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen
Research on text-to-image generation has witnessed significant progress in
generating diverse and photo-realistic images, driven by diffusion and
auto-regressive models trained on large-scale image-text data. Though
state-of-the-art models can generate high-quality images of common entities,
they often have difficulty generating images of uncommon entities, such as
`Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the
Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model
that uses retrieved information to produce high-fidelity and faithful images,
even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an
external multi-modal knowledge base to retrieve relevant (image, text) pairs
and uses them as references to generate the image. With this retrieval step,
Re-Imagen is augmented with the knowledge of high-level semantics and low-level
visual details of the mentioned entities, and thus improves its accuracy in
generating the entities' visual appearances. We train Re-Imagen on a
constructed dataset containing (image, text, retrieval) triples to teach the
model to ground on both text prompt and retrieval. Furthermore, we develop a
new sampling strategy to interleave the classifier-free guidance for text and
retrieval conditions to balance the text and retrieval alignment. Re-Imagen
achieves significant gain on FID score over COCO and WikiImage. To further
evaluate the capabilities of the model, we introduce EntityDrawBench, a new
benchmark that evaluates image generation for diverse entities, from frequent
to rare, across multiple object categories including dogs, foods, landmarks,
birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen
can significantly improve the fidelity of generated images, especially on less
frequent entities.
Authors' comments: 9 pages
Maddalena Amendola, Andrea Passarella, Raffaele Perego
Social Search research deals with studying methodologies exploiting social information to better satisfy user information needs in Online Social Media while simplifying the search effort and consequently reducing the time spent and the computational resources utilized. Starting from previous studies, in this work, we analyze the current state of the art of the Social Search area, proposing a new taxonomy and highlighting current limitations and open research directions. We divide the Social Search area into three subcategories, where the social aspect plays a pivotal role: Social Question&Answering, Social Content Search, and Social Collaborative Search. For each subcategory, we present the key concepts and selected representative approaches in the literature in greater detail. We found that, up to now, a large body of studies model users' preferences and their relations by simply combining social features made available by social platforms. It paves the way for significant research to exploit more structured information about users' social profiles and behaviors (as they can be inferred from data available on social platforms) to optimize their information needs further.
Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi Zheng, Chunhua Shen
Cross-modal retrieval between videos and texts has gained increasing research
interest due to the rapid emergence of videos on the web. Generally, a video
contains rich instance and event information and the query text only describes
a part of the information. Thus, a video can correspond to multiple different
text descriptions and queries. We call this phenomenon the ``Video-Text
Correspondence Ambiguity'' problem. Current techniques mostly concentrate on
mining local or multi-level alignment between contents of a video and text
(\textit{e.g.}, object to entity and action to verb). It is difficult for these
methods to alleviate the video-text correspondence ambiguity by describing a
video using only one single feature, which is required to be matched with
multiple different text features at the same time. To address this problem, we
propose a Text-Adaptive Multiple Visual Prototype Matching model, which
automatically captures multiple prototypes to describe a video by adaptive
aggregation of video token features. Given a query text, the similarity is
determined by the most similar prototype to find correspondence in the video,
which is termed text-adaptive matching. To learn diverse prototypes for
representing the rich information in videos, we propose a variance loss to
encourage different prototypes to attend to different contents of the video.
Our method outperforms state-of-the-art methods on four public video retrieval
datasets.
Authors' comments: NIPS2022
Cheng-An Hsieh, Cheng-Ping Hsieh, Pu-Jen Cheng
Multimodal learning is a recent challenge that extends unimodal learning by
generalizing its domain to diverse modalities, such as texts, images, or
speech. This extension requires models to process and relate information from
multiple modalities. In Information Retrieval, traditional retrieval tasks
focus on the similarity between unimodal documents and queries, while
image-text retrieval hypothesizes that most texts contain the scene context
from images. This separation has ignored that real-world queries may involve
text content, image captions, or both. To address this, we introduce Multimodal
Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and
comprehensive dataset for multimodal retrieval. We utilize the Wikipedia
dataset with rich text-image examples and generate three types of text-based
queries with different modality information: text-related, image-related, and
mixed. To validate the effectiveness of our dataset, we provide a multimodal
training paradigm and evaluate previous text retrieval and image retrieval
frameworks. The results show that proposed multimodal retrieval can improve
retrieval performance, but creating a well-unified document representation with
texts and images is still a challenge. We hope Mr. Right allows us to broaden
current retrieval systems better and contributes to accelerating the
advancement of multimodal learning in the Information Retrieval.
Authors' comments: Dataset available at https://github.com/hsiehjackson/Mr.Right
Yufeng Shi, Shujian Yu, Duanquan Xu, Xinge You
Zero-shot cross-modal retrieval (ZS-CMR) deals with the retrieval problem among heterogenous data from unseen classes. Typically, to guarantee generalization, the pre-defined class embeddings from natural language processing (NLP) models are used to build a common space. In this paper, instead of using an extra NLP model to define a common space beforehand, we consider a totally different way to construct (or learn) a common hamming space from an information-theoretic perspective. We term our model the Information-Theoretic Hashing (ITH), which is composed of two cascading modules: an Adaptive Information Aggregation (AIA) module; and a Semantic Preserving Encoding (SPE) module. Specifically, our AIA module takes the inspiration from the Principle of Relevant Information (PRI) to construct a common space that adaptively aggregates the intrinsic semantics of different modalities of data and filters out redundant or irrelevant information. On the other hand, our SPE module further generates the hashing codes of different modalities by preserving the similarity of intrinsic semantics with the element-wise Kullback-Leibler (KL) divergence. A total correlation regularization term is also imposed to reduce the redundancy amongst different dimensions of hash codes. Sufficient experiments on three benchmark datasets demonstrate the superiority of the proposed ITH in ZS-CMR. Source code is available in the supplementary material.
Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu et al.
Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data. Powered by LLM's generalization ability, Promptagator makes it possible to create task-specific end-to-end retrievers solely based on a few examples {without} using Natural Questions or MS MARCO to train %question generators or dual encoders. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.
Xiang Fang, Daizong Liu, Pan Zhou, Yuchong Hu
As an increasingly popular task in multimedia information retrieval, video
moment retrieval (VMR) aims to localize the target moment from an untrimmed
video according to a given language query. Most previous methods depend heavily
on numerous manual annotations (i.e., moment boundaries), which are extremely
expensive to acquire in practice. In addition, due to the domain gap between
different datasets, directly applying these pre-trained models to an unseen
domain leads to a significant performance drop. In this paper, we focus on a
novel task: cross-domain VMR, where fully-annotated datasets are available in
one domain (``source domain''), but the domain of interest (``target domain'')
only contains unannotated datasets. As far as we know, we present the first
study on cross-domain VMR. To address this new task, we propose a novel
Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation
knowledge from the source domain to the target domain. However, due to the
domain discrepancy between the source and target domains and the semantic gap
between videos and queries, directly applying trained models to the target
domain generally leads to a performance drop. To solve this problem, we develop
three novel modules: (i) a domain alignment module is designed to align the
feature distributions between different domains of each modality; (ii) a
cross-modal alignment module aims to map both video and query features into a
joint embedding space and to align the feature distributions between different
modalities in the target domain; (iii) a specific alignment module tries to
obtain the fine-grained similarity between a specific frame and the given query
for optimal localization. By jointly training these three modules, our MMCDA
can learn domain-invariant and semantic-aligned cross-modal representations.
Authors' comments: Accepted by IEEE Transactions on Multimedia
Huang Xie, Samuel Lipping, Tuomas Virtanen
Language-based audio retrieval is a task, where natural language textual
captions are used as queries to retrieve audio signals from a dataset. It has
been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which
aims at developing computational systems to model relationships between audio
signals and free-form textual descriptions. Compared with audio captioning
(Subtask 6A), which is about generating audio captions for audio signals,
language-based audio retrieval (Subtask 6B) focuses on ranking audio signals
according to their relevance to natural language textual captions. In DCASE
2022 Challenge, the provided baseline system for Subtask 6B was significantly
outperformed, with top performance being 0.276 in mAP@10. This paper presents
the outcome of Subtask 6B in terms of submitted systems' performance and
analysis.
Authors' comments: Update for arXiv:2206.06108 mistakenly submitted as a new article
Ling Luo, Yulia Gryaditskaya, Yongxin Yang, Tao Xiang, Yi-Zhe Song
Growing free online 3D shapes collections dictated research on 3D retrieval. Active debate has however been had on (i) what the best input modality is to trigger retrieval, and (ii) the ultimate usage scenario for such retrieval. In this paper, we offer a different perspective towards answering these questions -- we study the use of 3D sketches as an input modality and advocate a VR-scenario where retrieval is conducted. Thus, the ultimate vision is that users can freely retrieve a 3D model by air-doodling in a VR environment. As a first stab at this new 3D VR-sketch to 3D shape retrieval problem, we make four contributions. First, we code a VR utility to collect 3D VR-sketches and conduct retrieval. Second, we collect the first set of $167$ 3D VR-sketches on two shape categories from ModelNet. Third, we propose a novel approach to generate a synthetic dataset of human-like 3D sketches of different abstract levels to train deep networks. At last, we compare the common multi-view and volumetric approaches: We show that, in contrast to 3D shape to 3D shape retrieval, volumetric point-based approaches exhibit superior performance on 3D sketch to 3D shape retrieval due to the sparse and abstract nature of 3D VR-sketches. We believe these contributions will collectively serve as enablers for future attempts at this problem. The VR interface, code and datasets are available at https://tinyurl.com/3DSketch3DV.