Xun Yang, Fuli Feng, Wei Ji, Meng Wang, Tat-Seng Chua
We tackle the task of video moment retrieval (VMR), which aims to localize a
specific moment in a video according to a textual query. Existing methods
primarily model the matching relationship between query and moment by complex
cross-modal interactions. Despite their effectiveness, current models mostly
exploit dataset biases while ignoring the video content, thus leading to poor
generalizability. We argue that the issue is caused by the hidden confounder in
VMR, {i.e., temporal location of moments}, that spuriously correlates the model
input and prediction. How to design robust matching models against the temporal
location biases is crucial but, as far as we know, has not been studied yet for
VMR.
To fill the research gap, we propose a causality-inspired VMR framework that
builds structural causal model to capture the true effect of query and video
content on the prediction. Specifically, we develop a Deconfounded Cross-modal
Matching (DCM) method to remove the confounding effects of moment location. It
first disentangles moment representation to infer the core feature of visual
content, and then applies causal intervention on the disentangled multimodal
input based on backdoor adjustment, which forces the model to fairly
incorporate each possible location of the target into consideration. Extensive
experiments clearly show that our approach can achieve significant improvement
over the state-of-the-art methods in terms of both accuracy and generalization
(Codes:
\color{blue}{\url{https://github.com/Xun-Yang/Causal_Video_Moment_Retrieval}}
Authors' comments: This work has been accepted by SIGIR 2021
Han Wang, Yang Liu, Chenguang Zhu, Linjun Shou, Ming Gong, Yichong Xu, Michael Zeng
Commonsense generation is a challenging task of generating a plausible
sentence describing an everyday scenario using provided concepts. Its
requirement of reasoning over commonsense knowledge and compositional
generalization ability even puzzles strong pre-trained language generation
models. We propose a novel framework using retrieval methods to enhance both
the pre-training and fine-tuning for commonsense generation. We retrieve
prototype sentence candidates by concept matching and use them as auxiliary
input. For fine-tuning, we further boost its performance with a trainable
sentence retriever. We demonstrate experimentally on the large-scale CommonGen
benchmark that our approach achieves new state-of-the-art results.
Authors' comments: Findings of ACL-IJCNLP 2021
Josef Knapp, Alexander Paulus, Jonas Kornprobst, Uwe Siart, Thomas F. Eibert
Phase retrieval problems in antenna measurements arise when a reference phase
cannot be provided to all measurement locations. Phase retrieval algorithms
require sufficiently many independent measurement samples of the radiated
fields to be successful. Larger amounts of independent data may improve the
reconstruction of the phase information from magnitude-only measurements. We
show how the knowledge of relative phases among the spectral components of a
modulated signal at the individual measurement locations may be employed to
reconstruct the relative phases between different measurement locations at all
frequencies. Projection matrices map the estimated phases onto the space of
fields possibly generated by equivalent antenna under test (AUT) sources at all
frequencies. In this way, the phase of the reconstructed solution is not only
restricted by the measurement samples at one frequency, but by the samples at
allfrequencies simultaneously. The proposed method can increase the amount of
independent phase information even if all probes are located in the far field
of the AUT.
Authors' comments: 14 pages, 29 figures, 1 table, published in IEEE Transactions on
Antennas and Propagation
Conghui Hu, Yongxin Yang, Yunpeng Li, Timothy M. Hospedales, Yi-Zhe Song
The practical value of existing supervised sketch-based image retrieval (SBIR) algorithms is largely limited by the requirement for intensive data collection and labeling. In this paper, we present the first attempt at unsupervised SBIR to remove the labeling cost (both category annotations and sketch-photo pairings) that is conventionally needed for training. Existing single-domain unsupervised representation learning methods perform poorly in this application, due to the unique cross-domain (sketch and photo) nature of the problem. We therefore introduce a novel framework that simultaneously performs sketch-photo domain alignment and semantic-aware representation learning. Technically this is underpinned by introducing joint distribution optimal transport (JDOT) to align data from different domains, which we extend with trainable cluster prototypes and feature memory banks to further improve scalability and efficacy. Extensive experiments show that our framework achieves excellent performance in the new unsupervised setting, and performs comparably or better than state-of-the-art in the zero-shot setting.
Shah Riya Chiragkumar
Music Information Retrieval (MIR) is a collaborative scientific study that
help to build innovative information research themes, novel frameworks, and
developing connected delivery mechanisms in addition to making the world's
massive collection of music open for everyone. Modern rock music proved to be
difficult to estimate tempo and chord recognition did not work. All of the
findings indicate that modern rock and metal music can be analysed, despite its
complexity, but that further research is needed in this area to make it useful.
Using a neural network has been one of the simplest ways of dealing with it.
The pitch class profile vector is used in the neural network method. Because
the vector only contains 12 elements of semi-tone values, it is enough for
chord recognition. Of course, there are other ways of achieving this work, most
of them depend on pitch class profiling to transform the chord into a type that
can be recognised, but the recognition process is time-consuming centred on
extremely complicated and memory-intensive methods.
Authors' comments: work in progress
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh
Given a collection of untrimmed and unsegmented videos, video corpus moment
retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video)
that semantically corresponds to a given text query. As video and text are from
two distinct feature spaces, there are two general approaches to address VCMR:
(i) to separately encode each modality representations, then align the two
modality representations for query processing, and (ii) to adopt fine-grained
cross-modal interaction to learn multi-modal representations for query
processing. While the second approach often leads to better retrieval accuracy,
the first approach is far more efficient. In this paper, we propose a Retrieval
and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We
adopt the first approach and introduce two contrastive learning objectives to
refine video encoder and text encoder to learn video and text representations
separately but with better alignment for VCMR. The video contrastive learning
(VideoCL) is to maximize mutual information between query and candidate video
at video-level. The frame contrastive learning (FrameCL) aims to highlight the
moment region corresponds to the query at frame-level, within a video.
Experimental results show that, although ReLoCLNet encodes text and video
separately for efficiency, its retrieval accuracy is comparable with baselines
adopting cross-modal interaction learning.
Authors' comments: 11 pages, 7 figures and 6 tables. Accepted by SIGIR 2021
Sanya B. Taneja, Richard D. Boyce, William T. Reynolds, Denis Newman-Griffis
Introducing biomedical informatics (BMI) students to natural language
processing (NLP) requires balancing technical depth with practical know-how to
address application-focused needs. We developed a set of three activities
introducing introductory BMI students to information retrieval with NLP,
covering document representation strategies and language models from TF-IDF to
BERT. These activities provide students with hands-on experience targeted
towards common use cases, and introduce fundamental components of NLP workflows
for a wide variety of applications.
Authors' comments: To appear in the Proceedings of the Fifth Workshop on Teaching NLP @
NAACL
Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie
We consider the task of retrieving audio using free-form natural language
queries. To study this problem, which has received limited attention in the
existing literature, we introduce challenging new benchmarks for text-based
audio retrieval using text annotations sourced from the Audiocaps and Clotho
datasets. We then employ these benchmarks to establish baselines for
cross-modal audio retrieval, where we demonstrate the benefits of pre-training
on diverse audio tasks. We hope that our benchmarks will inspire further
research into cross-modal text-based audio retrieval with free-form text
queries.
Authors' comments: Accepted at INTERSPEECH 2021
Hengxin Fun, Sunil Gandhi, Sujith Ravi
Recently, there have been significant advances in neural methods for tackling knowledge-intensive tasks such as open domain question answering (QA). These advances are fueled by combining large pre-trained language models with learnable retrieval of documents. Majority of these models use separate encoders for learning query representation, passage representation for the retriever and an additional encoder for the downstream task. Using separate encoders for each stage/task occupies a lot of memory and makes it difficult to scale to a large number of tasks. In this paper, we propose a novel Retrieval Optimized Multi-task (ROM) framework for jointly training self-supervised tasks, knowledge retrieval, and extractive question answering. Our ROM approach presents a unified and generalizable framework that enables scaling efficiently to multiple tasks, varying levels of supervision, and optimization choices such as different learning schedules without changing the model architecture. It also provides the flexibility of changing the encoders without changing the architecture of the system. Using our framework, we achieve comparable or better performance than recent methods on QA, while drastically reducing the number of parameters.
Xiangteng He, Yulin Pan, Mingqian Tang, Yiliang Lv
Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed SVRTN method, which achieves the best performance of video retrieval on accuracy and efficiency.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston
Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We study various types of architectures with multiple components - retrievers, rankers, and encoder-decoders - with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots.
Kai Wang, Luis Herranz, Joost van de Weijer
Multimodal representations and continual learning are two areas closely
related to human intelligence. The former considers the learning of shared
representation spaces where information from different modalities can be
compared and integrated (we focus on cross-modal retrieval between language and
visual representations). The latter studies how to prevent forgetting a
previously learned task when learning a new one. While humans excel in these
two aspects, deep neural networks are still quite limited. In this paper, we
propose a combination of both problems into a continual cross-modal retrieval
setting, where we study how the catastrophic interference caused by new tasks
impacts the embedding spaces and their cross-modal alignment required for
effective retrieval. We propose a general framework that decouples the
training, indexing and querying stages. We also identify and study different
factors that may lead to forgetting, and propose tools to alleviate it. We
found that the indexing stage pays an important role and that simply avoiding
reindexing the database with updated embedding networks can lead to significant
gains. We evaluated our methods in two image-text retrieval datasets, obtaining
significant gains with respect to the fine tuning baseline.
Authors' comments: 2nd CLVISION workshop in CVPR 2021
Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin
Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations. One recent work that has garnered much attention is the dense passage retriever (DPR) technique proposed by Karpukhin et al. (2020) for end-to-end open-domain question answering. We present a replication study of this work, starting with model checkpoints provided by the authors, but otherwise from an independent implementation in our group's Pyserini IR toolkit and PyGaggle neural text ranking library. Although our experimental results largely verify the claims of the original paper, we arrived at two important additional findings that contribute to a better understanding of DPR: First, it appears that the original authors under-report the effectiveness of the BM25 baseline and hence also dense--sparse hybrid retrieval results. Second, by incorporating evidence from the retriever and an improved answer span scoring technique, we are able to improve end-to-end question answering effectiveness using exactly the same models as in the original work.
Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal
Interest in physical therapy and individual exercises such as yoga/dance has
increased alongside the well-being trend. However, such exercises are hard to
follow without expert guidance (which is impossible to scale for personalized
feedback to every trainee remotely). Thus, automated pose correction systems
are required more than ever, and we introduce a new captioning dataset named
FixMyPose to address this need. We collect descriptions of correcting a
"current" pose to look like a "target" pose (in both English and Hindi). The
collected descriptions have interesting linguistic properties such as
egocentric relations to environment objects, analogous references, etc.,
requiring an understanding of spatial relations and commonsense knowledge about
postures. Further, to avoid ML biases, we maintain a balance across characters
with diverse demographics, who perform a variety of movements in several
interior environments (e.g., homes, offices). From our dataset, we introduce
the pose-correctional-captioning task and its reverse target-pose-retrieval
task. During the correctional-captioning task, models must generate
descriptions of how to move from the current to target pose image, whereas in
the retrieval task, models should select the correct target pose given the
initial pose and correctional description. We present strong cross-attention
baseline models (uni/multimodal, RL, multilingual) and also show that our
baselines are competitive with other models when evaluated on other
image-difference datasets. We also propose new task-specific metrics
(object-match, body-part-match, direction-match) and conduct human evaluation
for more reliable evaluation, and we demonstrate a large human-model
performance gap suggesting room for promising future work. To verify the
sim-to-real transfer of our FixMyPose dataset, we collect a set of real images
and show promising performance on these images.
Authors' comments: AAAI 2021 (18 pages, 16 figures; webpage:
https://fixmypose-unc.github.io/)
Fuwen Tan, Jiangbo Yuan, Vicente Ordonez
Instance-level image retrieval is the task of searching in a large database
for images that match an object in a query image. To address this task, systems
usually rely on a retrieval step that uses global image descriptors, and a
subsequent step that performs domain-specific refinements or reranking by
leveraging operations such as geometric verification based on local features.
In this work, we propose Reranking Transformers (RRTs) as a general model to
incorporate both local and global features to rerank the matching images in a
supervised fashion and thus replace the relatively expensive process of
geometric verification. RRTs are lightweight and can be easily parallelized so
that reranking a set of top matching results can be performed in a single
forward-pass. We perform extensive experiments on the Revisited Oxford and
Paris datasets, and the Google Landmarks v2 dataset, showing that RRTs
outperform previous reranking approaches while using much fewer local
descriptors. Moreover, we demonstrate that, unlike existing approaches, RRTs
can be optimized jointly with the feature extractor, which can lead to feature
representations tailored to downstream tasks and further accuracy improvements.
The code and trained models are publicly available at
https://github.com/uvavision/RerankingTransformer.
Authors' comments: ICCV 2021, Table-3 corrected
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.
Michael Wray, Hazel Doughty, Dima Damen
Current video retrieval efforts all found their evaluation on an
instance-based assumption, that only a single caption is relevant to a query
video and vice versa. We demonstrate that this assumption results in
performance comparisons often not indicative of models' retrieval capabilities.
We propose a move to semantic similarity video retrieval, where (i) multiple
videos/captions can be deemed equally relevant, and their relative ranking does
not affect a method's reported performance and (ii) retrieved videos/captions
are ranked by their similarity to a query. We propose several proxies to
estimate semantic similarities in large-scale retrieval datasets, without
additional annotations. Our analysis is performed on three commonly used video
retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).
Authors' comments: Accepted in CVPR 2021. Project Page: https://mwray.github.io/SSVR/
Chen Qu, Liu Yang, Cen Chen, W. Bruce Croft, Kalpesh Krishna, Mohit Iyyer
Recent studies on Question Answering (QA) and Conversational QA (ConvQA)
emphasize the role of retrieval: a system first retrieves evidence from a large
collection and then extracts answers. This open-retrieval ConvQA setting
typically assumes that each question is answerable by a single span of text
within a particular passage (a span answer). The supervision signal is thus
derived from whether or not the system can recover an exact match of this
ground-truth answer span from the retrieved passages. This method is referred
to as span-match weak supervision. However, information-seeking conversations
are challenging for this span-match method since long answers, especially
freeform answers, are not necessarily strict spans of any passage. Therefore,
we introduce a learned weak supervision approach that can identify a
paraphrased span of the known answer in a passage. Our experiments on QuAC and
CoQA datasets show that the span-match weak supervisor can only handle
conversations with span answers, and has less satisfactory results for freeform
answers generated by people. Our method is more flexible as it can handle both
span answers and freeform answers. Moreover, our method can be more powerful
when combined with the span-match method which shows it is complementary to the
span-match method. We also conduct in-depth analyses to show more insights on
open-retrieval ConvQA under a weak supervision setting.
Authors' comments: Accepted to ECIR'21
William H. B. Smith, Michael Milford, Klaus D. McDonald-Maier, Shoaib Ehsan
Visual navigation localizes a query place image against a reference database
of place images, also known as a `visual map'. Localization accuracy
requirements for specific areas of the visual map, `scene classes', vary
according to the context of the environment and task. State-of-the-art visual
mapping is unable to reflect these requirements by explicitly targetting scene
classes for inclusion in the map. Four different scene classes, including
pedestrian crossings and stations, are identified in each of the Nordland and
St. Lucia datasets. Instead of re-training separate scene classifiers which
struggle with these overlapping scene classes we make our first contribution:
defining the problem of `scene retrieval'. Scene retrieval extends image
retrieval to classification of scenes defined at test time by associating a
single query image to reference images of scene classes. Our second
contribution is a triplet-trained convolutional neural network (CNN) to address
this problem which increases scene classification accuracy by up to 7% against
state-of-the-art networks pre-trained for scene recognition. The second
contribution is an algorithm `DMC' that combines our scene classification with
distance and memorability for visual mapping. Our analysis shows that DMC
includes 64% more images of our chosen scene classes in a visual map than just
using distance interval mapping. State-of-the-art visual place descriptors
AMOS-Net, Hybrid-Net and NetVLAD are finally used to show that DMC improves
scene class localization accuracy by a mean of 3% and localization accuracy of
the remaining map images by a mean of 10% across both datasets.
Authors' comments: 8 page paper on visual place recogniton and scene classification
Rita Parada Ramos, Patrícia Pereira, Helena Moniz, Joao Paulo Carvalho, Bruno Martins
Deep neural networks have achieved state-of-the-art results in various vision
and/or language tasks. Despite the use of large training datasets, most models
are trained by iterating over single input-output pairs, discarding the
remaining examples for the current prediction. In this work, we actively
exploit the training data, using the information from nearest training examples
to aid the prediction both during training and testing. Specifically, our
approach uses the target of the most similar training example to initialize the
memory state of an LSTM model, or to guide attention mechanisms. We apply this
approach to image captioning and sentiment analysis, respectively through image
and text retrieval. Results confirm the effectiveness of the proposed approach
for the two tasks, on the widely used Flickr8 and IMDB datasets. Our code is
publicly available at http://github.com/RitaRamo/retrieval-augmentation-nn.
Authors' comments: Accepted at IJCNN 2021