Niranjan Uma Naresh, Ziyan Jiang, Ankit, Sungjin Lee, Jie Hao, Xing Fan, Chenlei Guo
Conversational understanding is an integral part of modern intelligent
devices. In a large fraction of the global traffic from customers using smart
digital assistants, frictions in dialogues may be attributed to incorrect
understanding of the entities in a customer's query due to factors including
ambiguous mentions, mispronunciation, background noise and faulty on-device
signal processing. Such errors are compounded by two common deficiencies from
intelligent devices namely, (1) the device not being tailored to individual
customers, and (2) the device responses being unaware of the context in the
conversation session. Viewing this problem via the lens of retrieval-based
search engines, we build and evaluate a scalable entity correction system,
PENTATRON. The system leverages a parametric transformer-based language model
to learn patterns from in-session customer-device interactions coupled with a
non-parametric personalized entity index to compute the correct query, which
aids downstream components in reasoning about the best response. In addition to
establishing baselines and demonstrating the value of personalized and
context-aware systems, we use multitasking to learn the domain of the correct
entity. We also investigate the utility of language model prompts. Through
extensive experiments, we show a significant upward movement of the key metric
(Exact Match) by up to 500.97% (relative to the baseline).
Authors' comments: EMNLP 2022
Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder et al.
Sampling proper negatives from a large document pool is vital to effectively
train a dense retrieval model. However, existing negative sampling strategies
suffer from the uninformative or false negative problem. In this work, we
empirically show that according to the measured relevance scores, the negatives
ranked around the positives are generally more informative and less likely to
be false negatives. Intuitively, these negatives are not too hard (\emph{may be
false negatives}) or too easy (\emph{uninformative}). They are the ambiguous
negatives and need more attention during training. Thus, we propose a simple
ambiguous negatives sampling method, SimANS, which incorporates a new sampling
probability distribution to sample more ambiguous negatives. Extensive
experiments on four public and one industry datasets show the effectiveness of
our approach. We made the code and models publicly available in
\url{https://github.com/microsoft/SimXNS}.
Authors' comments: 12 pages, accepted by EMNLP 2022
Sebastian Bruch, Siyu Gai, Amir Ingber
We study hybrid search in text retrieval where lexical and semantic search are fused together with the intuition that the two are complementary in how they model relevance. In particular, we examine fusion by a convex combination (CC) of lexical and semantic scores, as well as the Reciprocal Rank Fusion (RRF) method, and identify their advantages and potential pitfalls. Contrary to existing studies, we find RRF to be sensitive to its parameters; that the learning of a CC fusion is generally agnostic to the choice of score normalization; that CC outperforms RRF in in-domain and out-of-domain settings; and finally, that CC is sample efficient, requiring only a small set of training examples to tune its only parameter to a target domain.
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Ying Jin, Yufeng Zhang
Most image-text retrieval work adopts binary labels indicating whether a pair
of image and text matches or not. Such a binary indicator covers only a limited
subset of image-text semantic relations, which is insufficient to represent
relevance degrees between images and texts described by continuous labels such
as image captions. The visual-semantic embedding space obtained by learning
binary labels is incoherent and cannot fully characterize the relevance
degrees. In addition to the use of binary labels, this paper further
incorporates continuous pseudo labels (generally approximated by text
similarity between captions) to indicate the relevance degrees. To learn a
coherent embedding space, we propose an image-text retrieval framework with
Binary and Continuous Label Supervision (BCLS), where binary labels are used to
guide the retrieval model to learn limited binary correlations, and continuous
labels are complementary to the learning of image-text semantic relations. For
the learning of binary labels, we improve the common Triplet ranking loss with
Soft Negative mining (Triplet-SN) to improve convergence. For the learning of
continuous labels, we design Kendall ranking loss inspired by Kendall rank
correlation coefficient (Kendall), which improves the correlation between the
similarity scores predicted by the retrieval model and the continuous labels.
To mitigate the noise introduced by the continuous pseudo labels, we further
design Sliding Window sampling and Hard Sample mining strategy (SW-HS) to
alleviate the impact of noise and reduce the complexity of our framework to the
same order of magnitude as the triplet ranking loss. Extensive experiments on
two image-text retrieval benchmarks demonstrate that our method can improve the
performance of state-of-the-art image-text retrieval models.
Authors' comments: 13 pages, 7 figures
Laura Hanu, James Thewlis, Yuki M. Asano, Christian Rupprecht
Multi-modal retrieval is an important problem for many applications, such as
recommendation and search. Current benchmarks and even datasets are often
manually constructed and consist of mostly clean samples where all modalities
are well-correlated with the content. Thus, current video-text retrieval
literature largely focuses on video titles or audio transcripts, while ignoring
user comments, since users often tend to discuss topics only vaguely related to
the video. Despite the ubiquity of user comments online, there is currently no
multi-modal representation learning datasets that includes comments. In this
paper, we a) introduce a new dataset of videos, titles and comments; b) present
an attention-based mechanism that allows the model to learn from sometimes
irrelevant data such as comments; c) show that by using comments, our method is
able to learn better, more contextualised, representations for image, video and
audio representations. Project page: https://unitaryai.github.io/vtc-paper.
Authors' comments: Accepted paper at the European Conference on Computer Vision (ECCV)
2022
Tabish Ahmed, Sahan Bulathwela
Extracting useful information from the user history to clearly understand
informational needs is a crucial feature of a proactive information retrieval
system. Regarding understanding information and relevance, Wikipedia can
provide the background knowledge that an intelligent system needs. This work
explores how exploiting the context of a query using Wikipedia concepts can
improve proactive information retrieval on noisy text. We formulate two models
that use entity linking to associate Wikipedia topics with the relevance model.
Our experiments around a podcast segment retrieval task demonstrate that there
is a clear signal of relevance in Wikipedia concepts while a ranking model can
improve precision by incorporating them. We also find Wikifying the background
context of a query can help disambiguate the meaning of the query, further
helping proactive information retrieval.
Authors' comments: To be published at the First Workshop on Proactive and
Agent-Supported Information Retrieval at CIKM 2022
Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Joemon M. Jose
Image-sentence retrieval has attracted extensive research attention in
multimedia and computer vision due to its promising application. The key issue
lies in jointly learning the visual and textual representation to accurately
estimate their similarity. To this end, the mainstream schema adopts an
object-word based attention to calculate their relevance scores and refine
their interactive representations with the attention features, which, however,
neglects the context of the object representation on the inter-object
relationship that matches the predicates in sentences. In this paper, we
propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for
image-sentence retrieval, which correlates the intra- and inter-modal semantics
between objects and words. In particular, we first design the intra-modal
spatial and semantic graphs based reasoning to enhance the semantic
representations of objects guided by the explicit relationships of the objects'
spatial positions and their scene graph. Then the visual and textual semantic
representations are refined jointly via the inter-modal interactive attention
and the cross-modal alignment. To correlate the context of objects with the
textual context, we further refine the visual semantic representation via the
cross-level object-sentence and word-image based interactive attention.
Experimental results on seven standard evaluation metrics show that the
proposed CMSEI outperforms the state-of-the-art and the alternative approaches
on MS-COCO and Flickr30K benchmarks.
Authors' comments: accepted to WACV 2023
Ning Han, Xun Yang, Ee-Peng Lim, Hao Chen, Qianru Sun
Cross-modal video retrieval aims to retrieve the semantically relevant videos given a text as a query, and is one of the fundamental tasks in Multimedia. Most of top-performing methods primarily leverage Visual Transformer (ViT) to extract video features [1, 2, 3], suffering from high computational complexity of ViT especially for encoding long videos. A common and simple solution is to uniformly sample a small number (say, 4 or 8) of frames from the video (instead of using the whole video) as input to ViT. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames performs better than using 4 frames yet needs more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level learns a cross-modal video retrieval model whose input is the "compressed frames" learned by frame-level optimization. In turn, the frame-level optimization is through gradient descent using the meta loss of video retrieval model computed on the whole video. We call this BOP method as well as the "compressed frames" as Meta-Optimized Frames (MOF). By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation, we conduct extensive experiments of cross-modal video retrieval on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method to boost multiple baseline methods, and can achieve a new state-of-the-art performance.
Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo
Video moment retrieval (VMR) aims to localize target moments in untrimmed
videos pertinent to a given textual query. Existing retrieval systems tend to
rely on retrieval bias as a shortcut and thus, fail to sufficiently learn
multi-modal interactions between query and video. This retrieval bias stems
from learning frequent co-occurrence patterns between query and moments, which
spuriously correlate objects (e.g., a pencil) referred in the query with
moments (e.g., scene of writing with a pencil) where the objects frequently
appear in the video, such that they converge into biased moment predictions.
Although recent debiasing methods have focused on removing this retrieval bias,
we argue that these biased predictions sometimes should be preserved because
there are many queries where biased predictions are rather helpful. To
conjugate this retrieval bias, we propose a Selective Query-guided Debiasing
network (SQuiDNet), which incorporates the following two main properties: (1)
Biased Moment Retrieval that intentionally uncovers the biased moments inherent
in objects of the query and (2) Selective Query-guided Debiasing that performs
selective debiasing guided by the meaning of the query. Our experimental
results on three moment retrieval benchmarks (i.e., TVR, ActivityNet, DiDeMo)
show the effectiveness of SQuiDNet and qualitative analysis shows improved
interpretability.
Authors' comments: 16 pages, 6 figures, Accepted in ECCV 2022
Kosuke Nishida, Naoki Yoshinaga, Kyosuke Nishida
Although named entity recognition (NER) helps us to extract domain-specific
entities from text (e.g., artists in the music domain), it is costly to create
a large amount of training data or a structured knowledge base to perform
accurate NER in the target domain. Here, we propose self-adaptive NER, which
retrieves external knowledge from unstructured text to learn the usages of
entities that have not been learned well. To retrieve useful knowledge for NER,
we design an effective two-stage model that retrieves unstructured knowledge
using uncertain entities as queries. Our model predicts the entities in the
input and then finds those of which the prediction is not confident. Then, it
retrieves knowledge by using these uncertain entities as queries and
concatenates the retrieved text to the original input to revise the prediction.
Experiments on CrossNER datasets demonstrated that our model outperforms strong
baselines by 2.35 points in F1 metric.
Authors' comments: EACL2023 (long)
Xiyang Hu, Xinchi Chen, Peng Qi, Deguang Kong, Kunlun Liu, William Yang Wang, Zhiheng Huang
Multilingual information retrieval (IR) is challenging since annotated
training data is costly to obtain in many languages. We present an effective
method to train multilingual IR systems when only English IR training data and
some parallel corpora between English and other languages are available. We
leverage parallel and non-parallel corpora to improve the pretrained
multilingual language models' cross-lingual transfer ability. We design a
semantic contrastive loss to align representations of parallel sentences that
share the same semantics in different languages, and a new language contrastive
loss to leverage parallel sentence pairs to remove language-specific
information in sentence representations from non-parallel corpora. When trained
on English IR data with these losses and evaluated zero-shot on non-English
data, our model demonstrates significant improvement to prior work on retrieval
performance, while it requires much less computational effort. We also
demonstrate the value of our model for a practical setting when a parallel
corpus is only available for a few languages, but a lack of parallel corpora
resources persists for many other low-resource languages. Our model can work
well even with a small number of parallel sentences, and be used as an add-on
module to any backbones and other tasks.
Authors' comments: ACL Findings 2023
Odunayo Ogundepo, Xinyu Zhang, Jimmy Lin
Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse languages in the MrTyDi collection: results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages. In many cases, our approach also improves retrieval effectiveness when combined with existing custom-built tokenizers.
Kai Hui, Tao Chen, Zhen Qin, Honglei Zhuang, Fernando Diaz, Mike Bendersky, Don Metzler
Retrieval augmentation has shown promising improvements in different tasks. However, whether such augmentation can assist a large language model based re-ranker remains unclear. We investigate how to augment T5-based re-rankers using high-quality information retrieved from two external corpora -- a commercial web search engine and Wikipedia. We empirically demonstrate how retrieval augmentation can substantially improve the effectiveness of T5-based re-rankers for both in-domain and zero-shot out-of-domain re-ranking tasks.
Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao
Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost of exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tends to be strong feature for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI$^2$), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings. Our code and checkpoint are publicly available at https://github.com/namespace-Pt/Adon/tree/HI2.
Cuong Hoang, Devendra Sachan, Prashant Mathur, Brian Thompson, Marcello Federico
Several recent studies have reported dramatic performance improvements in neural machine translation (NMT) by augmenting translation at inference time with fuzzy-matches retrieved from a translation memory (TM). However, these studies all operate under the assumption that the TMs available at test time are highly relevant to the testset. We demonstrate that for existing retrieval augmented translation methods, using a TM with a domain mismatch to the test set can result in substantially worse performance compared to not using a TM at all. We propose a simple method to expose fuzzy-match NMT systems during training and show that it results in a system that is much more tolerant (regaining up to 5.8 BLEU) to inference with TMs with domain mismatch. Also, the model is still competitive to the baseline when fed with suggestions from relevant TMs.
Tanay Dixit, Bhargavi Paranjape, Hannaneh Hajishirzi, Luke Zettlemoyer
Counterfactual data augmentation (CDA) -- i.e., adding minimally perturbed
inputs during training -- helps reduce model reliance on spurious correlations
and improves generalization to out-of-distribution (OOD) data. Prior work on
generating counterfactuals only considered restricted classes of perturbations,
limiting their effectiveness. We present COunterfactual Generation via
Retrieval and Editing (CORE), a retrieval-augmented generation framework for
creating diverse counterfactual perturbations for CDA. For each training
example, CORE first performs a dense retrieval over a task-related unlabeled
text corpus using a learned bi-encoder and extracts relevant counterfactual
excerpts. CORE then incorporates these into prompts to a large language model
with few-shot learning capabilities, for counterfactual editing. Conditioning
language model edits on naturally occurring data results in diverse
perturbations. Experiments on natural language inference and sentiment analysis
benchmarks show that CORE counterfactuals are more effective at improving
generalization to OOD data compared to other DA approaches. We also show that
the CORE retrieval framework can be used to encourage diversity in manually
authored perturbations
Authors' comments: Findings EMNLP 2022
Adriano Fragomeni, Michael Wray, Dima Damen
In this paper, we re-examine the task of cross-modal clip-sentence retrieval,
where the clip is part of a longer untrimmed video. When the clip is short or
visually ambiguous, knowledge of its local temporal context (i.e. surrounding
video segments) can be used to improve the retrieval performance. We propose
Context Transformer (ConTra); an encoder architecture that models the
interaction between a video clip and its local temporal context in order to
enhance its embedded representations. Importantly, we supervise the context
transformer using contrastive losses in the cross-modal embedding space. We
explore context transformers for video and text modalities. Results
consistently demonstrate improved performance on three datasets: YouCook2,
EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive
ablation studies and context analysis show the efficacy of the proposed method.
Authors' comments: Accepted in ACCV 2022
Yukun Zheng, Jiang Bian, Guanghao Meng, Chao Zhang, Honggang Wang, Zhixuan Zhang, Sen Li, Tao Zhuang et al.
In large-scale e-commerce platforms like Taobao, it is a big challenge to
retrieve products that satisfy users from billions of candidates. This has been
a common concern of academia and industry. Recently, plenty of works in this
domain have achieved significant improvements by enhancing embedding-based
retrieval (EBR) methods, including the Multi-Grained Deep Semantic Product
Retrieval (MGDSPR) model [16] in Taobao search engine. However, we find that
MGDSPR still has problems of poor relevance and weak personalization compared
to other retrieval methods in our online system, such as lexical matching and
collaborative filtering. These problems promote us to further strengthen the
capabilities of our EBR model in both relevance estimation and personalized
retrieval. In this paper, we propose a novel Multi-Objective Personalized
Product Retrieval (MOPPR) model with four hierarchical optimization objectives:
relevance, exposure, click and purchase. We construct entire-space
multi-positive samples to train MOPPR, rather than the single-positive samples
for existing EBR models.We adopt a modified softmax loss for optimizing
multiple objectives. Results of extensive offline and online experiments show
that MOPPR outperforms the baseline MGDSPR on evaluation metrics of relevance
estimation and personalized retrieval. MOPPR achieves 0.96% transaction and
1.29% GMV improvements in a 28-day online A/B test. Since the Double-11
shopping festival of 2021, MOPPR has been fully deployed in mobile Taobao
search, replacing the previous MGDSPR. Finally, we discuss several advanced
topics of our deeper explorations on multi-objective retrieval and ranking to
contribute to the community.
Authors' comments: 9 pages, 4 figures, submitted to the 28th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining
Mohammed Hammad
Modern day applications, especially information retrieval webapps that
involve "search" as their use cases are gradually moving towards "answering"
modules. Conversational chatbots which have been proved to be more engaging to
users, use Question Answering as their core. Since, precise answering is
computationally expensive, several approaches have been developed to prefetch
the most relevant documents/passages from the database that contain the answer.
We propose a different approach that retrieves the evidence documents
efficiently and accurately, making sure that the relevant document for a given
user query is not missed. We do so by assigning each document (or passage in
our case), a unique identifier and using them to create dense vectors which can
be efficiently indexed. More precisely, we use the identifier to predict
randomly sampled context window words of the relevant question corresponding to
the passage along with the words of passage itself. This naturally embeds the
passage identifier into the vector space in such a way that the embedding is
closer to the question without compromising he information content. This
approach enables efficient creation of real-time query vectors in ~4
milliseconds.
Authors' comments: Year-2019
Noam Malali, Yosi Keller
We present a deep learning approach for learning the joint semantic
embeddings of images and captions in a Euclidean space, such that the semantic
similarity is approximated by the L2 distances in the embedding space. For
that, we introduce a metric learning scheme that utilizes multitask learning to
learn the embedding of identical semantic concepts using a center loss. By
introducing a differentiable quantization scheme into the end-to-end trainable
network, we derive a semantic embedding of semantically similar concepts in
Euclidean space. We also propose a novel metric learning formulation using an
adaptive margin hinge loss, that is refined during the training phase. The
proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets,
and was shown to compare favorably with contemporary state-of-the-art
approaches.
Authors' comments: in IEEE Transactions on Pattern Analysis and Machine Intelligence,
2023