Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jiaxin Mao, Xiaohui Xie, Min Zhang, Shaoping Ma
Recent advance in Dense Retrieval (DR) techniques has significantly improved
the effectiveness of first-stage retrieval. Trained with large-scale supervised
data, DR models can encode queries and documents into a low-dimensional dense
space and conduct effective semantic matching. However, previous studies have
shown that the effectiveness of DR models would drop by a large margin when the
trained DR models are adopted in a target domain that is different from the
domain of the labeled data. One of the possible reasons is that the DR model
has never seen the target corpus and thus might be incapable of mitigating the
difference between the training and target domains. In practice, unfortunately,
training a DR model for each target domain to avoid domain shift is often a
difficult task as it requires additional time, storage, and domain-specific
data labeling, which are not always available. To address this problem, in this
paper, we propose a novel DR framework named Disentangled Dense Retrieval (DDR)
to support effective and flexible domain adaptation for DR models. DDR consists
of a Relevance Estimation Module (REM) for modeling domain-invariant matching
patterns and several Domain Adaption Modules (DAMs) for modeling
domain-specific features of multiple target corpora. By making the REM and DAMs
disentangled, DDR enables a flexible training paradigm in which REM is trained
with supervision once and DAMs are trained with unsupervised data.
Comprehensive experiments in different domains and languages show that DDR
significantly improves ranking performance compared to strong DR baselines and
substantially outperforms traditional retrieval methods in most scenarios.
Authors' comments: Preprint
Nan Jiang, Dhivya Eswaran, Choon Hui Teo, Yexiang Xue, Yesh Dattatreya, Sujay Sanghavi, Vishy Vishwanathan
We consider text retrieval within dense representational space in real-world settings such as e-commerce search where (a) document popularity and (b) diversity of queries associated with a document have a skewed distribution. Most of the contemporary dense retrieval literature presents two shortcomings in these settings. (1) They learn an almost equal number of representations per document, agnostic to the fact that a few head documents are disproportionately more critical to achieving a good retrieval performance. (ii) They learn purely semantic document representations inferred from intrinsic document characteristics which may not contain adequate information to determine the queries for which the document is relevant--especially when the document is short. We propose to overcome these limitations by augmenting semantic document representations learned by bi-encoders with behavioral document representations learned by our proposed approach MVG. To do so, MVG (1) determines how to divide the total budget for behavioral representations by drawing a connection to the Pitman-Yor process, and (2) simply clusters the queries related to a given document (based on user behavior) within the representational space learned by a base bi-encoder, and treats the cluster centers as its behavioral representations. Our central contribution is the finding such a simple intuitive light-weight approach leads to substantial gains in key first-stage retrieval metrics by incurring only a marginal memory overhead. We establish this via extensive experiments over three large public datasets comparing several single-vector and multi-vector bi-encoders, a proprietary e-commerce search dataset compared to production-quality bi-encoder, and an A/B test.
Negar Arabzadeh, Mahsa Seifikar, Charles L. A. Clarke
Despite recent progress on conversational systems, they still do not perform smoothly and coherently when faced with ambiguous requests. When questions are unclear, conversational systems should have the ability to ask clarifying questions, rather than assuming a particular interpretation or simply responding that they do not understand. Previous studies have shown that users are more satisfied when asked a clarifying question, rather than receiving an unrelated response. While the research community has paid substantial attention to the problem of predicting query ambiguity in traditional search contexts, researchers have paid relatively little attention to predicting when this ambiguity is sufficient to warrant clarification in the context of conversational systems. In this paper, we propose an unsupervised method for predicting the need for clarification. This method is based on the measured coherency of results from an initial answer retrieval step, under the assumption that a less ambiguous query is more likely to retrieve more coherent results when compared to an ambiguous query. We build a graph from retrieved items based on their context similarity, treating measures of graph connectivity as indicators of ambiguity. We evaluate our approach on two recently released open-domain conversational question answering datasets, ClariQ and AmbigNQ, comparing it with neural and non-neural baselines. Our unsupervised approach performs as well as supervised approaches while providing better generalization.
Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, Errui Ding
Video-text retrieval (VTR) is an attractive yet challenging task for
multi-modal understanding, which aims to search for relevant video (text) given
a query (video). Existing methods typically employ completely heterogeneous
visual-textual information to align video and text, whilst lacking the
awareness of homogeneous high-level semantic information residing in both
modalities. To fill this gap, in this work, we propose a novel
visual-linguistic aligning model named HiSE for VTR, which improves the
cross-modal representation by incorporating explicit high-level semantics.
First, we explore the hierarchical property of explicit high-level semantics,
and further decompose it into two levels, i.e. discrete semantics and holistic
semantics. Specifically, for visual branch, we exploit an off-the-shelf
semantic entity predictor to generate discrete high-level semantics. In
parallel, a trained video captioning model is employed to output holistic
high-level semantics. As for the textual modality, we parse the text into three
parts including occurrence, action and entity. In particular, the occurrence
corresponds to the holistic high-level semantics, meanwhile both action and
entity represent the discrete ones. Then, different graph reasoning techniques
are utilized to promote the interaction between holistic and discrete
high-level semantics. Extensive experiments demonstrate that, with the aid of
explicit high-level semantics, our method achieves the superior performance
over state-of-the-art methods on three benchmark datasets, including MSR-VTT,
MSVD and DiDeMo.
Authors' comments: Accepted by ACMMM 2022
Atreyee Saha, Salman S Khan, Sagar Sehrawat, Sanjana S Prabhu, Shanti Bhattacharya, Kaushik Mitra
Fourier Ptychographic Microscopy (FPM) is an imaging procedure that overcomes the traditional limit on Space-Bandwidth Product (SBP) of conventional microscopes through computational means. It utilizes multiple images captured using a low numerical aperture (NA) objective and enables high-resolution phase imaging through frequency domain stitching. Existing FPM reconstruction methods can be broadly categorized into two approaches: iterative optimization based methods, which are based on the physics of the forward imaging model, and data-driven methods which commonly employ a feed-forward deep learning framework. We propose a hybrid model-driven residual network that combines the knowledge of the forward imaging system with a deep data-driven network. Our proposed architecture, LWGNet, unrolls traditional Wirtinger flow optimization algorithm into a novel neural network design that enhances the gradient images through complex convolutional blocks. Unlike other conventional unrolling techniques, LWGNet uses fewer stages while performing at par or even better than existing traditional and deep learning techniques, particularly, for low-cost and low dynamic range CMOS sensors. This improvement in performance for low-bit depth and low-cost sensors has the potential to bring down the cost of FPM imaging setup significantly. Finally, we show consistently improved performance on our collected real data.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin et al.
Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and NaturalQuestions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameters model by 3% despite having 50x fewer parameters.
Qi Zhang, Zijian Yang, Yilun Huang, Ze Chen, Zijian Cai, Kangxu Wang, Jiewen Zheng, Jiarong He et al.
This paper mainly describes our winning solution (team name: www) to Amazon ESCI Challenge of KDD CUP 2022, which achieves a NDCG score of 0.9043 and wins the first place on task 1: the query-product ranking track. In this competition, participants are provided with a real-world large-scale multilingual shopping queries data set and it contains query-product pairs in English, Japanese and Spanish. Three different tasks are proposed in this competition, including ranking the results list as task 1, classifying the query/product pairs into Exact, Substitute, Complement, or Irrelevant (ESCI) categories as task 2 and identifying substitute products for a given query as task 3. We mainly focus on task 1 and propose a semantic alignment system for multilingual query-product retrieval. Pre-trained multilingual language models (LM) are adopted to get the semantic representation of queries and products. Our models are all trained with cross-entropy loss to classify the query-product pairs into ESCI 4 categories at first, and then we use weighted sum with the 4-class probabilities to get the score for ranking. To further boost the model, we also do elaborative data preprocessing, data augmentation by translation, specially handling English texts with English LMs, adversarial training with AWP and FGM, self distillation, pseudo labeling, label smoothing and ensemble. Finally, Our solution outperforms others both on public and private leaderboard.
Xiao Han, Kam Woh Ng, Sauradip Nag, Zhiyu Qu
Large-scale weakly supervised product retrieval is a practically useful yet
computationally challenging problem. This paper introduces a novel solution for
the eBay Visual Search Challenge (eProduct) held at the Ninth Workshop on
Fine-Grained Visual Categorisation workshop (FGVC9) of CVPR 2022. This
competition presents two challenges: (a) E-commerce is a drastically
fine-grained domain including many products with subtle visual differences; (b)
A lacking of target instance-level labels for model training, with only coarse
category labels and product titles available. To overcome these obstacles, we
formulate a strong solution by a set of dedicated designs: (a) Instead of using
text training data directly, we mine thousands of pseudo-attributes from
product titles and use them as the ground truths for multi-label
classification. (b) We incorporate several strong backbones with advanced
training recipes for more discriminative representation learning. (c) We
further introduce a number of post-processing techniques including whitening,
re-ranking and model ensemble for retrieval enhancement. By achieving 71.53%
MAR, our solution "Involution King" achieves the second position on the
leaderboard.
Authors' comments: FGVC9 CVPR2022
Yitong Zhang, Sophia Bano, Ann-Sophie Page, Jan Deprest, Danail Stoyanov, Francisco Vasconcelos
In minimally invasive surgery, surgical workflow segmentation from video
analysis is a well studied topic. The conventional approach defines it as a
multi-class classification problem, where individual video frames are
attributed a surgical phase label.
We introduce a novel reinforcement learning formulation for offline phase
transition retrieval. Instead of attempting to classify every video frame, we
identify the timestamp of each phase transition. By construction, our model
does not produce spurious and noisy phase transitions, but contiguous phase
blocks. We investigate two different configurations of this model. The first
does not require processing all frames in a video (only <60% and <20% of frames
in 2 different applications), while producing results slightly under the
state-of-the-art accuracy. The second configuration processes all video frames,
and outperforms the state-of-the art at a comparable computational cost.
We compare our method against the recent top-performing frame-based
approaches TeCNO and Trans-SVNet on the public dataset Cholec80 and also on an
in-house dataset of laparoscopic sacrocolpopexy. We perform both a frame-based
(accuracy, precision, recall and F1-score) and an event-based (event ratio)
evaluation of our algorithms.
Authors' comments: Accepted by MICCAI 2022
Zhongtian Hu, Lifang Wang, Yangqi Chen, Yushuang Liu, Ronghan Li, Meng Zhao, Xinyu Lu, Zejun Jiang
Knowledge-driven dialog system has recently made remarkable breakthroughs. Compared with general dialog systems, superior knowledge-driven dialog systems can generate more informative and knowledgeable responses with pre-provided knowledge. However, in practical applications, the dialog system cannot be provided with corresponding knowledge in advance because it cannot know in advance the development of the conversation. Therefore, in order to make the knowledge dialogue system more practical, it is vital to find a way to retrieve relevant knowledge based on the dialogue history. To solve this problem, we design a knowledge-driven dialog system named DRKQG (Dynamically Retrieving Knowledge via Query Generation for informative dialog response). Specifically, the system can be divided into two modules: the query generation module and the dialog generation module. First, a time-aware mechanism is utilized to capture context information, and a query can be generated for retrieving knowledge through search engine. Then, we integrate the copy mechanism and transformers, which allows the response generation module to produce responses derived from the context and retrieved knowledge. Experimental results at LIC2022, Language and Intelligence Technology Competition, show that our module outperforms the baseline model by a large margin on automatic evaluation metrics, while human evaluation by the Baidu Linguistics team shows that our system achieves impressive results in Factually Correct and Knowledgeable.
Min Sik Oh, Min Sang Kim
Persona and Knowledge dual context open-domain chat is a novel dialogue generation task introduced recently. While Persona and Knowledge is each interesting context of open-domain dialogue, the combination of both has not been well studied. We tackle Persona-Knowledge identification and response generation tasks in this paper. We design an informed data augmentation strategy that is compatible with neural Q&A retrieval models. With the augmented data, we perform permutative Persona-Knowledge evaluation and successive Persona search fine-tuning. Furthermore, we perform dialogue generation with various decoding techniques and illustrate crucial elements. We achieve SOTA across official metrics with 93.99% Grounding accuracy average and 23.62 SacreBLEU score.
Lijun Wei, Valerie Gouet-Brunet, Anthony Cohn
Location retrieval based on visual information is to retrieve the location of an agent (e.g. human, robot) or the area they see by comparing the observations with a certain form of representation of the environment. Existing methods generally require precise measurement and storage of the observed environment features, which may not always be robust due to the change of season, viewpoint, occlusion, etc. They are also challenging to scale up and may not be applicable for humans due to the lack of measuring/imaging devices. Considering that humans often use less precise but easily produced qualitative spatial language and high-level semantic landmarks when describing an environment, a qualitative location retrieval method is proposed in this work by describing locations/places using qualitative place signatures (QPS), defined as the perceived spatial relations between ordered pairs of co-visible landmarks from viewers' perspective. After dividing the space into place cells each with individual signatures attached, a coarse-to-fine location retrieval method is proposed to efficiently identify the possible location(s) of viewers based on their qualitative observations. The usability and effectiveness of the proposed method were evaluated using openly available landmark datasets, together with simulated observations by considering the possible perception error.
ZhenHao Tang, XiaoBing Zhang, Zi Long, XiangHua Fu
Recently, numbers of works shows that the performance of neural machine
translation (NMT) can be improved to a certain extent with using visual
information. However, most of these conclusions are drawn from the analysis of
experimental results based on a limited set of bilingual sentence-image pairs,
such as Multi30K. In these kinds of datasets, the content of one bilingual
parallel sentence pair must be well represented by a manually annotated image,
which is different with the actual translation situation. Some previous works
are proposed to addressed the problem by retrieving images from exiting
sentence-image pairs with topic model. However, because of the limited
collection of sentence-image pairs they used, their image retrieval method is
difficult to deal with the out-of-vocabulary words, and can hardly prove that
visual information enhance NMT rather than the co-occurrence of images and
sentences. In this paper, we propose an open-vocabulary image retrieval methods
to collect descriptive images for bilingual parallel corpus using image search
engine. Next, we propose text-aware attentive visual encoder to filter
incorrectly collected noise images. Experiment results on Multi30K and other
two translation datasets show that our proposed method achieves significant
improvements over strong baselines.
Authors' comments: 9 pages, 5 figures
Kaiyi Luo, Chao Zhang, Huaxiong Li, Xiuyi Jia, Chunlin Chen
In recent years, Cross-Modal Hashing (CMH) has aroused much attention due to its fast query speed and efficient storage. Previous literatures have achieved promising results for Cross-Modal Retrieval (CMR) by discovering discriminative hash codes and modality-specific hash functions. Nonetheless, most existing CMR works are subjected to some restrictions: 1) It is assumed that data of different modalities are fully paired, which is impractical in real applications due to sample missing and false data alignment, and 2) binary regression targets including the label matrix and binary codes are too rigid to effectively learn semantic-preserving hash codes and hash functions. To address these problems, this paper proposes an Adaptive Marginalized Semantic Hashing (AMSH) method which not only enhances the discrimination of latent representations and hash codes by adaptive margins, but also can be used for both paired and unpaired CMR. As a two-step method, in the first step, AMSH generates semantic-aware modality-specific latent representations with adaptively marginalized labels, which enlarges the distances between different classes, and exploits the labels to preserve the inter-modal and intra-modal semantic similarities into latent representations and hash codes. In the second step, adaptive margin matrices are embedded into the hash codes, and enlarge the gaps between positive and negative bits, which improves the discrimination and robustness of hash functions. On this basis, AMSH generates similarity-preserving hash codes and robust hash functions without strict one-to-one data correspondence requirement. Experiments are conducted on several benchmark datasets to demonstrate the superiority and flexibility of AMSH over some state-of-the-art CMR methods.
Sitan Yang, Carson Eisenach, Dhruv Madeka
Multi-horizon probabilistic time series forecasting has wide applicability to
real-world tasks such as demand forecasting. Recent work in neural time-series
forecasting mainly focus on the use of Seq2Seq architectures. For example,
MQTransformer - an improvement of MQCNN - has shown the state-of-the-art
performance in probabilistic demand forecasting. In this paper, we consider
incorporating cross-entity information to enhance model performance by adding a
cross-entity attention mechanism along with a retrieval mechanism to select
which entities to attend over. We demonstrate how our new neural architecture,
MQRetNN, leverages the encoded contexts from a pretrained baseline model on the
entire population to improve forecasting accuracy. Using MQCNN as the baseline
model (due to computational constraints, we do not use MQTransformer), we first
show on a small demand forecasting dataset that it is possible to achieve ~3%
improvement in test loss by adding a cross-entity attention mechanism where
each entity attends to all others in the population. We then evaluate the model
with our proposed retrieval methods - as a means of approximating an attention
over a large population - on a large-scale demand forecasting application with
over 2 million products and observe ~1% performance gain over the MQCNN
baseline.
Authors' comments: Accepted at KDD2022 Workshop on Mining and Learning from Time Series
Conghui Hu, Gim Hee Lee
Current supervised cross-domain image retrieval methods can achieve excellent
performance. However, the cost of data collection and labeling imposes an
intractable barrier to practical deployment in real applications. In this
paper, we investigate the unsupervised cross-domain image retrieval task, where
class labels and pairing annotations are no longer a prerequisite for training.
This is an extremely challenging task because there is no supervision for both
in-domain feature representation learning and cross-domain alignment. We
address both challenges by introducing: 1) a new cluster-wise contrastive
learning mechanism to help extract class semantic-aware features, and 2) a
novel distance-of-distance loss to effectively measure and minimize the domain
discrepancy without any external supervision. Experiments on the Office-Home
and DomainNet datasets consistently show the superior image retrieval
accuracies of our framework over state-of-the-art approaches. Our source code
can be found at https://github.com/conghuihu/UCDIR.
Authors' comments: ECCV2022
Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
The amount of audio data available on public websites is growing rapidly, and
an efficient mechanism for accessing the desired data is necessary. We propose
a content-based audio retrieval method that can retrieve a target audio that is
similar to but slightly different from the query audio by introducing auxiliary
textual information which describes the difference between the query and target
audio. While the range of conventional content-based audio retrieval is limited
to audio that is similar to the query audio, the proposed method can adjust the
retrieval range by adding an embedding of the auxiliary text query-modifier to
the embedding of the query sample audio in a shared latent space. To evaluate
our method, we built a dataset comprising two different audio clips and the
text that describes the difference. The experimental results show that the
proposed method retrieves the paired audio more accurately than the baseline.
We also confirmed based on visualization that the proposed method obtains the
shared latent space in which the audio difference and the corresponding text
are represented as similar embedding vectors.
Authors' comments: Accepted to Interspeech 2022
Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, Jun Liu
Early action prediction aims to successfully predict the class label of an
action before it is completely performed. This is a challenging task because
the beginning stages of different actions can be very similar, with only minor
subtle differences for discrimination. In this paper, we propose a novel Expert
Retrieval and Assembly (ERA) module that retrieves and assembles a set of
experts most specialized at using discriminative subtle differences, to
distinguish an input sample from other highly similar samples. To encourage our
model to effectively use subtle differences for early action prediction, we
push experts to discriminate exclusively between samples that are highly
similar, forcing these experts to learn to use subtle differences that exist
between those samples. Additionally, we design an effective Expert Learning
Rate Optimization method that balances the experts' optimization and leads to
better performance. We evaluate our ERA module on four public action datasets
and achieve state-of-the-art performance.
Authors' comments: Accepted to ECCV 2022
Hui Shi, Yupeng Gu, Yitong Zhou, Bo Zhao, Sicun Gao, Jishen Zhao
User embeddings (vectorized representations of a user) are essential in
recommendation systems. Numerous approaches have been proposed to construct a
representation for the user in order to find similar items for retrieval tasks,
and they have been proven effective in industrial recommendation systems as
well. Recently people have discovered the power of using multiple embeddings to
represent a user, with the hope that each embedding represents the user's
interest in a certain topic. With multi-interest representation, it's important
to model the user's preference over the different topics and how the preference
change with time. However, existing approaches either fail to estimate the
user's affinity to each interest or unreasonably assume every interest of every
user fades with an equal rate with time, thus hurting the recall of candidate
retrieval. In this paper, we propose the Multi-Interest Preference (MIP) model,
an approach that not only produces multi-interest for users by using the user's
sequential engagement more effectively but also automatically learns a set of
weights to represent the preference over each embedding so that the candidates
can be retrieved from each interest proportionally. Extensive experiments have
been done on various industrial-scale datasets to demonstrate the effectiveness
of our approach.
Authors' comments: Accepted by ICML 23'
Hao Chen, Liqing Xu
Private information retrieval (PIR) schemes (with or without colluding
servers) have been proposed for realistic coded distributed data storage
systems. Star product PIR schemes with colluding servers for general coded
distributed storage system were constructed over general finite fields by R.
Freij-Hollanti, O. W. Gnilke, C. Hollanti and A. Karpuk in 2017. These star
product PIR schemes with colluding servers are suitable for the storage of
files over small fields and can be constructed for coded distributed storage
system with large number of servers. In this paper for an efficient storage
code, the problem to find good retrieval codes is considered. In general if the
storage code is a binary Reed-Muller code the retrieval code needs not to be a
binary Reed-Muller code in general. It is proved that when the storage code
contains some special codewords, nonzero retrieval rate star product PIR
schemes with colluding servers can only protect against small number of
colluding servers. We also give examples to show that when the storage code is
a good cyclic code, the best choice of the retrieval code is not cyclic in
general. Therefore in the design of star product PIR schemes with colluding
servers, the scheme with the storage code and the retrieval code in the same
family of algebraic codes is not always efficient.
Authors' comments: 25 pages,PIR schemes with the storage code and the retrieval code in
the same family of algebraic codes seem not always efficient. arXiv admin
note: text overlap with arXiv:2207.03163