Craig Macdonald, Nicola Tonellotto
Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.
Sourav Dutta, Haytham Assem, Edward Burgin
Frequently-Asked-Question (FAQ) retrieval provides an effective procedure for
responding to user's natural language based queries. Such platforms are
becoming common in enterprise chatbots, product question answering, and
preliminary technical support for customers. However, the challenge in such
scenarios lies in bridging the lexical and semantic gap between varied query
formulations and the corresponding answers, both of which typically have a very
short span. This paper proposes TI-S2S, a novel learning framework combining
TF-IDF based keyword extraction and Word2Vec embeddings for training a
Sequence-to-Sequence (Seq2Seq) architecture. It achieves high precision for FAQ
retrieval by better understanding the underlying intent of a user question
captured via the representative keywords. We further propose a variant with an
additional neural network module for guiding retrieval via relevant candidate
identification based on similarity features. Experiments on publicly available
dataset depict our approaches to provide around 92% precision-at-rank-5,
exhibiting nearly 13% improvement over existing approaches.
Authors' comments: 6 pages
Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin
We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual
retrieval in eleven typologically diverse languages, designed to evaluate
ranking with learned dense representations. The goal of this resource is to
spur research in dense retrieval techniques in non-English languages, motivated
by recent observations that existing techniques for representation learning
perform poorly when applied to out-of-distribution data. As a starting point,
we provide zero-shot baselines for this new dataset based on a multi-lingual
adaptation of DPR that we call "mDPR". Experiments show that although the
effectiveness of mDPR is much lower than BM25, dense representations
nevertheless appear to provide valuable relevance signals, improving BM25
results in sparse-dense hybrids. In addition to analyses of our results, we
also discuss future challenges and present a research agenda in multi-lingual
dense retrieval. Mr. TyDi can be downloaded at
https://github.com/castorini/mr.tydi.
Authors' comments: Workshop on Multilingual Representation Learning at EMNLP 2021
Soumava Paul, Titir Dutta, Soma Biswas
In this work, for the first time, we address the problem of universal
cross-domain retrieval, where the test data can belong to classes or domains
which are unseen during training. Due to dynamically increasing number of
categories and practical constraint of training on every possible domain, which
requires large amounts of data, generalizing to both unseen classes and domains
is important. Towards that goal, we propose SnMpNet (Semantic Neighbourhood and
Mixture Prediction Network), which incorporates two novel losses to account for
the unseen classes and domains encountered during testing. Specifically, we
introduce a novel Semantic Neighborhood loss to bridge the knowledge gap
between seen and unseen classes and ensure that the latent space embedding of
the unseen classes is semantically meaningful with respect to its neighboring
classes. We also introduce a mix-up based supervision at image-level as well as
semantic-level of the data for training with the Mixture Prediction loss, which
helps in efficient retrieval when the query belongs to an unseen domain. These
losses are incorporated on the SE-ResNet50 backbone to obtain SnMpNet.
Extensive experiments on two large-scale datasets, Sketchy Extended and
DomainNet, and thorough comparisons with state-of-the-art justify the
effectiveness of the proposed model.
Authors' comments: Accepted at ICCV 2021. 15 pages, 6 figures
Donglin Zhang, Xiao-Jun Wu, He-Feng Yin, Josef Kittler
In recent years, cross-media hashing technique has attracted increasing attention for its high computation efficiency and low storage cost. However, the existing approaches still have some limitations, which need to be explored. 1) A fixed hash length (e.g., 16bits or 32bits) is predefined before learning the binary codes. Therefore, these models need to be retrained when the hash length changes, that consumes additional computation power, reducing the scalability in practical applications. 2) Existing cross-modal approaches only explore the information in the original multimedia data to perform the hash learning, without exploiting the semantic information contained in the learned hash codes. To this end, we develop a novel Multiple hash cOdes jOint learNing method (MOON) for cross-media retrieval. Specifically, the developed MOON synchronously learns the hash codes with multiple lengths in a unified framework. Besides, to enhance the underlying discrimination, we combine the clues from the multimodal data, semantic labels and learned hash codes for hash learning. As far as we know, the proposed MOON is the first work to simultaneously learn different length hash codes without retraining in cross-media retrieval. Experiments on several databases show that our MOON can achieve promising performance, outperforming some recent competitive shallow and deep methods.
Hongjin Qian, Zhicheng Dou, Yutao Zhu, Yueyuan Ma, Ji-Rong Wen
In this paper, we explore the problem of developing personalized chatbots. A
personalized chatbot is designed as a digital chatting assistant for a user.
The key characteristic of a personalized chatbot is that it should have a
consistent personality with the corresponding user. It can talk the same way as
the user when it is delegated to respond to others' messages. We present a
retrieval-based personalized chatbot model, namely IMPChat, to learn an
implicit user profile from the user's dialogue history. We argue that the
implicit user profile is superior to the explicit user profile regarding
accessibility and flexibility. IMPChat aims to learn an implicit user profile
through modeling user's personalized language style and personalized
preferences separately. To learn a user's personalized language style, we
elaborately build language models from shallow to deep using the user's
historical responses; To model a user's personalized preferences, we explore
the conditional relations underneath each post-response pair of the user. The
personalized preferences are dynamic and context-aware: we assign higher
weights to those historical pairs that are topically related to the current
query when aggregating the personalized preferences. We match each response
candidate with the personalized language style and personalized preference,
respectively, and fuse the two matching signals to determine the final ranking
score. Comprehensive experiments on two large datasets show that our method
outperforms all baseline models.
Authors' comments: Accepted by CIKM 2021, codes and dataset will be released at
https://github.com/qhjqhj00/CIKM2021-IMPChat
Philipp Grohs, Martin Rathmair
We consider the problem of reconstructing the missing phase information from spectrogram data $|\mathcal{G} f|,$ with $$ \mathcal{G}f(x,y)=\int_\mathbb{R} f(t) e^{-\pi(t-x)^2}e^{-2\pi i t y}dt, $$ the Gabor transform of a signal $f\in L^2(\mathbb{R})$. More specifically, we are interested in domains $\Omega\subseteq \mathbb{R}^2$, which allow for stable local reconstruction, that is $$ |\mathcal{G}g| \approx |\mathcal{G}f| \quad \text{in} ~\Omega \quad\Longrightarrow \quad \exists \tau\in\mathbb{T}:\quad \mathcal{G}g \approx \tau\mathcal{G}f \quad \text{in} ~\Omega. $$ In recent work [P. Grohs and M. Rathmair. Stable Gabor Phase Retrieval and Spectral Clustering. Comm. Pure Appl. Math. (2019)] and [P. Grohs and M. Rathmair. Stable Gabor phase retrieval for multivariate functions. J. Eur. Math. Soc. (2021)] we established a characterization of the stability of this phase retrieval problem in terms of the connectedness of the observed measurements. The main downside of the aforementioned results is that the similarity of two spectrograms is measured w.r.t. a first order weighted Sobolev norm. In this article we remove this flaw and essentially show that the Sobolev norm may be replaced by the $L^2-$norm. Using this result allows us to show that it suffices to sample the spectrogram on suitable discrete sampling sets -- a property of crucial importance for practical applications.
Craig Macdonald, Nicola Tonellotto, Iadh Ounis
The advent of contextualised language models has brought gains in search
effectiveness, not just when applied for re-ranking the output of classical
weighting models such as BM25, but also when used directly for passage indexing
and retrieval, a technique which is called dense retrieval. In the existing
literature in neural ranking, two dense retrieval families have become
apparent: single representation, where entire passages are represented by a
single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE
approach), or multiple representations, where each token in a passage is
represented by its own embedding (as exemplified by the recent ColBERT
approach). These two families have not been directly compared. However, because
of the likely importance of dense retrieval moving forward, a clear
understanding of their advantages and disadvantages is paramount. To this end,
this paper contributes a direct study on their comparative effectiveness,
noting situations where each method under/over performs w.r.t. each other, and
w.r.t. a BM25 baseline. We observe that, while ANCE is more efficient than
ColBERT in terms of response time and memory usage, multiple representations
are statistically more effective than the single representations for MAP and
MRR@10. We also show that multiple representations obtain better improvements
than single representations for queries that are the hardest for BM25, as well
as for definitional queries, and those with complex information needs.
Authors' comments: Published at the 11th Italian Information Retrieval Workshop (IIR
2021)
Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, Yong Yu
Prediction over tabular data is an essential task in many data science
applications such as recommender systems, online advertising, medical
treatment, etc. Tabular data is structured into rows and columns, with each row
as a data sample and each column as a feature attribute. Both the columns and
rows of the tabular data carry useful patterns that could improve the model
prediction performance. However, most existing models focus on the cross-column
patterns yet overlook the cross-row patterns as they deal with single samples
independently. In this work, we propose a general learning framework named
Retrieval & Interaction Machine (RIM) that fully exploits both cross-row and
cross-column patterns among tabular data. Specifically, RIM first leverages
search engine techniques to efficiently retrieve useful rows of the table to
assist the label prediction of the target row, then uses feature interaction
networks to capture the cross-column patterns among the target row and the
retrieved rows so as to make the final label prediction. We conduct extensive
experiments on 11 datasets of three important tasks, i.e., CTR prediction
(classification), top-n recommendation (ranking) and rating prediction
(regression). Experimental results show that RIM achieves significant
improvements over the state-of-the-art and various baselines, demonstrating the
superiority and efficacy of RIM.
Authors' comments: SIGKDD 2021
Zhiwei Zhang, Hanyu Peng
Deep hashing has been widely applied to large-scale image retrieval by
encoding high-dimensional data points into binary codes for efficient
retrieval. Compared with pairwise/triplet similarity based hash learning,
central similarity based hashing can more efficiently capture the global data
distribution. For multi-label image retrieval, however, previous methods only
use multiple hash centers with equal weights to generate one centroid as the
learning target, which ignores the relationship between the weights of hash
centers and the proportion of instance regions in the image. To address the
above issue, we propose a two-step alternative optimization approach,
Instance-weighted Central Similarity (ICS), to automatically learn the center
weight corresponding to a hash code. Firstly, we apply the maximum entropy
regularizer to prevent one hash center from dominating the loss function, and
compute the center weights via projection gradient descent. Secondly, we update
neural network parameters by standard back-propagation with fixed center
weights. More importantly, the learned center weights can well reflect the
proportion of foreground instances in the image. Our method achieves the
state-of-the-art performance on the image retrieval benchmarks, and especially
improves the mAP by 1.6%-6.4% on the MS COCO dataset.
Authors' comments: 10 pages, 6 figures
Tom Schmit, Luigi Giannelli, Anders S. Sørensen, Giovanna Morigi
Quantum networks using photonic channels require control of the interactions
between the photons, carrying the information, and the elements comprising the
nodes. In this work we theoretically analyse the spectral properties of an
optical photon emitted by a solid-state quantum memory, which acts as a
converter of a photon absorbed in another frequency range. We determine
explicitly the expression connecting the stored and retrieved excitation taking
into account possible mode and phase mismatch of the experimental setup. The
expression we obtain describes the output field as a function of the input
field for a transducer working over a wide range of frequencies, from
optical-to-optical to microwave-to-optical. We apply this result to analyse the
photon spectrum and the retrieval probability as a function of the optical
depth for microwave-to-optical transduction. In the absence of losses, the
efficiency of the solid-state quantum transducer is intrinsically determined by
the capability of designing the retrieval process as the time-reversal of the
storage dynamics.
Authors' comments: 14 pages, 4 figures; To appear in Phys Rev A
Sennur Ulukus, Salman Avestimehr, Michael Gastpar, Syed Jafar, Ravi Tandon, Chao Tian
Most of our lives are conducted in the cyberspace. The human notion of privacy translates into a cyber notion of privacy on many functions that take place in the cyberspace. This article focuses on three such functions: how to privately retrieve information from cyberspace (privacy in information retrieval), how to privately leverage large-scale distributed/parallel processing (privacy in distributed computing), and how to learn/train machine learning models from private data spread across multiple users (privacy in distributed (federated) learning). The article motivates each privacy setting, describes the problem formulation, summarizes breakthrough results in the history of each problem, and gives recent results and discusses some of the major ideas that emerged in each field. In addition, the cross-cutting techniques and interconnections between the three topics are discussed along with a set of open problems and challenges.
Zhizhong Chen, Carsten Eickhoff
Despite advances in neural machine translation, cross-lingual retrieval tasks in which queries and documents live in different natural language spaces remain challenging. Although neural translation models may provide an intuitive approach to tackle the cross-lingual problem, their resource-consuming training and advanced model structures may complicate the overall retrieval pipeline and reduce users engagement. In this paper, we build our end-to-end Cross-Lingual Arabic Information REtrieval (CLAIRE) system based on the cross-lingual word embedding where searchers are assumed to have a passable passive understanding of Arabic and various supporting information in English is provided to aid retrieval experience. The proposed system has three major advantages: (1) The usage of English-Arabic word embedding simplifies the overall pipeline and avoids the potential mistakes caused by machine translation. (2) Our CLAIRE system can incorporate arbitrary word embedding-based neural retrieval models without structural modification. (3) Early empirical results on an Arabic news collection show promising performance.
Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel et al.
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.
Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, Jing Liu
Video-text retrieval is an important yet challenging task in vision-language
understanding, which aims to learn a joint embedding space where related video
and text instances are close to each other. Most current works simply measure
the video-text similarity based on video-level and text-level embeddings.
However, the neglect of more fine-grained or local information causes the
problem of insufficient representation. Some works exploit the local details by
disentangling sentences, but overlook the corresponding videos, causing the
asymmetry of video-text representation. To address the above limitations, we
propose a Hierarchical Alignment Network (HANet) to align different level
representations for video-text matching. Specifically, we first decompose video
and text into three semantic levels, namely event (video and text), action
(motion and verb), and entity (appearance and noun). Based on these, we
naturally construct hierarchical representations in the individual-local-global
manner, where the individual level focuses on the alignment between frame and
word, local level focuses on the alignment between video clip and textual
context, and global level focuses on the alignment between the whole video and
text. Different level alignments capture fine-to-coarse correlations between
video and text, as well as take the advantage of the complementary information
among three semantic levels. Besides, our HANet is also richly interpretable by
explicitly learning key semantic concepts. Extensive experiments on two public
datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other
state-of-the-art methods, which demonstrates the effectiveness of hierarchical
representation and alignment. Our code is publicly available.
Authors' comments: This work has been accepted to ACM-MM 2021
Souvik Das, Sougata Saha, Rohini K. Srihari
The Covid-19 pandemic has caused a spur in the medical research literature.
With new research advances in understanding the virus, there is a need for
robust text mining tools which can process, extract and present answers from
the literature in a concise and consumable way. With a DialoGPT based
multi-turn conversation generation module, and BM-25 \& neural embeddings based
ensemble information retrieval module, in this paper we present a
conversational system, which can retrieve and answer coronavirus-related
queries from the rich medical literature, and present it in a conversational
setting with the user. We further perform experiments to compare neural
embedding-based document retrieval and the traditional BM25 retrieval algorithm
and report the results.
Authors' comments: SBP-BRiMS 2020 Pandemic Track paper
Nguyen Hieu Thao, Oleg Soloviev, Russell Luke, Michel Verhaegen
We develop for the first time a mathematical framework in which the class of
projection algorithms can be applied to high numerical aperture (NA) phase
retrieval. Within this framework, we first analyze the basic steps of solving
the high-NA phase retrieval problem by projection algorithms and establish the
closed forms of all the relevant prox-operators. We then study the geometry of
the high-NA phase retrieval problem and the obtained results are subsequently
used to establish convergence criteria of projection algorithms in the presence
of noise. Making use of the vectorial point-spread-function (PSF) is, on the
one hand, the key difference between this paper and the literature of phase
retrieval mathematics which deals with the scalar PSF. The results of this
paper, on the other hand, can be viewed as extensions of those concerning
projection methods for low-NA phase retrieval. Importantly, the improved
performance of projection methods over the other classes of phase retrieval
algorithms in the low-NA setting now also becomes applicable to the high-NA
case. This is demonstrated by the accompanying numerical results which show
that available solution approaches for high-NA phase retrieval are outperformed
by projection methods.
Authors' comments: 26 pages, 2 figures, 2 tables
Yutao Zhu, Jian-Yun Nie, Kun Zhou, Pan Du, Hao Jiang, Zhicheng Dou
A proactive dialogue system has the ability to proactively lead the
conversation. Different from the general chatbots which only react to the user,
proactive dialogue systems can be used to achieve some goals, e.g., to
recommend some items to the user. Background knowledge is essential to enable
smooth and natural transitions in dialogue. In this paper, we propose a new
multi-task learning framework for retrieval-based knowledge-grounded proactive
dialogue. To determine the relevant knowledge to be used, we frame knowledge
prediction as a complementary task and use explicit signals to supervise its
learning. The final response is selected according to the predicted knowledge,
the goal to achieve, and the context. Experimental results show that explicit
modeling of knowledge prediction and goal selection can greatly improve the
final response selection. Our code is available at
https://github.com/DaoD/KPN/.
Authors' comments: Accepted by SIGIR 2021
Bruno Moio, Gian Luca Dolso, Giacomo Inzani, Nicola Di Palo, Shunsuke A. Sato, Rocío Borrego-Varillas, Mauro Nisoli, Matteo Lucchini
The first step to gain optical control over the ultrafast processes initiated by light in solids is a correct identification of the physical mechanisms at play. Among them, exciton formation has been identified as a crucial phenomenon which deeply affects the electro-optical properties of most semiconductors and insulators of technological interest. While recent experiments based on attosecond spectroscopy techniques have demonstrated the possibility to observe the early-stage exciton dynamics, the description of the underlying exciton properties remains non-trivial. In this work we propose a new method called extended Ptychographic Iterative engine for eXcitons (ePIX), capable of reconstructing the main physical properties which determine the evolution of the quasi-particle with no prior knowledge of the exact relaxation dynamics or the pump temporal characteristics. By demonstrating its accuracy even when the exciton dynamics is comparable to the pump pulse duration, ePIX is established as a powerful approach to widen our knowledge of solid-state physics.
Yizhi Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu
Dense retrieval conducts text retrieval in the embedding space and has shown
many advantages compared to sparse retrieval. Existing dense retrievers
optimize representations of queries and documents with contrastive training and
map them to the embedding space. The embedding space is optimized by aligning
the matched query-document pairs and pushing the negative documents away from
the query. However, in such training paradigm, the queries are only optimized
to align to the documents and are coarsely positioned, leading to an
anisotropic query embedding space. In this paper, we analyze the embedding
space distributions and propose an effective training paradigm, Contrastive
Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained
query representations for dense retrieval. DANCE incorporates an additional
dual training object of query retrieval, inspired by the classic information
retrieval training axiom, query likelihood. With contrastive learning, the dual
training object of DANCE learns more tailored representations for queries and
documents to keep the embedding space smooth and uniform, thriving on the
ranking performance of DANCE on the MS MARCO document retrieval task. Different
from ANCE that only optimized with the document retrieval task, DANCE
concentrates the query embeddings closer to document representations while
making the document distribution more discriminative. Such concentrated query
embedding distribution assigns more uniform negative sampling probabilities to
queries and helps to sufficiently optimize query representations in the query
retrieval task. Our codes are released at https://github.com/thunlp/DANCE.
Authors' comments: Accepted by ICTIR 2021