Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu et al.
Embedding based retrieval (EBR) is a fundamental building block in many web applications. However, EBR in sponsored search is distinguished from other generic scenarios and technically challenging due to the need of serving multiple retrieval purposes: firstly, it has to retrieve high-relevance ads, which may exactly serve user's search intent; secondly, it needs to retrieve high-CTR ads so as to maximize the overall user clicks. In this paper, we present a novel representation learning framework Uni-Retriever developed for Bing Search, which unifies two different training modes knowledge distillation and contrastive learning to realize both required objectives. On one hand, the capability of making high-relevance retrieval is established by distilling knowledge from the ``relevance teacher model''. On the other hand, the capability of making high-CTR retrieval is optimized by learning to discriminate user's clicked ads from the entire corpus. The two training modes are jointly performed as a multi-objective learning process, such that the ads of high relevance and CTR can be favored by the generated embeddings. Besides the learning strategy, we also elaborate our solution for EBR serving pipeline built upon the substantially optimized DiskANN, where massive-scale EBR can be performed with competitive time and memory efficiency, and accomplished in high-quality. We make comprehensive offline and online experiments to evaluate the proposed techniques, whose findings may provide useful insights for the future development of EBR systems. Uni-Retriever has been mainstreamed as the major retrieval path in Bing's production thanks to the notable improvements on the representation and EBR serving quality.
Matthias Wellershoff
We characterise all pairs of finite order entire functions whose magnitudes
agree on two arbitrary lines in the complex plane by means of the Hadamard
factorisation theorem. Building on this, we also characterise all pairs of
second order entire functions whose magnitudes agree on infinitely many
equidistant parallel lines. Furthermore, we show that the magnitude of an
entire function on three parallel lines, whose distances are rationally
independent, uniquely determines the function up to global phase, and that
there exists a first order entire function whose magnitude on the lines
$\mathbb{R} + \tfrac{\mathrm{i}}{n} \mathbb{Z}$ does not uniquely determine it
up to global phase, for all positive integers $n$. Our results have direct
implications for Gabor phase retrieval.
Authors' comments: 29 pages, 2 figures; minor change in acknowledgements
Sophia Althammer, Sebastian Hofstätter, Mete Sertkan, Suzan Verberne, Allan Hanbury
Dense passage retrieval (DPR) models show great effectiveness gains in first
stage retrieval for the web domain. However in the web domain we are in a
setting with large amounts of training data and a query-to-passage or a
query-to-document retrieval task. We investigate in this paper dense
document-to-document retrieval with limited labelled target data for training,
in particular legal case retrieval. In order to use DPR models for
document-to-document retrieval, we propose a Paragraph Aggregation Retrieval
Model (PARM) which liberates DPR models from their limited input length. PARM
retrieves documents on the paragraph-level: for each query paragraph, relevant
documents are retrieved based on their paragraphs. Then the relevant results
per query paragraph are aggregated into one ranked list for the whole query
document. For the aggregation we propose vector-based aggregation with
reciprocal rank fusion (VRRF) weighting, which combines the advantages of
rank-based aggregation and topical aggregation based on the dense embeddings.
Experimental results show that VRRF outperforms rank-based aggregation
strategies for dense document-to-document retrieval with PARM. We compare PARM
to document-level retrieval and demonstrate higher retrieval effectiveness of
PARM for lexical and dense first-stage retrieval on two different legal case
retrieval collections. We investigate how to train the dense retrieval model
for PARM on limited target data with labels on the paragraph or the
document-level. In addition, we analyze the differences of the retrieved
results of lexical and dense retrieval with PARM.
Authors' comments: Accepted at ECIR 2022
Konstantin Avrachenkov, Evsey Morozov, Ruslana Nekrasova
We establish stability criterion for a two-class retrial system with Poisson inputs, general class-dependent service times and class-dependent constant retrial rates. We also characterise an interesting phenomenon of partial stability when one orbit is tight but the other orbit goes to infinity in probability. All theoretical results are illustrated by numerical experiments.
Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta et al.
Despite their recent popularity and well-known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query and to generalize to out-of-domain data. It has been argued that this is an inherent limitation of dense models. We rebut this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dense retriever with the lexical matching capacity of a sparse model. We show that a dense Lexical Model {\Lambda} can be trained to imitate a sparse one, and SPAR is built by augmenting a standard dense retriever with {\Lambda}. Empirically, SPAR shows superior performance on a range of tasks including five question answering datasets, MS MARCO passage retrieval, as well as the EntityQuestions and BEIR benchmarks for out-of-domain evaluation, exceeding the performance of state-of-the-art dense and sparse retrievers. The code and models of SPAR are available at: https://github.com/facebookresearch/dpr-scale/tree/main/spar
Sophia Althammer, Arian Askari, Suzan Verberne, Allan Hanbury
In this paper, we present our approaches for the case law retrieval and the
legal case entailment task in the Competition on Legal Information
Extraction/Entailment (COLIEE) 2021. As first stage retrieval methods combined
with neural re-ranking methods using contextualized language models like BERT
achieved great performance improvements for information retrieval in the web
and news domain, we evaluate these methods for the legal domain. A distinct
characteristic of legal case retrieval is that the query case and case
description in the corpus tend to be long documents and therefore exceed the
input length of BERT. We address this challenge by combining lexical and dense
retrieval methods on the paragraph-level of the cases for the first stage
retrieval. Here we demonstrate that the retrieval on the paragraph-level
outperforms the retrieval on the document-level. Furthermore the experiments
suggest that dense retrieval methods outperform lexical retrieval. For
re-ranking we address the problem of long documents by summarizing the cases
and fine-tuning a BERT-based re-ranker with the summaries. Overall, our best
results were obtained with a combination of BM25 and dense passage retrieval
using domain-specific embeddings.
Authors' comments: Published in COLIEE 2021
Hao Cheng, Ping Wang, Chun Qi
As important data carriers, the drastically increasing number of multimedia
videos often brings many duplicate and near-duplicate videos in the top results
of search. Near-duplicate video retrieval (NDVR) can cluster and filter out the
redundant contents. In this paper, the proposed NDVR approach extracts the
frame-level video representation based on convolutional neural network (CNN)
features from fully-connected layer and aggregated intermediate convolutional
layers. Unsupervised metric learning is used for similarity measurement and
feature matching. An efficient re-ranking algorithm combined with k-nearest
neighborhood fuses the retrieval results from two levels of features and
further improves the retrieval performance. Extensive experiments on the widely
used CC\_WEB\_VIDEO dataset shows that the proposed approach exhibits superior
performance over the state-of-the-art.
Authors' comments: This paper is submitted to ICIP 2019
Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, Iryna Gurevych
Current state-of-the-art approaches to cross-modal retrieval process text and
visual input jointly, relying on Transformer-based architectures with
cross-attention mechanisms that attend over all words and objects in an image.
While offering unmatched retrieval performance, such models: 1) are typically
pretrained from scratch and thus less scalable, 2) suffer from huge retrieval
latency and inefficiency issues, which makes them impractical in realistic
applications. To address these crucial gaps towards both improved and efficient
cross-modal retrieval, we propose a novel fine-tuning framework that turns any
pretrained text-image multi-modal model into an efficient retrieval model. The
framework is based on a cooperative retrieve-and-rerank approach which
combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items
of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder
component for a more nuanced (i.e., smarter) ranking of the retrieved small set
of items. We also propose to jointly fine-tune the two components with shared
weights, yielding a more parameter-efficient model. Our experiments on a series
of standard cross-modal retrieval benchmarks in monolingual, multilingual, and
zero-shot setups, demonstrate improved accuracy and huge efficiency benefits
over the state-of-the-art cross-encoders.
Authors' comments: TACL 2022
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma
Ranking has always been one of the top concerns in information retrieval research. For decades, lexical matching signal has dominated the ad-hoc retrieval process, but it also has inherent defects, such as the vocabulary mismatch problem. Recently, Dense Retrieval (DR) technique has been proposed to alleviate these limitations by capturing the deep semantic relationship between queries and documents. The training of most existing Dense Retrieval models relies on sampling negative instances from the corpus to optimize a pairwise loss function. Through investigation, we find that this kind of training strategy is biased and fails to optimize full retrieval performance effectively and efficiently. To solve this problem, we propose a Learning To Retrieve (LTRe) training technique. LTRe constructs the document index beforehand. At each training iteration, it performs full retrieval without negative sampling and then updates the query representation model parameters. Through this process, it teaches the DR model how to retrieve relevant documents from the entire corpus instead of how to rerank a potentially biased sample of documents. Experiments in both passage retrieval and document retrieval tasks show that: 1) in terms of effectiveness, LTRe significantly outperforms all competitive sparse and dense baselines. It even gains better performance than the BM25-BERT cascade system under reasonable latency constraints. 2) in terms of training efficiency, compared with the previous state-of-the-art DR method, LTRe provides more than 170x speed-up in the training process. Training with a compressed index further saves computing resources with minor performance loss.
Tao Guo, Ruida Zhou, Chao Tian
In a private information retrieval (PIR) system, the user needs to retrieve one of the possible messages from a set of storage servers, but wishes to keep the identity of requested message private from any given server. Existing efforts in this area have made it clear that the efficiency of the retrieval will be impacted significantly by the amount of the storage space allowed at the servers. In this work, we consider the tradeoff between the storage cost and the retrieval cost. We first present three fundamental results: 1) a regime-wise 2-approximate characterization of the optimal tradeoff, 2) a cyclic permutation lemma that can produce more sophisticated codes from simpler ones, and 3) a relaxed entropic linear program (LP) lower bound that has a polynomial complexity. Equipped with the cyclic permutation lemma, we then propose two novel code constructions, and by applying the lemma, obtain new storage-retrieval points. Furthermore, we derive more explicit lower bounds by utilizing only a subset of the constraints in the relaxed entropic LP in a systematic manner. Though the new upper bound and lower bound do not lead to a more precise approximate characterization in general, they are significantly tighter than the existing art.
Numan Khurshid, Talha Hanif, Mohbat Tharani, Murtaza Taj
Cross-modal retrieval aims to measure the content similarity between
different types of data. The idea has been previously applied to visual, text,
and speech data. In this paper, we present a novel cross-modal retrieval method
specifically for multi-view images, called Cross-view Image Retrieval CVIR. Our
approach aims to find a feature space as well as an embedding space in which
samples from street-view images are compared directly to satellite-view images
(and vice-versa). For this comparison, a novel deep metric learning based
solution "DeepCVIR" has been proposed. Previous cross-view image datasets are
deficient in that they (1) lack class information; (2) were originally
collected for cross-view image geolocalization task with coupled images; (3) do
not include any images from off-street locations. To train, compare, and
evaluate the performance of cross-view image retrieval, we present a new 6
class cross-view image dataset termed as CrossViewRet which comprises of images
including freeway, mountain, palace, river, ship, and stadium with 700
high-resolution dual-view images for each class. Results show that the proposed
DeepCVIR outperforms conventional matching approaches on the CVIR task for the
given dataset and would also serve as the baseline for future research.
Authors' comments: International Conference on Neural Information Processing
(ICONIP-2019)
Nils C. Geib, Matthias Zilk, Thomas Pertsch, Falk Eilenberger
We present a common pulse retrieval algorithm (COPRA) that can be used for a broad category of ultrashort laser pulse measurement schemes including frequency-resolved optical gating (FROG), interferometric FROG, dispersion scan, time domain ptychography, and pulse shaper assisted techniques such as multiphoton intrapulse interference phase scan (MIIPS). We demonstrate its properties in comprehensive numerical tests and show that it is fast, reliable and accurate in the presence of Gaussian noise. For FROG it outperforms retrieval algorithms based on generalized projections and ptychography. Furthermore, we discuss the pulse retrieval problem as a nonlinear least-squares problem and demonstrate the importance of obtaining a least-squares solution for noisy data. These results improve and extend the possibilities of numerical pulse retrieval. COPRA is faster and provides more accurate results in comparison to existing retrieval algorithms. Furthermore, it enables full pulse retrieval from measurements for which no retrieval algorithm was known before, e.g., MIIPS measurements.
Kyosuke Nishida, Itsumi Saito, Atsushi Otsuka, Hisako Asano, Junji Tomita
This study considers the task of machine reading at scale (MRS) wherein,
given a question, a system first performs the information retrieval (IR) task
of finding relevant passages in a knowledge source and then carries out the
reading comprehension (RC) task of extracting an answer span from the passages.
Previous MRS studies, in which the IR component was trained without considering
answer spans, struggled to accurately find a small number of relevant passages
from a large set of passages. In this paper, we propose a simple and effective
approach that incorporates the IR and RC tasks by using supervised multi-task
learning in order that the IR component can be trained by considering answer
spans. Experimental results on the standard benchmark, answering SQuAD
questions using the full Wikipedia as the knowledge source, showed that our
model achieved state-of-the-art performance. Moreover, we thoroughly evaluated
the individual contributions of our model components with our new Japanese
dataset and SQuAD. The results showed significant improvements in the IR task
and provided a new perspective on IR for RC: it is effective to teach which
part of the passage answers the question rather than to give only a relevance
score to the whole passage.
Authors' comments: 10 pages, 6 figure. Accepted as a full paper at CIKM 2018
Karim Banawan, Sennur Ulukus
We consider the problem of noisy private information retrieval (NPIR) from
$N$ non-communicating databases, each storing the same set of $M$ messages. In
this model, the answer strings are not returned through noiseless bit pipes,
but rather through \emph{noisy} memoryless channels. We aim at characterizing
the PIR capacity for this model as a function of the statistical information
measures of the noisy channels such as entropy and mutual information. We
derive a general upper bound for the retrieval rate in the form of a max-min
optimization. We use the achievable schemes for the PIR problem under
asymmetric traffic constraints and random coding arguments to derive a general
lower bound for the retrieval rate. The upper and lower bounds match for $M=2$
and $M=3$, for any $N$, and any noisy channel. The results imply that
separation between channel coding and retrieval is optimal except for adapting
the traffic ratio from the databases. We refer to this as \emph{almost
separation}. Next, we consider the private information retrieval problem from
multiple access channels (MAC-PIR). In MAC-PIR, the database responses reach
the user through a multiple access channel (MAC) that mixes the responses
together in a stochastic way. We show that for the additive MAC and the
conjunction/disjunction MAC, channel coding and retrieval scheme are
\emph{inseparable} unlike in NPIR. We show that the retrieval scheme depends on
the properties of the MAC, in particular on the linearity aspect. For both
cases, we provide schemes that achieve the full capacity without any loss due
to the privacy constraint, which implies that the user can exploit the nature
of the channel to improve privacy. Finally, we show that the full unconstrained
capacity is not always attainable by determining the capacity of the selection
channel.
Authors' comments: Submitted to IEEE Transactions on Information Theory, July 2018
Mohini P. Sardey, G. K. Kharate
Basic group of visual techniques such as color, shape, texture are used in
Content Based Image Retrievals (CBIR) to retrieve query image or subregion of
image to find similar images in image database. To improve query result,
relevance feedback is used many times in CBIR to help user to express their
preference and improve query results.In this paper, a new approach for image
retrieval is proposed which is based on the features such as Color Histogram,
Eigen Values and Match Point. Images from various types of database are first
identified by using edge detection techniques.Once the image is identified,
then the image is searched in the particular database, then all related images
are displayed. This will save the retrieval time. Further to retrieve the
precise query image, any of the three techniques are used and comparison is
done w.r.t. average retrieval time. Eigen value technique found to be the best
as compared with other two techniques
Authors' comments: 9 pages, 4 figures, 2 tables
C Ravindranath Chowdary, Anil Kumar Singh, Anil Nelakanti
Retrieval and content management are assumed to be mutually exclusive. In this paper we suggest that they need not be so. In the usual information retrieval scenario, some information about queries leading to a website (due to `hits' or `visits') is available to the server administrator of the concerned website. This information can used to better present the content on the website. Further, we suggest that some more information can be shared by the retrieval system with the content provider. This will enable the content provider (any website) to have a more dynamic presentation of the content that is in tune with the query trends, without violating the privacy of the querying user. The result will be a better synchronization between retrieval systems and content providers, with the purpose of improving the user's web search experience. This will also give the content provider a say in this process, given that the content provider is the one who knows much more about the content than the retrieval system. It also means that the content presentation may change in response to a query. In the end, the user will be able to find the relevant content more easily and quickly.
Robin Aly, Maria Eskevich, Roeland Ordelman, Gareth J. F. Jones
This report describes metrics for the evaluation of the effectiveness of
segment-based retrieval based on existing binary information retrieval metrics.
This metrics are described in the context of a task for the hyperlinking of
video segments. This evaluation approach re-uses existing evaluation measures
from the standard Cranfield evaluation paradigm. Our adaptation approach can in
principle be used with any kind of effectiveness measure that uses binary
relevance, and for other segment-baed retrieval tasks. In our video
hyperlinking setting, we use precision at a cut-off rank n and mean average
precision.
Authors' comments: Explanation of evaluation measures for the linking task of the
MediaEval Workshop 2013
Michael R. Line, Aaron Wolf, Xi Zhang, Heather Knutson, Joshua Kammer, Elias Ellison, Pieter Deroo, Dave Crisp et al.
Spectra of exoplanet atmospheres provide us the opportunity to improve our
understanding of these objects just as remote sensing in our own solar system
has increased our understanding of the solar system bodies. The challenge is to
quantitatively determine the range of temperatures and species abundances
allowed by the data. This challenge is often difficult given the low
information content of most exoplanet spectra which commonly leads to
degeneracies in the interpretation. A variety of temperature and abundance
retrieval approaches have been applied to exoplanet spectra, but no previous
investigations have sought to compare these approaches. In this investigation
we compare three different retrieval methods: Optimal Estimation, Differential
Evolution Markov Chain Monte Carlo, and Bootstrap Monte Carlo. We call our
suite of retrieval algorithms the Caltech Inverse Modeling and Retrieval
Algorithms (CHIMERA). We discuss what we can expect in terms of uncertainties
in abundances and temperatures given current observations as well as potential
future observations and what conclusions can be drawn given those
uncertainties. In general we find that the three approaches agree for high
quality spectra expected to come from potential future spaceborne missions, but
disagree for low quality spectra representative of current observations. We
also show that the Gaussian posterior probability distribution assumption made
in the Optimal Estimation approach is valid for high quality spectral data. We
also discuss the implications of our models for the inferred C to O ratios of
exoplanetary atmospheres, which of course are important for understanding
formation environments. More specifically we show that in the observational
limit of a few photometric points, the retrieved C/O is biased towards values
near solar and near one simply due to the assumption of uninformative priors.
Authors' comments: 27 pages, 13 figures
Konstantin Avrachenkov, Evsey Morozov
We consider a GI/G/c/K-type retrial queueing system with constant retrial rate. The system consists of a primary queue and an orbit queue. The primary queue has $c$ identical servers and can accommodate the maximal number of $K$ jobs. If a newly arriving job finds the full primary queue, it joins the orbit. The original primary jobs arrive to the system according to a renewal process. The jobs have general i.i.d. service times. A job in front of the orbit queue retries to enter the primary queue after an exponentially distributed time independent of the orbit queue length. Telephone exchange systems, Medium Access Protocols and short TCP transfers are just some applications of the proposed queueing system. For this system we establish minimal sufficient stability conditions. Our model is very general. In addition, to the known particular cases (e.g., M/G/1/1 or M/M/c/c systems), the proposed model covers as particular cases the deterministic service model and the Erlang model with constant retrial rate. The latter particular cases have not been considered in the past. The obtained stability conditions have clear probabilistic interpretation.
Tewfik Kernane
We study the stability of single server retrial queues under general distribution for retrial times and stationary ergodic service times, for three main retrial policies studied in the literature: classical linear, constant and control policies. The approach used is the renovating events approach to obtain sufficient stability conditions by strong coupling convergence of the process modeling the dynamics of the system to a unique stationary ergodic regime. We also obtain instability conditions by convergence in distribution to improper limiting sequences.