Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Binxing Jiao, Daxin Jiang
A ranker plays an indispensable role in the de facto 'retrieval & rerank'
pipeline, but its training still lags behind -- learning from moderate
negatives or/and serving as an auxiliary module for a retriever. In this work,
we first identify two major barriers to a robust ranker, i.e., inherent label
noises caused by a well-trained retriever and non-ideal negatives sampled for a
high-capable ranker. Thereby, we propose multiple retrievers as negative
generators improve the ranker's robustness, where i) involving extensive
out-of-distribution label noises renders the ranker against each noise
distribution, and ii) diverse hard negatives from a joint distribution are
relatively close to the ranker's negative distribution, leading to more
challenging thus effective training. To evaluate our robust ranker (dubbed
R$^2$anker), we conduct experiments in various settings on the popular passage
retrieval benchmark, including BM25-reranking, full-ranking, retriever
distillation, etc. The empirical results verify the new state-of-the-art
effectiveness of our model.
Authors' comments: 11 pages of main content, 4 tables, 3 figures
Peter C. Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Théophane Weber, Timothy Lillicrap
Effective decision making involves flexibly relating past experiences and
relevant contextual information to a novel situation. In deep reinforcement
learning (RL), the dominant paradigm is for an agent to amortise information
that helps decision making into its network weights via gradient descent on
training losses. Here, we pursue an alternative approach in which agents can
utilise large-scale context sensitive database lookups to support their
parametric computations. This allows agents to directly learn in an end-to-end
manner to utilise relevant information to inform their outputs. In addition,
new information can be attended to by the agent, without retraining, by simply
augmenting the retrieval dataset. We study this approach for offline RL in 9x9
Go, a challenging game for which the vast combinatorial state space privileges
generalisation over direct matching to past experiences. We leverage fast,
approximate nearest neighbor techniques in order to retrieve relevant data from
a set of tens of millions of expert demonstration states. Attending to this
information provides a significant boost to prediction accuracy and game-play
performance over simply using these demonstrations as training trajectories,
providing a compelling demonstration of the value of large-scale retrieval in
offline RL agents.
Authors' comments: Thirty-sixth Annual Conference on Neural Information Processing
Systems (NeurIPS 2022), 16 pages
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia et al.
Current state-of-the-art document retrieval solutions mainly follow an
index-retrieve paradigm, where the index is hard to be directly optimized for
the final retrieval target. In this paper, we aim to show that an end-to-end
deep neural network unifying training and indexing stages can significantly
improve the recall performance of traditional methods. To this end, we propose
Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates
relevant document identifiers directly for a designated query. To optimize the
recall performance of NCI, we invent a prefix-aware weight-adaptive decoder
architecture, and leverage tailored techniques including query generation,
semantic document identifiers, and consistency-based regularization. Empirical
studies demonstrated the superiority of NCI on two commonly used academic
benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on
NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to
the best baseline method.
Authors' comments: 19 pages, 6 figures, accepted by NeurIPS 2022
Devansh Gupta, Aditya Saini, Drishti Bhasin, Sarthak Bhagat, Shagun Uppal, Rishi Raj Jain, Ponnurangam Kumaraguru, Rajiv Ratn Shah
Retrieving facial images from attributes plays a vital role in various systems such as face recognition and suspect identification. Compared to other image retrieval tasks, facial image retrieval is more challenging due to the high subjectivity involved in describing a person's facial features. Existing methods do so by comparing specific characteristics from the user's mental image against the suggested images via high-level supervision such as using natural language. In contrast, we propose a method that uses a relatively simpler form of binary supervision by utilizing the user's feedback to label images as either similar or dissimilar to the target image. Such supervision enables us to exploit the contrastive learning paradigm for encapsulating each user's personalized notion of similarity. For this, we propose a novel loss function optimized online via user feedback. We validate the efficacy of our proposed approach using a carefully designed testbed to simulate user feedback and a large-scale user study. Our experiments demonstrate that our method iteratively improves personalization, leading to faster convergence and enhanced recommendation relevance, thereby, improving user satisfaction. Our proposed framework is also equipped with a user-friendly web interface with a real-time experience for facial image retrieval.
Philippe Jaming, Martin Rathmair
In this paper we consider the question of finding an as small as possible family of operators $(T_j)_{j\in J}$ on $L^2(R)$ that does phase retrieval: every $\varphi$ is uniquely determined (up to a constant phase factor) by the phaseless data $(|T_j\varphi|)_{j\in J}$. This problem arises in various fields of applied sciences where usually the operators obey further restrictions. Of particular interest here are so-called {\em coded diffraction paterns} where the operators are of the form $T_j\varphi=\mathcal{F}m_j\varphi$, $\mathcal{F}$ the Fourier transform and $m_j\in L^\infty(R)$ are "masks". Here we explicitely construct three real-valued masks $m_1,m_2,m_3\in L^\infty(R)$ so that the associated coded diffraction patterns do phase retrieval. This implies that the three self-adjoint operators $T_j\varphi=\mathcal{F}[m_j\mathcal{F}^{-1}\varphi]$ also do phase retrieval. The proof uses complex analysis.We then show that some natural analogues of these operators in the finite dimensional setting do not always lead to the same uniqueness result due to an undersampling effect.
Avinash Madasu, Junier Oliva, Gedas Bertasius
The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework
Sayan Nath, Nikhil Nayak
In recent years, we know that the interaction with images has increased. Image similarity involves fetching similar-looking images abiding by a given reference image. The target is to find out whether the image searched as a query can result in similar pictures. We are using the BigTransfer Model, which is a state-of-art model itself. BigTransfer(BiT) is essentially a ResNet but pre-trained on a larger dataset like ImageNet and ImageNet-21k with additional modifications. Using the fine-tuned pre-trained Convolution Neural Network Model, we extract the key features and train on the K-Nearest Neighbor model to obtain the nearest neighbor. The application of our model is to find similar images, which are hard to achieve through text queries within a low inference time. We analyse the benchmark of our model based on this application.
Jihyuk Kim, Minsso Kim, Seung-won Hwang
Deep learning for Information Retrieval (IR) requires a large amount of
high-quality query-document relevance labels, but such labels are inherently
sparse. Label smoothing redistributes some observed probability mass over
unobserved instances, often uniformly, uninformed of the true distribution. In
contrast, we propose knowledge distillation for informed labeling, without
incurring high computation overheads at evaluation time. Our contribution is
designing a simple but efficient teacher model which utilizes collective
knowledge, to outperform state-of-the-arts distilled from a more complex
teacher model. Specifically, we train up to x8 faster than the state-of-the-art
teacher, while distilling the rankings better. Our code is publicly available
at https://github.com/jihyukkim-nlp/CollectiveKD.
Authors' comments: NAACL 2022
Chengyuan Qian, Ruida Zhou, Chao Tian, Tie Liu
We study the problem of weakly private information retrieval (W-PIR), where a
user wishes to retrieve a desired message from $N$ non-colluding servers in a
way that the privacy leakage regarding the desired message's identity is less
than or equal to a threshold. We propose a new code construction which
significantly improves upon the best known result in the literature, based on
the following critical observation. In previous constructions, for the extreme
case of minimum download, the retrieval pattern is to download the message
directly from $N-1$ servers; however this causes leakage to all these $N-1$
servers, and a better retrieval pattern for this extreme case is to download
the message directly from a single server. The proposed code construction
allows a natural transition to such a pattern, and for both the maximal leakage
metric and the mutual information leakage metric, significant improvements can
be obtained. We provide explicit solutions, in contrast to a previous work by
Lin et al., where only numerical solutions were obtained.
Authors' comments: 6 pages 1 figure, ISIT 2022 accepted
Michael Christ, Ben Pineau, Mitchell A. Taylor
Examples are constructed of infinite-dimensional subspaces $V\subset L^2(\mu)$ with the property that for any $f,g\in V$, if $|f|$ is approximately equal to $|g|$ with respect to the $L^2$ norm, then there exists a unimodular scalar $z$ such that $f$ is approximately equal to $zg$.
Ziyue Wang, Aozhu Chen, Fan Hu, Xirong Li
Negation is a common linguistic skill that allows human to express what we do
NOT want. Naturally, one might expect video retrieval to support
natural-language queries with negation, e.g., finding shots of kids sitting on
the floor and not playing with a dog. However, the state-of-the-art deep
learning based video retrieval models lack such ability, as they are typically
trained on video description datasets such as MSR-VTT and VATEX that lack
negated descriptions. Their retrieved results basically ignore the negator in
the sample query, incorrectly returning videos showing kids playing with dog.
This paper presents the first study on learning to understand negation in video
retrieval and make contributions as follows. By re-purposing two existing
datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video
retrieval with negation. We propose a learning based method for training a
negation-aware video retrieval model. The key idea is to first construct a soft
negative caption for a specific training video by partially negating its
original caption, and then compute a bidirectionally constrained loss on the
triplet. This auxiliary loss is weightedly added to a standard retrieval loss.
Experiments on the re-purposed benchmarks show that re-training the CLIP
(Contrastive Language-Image Pre-Training) model by the proposed method clearly
improves its ability to handle queries with negation. In addition, the model
performance on the original benchmarks is also improved.
Authors' comments: Accepted by ACMMM2022
Hansi Zeng, Hamed Zamani, Vishwa Vinay
Recent work has shown that more effective dense retrieval models can be
obtained by distilling ranking knowledge from an existing base re-ranking
model. In this paper, we propose a generic curriculum learning based
optimization framework called CL-DRD that controls the difficulty level of
training data produced by the re-ranking (teacher) model. CL-DRD iteratively
optimizes the dense retrieval (student) model by increasing the difficulty of
the knowledge distillation data made available to it. In more detail, we
initially provide the student model coarse-grained preference pairs between
documents in the teacher's ranking and progressively move towards finer-grained
pairwise document ordering requirements. In our experiments, we apply a simple
implementation of the CL-DRD framework to enhance two state-of-the-art dense
retrieval models. Experiments on three public passage retrieval datasets
demonstrate the effectiveness of our proposed framework.
Authors' comments: Accepted to SIGIR 2022
Antonio Mallia, Joel Mackenzie, Torsten Suel, Nicola Tonellotto
Neural information retrieval architectures based on transformers such as BERT
are able to significantly improve system effectiveness over traditional sparse
models such as BM25. Though highly effective, these neural approaches are very
expensive to run, making them difficult to deploy under strict latency
constraints. To address this limitation, recent studies have proposed new
families of learned sparse models that try to match the effectiveness of
learned dense models, while leveraging the traditional inverted index data
structure for efficiency. Current learned sparse models learn the weights of
terms in documents and, sometimes, queries; however, they exploit different
vocabulary structures, document expansion techniques, and query expansion
strategies, which can make them slower than traditional sparse models such as
BM25. In this work, we propose a novel indexing and query processing technique
that exploits a traditional sparse model's "guidance" to efficiently traverse
the index, allowing the more effective learned model to execute fewer scoring
operations. Our experiments show that our guided processing heuristic is able
to boost the efficiency of the underlying learned sparse model by a factor of
four without any measurable loss of effectiveness.
Authors' comments: Accepted at SIGIR 2022
Fernando Diaz, Andres Ferraro
Offline evaluation of information retrieval and recommendation has
traditionally focused on distilling the quality of a ranking into a scalar
metric such as average precision or normalized discounted cumulative gain. We
can use this metric to compare the performance of multiple systems for the same
request. Although evaluation metrics provide a convenient summary of system
performance, they also collapse subtle differences across users into a single
number and can carry assumptions about user behavior and utility not supported
across retrieval scenarios. We propose recall-paired preference (RPP), a
metric-free evaluation method based on directly computing a preference between
ranked lists. RPP simulates multiple user subpopulations per query and compares
systems across these pseudo-populations. Our results across multiple search and
recommendation tasks demonstrate that RPP substantially improves discriminative
power while correlating well with existing metrics and being equally robust to
incomplete data.
Authors' comments: to appear at SIGIR 2022
Meng Huang, Zhiqiang Xu
The recovery of a signal from the intensity measurements with some entries
being known in advance is termed as {\em affine phase retrieval}. In this
paper, we prove that a natural least squares formulation for the affine phase
retrieval is strongly convex on the entire space under some mild conditions,
provided the measurements are complex Gaussian random vecotrs and the
measurement number $m \gtrsim d \log d$ where $d$ is the dimension of signals.
Based on the result, we prove that the simple gradient descent method for the
affine phase retrieval converges linearly to the target solution with high
probability from an arbitrary initial point. These results show an essential
difference between the affine phase retrieval and the classical phase
retrieval, where the least squares formulations for the classical phase
retrieval are non-convex.
Authors' comments: 32 pages
Xun Wang, Bingqing Ke, Xuanping Li, Fangyu Liu, Mingyu Zhang, Xiao Liang, Qiushi Xiao, Cheng Luo et al.
Video search has become the main routine for users to discover videos
relevant to a text query on large short-video sharing platforms. During
training a query-video bi-encoder model using online search logs, we identify a
modality bias phenomenon that the video encoder almost entirely relies on text
matching, neglecting other modalities of the videos such as vision, audio. This
modality imbalanceresults from a) modality gap: the relevance between a query
and a video text is much easier to learn as the query is also a piece of text,
with the same modality as the video text; b) data bias: most training samples
can be solved solely by text matching. Here we share our practices to improve
the first retrieval stage including our solution for the modality imbalance
issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two
key components: manually generated modality-shuffled (MS) samples and a dynamic
margin (DM) based on visual relevance. They can encourage the video encoder to
pay balanced attentions to each modality. Through extensive experiments on a
real world dataset, we show empirically that our method is both effective and
efficient in solving modality bias problem. We have also deployed our MBVR in a
large video platform and observed statistically significant boost over a highly
optimized baseline in an A/B test and manual GSB evaluations.
Authors' comments: Accepted by SIGIR-2022, short paper
Bill Yuchen Lin, Kangmin Tan, Chris Miller, Beiwen Tian, Xiang Ren
Humans can perform unseen tasks by recalling relevant skills acquired
previously and then generalizing them to the target tasks, even if there is no
supervision at all. In this paper, we aim to improve this kind of cross-task
generalization ability of massive multi-task language models, such as T0 and
FLAN, in an unsupervised setting. We propose a retrieval-augmentation method
named ReCross that takes a few unlabelled examples as queries to retrieve a
small subset of upstream data and uses them to update the multi-task model for
better generalization. ReCross is a straightforward yet effective retrieval
method that combines both efficient dense retrieval and effective pair-wise
reranking. Our results and analysis show that it significantly outperforms both
non-retrieval methods and other baseline methods.
Authors' comments: Accepted to NeurIPS 2022. Website: https://inklab.usc.edu/ReCross/
Andrei Neculai, Yanbei Chen, Zeynep Akata
Existing works in image retrieval often consider retrieving images with one
or two query inputs, which do not generalize to multiple queries. In this work,
we investigate a more challenging scenario for composing multiple multimodal
queries in image retrieval. Given an arbitrary number of query images and (or)
texts, our goal is to retrieve target images containing the semantic concepts
specified in multiple multimodal queries. To learn an informative embedding
that can flexibly encode the semantics of various queries, we propose a novel
multimodal probabilistic composer (MPC). Specifically, we model input images
and texts as probabilistic embeddings, which can be further composed by a
probabilistic composition rule to facilitate image retrieval with multiple
multimodal queries. We propose a new benchmark based on the MS-COCO dataset and
evaluate our model on various setups that compose multiple images and (or) text
queries for multimodal image retrieval. Without bells and whistles, we show
that our probabilistic model formulation significantly outperforms existing
related methods on multimodal image retrieval while generalizing well to query
with different amounts of inputs given in arbitrary visual and (or) textual
modalities. Code is available here: https://github.com/andreineculai/MPC.
Authors' comments: CVPR2022 MULA workshop
Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Xueqi Cheng
Fact verification (FV) is a challenging task which aims to verify a claim
using multiple evidential sentences from trustworthy corpora, e.g., Wikipedia.
Most existing approaches follow a three-step pipeline framework, including
document retrieval, sentence retrieval and claim verification. High-quality
evidences provided by the first two steps are the foundation of the effective
reasoning in the last step. Despite being important, high-quality evidences are
rarely studied by existing works for FV, which often adopt the off-the-shelf
models to retrieve relevant documents and sentences in an
"index-retrieve-then-rank" fashion. This classical approach has clear drawbacks
as follows: i) a large document index as well as a complicated search process
is required, leading to considerable memory and computational overhead; ii)
independent scoring paradigms fail to capture the interactions among documents
and sentences in ranking; iii) a fixed number of sentences are selected to form
the final evidence set. In this work, we propose GERE, the first system that
retrieves evidences in a generative fashion, i.e., generating the document
titles as well as evidence sentence identifiers. This enables us to mitigate
the aforementioned technical issues since: i) the memory and computational cost
is greatly reduced because the document index is eliminated and the heavy
ranking process is replaced by a light generative process; ii) the dependency
between documents and that between sentences could be captured via sequential
generation process; iii) the generative formulation allows us to dynamically
select a precise set of relevant evidences for each claim. The experimental
results on the FEVER dataset show that GERE achieves significant improvements
over the state-of-the-art baselines, with both time-efficiency and
memory-efficiency.
Authors' comments: Accepted by SIGIR 2022
Paulina Lewandowska, Ryszard Kukulski, Łukasz Pawela, Zbigniew Puchała
This work examines the problem of learning an unknown von Neumann measurement
of dimension $d$ from a finite number of copies. To obtain a faithful
approximation of the given measurement we are allowed to use it $N$ times. Our
main goal is to estimate the asymptotic behavior of the maximum value of the
average fidelity function $F_d$ for a general $N \rightarrow 1$ learning
scheme. We show that $F_d = 1 - \Theta\left(\frac{1}{N^2}\right)$ for arbitrary
but fixed dimension $d$. In addition to that, we compared various learning
schemes for $d=2$. We observed that the learning scheme based on deterministic
port-based teleportation is asymptotically optimal but performs poorly for low
$N$. In particular, we discovered a parallel learning scheme, which despite its
lack of asymptotic optimality, provides a high value of the fidelity for low
values of $N$ and uses only two-qubit entangled memory states.
Authors' comments: 19 pages, 9 figures