Yeon Seonwoo, Juhee Son, Jiho Jin, Sang-Woo Lee, Ji-Hoon Kim, Jung-Woo Ha, Alice Oh
The retriever-reader pipeline has shown promising performance in open-domain
QA but suffers from a very slow inference speed. Recently proposed question
retrieval models tackle this problem by indexing question-answer pairs and
searching for similar questions. These models have shown a significant increase
in inference speed, but at the cost of lower QA performance compared to the
retriever-reader models. This paper proposes a two-step question retrieval
model, SQuID (Sequential Question-Indexed Dense retrieval) and distant
supervision for training. SQuID uses two bi-encoders for question retrieval.
The first-step retriever selects top-k similar questions, and the second-step
retriever finds the most similar question from the top-k questions. We evaluate
the performance and the computational efficiency of SQuID. The results show
that SQuID significantly increases the performance of existing question
retrieval models with a negligible loss on inference speed.
Authors' comments: ACL2022-Findings
Carla Teixeira Lopes
This report provides an overview of the field of Information Retrieval (IR)
in healthcare. It does not aim to introduce general concepts and theories of IR
but to present and describe specific aspects of Health Information Retrieval
(HIR). After a brief introduction to the more broader field of IR, the
significance of HIR at current times is discussed. Specific characteristics of
Health Information, its classification and the main existing representations
for health concepts are described together with the main products and services
in the area (e.g.: databases of health bibliographic content, health specific
search engines and others). Recent research work is discussed and the most
active researchers, projects and research groups are also presented. Main
organizations and journals are also identified.
Authors' comments: 38 pages, 0 figures
Kennard Ng, Ser-Nam Lim, Gim Hee Lee
Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval. We evaluate our VRAG over several video retrieval tasks and achieve a new state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG shows higher retrieval precision than other existing video-level methods, and closer performance to frame-level methods at faster evaluation speeds. Finally, our code will be made publicly available.
Danilo Ribeiro, Shen Wang, Xiaofei Ma, Rui Dong, Xiaokai Wei, Henry Zhu, Xinchi Chen, Zhiheng Huang et al.
Large language models have achieved high performance on various question
answering (QA) benchmarks, but the explainability of their output remains
elusive. Structured explanations, called entailment trees, were recently
suggested as a way to explain and inspect a QA system's answer. In order to
better generate such entailment trees, we propose an architecture called
Iterative Retrieval-Generation Reasoner (IRGR). Our model is able to explain a
given hypothesis by systematically generating a step-by-step explanation from
textual premises. The IRGR model iteratively searches for suitable premises,
constructing a single entailment step at a time. Contrary to previous
approaches, our method combines generation steps and retrieval of premises,
allowing the model to leverage intermediate conclusions, and mitigating the
input size limit of baseline encoder-decoder models. We conduct experiments
using the EntailmentBank dataset, where we outperform existing benchmarks on
premise retrieval and entailment tree generation, with around 300% gain in
overall correctness.
Authors' comments: published in NAACL 2022
Yuantong Li, Xiaokai Wei, Zijian Wang, Shen Wang, Parminder Bhatia, Xiaofei Ma, Andrew Arnold
People frequently interact with information retrieval (IR) systems, however,
IR models exhibit biases and discrimination towards various demographics. The
in-processing fair ranking methods provide a trade-offs between accuracy and
fairness through adding a fairness-related regularization term in the loss
function. However, there haven't been intuitive objective functions that depend
on the click probability and user engagement to directly optimize towards this.
In this work, we propose the In-Batch Balancing Regularization (IBBR) to
mitigate the ranking disparity among subgroups. In particular, we develop a
differentiable \textit{normed Pairwise Ranking Fairness} (nPRF) and leverage
the T-statistics on top of nPRF over subgroups as a regularization to improve
fairness. Empirical results with the BERT-based neural rankers on the MS MARCO
Passage Retrieval dataset with the human-annotated non-gendered queries
benchmark \citep{rekabsaz2020neural} show that our IBBR method with nPRF
achieves significantly less bias with minimal degradation in ranking
performance compared with the baseline.
Authors' comments: 9 pages, 1 figure, and 3 tables. A version appears in the Proceedings
of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP),
2022
Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, Mohit Iyyer
Exemplification is a process by which writers explain or clarify a concept by
providing an example. While common in all forms of writing, exemplification is
particularly useful in the task of long-form question answering (LFQA), where a
complicated answer can be made more understandable through simple examples. In
this paper, we provide the first computational study of exemplification in QA,
performing a fine-grained annotation of different types of examples (e.g.,
hypotheticals, anecdotes) in three corpora. We show that not only do
state-of-the-art LFQA models struggle to generate relevant examples, but also
that standard evaluation metrics such as ROUGE are insufficient to judge
exemplification quality. We propose to treat exemplification as a
\emph{retrieval} problem in which a partially-written answer is used to query a
large set of human-written examples extracted from a corpus. Our approach
allows a reliable ranking-type automatic metrics that correlates well with
human evaluation. A human evaluation shows that our model's retrieved examples
are more relevant than examples generated from a state-of-the-art LFQA model.
Authors' comments: 2022 Annual Conference of the North American Chapter of the
Association for Computational Linguistics
Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.
Zhusheng Wang, Sennur Ulukus
We introduce the problem of random symmetric private information retrieval (RSPIR). In canonical PIR, a user downloads a message out of $K$ messages from $N$ non-colluding and replicated databases in such a way that no database can know which message the user has downloaded (user privacy). In SPIR, the privacy is symmetric, in that, not only that the databases cannot know which message the user has downloaded, the user itself cannot learn anything further than the particular message it has downloaded (database privacy). In RSPIR, different from SPIR, the user does not have an input to the databases, i.e., the user does not pick a specific message to download, instead is content with any one of the messages. In RSPIR, the databases need to send symbols to the user in such a way that the user is guaranteed to download a message correctly (random reliability), the databases do not know which message the user has received (user privacy), and the user does not learn anything further than the one message it has received (database privacy). This is the digital version of a blind box, also known as gachapon, which implements the above specified setting with physical objects for entertainment. This is also the blind version of $1$-out-of-$K$ oblivious transfer (OT), an important cryptographic primitive. We study the information-theoretic capacity of RSPIR for the case of $N=2$ databases. We determine its exact capacity for the cases of $K = 2, 3, 4$ messages. While we provide a general achievable scheme that is applicable to any number of messages, the capacity for $K\geq 5$ remains open.
Tal Peer, Simon Welker, Timo Gerkmann
Phase retrieval is a problem encountered not only in speech and audio
processing, but in many other fields such as optics. Iterative algorithms based
on non-convex set projections are effective and frequently used for retrieving
the phase when only STFT magnitudes are available. While the basic Griffin-Lim
algorithm and its variants have been the prevalent method for decades, more
recent advances, e.g. in optics, raise the question: Can we do better than
Griffin-Lim for speech signals, using the same principle of iterative
projection?
In this paper we compare the classical algorithms in the speech domain with
two modern methods from optics with respect to reconstruction quality and
convergence rate. Based on this study, we propose to combine Griffin-Lim with
the Difference Map algorithm in a hybrid approach which shows superior results,
in terms of both convergence and quality of the final reconstruction.
Authors' comments: Submitted to IWAENC 2022
Quang Cao, Rinaldo Gagiano, Duy Huynh, Xun Yi, Son Hoang Dau, Phuc Lu Le, Quang-Hung Luu, Emanuele Viterbo et al.
Motivated by a practical scenario in blockchains in which a client, who
possesses a transaction, wishes to privately verify that the transaction
actually belongs to a block, we investigate the problem of private retrieval of
Merkle proofs (i.e. proofs of inclusion/membership) in a Merkle tree. In this
setting, one or more servers store the nodes of a binary tree (a Merkle tree),
while a client wants to retrieve the set of nodes along a root-to-leaf path
(i.e. a Merkle proof, after appropriate node swapping operations), without
letting the servers know which path is being retrieved. We propose a method
that partitions the Merkle tree to enable parallel private retrieval of the
Merkle proofs. The partitioning step is based on a novel tree coloring called
ancestral coloring in which nodes with ancestor-descendant relationship must
have distinct colors. To minimize the retrieval time, the coloring must be
balanced, i.e. the sizes of the color classes must differ by at most one. We
develop a fast algorithm to find a balanced ancestral coloring in almost linear
time in the number of tree nodes, which can handle trees with billions of nodes
in minutes. Unlike existing approaches, ours allows an efficient indexing with
polylog time and space complexity. Our partitioning method can be applied on
top of any private information retrieval scheme, leading to the minimum storage
overhead and fastest running time compared to existing works.
Authors' comments: 27 pages
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Wing-Kwong Chan
Cross-modal recipe retrieval has attracted research attention in recent
years, thanks to the availability of large-scale paired data for training.
Nevertheless, obtaining adequate recipe-image pairs covering the majority of
cuisines for supervised learning is difficult if not impossible. By
transferring knowledge learnt from a data-rich cuisine to a data-scarce
cuisine, domain adaptation sheds light on this practical problem. Nevertheless,
existing works assume recipes in source and target domains are mostly
originated from the same cuisine and written in the same language. This paper
studies unsupervised domain adaptation for image-to-recipe retrieval, where
recipes in source and target domains are in different languages. Moreover, only
recipes are available for training in the target domain. A novel recipe mixup
method is proposed to learn transferable embedding features between the two
domains. Specifically, recipe mixup produces mixed recipes to form an
intermediate domain by discretely exchanging the section(s) between source and
target recipes. To bridge the domain gap, recipe mixup loss is proposed to
enforce the intermediate domain to locate in the shortest geodesic path between
source and target domains in the recipe embedding space. By using Recipe 1M
dataset as source domain (English) and Vireo-FoodTransfer dataset as target
domain (Chinese), empirical experiments verify the effectiveness of recipe
mixup for cross-lingual adaptation in the context of image-to-recipe retrieval.
Authors' comments: Accepted by ICMR2022
Zhengzhong Liang, Tushar Khot, Steven Bethard, Mihai Surdeanu, Ashish Sabharwal
Considerable progress has been made recently in open-domain question
answering (QA) problems, which require Information Retrieval (IR) and Reading
Comprehension (RC). A popular approach to improve the system's performance is
to improve the quality of the retrieved context from the IR stage. In this work
we show that for StrategyQA, a challenging open-domain QA dataset that requires
multi-hop reasoning, this common approach is surprisingly ineffective --
improving the quality of the retrieved context hardly improves the system's
performance. We further analyze the system's behavior to identify potential
reasons.
Authors' comments: 10 pages
Zhenghao Liu, Han Zhang, Chenyan Xiong, Zhiyuan Liu, Yu Gu, Xiaohua Li
Dense retrievers encode queries and documents and map them in an embedding
space using pre-trained language models. These embeddings need to be
high-dimensional to fit training signals and guarantee the retrieval
effectiveness of dense retrievers. However, these high-dimensional embeddings
lead to larger index storage and higher retrieval latency. To reduce the
embedding dimensions of dense retrieval, this paper proposes a Conditional
Autoencoder (ConAE) to compress the high-dimensional embeddings to maintain the
same embedding distribution and better recover the ranking features. Our
experiments show that ConAE is effective in compressing embeddings by achieving
comparable ranking performance with its teacher model and making the retrieval
system more efficient. Our further analyses show that ConAE can alleviate the
redundancy of the embeddings of dense retrieval with only one linear layer. All
codes of this work are available at https://github.com/NEUIR/ConAE.
Authors' comments: Accepted by EMNLP 2022
Xiang Chen, Lei Li, Ningyu Zhang, Chuanqi Tan, Fei Huang, Luo Si, Huajun Chen
Pre-trained language models have contributed significantly to relation
extraction by demonstrating remarkable few-shot learning abilities. However,
prompt tuning methods for relation extraction may still fail to generalize to
those rare or hard patterns. Note that the previous parametric learning
paradigm can be viewed as memorization regarding training data as a book and
inference as the close-book test. Those long-tailed or hard patterns can hardly
be memorized in parameters given few-shot instances. To this end, we regard RE
as an open-book examination and propose a new semiparametric paradigm of
retrieval-enhanced prompt tuning for relation extraction. We construct an
open-book datastore for retrieval regarding prompt-based instance
representations and corresponding relation labels as memorized key-value pairs.
During inference, the model can infer relations by linearly interpolating the
base output of PLM with the non-parametric nearest neighbor distribution over
the datastore. In this way, our model not only infers relation through
knowledge stored in the weights during training but also assists
decision-making by unwinding and querying examples in the open-book datastore.
Extensive experiments on benchmark datasets show that our method can achieve
state-of-the-art in both standard supervised and few-shot settings. Code are
available in https://github.com/zjunlp/PromptKG/tree/main/research/RetrievalRE.
Authors' comments: Accepted by SIGIR 2022, short paper
Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang
Recently, large-scale pre-training methods like CLIP have made great progress
in multi-modal research such as text-video retrieval. In CLIP, transformers are
vital for modeling complex multi-modal relations. However, in the vision
transformer of CLIP, the essential visual tokenization process, which produces
discrete visual token sequences, generates many homogeneous tokens due to the
redundancy nature of consecutive and similar frames in videos. This
significantly increases computation costs and hinders the deployment of video
retrieval models in web applications. In this paper, to reduce the number of
redundant video tokens, we design a multi-segment token clustering algorithm to
find the most representative tokens and drop the non-essential ones. As the
frame redundancy occurs mostly in consecutive frames, we divide videos into
multiple segments and conduct segment-level clustering. Center tokens from each
segment are later concatenated into a new sequence, while their original
spatial-temporal relations are well maintained. We instantiate two clustering
algorithms to efficiently find deterministic medoids and iteratively partition
groups in high dimensional space. Through this token clustering and center
selection procedure, we successfully reduce computation costs by removing
redundant visual tokens. This method further enhances segment-level semantic
alignment between video and text representations, enforcing the spatio-temporal
interactions of tokens from within-segment frames. Our method, coined as
CenterCLIP, surpasses existing state-of-the-art by a large margin on typical
text-video benchmarks, while reducing the training memory cost by 35\% and
accelerating the inference speed by 14\% at the best case. The code is
available at
\href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.
Authors' comments: accepted by SIGIR 2022, code is at
https://github.com/mzhaoshuai/CenterCLIP
Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, Oswald Lanz
Video retrieval using natural language queries has attracted increasing
interest due to its relevance in real-world applications, from intelligent
access in private media galleries to web-scale video search. Learning the
cross-similarity of video and text in a joint embedding space is the dominant
approach. To do so, a contrastive loss is usually employed because it organizes
the embedding space by putting similar items close and dissimilar items far.
This framework leads to competitive recall rates, as they solely focus on the
rank of the groundtruth items. Yet, assessing the quality of the ranking list
is of utmost importance when considering intelligent retrieval systems, since
multiple items may share similar semantics, hence a high relevance. Moreover,
the aforementioned framework uses a fixed margin to separate similar and
dissimilar items, treating all non-groundtruth items as equally irrelevant. In
this paper we propose to use a variable margin: we argue that varying the
margin used during training based on how much relevant an item is to a given
query, i.e. a relevance-based margin, easily improves the quality of the
ranking lists measured through nDCG and mAP. We demonstrate the advantages of
our technique using different models on EPIC-Kitchens-100 and YouCook2. We show
that even if we carefully tuned the fixed margin, our technique (which does not
have the margin as a hyper-parameter) would still achieve better performance.
Finally, extensive ablation studies and qualitative analysis support the
robustness of our approach. Code will be released at
\url{https://github.com/aranciokov/RelevanceMargin-ICMR22}.
Authors' comments: Accepted for presentation at International Conference on Multimedia
Retrieval (ICMR '22)
Cristiano Germani
One of the most striking evidences of the information loss paradox is that,
according to the Hawking's calculation, the correlation functions of a test
scalar field exponentially decay in time. In this paper, I argue that a
judicious use of the steepest descent expansion on the classical saddle point
(the Black Hole background), is enough to change this early time decay into a
late time growing, in agreement with information retrieval. I will explicitly
show this in the Jackiw-Teitelboim gravity. There, the so-called "ramp" in the
bulk tow-point function, is analytically obtained without the need of any other
subdominant configurations of the gravity path integral.
Authors' comments: v2: 6 pages, 4 figures, clarifications and better figures added.
Results unchanged
Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qifei Wu, Yuchen Ding, Hua Wu, Haifeng Wang et al.
Recent years have witnessed the significant advance in dense retrieval (DR) based on powerful pre-trained language models (PLM). DR models have achieved excellent performance in several benchmark datasets, while they are shown to be not as competitive as traditional sparse retrieval models (e.g., BM25) in a zero-shot retrieval setting. However, in the related literature, there still lacks a detailed and comprehensive study on zero-shot retrieval. In this paper, we present the first thorough examination of the zero-shot capability of DR models. We aim to identify the key factors and analyze how they affect zero-shot retrieval performance. In particular, we discuss the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero-shot DR models. Our findings provide important evidence to better understand and develop zero-shot DR models.
Xin Zhang, Xiaohua Xie, Jianhuang Lai, Wei-Shi Zheng
We are concerned with retrieving a query person from multiple videos captured
by a non-overlapping camera network. Existing methods often rely on purely
visual matching or consider temporal constraints but ignore the spatial
information of the camera network. To address this issue, we propose a
pedestrian retrieval framework based on cross-camera trajectory generation,
which integrates both temporal and spatial information. To obtain pedestrian
trajectories, we propose a novel cross-camera spatio-temporal model that
integrates pedestrians' walking habits and the path layout between cameras to
form a joint probability distribution. Such a spatio-temporal model among a
camera network can be specified using sparsely sampled pedestrian data. Based
on the spatio-temporal model, cross-camera trajectories can be extracted by the
conditional random field model and further optimized by restricted non-negative
matrix factorization. Finally, a trajectory re-ranking technique is proposed to
improve the pedestrian retrieval results. To verify the effectiveness of our
method, we construct the first cross-camera pedestrian trajectory dataset, the
Person Trajectory Dataset, in real surveillance scenarios. Extensive
experiments verify the effectiveness and robustness of the proposed method.
Authors' comments: IEEE Transactions on Image Processing (TIP), 2023
Jingtao Zhan, Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma
A retrieval model should not only interpolate the training data but also
extrapolate well to the queries that are different from the training data.
While neural retrieval models have demonstrated impressive performance on
ad-hoc search benchmarks, we still know little about how they perform in terms
of interpolation and extrapolation. In this paper, we demonstrate the
importance of separately evaluating the two capabilities of neural retrieval
models. Firstly, we examine existing ad-hoc search benchmarks from the two
perspectives. We investigate the distribution of training and test data and
find a considerable overlap in query entities, query intent, and relevance
labels. This finding implies that the evaluation on these test sets is biased
toward interpolation and cannot accurately reflect the extrapolation capacity.
Secondly, we propose a novel evaluation protocol to separately evaluate the
interpolation and extrapolation performance on existing benchmark datasets. It
resamples the training and test data based on query similarity and utilizes the
resampled dataset for training and evaluation. Finally, we leverage the
proposed evaluation protocol to comprehensively revisit a number of
widely-adopted neural retrieval models. Results show models perform differently
when moving from interpolation to extrapolation. For example,
representation-based retrieval models perform almost as well as
interaction-based retrieval models in terms of interpolation but not
extrapolation. Therefore, it is necessary to separately evaluate both
interpolation and extrapolation performance and the proposed resampling method
serves as a simple yet effective evaluation tool for future IR studies.
Authors' comments: CIKM 2022 Full Paper