Yuanchun Shen
Information retrieval based knowledge base question answering (KBQA) first retrieves a subgraph to reduce search space, then reasons on the subgraph to select answer entities. Existing approaches have three issues that impede the retrieval of such subgraphs. Firstly, there is no off-the-shelf toolkit for semantic-relevant subgraph retrieval. Secondly, existing methods are knowledge-graph-dependent, resulting in outdated knowledge graphs used even in recent studies. Thirdly, previous solutions fail to incorporate the best available techniques for entity linking or path expansion. In this paper, we present SRTK, a user-friendly toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs. SRTK is the first toolkit that streamlines the entire lifecycle of subgraph retrieval across multiple knowledge graphs. Additionally, it comes with state-of-the-art subgraph retrieval algorithms, guaranteeing an up-to-date solution set out of the box.
Zida Cheng, Chen Ju, Xu Chen, Zhonghua Zhai, Shuai Xiao, Xiaoyi Zeng, Weilin Huang
We formally define a novel valuable information retrieval task: image-to-multi-modal-retrieval (IMMR), where the query is an image and the doc is an entity with both image and textual description. IMMR task is valuable in various industrial application. We analyze three key challenges for IMMR: 1) skewed data and noisy label in metric learning, 2) multi-modality fusion, 3) effective and efficient training in large-scale industrial scenario. To tackle the above challenges, we propose a novel framework for IMMR task. Our framework consists of three components: 1) a novel data governance scheme coupled with a large-scale classification-based learning paradigm. 2) model architecture specially designed for multimodal learning, where the proposed concept-aware modality fusion module adaptively fuse image and text modality. 3. a hybrid parallel training approach for tackling large-scale training in industrial scenario. The proposed framework achieves SOTA performance on public datasets and has been deployed in a real-world industrial search system, leading to significant improvements in click-through rate and deal number. Code and data will be made publicly available.
Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.
Zahra Tabatabaei, Yuandou Wang, Adrián Colomer, Javier Oliver Moll, Zhiming Zhao, Valery Naranjo
The paper proposes a Federated Content-Based Medical Image Retrieval
(FedCBMIR) platform that utilizes Federated Learning (FL) to address the
challenges of acquiring a diverse medical data set for training CBMIR models.
CBMIR assists pathologists in diagnosing breast cancer more rapidly by
identifying similar medical images and relevant patches in prior cases compared
to traditional cancer detection methods. However, CBMIR in histopathology
necessitates a pool of Whole Slide Images (WSIs) to train to extract an optimal
embedding vector that leverages search engine performance, which may not be
available in all centers. The strict regulations surrounding data sharing in
medical data sets also hinder research and model development, making it
difficult to collect a rich data set. The proposed FedCBMIR distributes the
model to collaborative centers for training without sharing the data set,
resulting in shorter training times than local training. FedCBMIR was evaluated
in two experiments with three scenarios on BreaKHis and Camelyon17 (CAM17). The
study shows that the FedCBMIR method increases the F1-Score (F1S) of each
client to 98%, 96%, 94%, and 97% in the BreaKHis experiment with a generalized
model of four magnifications and does so in 6.30 hours less time than total
local training. FedCBMIR also achieves 98% accuracy with CAM17 in 2.49 hours
less training time than local training, demonstrating that our FedCBMIR is both
fast and accurate for both pathologists and engineers. In addition, our
FedCBMIR provides similar images with higher magnification for non-developed
countries where participate in the worldwide FedCBMIR with developed countries
to facilitate mitosis measuring in breast cancer diagnosis. We evaluate this
scenario by scattering BreaKHis into four centers with different
magnifications.
Authors' comments: This paper has been submitted in IEEE Access
Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko
Compositional reasoning is a hallmark of human visual intelligence. Yet,
despite the size of large vision-language models, they struggle to represent
simple compositions by combining objects with their attributes. To measure this
lack of compositional capability, we design Cola, a text-to-image retrieval
benchmark to Compose Objects Localized with Attributes. To solve Cola, a model
must retrieve images with the correct configuration of attributes and objects
and avoid choosing a distractor image with the same objects and attributes but
in the wrong configuration. Cola contains about 1.2k composed queries of 168
objects and 197 attributes on around 30K images. Our human evaluation finds
that Cola is 83.33% accurate, similar to contemporary compositionality
benchmarks. Using Cola as a testbed, we explore empirical modeling designs to
adapt pre-trained vision-language models to reason compositionally. We explore
6 adaptation strategies on 2 seminal vision-language models, using
compositionality-centric test benchmarks - Cola and CREPE. We find the optimal
adaptation strategy is to train a multi-modal attention layer that jointly
attends over the frozen pre-trained image and language features. Surprisingly,
training multimodal layers on CLIP performs better than tuning a larger FLAVA
model with already pre-trained multimodal layers. Furthermore, our adaptation
strategy improves CLIP and FLAVA to comparable levels, suggesting that training
multimodal layers using contrastive attribute-object data is key, as opposed to
using them pre-trained. Lastly, we show that Cola is harder than a closely
related contemporary benchmark, CREPE, since simpler fine-tuning strategies
without multimodal layers suffice on CREPE but not on Cola. However, we still
see a significant gap between our best adaptation and human accuracy,
suggesting considerable room for further research.
Authors' comments: Accepted to NeurIPS 2023. Webpage:
https://cs-people.bu.edu/array/research/cola/
Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, Rui Yan
With direct access to human-written reference as memory, retrieval-augmented
generation has achieved much progress in a wide range of text generation tasks.
Since better memory would typically prompt better generation~(we define this as
primal problem). The traditional approach for memory retrieval involves
selecting memory that exhibits the highest similarity to the input. However,
this method is constrained by the quality of the fixed corpus from which memory
is retrieved. In this paper, by exploring the duality of the primal problem:
better generation also prompts better memory, we propose a novel framework,
selfmem, which addresses this limitation by iteratively employing a
retrieval-augmented generator to create an unbounded memory pool and using a
memory selector to choose one output as memory for the subsequent generation
round. This enables the model to leverage its own output, referred to as
self-memory, for improved generation. We evaluate the effectiveness of selfmem
on three distinct text generation tasks: neural machine translation,
abstractive text summarization, and dialogue generation, under two generation
paradigms: fine-tuned small model and few-shot LLM. Our approach achieves
state-of-the-art results in four directions in JRC-Acquis, XSum (50.3 ROUGE-1),
and BigPatent (62.9 ROUGE-1), demonstrating the potential of self-memory in
enhancing retrieval-augmented generation models. Furthermore, we conduct
thorough analyses of each component in the selfmem framework to identify
bottlenecks and provide insights for future research.
Authors' comments: Neurips 2023
Yifan Qiao, Yingrui Yang, Haixin Lin, Tao Yang
Recent studies show that BM25-driven dynamic index skipping can greatly
accelerate MaxScore-based document retrieval based on the learned sparse
representation derived by DeepImpact. This paper investigates the effectiveness
of such a traversal guidance strategy during top k retrieval when using other
models such as SPLADE and uniCOIL, and finds that unconstrained BM25-driven
skipping could have a visible relevance degradation when the BM25 model is not
well aligned with a learned weight model or when retrieval depth k is small.
This paper generalizes the previous work and optimizes the BM25 guided index
traversal with a two-level pruning control scheme and model alignment for fast
retrieval using a sparse representation. Although there can be a cost of
increased latency, the proposed scheme is much faster than the original
MaxScore method without BM25 guidance while retaining the relevance
effectiveness. This paper analyzes the competitiveness of this two-level
pruning scheme, and evaluates its tradeoff in ranking relevance and time
efficiency when searching several test datasets.
Authors' comments: This paper is published in WWW'23
Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Tianyi Zhou, Daxin Jiang
In this work, we propose a simple method that applies a large language model
(LLM) to large-scale retrieval in zero-shot scenarios. Our method, the Language
language model as Retriever (LameR), is built upon no other neural models but
an LLM, while breaking brute-force combinations of retrievers with LLMs and
lifting the performance of zero-shot retrieval to be very competitive on
benchmark datasets. Essentially, we propose to augment a query with its
potential answers by prompting LLMs with a composition of the query and the
query's in-domain candidates. The candidates, regardless of correct or wrong,
are obtained by a vanilla retrieval procedure on the target collection. As a
part of the prompts, they are likely to help LLM generate more precise answers
by pattern imitation or candidate summarization. Even if all the candidates are
wrong, the prompts at least make LLM aware of in-collection patterns and
genres. Moreover, due to the low performance of a self-supervised retriever,
the LLM-based query augmentation becomes less effective as the retriever
bottlenecks the whole pipeline. Therefore, we propose to leverage a
non-parametric lexicon-based method (e.g., BM25) as the retrieval module to
capture query-document overlap in a literal fashion. As such, LameR makes the
retrieval procedure transparent to the LLM, thus circumventing the performance
bottleneck.
Authors' comments: Work in progress
Hansi Zeng, Surya Kallumadi, Zaid Alibadi, Rodrigo Nogueira, Hamed Zamani
Developing a universal model that can efficiently and effectively respond to
a wide range of information access requests -- from retrieval to recommendation
to question answering -- has been a long-lasting goal in the information
retrieval community. This paper argues that the flexibility, efficiency, and
effectiveness brought by the recent development in dense retrieval and
approximate nearest neighbor search have smoothed the path towards achieving
this goal. We develop a generic and extensible dense retrieval framework,
called \framework, that can handle a wide range of (personalized) information
access requests, such as keyword search, query by example, and complementary
item recommendation. Our proposed approach extends the capabilities of dense
retrieval models for ad-hoc retrieval tasks by incorporating user-specific
preferences through the development of a personalized attentive network. This
allows for a more tailored and accurate personalized information access
experience. Our experiments on real-world e-commerce data suggest the
feasibility of developing universal information access models by demonstrating
significant improvements even compared to competitive baselines specifically
developed for each of these individual information access tasks. This work
opens up a number of fundamental research directions for future exploration.
Authors' comments: Accepted to SIGIR 2023
Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, Yuedong Yang
With the recent progress in large-scale vision and language representation
learning, Vision Language Pre-training (VLP) models have achieved promising
improvements on various multi-modal downstream tasks. Albeit powerful, these
models have not fully leveraged world knowledge to their advantage. A key
challenge of knowledge-augmented VLP is the lack of clear connections between
knowledge and multi-modal data. Moreover, not all knowledge present in
images/texts is useful, therefore prior approaches often struggle to
effectively integrate knowledge, visual, and textual information. In this
study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL),
a novel knowledge-augmented pre-training framework to address the above issues.
For the first time, we introduce a knowledge-aware self-supervised learning
scheme that efficiently establishes the correspondence between knowledge and
multi-modal data and identifies informative knowledge to improve the modeling
of alignment and interactions between visual and textual modalities. By
adaptively integrating informative knowledge with visual and textual
information, REAVL achieves new state-of-the-art performance uniformly on
knowledge-based vision-language understanding and multi-modal entity linking
tasks, as well as competitive results on general vision-language tasks while
only using 0.2% pre-training data of the best models. Our model shows strong
sample efficiency and effective knowledge utilization.
Authors' comments: arXiv admin note: text overlap with arXiv:2210.09338 by other authors
Carlos Lassance, Simon Lupart, Hervé Dejean, Stéphane Clinchant, Nicola Tonellotto
Sparse neural retrievers, such as DeepImpact, uniCOIL and SPLADE, have been
introduced recently as an efficient and effective way to perform retrieval with
inverted indexes. They aim to learn term importance and, in some cases,
document expansions, to provide a more effective document ranking compared to
traditional bag-of-words retrieval models such as BM25. However, these sparse
neural retrievers have been shown to increase the computational costs and
latency of query processing compared to their classical counterparts. To
mitigate this, we apply a well-known family of techniques for boosting the
efficiency of query processing over inverted indexes: static pruning. We
experiment with three static pruning strategies, namely document-centric,
term-centric and agnostic pruning, and we assess, over diverse datasets, that
these techniques still work with sparse neural retrievers. In particular,
static pruning achieves $2\times$ speedup with negligible effectiveness loss
($\leq 2\%$ drop) and, depending on the use case, even $4\times$ speedup with
minimal impact on the effectiveness ($\leq 8\%$ drop). Moreover, we show that
neural rerankers are robust to candidates from statically pruned indexes.
Authors' comments: Short accepted at SIGIR 2023. Has extra content in the Appendix that
is not present in the ACM version
Jianzhang Zhang, Yiyang Chen, Nan Niu, Chuang Liu
Context: Recently, many illustrative examples have shown ChatGPT's impressive ability to perform programming tasks and answer general domain questions. Objective: We empirically evaluate how ChatGPT performs on requirements analysis tasks to derive insights into how generative large language model, represented by ChatGPT, influence the research and practice of natural language processing for requirements engineering. Method: We design an evaluation pipeline including two common requirements information retrieval tasks, four public datasets involving two typical requirements artifacts, querying ChatGPT with fixed task prompts, and quantitative and qualitative results analysis. Results: Quantitative results show that ChatGPT achieves comparable or better $F\beta$ values in all datasets under a zero-shot setting. Qualitative analysis further illustrates ChatGPT's powerful natural language processing ability and limited requirements engineering domain knowledge. Conclusion: The evaluation results demonstrate ChatGPT' impressive ability to retrieve requirements information from different types artifacts involving multiple languages under a zero-shot setting. It is worthy for the research and industry communities to study generative large language model based requirements retrieval models and to develop corresponding tools.
Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, Tat-Seng Chua
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal
content based on semantic similarities. Prior work usually focuses on the
pairwise relations (i.e., whether a data sample matches another) but ignores
the higher-order neighbor relations (i.e., a matching structure among multiple
data samples). Re-ranking, a popular post-processing practice, has revealed the
superiority of capturing neighbor relations in single-modality retrieval tasks.
However, it is ineffective to directly extend existing re-ranking algorithms to
image-text retrieval. In this paper, we analyze the reason from four
perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and
propose a novel learnable pillar-based re-ranking paradigm. Concretely, we
first select top-ranked intra- and inter-modal neighbors as pillars, and then
reconstruct data samples with the neighbor relations between them and the
pillars. In this way, each sample can be mapped into a multimodal pillar space
only using similarities, ensuring generalization. After that, we design a
neighbor-aware graph reasoning module to flexibly exploit the relations and
excavate the sparse positive items within a neighborhood. We also present a
structure alignment constraint to promote cross-modal collaboration and align
the asymmetric modalities. On top of various base backbones, we carry out
extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO,
demonstrating the effectiveness, superiority, generalization, and
transferability of our proposed re-ranking paradigm.
Authors' comments: Accepted by SIGIR'2023
Xueguang Ma, Tommaso Teofili, Jimmy Lin
Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that has been gaining traction in the community. It provides retrieval capabilities for both "traditional" bag-of-words retrieval models such as BM25 as well as retrieval using learned sparse representations such as SPLADE. With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface. Nevertheless, hybrid fusion techniques that integrate sparse and dense retrieval models need to stitch together results from two completely different "software stacks", which creates unnecessary complexities and inefficiencies. However, the introduction of HNSW indexes for dense vector search in Lucene promises the integration of both dense and sparse retrieval within a single software framework. We explore exactly this integration in the context of Anserini. Experiments on the MS MARCO passage and BEIR datasets show that our Anserini HNSW integration supports (reasonably) effective and (reasonably) efficient approximate nearest neighbor search for dense retrieval models, using only Lucene.
Haitao Li, Qingyao Ai, Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Zheng Liu, Zhao Cao
Recent studies have shown that Dense Retrieval (DR) techniques can
significantly improve the performance of first-stage retrieval in IR systems.
Despite its empirical effectiveness, the application of DR is still limited. In
contrast to statistic retrieval models that rely on highly efficient inverted
index solutions, DR models build dense embeddings that are difficult to be
pre-processed with most existing search indexing systems. To avoid the
expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN)
algorithm and corresponding indexes are widely applied to speed up the
inference process of DR models. Unfortunately, while ANN can improve the
efficiency of DR models, it usually comes with a significant price on retrieval
performance.
To solve this issue, we propose JTR, which stands for Joint optimization of
TRee-based index and query encoding. Specifically, we design a new unified
contrastive learning loss to train tree-based index and query encoder in an
end-to-end manner. The tree-based negative sampling strategy is applied to make
the tree have the maximum heap property, which supports the effectiveness of
beam search well. Moreover, we treat the cluster assignment as an optimization
problem to update the tree-based index that allows overlapped clustering. We
evaluate JTR on numerous popular retrieval benchmarks. Experimental results
show that JTR achieves better retrieval performance while retaining high system
efficiency compared with widely-adopted baselines. It provides a potential
solution to balance efficiency and effectiveness in neural retrieval system
designs.
Authors' comments: 10 pages, accepted at SIGIR 2023
Tae-Hun Lee, Jarosław K. Korbicz
We address the problem of fundamental limitations of information extraction
from the environment in open quantum systems. We derive a model-independent,
hybrid quantum-classical solution of open dynamics in the recoil-less limit,
which includes environmental degrees of freedom. Specifying to the celebrated
Caldeira-Leggett model of hot thermal environments, ubiquitous in everyday
situations, we reveal the existence of a new lengthscale, called
distinguishability length, different from the well-known thermal de Broglie
wavelength that governs the decoherence. Interestingly, a new integral kernel,
called Quantum Fisher Information kernel, appears in the analysis. It
complements the well-known dissipation and noise kernels and satisfies
disturbance-information gain type of relations, similar to the famous
fluctuation-dissipation relation. Our results complement the existing
treatments of the Caldeira-Legget model from a non-standard and highly
non-trivial perspective of information dynamics in the environment. This leads
to a full picture of how the open evolution looks like from both the system and
the environment points of view, as well as sets limits on the precision of
indirect observations.
Authors' comments: Published version, title changed, 12 pages, 1 figure
Mehdi Rafiei, Alexandros Iosifidis
Using a discriminative representation obtained by supervised deep learning
methods showed promising results on diverse Content-Based Image Retrieval
(CBIR) problems. However, existing methods exploiting labels during training
try to discriminate all available classes, which is not ideal in cases where
the retrieval problem focuses on a class of interest. In this paper, we propose
a regularized loss for Variational Auto-Encoders (VAEs) forcing the model to
focus on a given class of interest. As a result, the model learns to
discriminate the data belonging to the class of interest from any other
possibility, making the learnt latent space of the VAE suitable for
class-specific retrieval tasks. The proposed Class-Specific Variational
Auto-Encoder (CS-VAE) is evaluated on three public and one custom datasets, and
its performance is compared with that of three related VAE-based methods.
Experimental results show that the proposed method outperforms its competition
in both in-domain and out-of-domain retrieval problems.
Authors' comments: 8 pages, 7 figures, 6 tables, accepted at IJCNN conference
Weijing Chen, Linli Yao, Qin Jin
Image-text retrieval, as a fundamental and important branch of information
retrieval, has attracted extensive research attentions. The main challenge of
this task is cross-modal semantic understanding and matching. Some recent works
focus more on fine-grained cross-modal semantic matching. With the prevalence
of large scale multimodal pretraining models, several state-of-the-art models
(e.g. X-VLM) have achieved near-perfect performance on widely-used image-text
retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper,
we review the two common benchmarks and observe that they are insufficient to
assess the true capability of models on fine-grained cross-modal semantic
matching. The reason is that a large amount of images and texts in the
benchmarks are coarse-grained. Based on the observation, we renovate the
coarse-grained images and texts in the old benchmarks and establish the
improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the
image side, we enlarge the original image pool by adopting more similar images.
On the text side, we propose a novel semi-automatic renovation approach to
refine coarse-grained sentences into finer-grained ones with little human
effort. Furthermore, we evaluate representative image-text retrieval models on
our new benchmarks to demonstrate the effectiveness of our method. We also
analyze the capability of models on fine-grained semantic comprehension through
extensive experiments. The results show that even the state-of-the-art models
have much room for improvement in fine-grained semantic understanding,
especially in distinguishing attributes of close objects in images. Our code
and improved benchmark datasets are publicly available at:
https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire
further in-depth research on cross-modal retrieval.
Authors' comments: Accepted to SIGIR2023
Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, Xudong Dai
Image-text retrieval is one of the major tasks of cross-modal retrieval.
Several approaches for this task map images and texts into a common space to
create correspondences between the two modalities. However, due to the content
(semantics) richness of an image, redundant secondary information in an image
may cause false matches. To address this issue, this paper presents a semantic
optimization approach, implemented as a Visual Semantic Loss (VSL), to assist
the model in focusing on an image's main content. This approach is inspired by
how people typically annotate the content of an image by describing its main
content. Thus, we leverage the annotated texts corresponding to an image to
assist the model in capturing the main content of the image, reducing the
negative impact of secondary content. Extensive experiments on two benchmark
datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our
method. The code is available at: https://github.com/ZhangXu0963/VSL.
Authors' comments: 6 pages, 3 figures, accepted by ICME2023
Max F. Burg, Florian Wenzel, Dominik Zietlow, Max Horn, Osama Makansi, Francesco Locatello, Chris Russell
Many approaches have been proposed to use diffusion models to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations, and it remains an open question to which extent these models contribute to downstream classification performance. In particular, it remains unclear if they generalize enough to improve over directly using the additional data of their pre-training process for augmentation. We systematically evaluate a range of existing methods to generate images from diffusion models and study new extensions to assess their benefit for data augmentation. Personalizing diffusion models towards the target data outperforms simpler prompting strategies. However, using the pre-training data of the diffusion model alone, via a simple nearest-neighbor retrieval procedure, leads to even stronger downstream performance. Our study explores the potential of diffusion models in generating new training data, and surprisingly finds that these sophisticated models are not yet able to beat a simple and strong image retrieval baseline on simple downstream vision tasks.