Asbjrn O. Orvedal, Hsuan-Yin Lin, Eirik Rosnes
We consider the problem of weakly-private information retrieval (WPIR) when
data is encoded by a maximum distance separable code and stored across multiple
servers. In WPIR, a user wishes to retrieve a piece of data from a set of
servers without leaking too much information about which piece of data she is
interested in. We study and provide the first WPIR protocols for this scenario
and present results on their optimal trade-off between download rate and
information leakage using the maximal leakage privacy metric.
Authors' comments: To be presented at the 2024 International Zurich Seminar on
Information and Communication (IZS'24), Zurich, Switzerland
Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou
Knowledge retrieval with multi-modal queries plays a crucial role in
supporting knowledge-intensive multi-modal applications. However, existing
methods face challenges in terms of their effectiveness and training
efficiency, especially when it comes to training and integrating multiple
retrievers to handle multi-modal queries. In this paper, we propose an
innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can
effectively serve as virtual knowledge bases, even when trained with limited
data. We retrieve knowledge via a two-step process: 1) generating knowledge
clues related to the queries, and 2) obtaining the relevant document by
searching databases using the knowledge clue. In particular, we first introduce
an object-aware prefix-tuning technique to guide multi-grained visual learning.
Then, we align multi-grained visual features into the textual feature space of
the LLM, employing the LLM to capture cross-modal interactions. Subsequently,
we construct instruction data with a unified format for model training.
Finally, we propose the knowledge-guided generation strategy to impose prior
constraints in the decoding steps, thereby promoting the generation of
distinctive knowledge clues. Through experiments conducted on three benchmarks,
we demonstrate significant improvements ranging from 3.0% to 14.6% across all
evaluation metrics when compared to strong baselines.
Authors' comments: Accepted to AAAI 2024
Zexuan Qiu, Jiahong Liu, Yankai Chen, Irwin King
Existing unsupervised deep product quantization methods primarily aim for the
increased similarity between different views of the identical image, whereas
the delicate multi-level semantic similarities preserved between images are
overlooked. Moreover, these methods predominantly focus on the Euclidean space
for computational convenience, compromising their ability to map the
multi-level semantic relationships between images effectively. To mitigate
these shortcomings, we propose a novel unsupervised product quantization method
dubbed \textbf{Hi}erarchical \textbf{H}yperbolic \textbf{P}roduct
\textbf{Q}uantization (HiHPQ), which learns quantized representations by
incorporating hierarchical semantic similarity within hyperbolic geometry.
Specifically, we propose a hyperbolic product quantizer, where the hyperbolic
codebook attention mechanism and the quantized contrastive learning on the
hyperbolic product manifold are introduced to expedite quantization.
Furthermore, we propose a hierarchical semantics learning module, designed to
enhance the distinction between similar and non-matching images for a query by
utilizing the extracted hierarchical semantics as an additional training
supervision. Experiments on benchmarks show that our proposed method
outperforms state-of-the-art baselines.
Authors' comments: Accepted by AAAI 2024
Thiago Laitz, Konstantinos Papakostas, Roberto Lotufo, Rodrigo Nogueira
Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference. In this work, we propose a new method for distilling large rankers into their smaller versions focusing on out-of-domain effectiveness. We introduce InRanker, a version of monoT5 distilled from monoT5-3B with increased effectiveness on out-of-domain scenarios. Our key insight is to use language models and rerankers to generate as much as possible synthetic "in-domain" training data, i.e., data that closely resembles the data that will be seen at retrieval time. The pipeline consists of two distillation phases that do not require additional user queries or manual annotations: (1) training on existing supervised soft teacher labels, and (2) training on teacher soft labels for synthetic queries generated using a large language model. Consequently, models like monoT5-60M and monoT5-220M improved their effectiveness by using the teacher's knowledge, despite being 50x and 13x smaller, respectively. Models and code are available at https://github.com/unicamp-dl/InRanker.
Puxuan Yu, Antonio Mallia, Matthias Petri
We explore leveraging corpus-specific vocabularies that improve both
efficiency and effectiveness of learned sparse retrieval systems. We find that
pre-training the underlying BERT model on the target corpus, specifically
targeting different vocabulary sizes incorporated into the document expansion
process, improves retrieval quality by up to 12% while in some scenarios
decreasing latency by up to 50%. Our experiments show that adopting
corpus-specific vocabulary and increasing vocabulary size decreases average
postings list length which in turn reduces latency. Ablation studies show
interesting interactions between custom vocabularies, document expansion
techniques, and sparsification objectives of sparse models. Both effectiveness
and efficiency improvements transfer to different retrieval approaches such as
uniCOIL and SPLADE and offer a simple yet effective approach to providing new
efficiency-effectiveness trade-offs for learned sparse retrieval systems.
Authors' comments: ECIR 2024 Full Paper
Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, and Retrieval-augmented Generation (RAG) is an effective way to enhance the performance by locating relevant information and placing it into the context window of the LLM. However, the relationship between retrievers and LLMs in a RAG is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-"friendly" information and assembling a LLM-"friendly" context. In this work, we examine a novel bridge mechanism. We validate the ranking and selection assumptions of retrievers in the context of RAG and propose a framework that chains together supervised and reinforcement learning to train a bridge model that optimizes the connection between the retriever and the LLM. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks.
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek
Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical. We share the lessons learned and present 7 failure points to consider when designing a RAG system. The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. We conclude with a list of potential research directions on RAG systems for the software engineering community.
Paul Lerner, Olivier Ferret, Camille Guinaudeau
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.
Yusef Hassan-Montero, Victor Herrero-Solana
Tagging-based systems enable users to categorize web resources by means of tags (freely chosen keywords), in order to refinding these resources later. Tagging is implicitly also a social indexing process, since users share their tags and resources, constructing a social tag index, so-called folksonomy. At the same time of tagging-based system, has been popularised an interface model for visual information retrieval known as Tag-Cloud. In this model, the most frequently used tags are displayed in alphabetical order. This paper presents a novel approach to Tag-Cloud's tags selection, and proposes the use of clustering algorithms for visual layout, with the aim of improve browsing experience. The results suggest that presented approach reduces the semantic density of tag set, and improves the visual consistency of Tag-Cloud layout.
Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke
Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.
Francisco Ardévol Martínez, Michiel Min, Daniela Huppenkothen, Inga Kamp, Paul I. Palmer
Interpreting the observations of exoplanet atmospheres to constrain physical
and chemical properties is typically done using Bayesian retrieval techniques.
Because these methods require many model computations, a compromise is made
between model complexity and run time. Reaching this compromise leads to the
simplification of many physical and chemical processes (e.g. parameterised
temperature structure). Here we implement and test sequential neural posterior
estimation (SNPE), a machine learning inference algorithm, for exoplanet
atmospheric retrievals. The goal is to speed up retrievals so they can be run
with more computationally expensive atmospheric models, such as those computing
the temperature structure using radiative transfer. We generate 100 synthetic
observations using ARCiS (ARtful Modeling Code for exoplanet Science, an
atmospheric modelling code with the flexibility to compute models in varying
degrees of complexity) and perform retrievals on them to test the faithfulness
of the SNPE posteriors. The faithfulness quantifies whether the posteriors
contain the ground truth as often as we expect. We also generate a synthetic
observation of a cool brown dwarf using the self-consistent capabilities of
ARCiS and run a retrieval with self-consistent models to showcase the
possibilities that SNPE opens. We find that SNPE provides faithful posteriors
and is therefore a reliable tool for exoplanet atmospheric retrievals. We are
able to run a self-consistent retrieval of a synthetic brown dwarf spectrum
using only 50,000 forward model evaluations. We find that SNPE can speed up
retrievals between $\sim2\times$ and $\geq10\times$ depending on the
computational load of the forward model, the dimensionality of the observation,
and the signal-to-noise ratio of the observation. We make the code publicly
available for the community on Github.
Authors' comments: Accepted for publication at A&A
Qian Li, Lixin Su, Jiashu Zhao, Long Xia, Hengyi Cai, Suqi Cheng, Hengzhu Tang, Junfeng Wang et al.
Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.
Shangyu Wu, Ying Xiong, Yufei Cui, Xue Liu, Buzhou Tang, Tei-Wei Kuo, Chun Jason Xue
Retrieval-based augmentations that aim to incorporate knowledge from an external database into language models have achieved great success in various knowledge-intensive (KI) tasks, such as question-answering and text generation. However, integrating retrievals in non-knowledge-intensive (NKI) tasks, such as text classification, is still challenging. Existing works focus on concatenating retrievals to inputs as context to form the prompt-based inputs. Unfortunately, such methods require language models to have the capability to handle long texts. Besides, inferring such concatenated data would also consume a significant amount of computational resources. To solve these challenges, we propose \textbf{ReFusion} in this paper, a computation-efficient \textbf{Re}trieval representation \textbf{Fusion} with neural architecture search. The main idea is to directly fuse the retrieval representations into the language models. Specifically, we first propose an online retrieval module that retrieves representations of similar sentences. Then, we present a retrieval fusion module including two effective ranking schemes, i.e., reranker-based scheme and ordered-mask-based scheme, to fuse the retrieval representations with hidden states. Furthermore, we use Neural Architecture Search (NAS) to seek the optimal fusion structure across different layers. Finally, we conduct comprehensive experiments, and the results demonstrate our ReFusion can achieve superior and robust performance on various NKI tasks.
Zhichao Yin, Binyuan Hui, Min Yang, Fei Huang, Yongbin Li
Recently, substantial advancements in pre-trained vision-language models have
greatly enhanced the capabilities of multi-modal dialog systems. These models
have demonstrated significant improvements by fine-tuning on downstream tasks.
However, the existing pre-trained models primarily focus on effectively
capturing the alignment between vision and language modalities, often ignoring
the intricate nature of dialog context. In this paper, we propose a
parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog
retrieval. Specifically, our approach introduces a multi-modal context prompt
generator to learn context features which are subsequently distilled into
prompts within the pre-trained vision-language model CLIP. Besides, we
introduce domain prompt to mitigate the disc repancy from the downstream dialog
data. To facilitate various types of retrieval, we also design multiple experts
to learn mappings from CLIP outputs to multi-modal representation space, with
each expert being responsible to one specific retrieval type. Extensive
experiments show that DialCLIP achieves state-of-the-art performance on two
widely recognized benchmark datasets (i.e., PhotoChat and MMDialog) by tuning a
mere 0.04% of the total parameters. These results highlight the efficacy and
efficiency of our proposed approach, underscoring its potential to advance the
field of multi-modal dialog retrieval.
Authors' comments: arxiv: first version
Samarth Mishra, Carlos D. Castillo, Hongcheng Wang, Kate Saenko, Venkatesh Saligrama
In cross-domain retrieval, a model is required to identify images from the
same semantic category across two visual domains. For instance, given a sketch
of an object, a model needs to retrieve a real image of it from an online
store's catalog. A standard approach for such a problem is learning a feature
space of images where Euclidean distances reflect similarity. Even without
human annotations, which may be expensive to acquire, prior methods function
reasonably well using unlabeled images for training. Our problem constraint
takes this further to scenarios where the two domains do not necessarily share
any common categories in training data. This can occur when the two domains in
question come from different versions of some biometric sensor recording
identities of different people. We posit a simple solution, which is to
generate synthetic data to fill in these missing category examples across
domains. This, we do via category preserving translation of images from one
visual domain to another. We compare approaches specifically trained for this
translation for a pair of domains, as well as those that can use large-scale
pre-trained text-to-image diffusion models via prompts, and find that the
latter can generate better replacement synthetic data, leading to more accurate
cross-domain retrieval models. Our best SynCDR model can outperform prior art
by up to 15\%. Code for our work is available at
https://github.com/samarth4149/SynCDR .
Authors' comments: Pre-print
Haifeng Li, Mo Hai, Dong Tang
Ranker and retriever are two important components in dense passage retrieval. The retriever typically adopts a dual-encoder model, where queries and documents are separately input into two pre-trained models, and the vectors generated by the models are used for similarity calculation. The ranker often uses a cross-encoder model, where the concatenated query-document pairs are input into a pre-trained model to obtain word similarities. However, the dual-encoder model lacks interaction between queries and documents due to its independent encoding, while the cross-encoder model requires substantial computational cost for attention calculation, making it difficult to obtain real-time retrieval results. In this paper, we propose a dense retrieval model called MD2PR based on multi-level distillation. In this model, we distill the knowledge learned from the cross-encoder to the dual-encoder at both the sentence level and word level. Sentence-level distillation enhances the dual-encoder on capturing the themes and emotions of sentences. Word-level distillation improves the dual-encoder in analysis of word semantics and relationships. As a result, the dual-encoder can be used independently for subsequent encoding and retrieval, avoiding the significant computational cost associated with the participation of the cross-encoder. Furthermore, we propose a simple dynamic filtering method, which updates the threshold during multiple training iterations to ensure the effective identification of false negatives and thus obtains a more comprehensive semantic representation space. The experimental results over two standard datasets show our MD2PR outperforms 11 baseline models in terms of MRR and Recall metrics.
Benjamin Clavié
As language-specific training data tends to be sparsely available compared to English, document retrieval in many languages has been largely relying on multilingual models. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders, with Japanese-only models lagging far behind. However, multilingual models require considerably more compute and data to train and have higher computational and memory requirements while often missing out on culturally-relevant information. In this paper, we introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts while reaching competitive performance. Our strongest model largely outperform all existing monolingual Japanese retrievers on all dataset, as well as the strongest existing multilingual models on all out-of-domain tasks, highlighting the need for specialised models able to handle linguistic specificities. These results are achieved using a model with only 110 million parameters, considerably smaller than all multilingual models, and using only a limited Japanese-language. We believe our results show great promise to support Japanese retrieval-enhanced application pipelines in a wide variety of domains.
Zeqiang Wei, Kai Jin, Xiuzhuang Zhou
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks. Eliminating heterogeneity between different modalities to enhance semantic consistency is the key challenge of this task. The current Vision-Language Pretraining (VLP) models, with cross-modal contrastive learning and masked reconstruction as joint training tasks, can effectively enhance the performance of cross-modal retrieval. This framework typically employs dual-stream inputs, using unmasked data for cross-modal contrastive learning and masked data for reconstruction. However, due to task competition and information interference caused by significant differences between the inputs of the two proxy tasks, the effectiveness of representation learning for intra-modal and cross-modal features is limited. In this paper, we propose an efficient VLP framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks. This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time. Moreover, we introduce a new modality alignment strategy named Mapping before Aggregation (MbA). Unlike previous methods, MbA maps different modalities to a common feature space before conducting local feature aggregation, thereby reducing the loss of fine-grained semantic information necessary for improved modality alignment. Qualitative and quantitative experiments conducted on the MIMIC-CXR dataset validate the effectiveness of our approach, demonstrating state-of-the-art performance in medical cross-modal retrieval tasks.
Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao
Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.
Ali Abdari, Alex Falcon, Giuseppe Serra
Recently, the Metaverse is becoming increasingly attractive, with millions of
users accessing the many available virtual worlds. However, how do users find
the one Metaverse which best fits their current interests? So far, the search
process is mostly done by word of mouth, or by advertisement on
technology-oriented websites. However, the lack of search engines similar to
those available for other multimedia formats (e.g., YouTube for videos) is
showing its limitations, since it is often cumbersome to find a Metaverse based
on some specific interests using the available methods, while also making it
difficult to discover user-created ones which lack strong advertisement. To
address this limitation, we propose to use language to naturally describe the
desired contents of the Metaverse a user wishes to find. Second, we highlight
that, differently from more conventional 3D scenes, Metaverse scenarios
represent a more complex data format since they often contain one or more types
of multimedia which influence the relevance of the scenario itself to a user
query. Therefore, in this work, we create a novel task, called
Text-to-Metaverse retrieval, which aims at modeling these aspects while also
taking the cross-modal relations with the textual data into account. Since we
are the first ones to tackle this problem, we also collect a dataset of 33000
Metaverses, each of which consists of a 3D scene enriched with multimedia
content. Finally, we design and implement a deep learning framework based on
contrastive learning, resulting in a thorough experimental setup.
Authors' comments: Accepted at 30th International Conference on Multimedia Modeling-
MMM2024