Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Hao Wu et al.
Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.
Gabriel de Jesus, Sérgio Nunes
Searching for information on the internet and digital platforms to satisfy an
information need requires effective retrieval solutions. However, such
solutions are not yet available for Tetun, making it challenging to find
relevant documents for text-based search queries in this language. To address
these challenges, we investigate Tetun text retrieval with a focus on the
ad-hoc retrieval task. The study begins by developing essential language
resources -- including a list of stopwords, a stemmer, and a test collection --
which serve as foundational components for solutions tailored to Tetun text
retrieval. Various strategies are investigated using both document titles and
content to evaluate retrieval effectiveness. The results demonstrate that
retrieving document titles, after removing hyphens and apostrophes without
applying stemming, significantly improves retrieval performance compared to the
baseline. Efficiency increases by 31.37%, while effectiveness achieves an
average relative gain of +9.40% in MAP@10 and +30.35% in NDCG@10 with DFR BM25.
Beyond the top-10 cutoff point, Hiemstra LM shows strong performance across
various retrieval strategies and evaluation metrics. Contributions of this work
include the development of Labadain-Stopwords (a list of 160 Tetun stopwords),
Labadain-Stemmer (a Tetun stemmer with three variants), and
Labadain-Avaliad\'or (a Tetun test collection containing 59 topics, 33,550
documents, and 5,900 qrels). We make all resources publicly accessible to
facilitate future research in Tetun information retrieval.
Authors' comments: Version 3
Weijie Chen, Ting Bai, Jinbo Su, Jian Luan, Wei Liu, Chuan Shi
Large language models with retrieval-augmented generation encounter a pivotal challenge in intricate retrieval tasks, e.g., multi-hop question answering, which requires the model to navigate across multiple documents and generate comprehensive responses based on fragmented information. To tackle this challenge, we introduce a novel Knowledge Graph-based RAG framework with a hierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing in KG-Retriever is constructed on a hierarchical index graph that consists of a knowledge graph layer and a collaborative document layer. The associative nature of graph structures is fully utilized to strengthen intra-document and inter-document connectivity, thereby fundamentally alleviating the information fragmentation problem and meanwhile improving the retrieval efficiency in cross-document retrieval of LLMs. With the coarse-grained collaborative information from neighboring documents and concise information from the knowledge graph, KG-Retriever achieves marked improvements on five public QA datasets, showing the effectiveness and efficiency of our proposed RAG framework.
Suyuan Huang, Chao Zhang, Yuanyuan Wu, Haoxin Zhang, Yuan Wang, Maolin Wang, Shaosheng Cao, Tong Xu et al.
Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.
Aniket Deroy, Subhankar Maity
Code-mixing, the integration of lexical and grammatical elements from
multiple languages within a single sentence, is a widespread linguistic
phenomenon, particularly prevalent in multilingual societies. In India, social
media users frequently engage in code-mixed conversations using the Roman
script, especially among migrant communities who form online groups to share
relevant local information. This paper focuses on the challenges of extracting
relevant information from code-mixed conversations, specifically within Roman
transliterated Bengali mixed with English. This study presents a novel approach
to address these challenges by developing a mechanism to automatically identify
the most relevant answers from code-mixed conversations. We have experimented
with a dataset comprising of queries and documents from Facebook, and Query
Relevance files (QRels) to aid in this task. Our results demonstrate the
effectiveness of our approach in extracting pertinent information from complex,
code-mixed digital conversations, contributing to the broader field of natural
language processing in multilingual and informal text environments. We use
GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant
documents to frame a mathematical model which helps to detect relevant
documents corresponding to a query.
Authors' comments: Final and Updated version
Yuhang Liu, Xueyu Hu, Shengyu Zhang, Jingyuan Chen, Fan Wu, Fei Wu
Retrieval-Augmented Generation (RAG) has proven to be an effective method for
mitigating hallucination issues inherent in large language models (LLMs).
Previous approaches typically train retrievers based on semantic similarity,
lacking optimization for RAG. More recent works have proposed aligning
retrievers with the preference signals of LLMs. However, these preference
signals are often difficult for dense retrievers, which typically have weaker
language capabilities, to understand and learn effectively. Drawing inspiration
from pedagogical theories like Guided Discovery Learning, we propose a novel
framework, FiGRet (Fine-grained Guidance for Retrievers), which leverages the
language capabilities of LLMs to construct examples from a more granular,
information-centric perspective to guide the learning of retrievers.
Specifically, our method utilizes LLMs to construct easy-to-understand examples
from samples where the retriever performs poorly, focusing on three learning
objectives highly relevant to the RAG scenario: relevance, comprehensiveness,
and purity. These examples serve as scaffolding to ultimately align the
retriever with the LLM's preferences. Furthermore, we employ a dual curriculum
learning strategy and leverage the reciprocal feedback between LLM and
retriever to further enhance the performance of the RAG system. A series of
experiments demonstrate that our proposed framework enhances the performance of
RAG systems equipped with different retrievers and is applicable to various
LLMs.
Authors' comments: 13 pages, 4 figures
Qingfei Zhao, Ruobing Wang, Xin Wang, Daren Zha, Nan Mu
Retrieval-Augmented Generation (RAG) has emerged as a reliable external
knowledge augmentation technique to mitigate hallucination issues and
parameterized knowledge limitations in Large Language Models (LLMs). Existing
Adaptive RAG (ARAG) systems struggle to effectively explore multiple retrieval
sources due to their inability to select the right source at the right time. To
address this, we propose a multi-source ARAG framework, termed MSPR, which
synergizes reasoning and preference-driven retrieval to adaptive decide "when
and what to retrieve" and "which retrieval source to use". To better adapt to
retrieval sources of differing characteristics, we also employ retrieval action
adjustment and answer feedback strategy. They enable our framework to fully
explore the high-quality primary source while supplementing it with secondary
sources at the right time. Extensive and multi-dimensional experiments
conducted on three datasets demonstrate the superiority and effectiveness of
MSPR.
Authors' comments: 5 pages, 1 figure
Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, Feifei Li
Large Language Models (LLMs) have demonstrated remarkable generation capabilities but often struggle to access up-to-date information, which can lead to hallucinations. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating knowledge from external databases, enabling more accurate and relevant responses. Due to the context window constraints of LLMs, it is impractical to input the entire external database context directly into the model. Instead, only the most relevant information, referred to as chunks, is selectively retrieved. However, current RAG research faces three key challenges. First, existing solutions often select each chunk independently, overlooking potential correlations among them. Second, in practice the utility of chunks is non-monotonic, meaning that adding more chunks can decrease overall utility. Traditional methods emphasize maximizing the number of included chunks, which can inadvertently compromise performance. Third, each type of user query possesses unique characteristics that require tailored handling, an aspect that current approaches do not fully consider. To overcome these challenges, we propose a cost constrained retrieval optimization system CORAG for retrieval-augmented generation. We employ a Monte Carlo Tree Search (MCTS) based policy framework to find optimal chunk combinations sequentially, allowing for a comprehensive consideration of correlations among chunks. Additionally, rather than viewing budget exhaustion as a termination condition, we integrate budget constraints into the optimization of chunk combinations, effectively addressing the non-monotonicity of chunk utility.
Zijia Zhao, Longteng Guo, Tongtian Yue, Erdong Hu, Shuai Shao, Zehuan Yuan, Hua Huang, Jing Liu
In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at https://github.com/joez17/ChatSearch.
Dae Yon Hwang, Bilal Taha, Harshit Pande, Yaroslav Nechaev
Despite the recent advancements in information retrieval (IR), zero-shot IR
remains a significant challenge, especially when dealing with new domains,
languages, and newly-released use cases that lack historical query traffic from
existing users. For such cases, it is common to use query augmentations
followed by fine-tuning pre-trained models on the document data paired with
synthetic queries. In this work, we propose a novel Universal Document Linking
(UDL) algorithm, which links similar documents to enhance synthetic query
generation across multiple datasets with different characteristics. UDL
leverages entropy for the choice of similarity models and named entity
recognition (NER) for the link decision of documents using similarity scores.
Our empirical studies demonstrate the effectiveness and universality of the UDL
across diverse datasets and IR models, surpassing state-of-the-art methods in
zero-shot cases. The developed code for reproducibility is included in
https://github.com/eoduself/UDL
Authors' comments: Accepted for publication at EMNLP 2024 Main Conference
Seong-Il Park, Jay-Yoon Lee
Retrieval Augmented Language Models (RALMs) have gained significant attention
for their ability to generate accurate answer and improve efficiency. However,
RALMs are inherently vulnerable to imperfect information due to their reliance
on the imperfect retriever or knowledge source. We identify three common
scenarios-unanswerable, adversarial, conflicting-where retrieved document sets
can confuse RALM with plausible real-world examples. We present the first
comprehensive investigation to assess how well RALMs detect and handle such
problematic scenarios. Among these scenarios, to systematically examine
adversarial robustness we propose a new adversarial attack method, Generative
model-based ADVersarial attack (GenADV) and a novel metric Robustness under
Additional Document (RAD). Our findings reveal that RALMs often fail to
identify the unanswerability or contradiction of a document set, which
frequently leads to hallucinations. Moreover, we show the addition of an
adversary significantly degrades RALM's performance, with the model becoming
even more vulnerable when the two scenarios overlap (adversarial+unanswerable).
Our research identifies critical areas for assessing and enhancing the
robustness of RALMs, laying the foundation for the development of more robust
models.
Authors' comments: Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL)
Cody Clop, Yannick Teglia
Large Language Models (LLMs) have demonstrated remarkable capabilities in
generating coherent text but remain limited by the static nature of their
training data. Retrieval Augmented Generation (RAG) addresses this issue by
combining LLMs with up-to-date information retrieval, but also expand the
attack surface of the system. This paper investigates prompt injection attacks
on RAG, focusing on malicious objectives beyond misinformation, such as
inserting harmful links, promoting unauthorized services, and initiating
denial-of-service behaviors. We build upon existing corpus poisoning techniques
and propose a novel backdoor attack aimed at the fine-tuning process of the
dense retriever component. Our experiments reveal that corpus poisoning can
achieve significant attack success rates through the injection of a small
number of compromised documents into the retriever corpus. In contrast,
backdoor attacks demonstrate even higher success rates but necessitate a more
complex setup, as the victim must fine-tune the retriever using the attacker
poisoned dataset.
Authors' comments: 12 pages, 5 figures
Xiangci Li, Jessica Ouyang
Retrieval-augmented generation (RAG) has emerged as a powerful method for enhancing natural language generation by integrating external knowledge into a model's output. While prior work has demonstrated the importance of improving knowledge retrieval for boosting generation quality, the role of knowledge selection remains less clear. In this paper, we perform a comprehensive analysis of how knowledge retrieval and selection influence downstream generation performance in RAG systems. By simulating different retrieval and selection conditions through a controlled mixture of gold and distractor knowledge, we assess the impact of these factors on generation outcomes. Our findings indicate that the downstream generator model's capability, as well as the complexity of the task and dataset, significantly influence the impact of knowledge retrieval and selection on the overall RAG system performance. In typical scenarios, improving the knowledge recall score is key to enhancing generation outcomes, with the knowledge selector providing a limited additional benefit when a strong generator model is used on clear, well-defined tasks. For weaker generator models or more ambiguous tasks and datasets, the knowledge F1 score becomes a critical factor, and the knowledge selector plays a more prominent role in improving overall performance.
Pengfei Jin, Peng Shu, Sekeun Kim, Qing Xiao, Sifan Song, Cheng Chen, Tianming Liu, Xiang Li et al.
Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.
Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, Zhen-Hua Ling
Recent Retrieval Augmented Generation (RAG) aims to enhance Large Language
Models (LLMs) by incorporating extensive knowledge retrieved from external
sources. However, such approach encounters some challenges: Firstly, the
original queries may not be suitable for precise retrieval, resulting in
erroneous contextual knowledge; Secondly, the language model can easily
generate inconsistent answer with external references due to their knowledge
boundary limitation. To address these issues, we propose the
chain-of-verification (CoV-RAG) to enhance the external retrieval correctness
and internal generation consistency. Specifically, we integrate the
verification module into the RAG, engaging in scoring, judgment, and rewriting.
To correct external retrieval errors, CoV-RAG retrieves new knowledge using a
revised query. To correct internal generation errors, we unify QA and
verification tasks with a Chain-of-Thought (CoT) reasoning during training. Our
comprehensive experiments across various LLMs demonstrate the effectiveness and
adaptability compared with other strong baselines. Especially, our CoV-RAG can
significantly surpass the state-of-the-art baselines using different LLM
backbones.
Authors' comments: Accepted to EMNLP 2024 Findings. 9 pages, 4 figures, 7 tables
Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo
Evaluating production-level retrieval systems at scale is a crucial yet
challenging task due to the limited availability of a large pool of
well-trained human annotators. Large Language Models (LLMs) have the potential
to address this scaling issue and offer a viable alternative to humans for the
bulk of annotation tasks. In this paper, we propose a framework for assessing
the product search engines in a large-scale e-commerce setting, leveraging
Multimodal LLMs for (i) generating tailored annotation guidelines for
individual queries, and (ii) conducting the subsequent annotation task. Our
method, validated through deployment on a large e-commerce platform,
demonstrates comparable quality to human annotations, significantly reduces
time and cost, facilitates rapid problem discovery, and provides an effective
solution for production-level quality control at scale.
Authors' comments: 13 pages, 5 figures, 4 Tables
Benjamin Clavié
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.
Zeyu Chen, Pengfei Zhang, Kai Ye, Wei Dong, Xin Feng, Yana Zhang
The burgeoning short video industry has accelerated the advancement of
video-music retrieval technology, assisting content creators in selecting
appropriate music for their videos. In self-supervised training for
video-to-music retrieval, the video and music samples in the dataset are
separated from the same video work, so they are all one-to-one matches. This
does not match the real situation. In reality, a video can use different music
as background music, and a music can be used as background music for different
videos. Many videos and music that are not in a pair may be compatible, leading
to false negative noise in the dataset. A novel inter-intra modal (II) loss is
proposed as a solution. By reducing the variation of feature distribution
within the two modalities before and after the encoder, II loss can reduce the
model's overfitting to such noise without removing it in a costly and laborious
way. The video-music retrieval framework, II-CLVM (Contrastive Learning for
Video-Music Retrieval), incorporating the II Loss, achieves state-of-the-art
performance on the YouTube8M dataset. The framework II-CLVTM shows better
performance when retrieving music using multi-modal video information (such as
text in videos). Experiments are designed to show that II loss can effectively
alleviate the problem of false negative noise in retrieval tasks. Experiments
also show that II loss improves various self-supervised and supervised
uni-modal and cross-modal retrieval tasks, and can obtain good retrieval models
with a small amount of training samples.
Authors' comments: 10 pages, 7 figures
Zhiyu An, Xianzhong Ding, Yen-Chun Fu, Cheng-Chung Chu, Yan Li, Wan Du
This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever's superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin
The zero-shot effectiveness of neural retrieval models is often evaluated on
the BEIR benchmark -- a combination of different IR evaluation datasets.
Interestingly, previous studies found that particularly on the BEIR subset
Touch\'e 2020, an argument retrieval task, neural retrieval models are
considerably less effective than BM25. Still, so far, no further investigation
has been conducted on what makes argument retrieval so "special". To more
deeply analyze the respective potential limits of neural retrieval models, we
run a reproducibility study on the Touch\'e 2020 data. In our study, we focus
on two experiments: (i) a black-box evaluation (i.e., no model retraining),
incorporating a theoretical exploration using retrieval axioms, and (ii) a data
denoising evaluation involving post-hoc relevance judgments. Our black-box
evaluation reveals an inherent bias of neural models towards retrieving short
passages from the Touch\'e 2020 data, and we also find that quite a few of the
neural models' results are unjudged in the Touch\'e 2020 data. As many of the
short Touch\'e passages are not argumentative and thus non-relevant per se, and
as the missing judgments complicate fair comparison, we denoise the Touch\'e
2020 data by excluding very short passages (less than 20 words) and by
augmenting the unjudged data with post-hoc judgments following the Touch\'e
guidelines. On the denoised data, the effectiveness of the neural models
improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code
and the augmented Touch\'e 2020 dataset are available at
\url{https://github.com/castorini/touche-error-analysis}.
Authors' comments: SIGIR 2024 (Resource & Reproducibility Track)