Zhen Zhang, Xinyu Wang, Yong Jiang, Zhuo Chen, Feiteng Mu, Mengting Hu, Pengjun Xie, Fei Huang
Large Language Models (LLMs) are increasingly recognized for their practical applications. However, these models often encounter challenges in dynamically changing knowledge, as well as in managing unknown static knowledge. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. Actually, we find that the impact of RAG on the question answering capabilities of LLMs can be categorized into three groups: beneficial, neutral, and harmful. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs, while also improving the overall performance of LLMs. This insight motivates us to differentiate between types of questions using certain metrics as indicators, to decrease the retrieval ratio without compromising performance. In our work, we propose a method that is able to identify different types of questions from this view by training a Knowledge Boundary Model (KBM). Experiments conducted on 11 English and Chinese datasets illustrate that the KBM effectively delineates the knowledge boundary, significantly decreasing the proportion of retrievals required for optimal end-to-end performance. Specifically, we evaluate the effectiveness of KBM in three complex scenarios: dynamic knowledge, long-tail static knowledge, and multi-hop problems, as well as its functionality as an external LLM plug-in.
Sangam Lee, Ryang Heo, SeongKu Kang, Susik Yoon, Jinyoung Yeo, Dongha Lee
Generative retrieval has recently emerged as a new alternative of traditional information retrieval approaches. However, existing generative retrieval methods directly decode docid when a query is given, making it impossible to provide users with explanations as an answer for "Why this document is retrieved?". To address this limitation, we propose Hierarchical Category Path-Enhanced Generative Retrieval(HyPE), which enhances explainability by generating hierarchical category paths step-by-step before decoding docid. HyPE leverages hierarchical category paths as explanation, progressing from broad to specific semantic categories. This approach enables diverse explanations for the same document depending on the query by using shared category paths between the query and the document, and provides reasonable explanation by reflecting the document's semantic structure through a coarse-to-fine manner. HyPE constructs category paths with external high-quality semantic hierarchy, leverages LLM to select appropriate candidate paths for each document, and optimizes the generative retrieval model with path-augmented dataset. During inference, HyPE utilizes path-aware reranking strategy to aggregate diverse topic information, allowing the most relevant documents to be prioritized in the final ranked list of docids. Our extensive experiments demonstrate that HyPE not only offers a high level of explainability but also improves the retrieval performance in the document retrieval task.
Geonmin Kim, Jaeyeon Kim, Hancheol Park, Wooksu Shin, Tae-Ho Kim
Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance.
Dincy R. Arikkat, Abhinav M., Navya Binu, Parvathi M., Navya Biju, K. S. Arunima, Vinod P., Rafidha Rehiman K. A. et al.
In the rapidly evolving landscape of cyber security, intelligent chatbots are gaining prominence. Artificial Intelligence, Machine Learning, and Natural Language Processing empower these chatbots to handle user inquiries and deliver threat intelligence. This helps cyber security knowledge readily available to both professionals and the public. Traditional rule-based chatbots often lack flexibility and struggle to adapt to user interactions. In contrast, Large Language Model-based chatbots offer contextually relevant information across multiple domains and adapt to evolving conversational contexts. In this work, we develop IntellBot, an advanced cyber security Chatbot built on top of cutting-edge technologies like Large Language Models and Langchain alongside a Retrieval-Augmented Generation model to deliver superior capabilities. This chatbot gathers information from diverse data sources to create a comprehensive knowledge base covering known vulnerabilities, recent cyber attacks, and emerging threats. It delivers tailored responses, serving as a primary hub for cyber security insights. By providing instant access to relevant information and resources, this IntellBot enhances threat intelligence, incident response, and overall security posture, saving time and empowering users with knowledge of cyber security best practices. Moreover, we analyzed the performance of our copilot using a two-stage evaluation strategy. We achieved BERT score above 0.8 by indirect approach and a cosine similarity score ranging from 0.8 to 1, which affirms the accuracy of our copilot. Additionally, we utilized RAGAS to evaluate the RAG model, and all evaluation metrics consistently produced scores above 0.77, highlighting the efficacy of our system.
Hailey Joren, Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, Cyrus Rashtchian
Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance (Gemini 1.5 Pro, GPT 4o, Claude 3.5) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, smaller models with lower baseline performance (Mistral 3, Gemma 2) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10\% for Gemini, GPT, and Gemma. Key findings and the prompts used in our autorater analysis are available on our github.
Sonal Prabhune, Donald J. Berndt
Knowing that the generative capabilities of large language models (LLM) are sometimes hampered by tendencies to hallucinate or create non-factual responses, researchers have increasingly focused on methods to ground generated outputs in factual data. Retrieval Augmented Generation (RAG) has emerged as a key approach for integrating knowledge from data sources outside of the LLM's training set, including proprietary and up-to-date information. While many research papers explore various RAG strategies, their true efficacy is tested in real-world applications with actual data. The journey from conceiving an idea to actualizing it in the real world is a lengthy process. We present insights from the development and field-testing of a pilot project that integrates LLMs with RAG for information retrieval. Additionally, we examine the impacts on the information value chain, encompassing people, processes, and technology. Our aim is to identify the opportunities and challenges of implementing this emerging technology, particularly within the context of behavioral research in the information systems (IS) field. The contributions of this work include the development of best practices and recommendations for adopting this promising technology while ensuring compliance with industry regulations through a proposed AI governance model.
Zhichao Geng, Dongyu Ru, Yang Yang
Learned sparse retrieval, which can efficiently perform retrieval through mature inverted-index engines, has garnered growing attention in recent years. Particularly, the inference-free sparse retrievers are attractive as they eliminate online model inference in the retrieval phase thereby avoids huge computational cost, offering reasonable throughput and latency. However, even the state-of-the-art (SOTA) inference-free sparse models lag far behind in terms of search relevance when compared to both sparse and dense siamese models. Towards competitive search relevance for inference-free sparse retrievers, we argue that they deserve dedicated training methods other than using same ones with siamese encoders. In this paper, we propose two different approaches for performance improvement. First, we introduce the IDF-aware FLOPS loss, which introduces Inverted Document Frequency (IDF) to the sparsification of representations. We find that it mitigates the negative impact of the FLOPS regularization on search relevance, allowing the model to achieve a better balance between accuracy and efficiency. Moreover, we propose a heterogeneous ensemble knowledge distillation framework that combines siamese dense and sparse retrievers to generate supervisory signals during the pre-training phase. The ensemble framework of dense and sparse retriever capitalizes on their strengths respectively, providing a strong upper bound for knowledge distillation. To concur the diverse feedback from heterogeneous supervisors, we normalize and then aggregate the outputs of the teacher models to eliminate score scale differences. On the BEIR benchmark, our model outperforms existing SOTA inference-free sparse model by \textbf{3.3 NDCG@10 score}. It exhibits search relevance comparable to siamese sparse retrievers and client-side latency only \textbf{1.1x that of BM25}.
Moritz Staudinger, Florina Piroi, Andreas Rauber
There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.
Simone Betteti, Giacomo Baggio, Francesco Bullo, Sandro Zampieri
The Hopfield model provides a mathematically idealized yet insightful
framework for understanding the mechanisms of memory storage and retrieval in
the human brain. This model has inspired four decades of extensive research on
learning and retrieval dynamics, capacity estimates, and sequential transitions
among memories. Notably, the role and impact of external inputs has been
largely underexplored, from their effects on neural dynamics to how they
facilitate effective memory retrieval. To bridge this gap, we propose a novel
dynamical system framework in which the external input directly influences the
neural synapses and shapes the energy landscape of the Hopfield model. This
plasticity-based mechanism provides a clear energetic interpretation of the
memory retrieval process and proves effective at correctly classifying highly
mixed inputs. Furthermore, we integrate this model within the framework of
modern Hopfield architectures, using this connection to elucidate how current
and past information are combined during the retrieval process. Finally, we
embed both the classic and the new model in an environment disrupted by noise
and compare their robustness during memory retrieval.
Authors' comments: 24 pages, 15 figures
Hossein Hosseini, Mohammad Siobhan Zare, Amir Hossein Mohammadi, Arefeh Kazemi, Zahra Zojaji, Mohammad Ali Nematbakhsh
Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian.
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping
State-of-the-art retrieval models typically address a straightforward search
scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer
a specific question) and only a single modality is supported for both queries
and retrieved results. This paper introduces techniques for advancing
information retrieval with multimodal large language models (MLLMs), enabling a
broader search scenario, termed universal multimodal retrieval, where multiple
modalities and diverse retrieval tasks are accommodated. To this end, we first
study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16
retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever
is capable of understanding challenging queries, composed of both text and
image, but it underperforms compared to a smaller CLIP retriever in cross-modal
retrieval tasks due to the modality bias exhibited by MLLMs. To address the
issue, we propose modality-aware hard negative mining to mitigate the modality
bias exhibited by MLLM retrievers. Second, we propose continuously fine-tuning
the universal multimodal retriever to enhance its text retrieval capability
while preserving multimodal retrieval capability. As a result, our model,
MM-Embed, achieves state-of-the-art performance on the multimodal retrieval
benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing
the state-of-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval
benchmark. We also explore prompting the off-the-shelf MLLMs as zero-shot
rerankers to refine the ranking of the candidates from the multimodal
retriever. We find that, through prompt-and-reranking, MLLMs can further
improve multimodal retrieval when the user queries (e.g., text-image composed
queries) are more complex and challenging to understand. These findings also
pave the way for advancing universal multimodal retrieval in the future.
Authors' comments: Accepted at ICLR 2025. We release the model weights at:
https://huggingface.co/nvidia/MM-Embed
Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, Grant Van Horn
We introduce INQUIRE, a text-to-image retrieval benchmark designed to
challenge multimodal vision-language models on expert-level queries. INQUIRE
includes iNaturalist 2024 (iNat24), a new dataset of five million natural world
images, along with 250 expert-level retrieval queries. These queries are paired
with all relevant images comprehensively labeled within iNat24, comprising
33,000 total matches. Queries span categories such as species identification,
context, behavior, and appearance, emphasizing tasks that require nuanced image
understanding and domain expertise. Our benchmark evaluates two core retrieval
tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2)
INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed
evaluation of a range of recent multimodal models demonstrates that INQUIRE
poses a significant challenge, with the best models failing to achieve an
mAP@50 above 50%. In addition, we show that reranking with more powerful
multimodal models can enhance retrieval performance, yet there remains a
significant margin for improvement. By focusing on scientifically-motivated
ecological challenges, INQUIRE aims to bridge the gap between AI capabilities
and the needs of real-world scientific inquiry, encouraging the development of
retrieval systems that can assist with accelerating ecological and biodiversity
research. Our dataset and code are available at
https://inquire-benchmark.github.io
Authors' comments: Published in NeurIPS 2024, Datasets and Benchmarks Track
Yuefeng Peng, Junda Wang, Hong Yu, Amir Houmansadr
Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting RAG's knowledge databases. We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs. As a result, they fail on models that are less responsive to such malicious prompts -- for example, our experiments show that state-of-the-art attacks achieve near-zero success on Gemma-2B-IT. Moreover, even for models that can follow these instructions, we found fine-tuning may significantly reduce attack performance. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. For example, on Gemma-2B-IT, we show that with only 5\% poisoned data, our method achieves an average success rate of 94.1\% for verbatim extraction (ROUGE-L score: 82.1) and 63.6\% for paraphrased extraction (average ROUGE score: 66.4) across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.
Lixiao Yang, Mengyang Xu, Weimao Ke
Question-answering (QA) is an important application of Information Retrieval
(IR) and language models, and the latest trend is toward pre-trained large
neural networks with embedding parameters. Augmenting QA performances with
these LLMs requires intensive computational resources for fine-tuning. We
propose an innovative approach to improve QA task performances by integrating
optimized vector retrievals and instruction methodologies. Based on retrieval
augmentation, the process involves document embedding, vector retrieval, and
context construction for optimal QA results. We experiment with different
combinations of text segmentation techniques and similarity functions, and
analyze their impacts on QA performances. Results show that the model with a
small chunk size of 100 without any overlap of the chunks achieves the best
result and outperforms the models based on semantic segmentation using
sentences. We discuss related QA examples and offer insight into how model
performances are improved within the two-stage framework.
Authors' comments: 6 pages, 4 tables
Nicola Dall'Asen, Yiming Wang, Enrico Fini, Elisa Ricci
Low-resource domains, characterized by scarce data and annotations, present
significant challenges for language and visual understanding tasks, with the
latter much under-explored in the literature. Recent advancements in
Vision-Language Models (VLM) have shown promising results in high-resource
domains but fall short in low-resource concepts that are under-represented
(e.g. only a handful of images per category) in the pre-training set. We tackle
the challenging task of zero-shot low-resource image classification from a
novel perspective. By leveraging a retrieval-based strategy, we achieve this in
a training-free fashion. Specifically, our method, named CoRE (Combination of
Retrieval Enrichment), enriches the representation of both query images and
class prototypes by retrieving relevant textual information from large
web-crawled databases. This retrieval-based enrichment significantly boosts
classification performance by incorporating the broader contextual information
relevant to the specific class. We validate our method on a newly established
benchmark covering diverse low-resource domains, including medical imaging,
rare plants, and circuits. Our experiments demonstrate that CORE outperforms
existing state-of-the-art methods that rely on synthetic data generation and
model fine-tuning.
Authors' comments: Accepted to EMNLP 2024 (Main)
Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Dogan Can, Xiaodan Zhuang
Neural contextual biasing allows speech recognition models to leverage
contextually relevant information, leading to improved transcription accuracy.
However, the biasing mechanism is typically based on a cross-attention module
between the audio and a catalogue of biasing entries, which means computational
complexity can pose severe practical limitations on the size of the biasing
catalogue and consequently on accuracy improvements. This work proposes an
approximation to cross-attention scoring based on vector quantization and
enables compute- and memory-efficient use of large biasing catalogues. We
propose to use this technique jointly with a retrieval based contextual biasing
approach. First, we use an efficient quantized retrieval module to shortlist
biasing entries by grounding them on audio. Then we use retrieved entries for
biasing. Since the proposed approach is agnostic to the biasing method, we
investigate using full cross-attention, LLM prompting, and a combination of the
two. We show that retrieval based shortlisting allows the system to efficiently
leverage biasing catalogues of several thousands of entries, resulting in up to
71% relative error rate reduction in personal entity recognition. At the same
time, the proposed approximation algorithm reduces compute time by 20% and
memory usage by 85-95%, for lists of up to one million entries, when compared
to standard dot-product cross-attention.
Authors' comments: 14 pages, 7 figures, submitted to IEEE/ACM Transactions on Audio,
Speech, and Language Processing
Anuradha Chopra, Abhinaba Roy, Dorien Herremans
This paper introduces an extendable modular system that compiles a range of
music feature extraction models to aid music information retrieval research.
The features include musical elements like key, downbeats, and genre, as well
as audio characteristics like instrument recognition, vocals/instrumental
classification, and vocals gender detection. The integrated models are
state-of-the-art or latest open-source. The features can be extracted as latent
or post-processed labels, enabling integration into music applications such as
generative music, recommendation, and playlist generation. The modular design
allows easy integration of newly developed systems, making it a good
benchmarking and comparison tool. This versatile toolkit supports the research
community in developing innovative solutions by providing concrete musical
features.
Authors' comments: 2 pages, 4 tables, submitted to Extended Abstracts for the
Late-Breaking Demo Session of the 25th Int. Society for Music Information
Retrieval Conf., San Francisco, United States, 2024
Matyas Juhasz, Kalyan Dutia, Henry Franks, Conor Delahunty, Patrick Fawbert Mills, Harrison Pim
Climate decision making is constrained by the complexity and inaccessibility of key information within lengthy, technical, and multi-lingual documents. Generative AI technologies offer a promising route for improving the accessibility of information contained within these documents, but suffer from limitations. These include (1) a tendency to hallucinate or mis-represent information, (2) difficulty in steering or guaranteeing properties of generated output, and (3) reduced performance in specific technical domains. To address these challenges, we introduce a novel evaluation framework with domain-specific dimensions tailored for climate-related documents. We then apply this framework to evaluate Retrieval-Augmented Generation (RAG) approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents. In addition, we publish a human-annotated dataset and scalable automated evaluation tools, with the aim of facilitating broader adoption and robust assessment of these systems in the climate domain. Our findings highlight the key components of responsible deployment of RAG to enhance decision-making, while also providing insights into user experience (UX) considerations for safely deploying such systems to build trust with users in high-risk domains.
Jia Song, Wanru Zhuang, Yujie Lin, Liang Zhang, Chunyan Li, Jinsong Su, Song He, Xiaochen Bo
Cross-modal text-molecule retrieval model aims to learn a shared feature
space of the text and molecule modalities for accurate similarity calculation,
which facilitates the rapid screening of molecules with specific properties and
activities in drug design. However, previous works have two main defects.
First, they are inadequate in capturing modality-shared features considering
the significant gap between text sequences and molecule graphs. Second, they
mainly rely on contrastive learning and adversarial training for cross-modality
alignment, both of which mainly focus on the first-order similarity, ignoring
the second-order similarity that can capture more structural information in the
embedding space. To address these issues, we propose a novel cross-modal
text-molecule retrieval model with two-fold improvements. Specifically, on the
top of two modality-specific encoders, we stack a memory bank based feature
projector that contain learnable memory vectors to extract modality-shared
features better. More importantly, during the model training, we calculate four
kinds of similarity distributions (text-to-text, text-to-molecule,
molecule-to-molecule, and molecule-to-text similarity distributions) for each
instance, and then minimize the distance between these similarity distributions
(namely second-order similarity losses) to enhance cross-modal alignment.
Experimental results and analysis strongly demonstrate the effectiveness of our
model. Particularly, our model achieves SOTA performance, outperforming the
previously-reported best result by 6.4%.
Authors' comments: BIBM 2024 regular paper
Jianqun Zhou, Yuanlei Zheng, Wei Chen, Qianqian Zheng, Hui Su, Wei Zhang, Rui Meng, Xiaoyu Shen
Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.