Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King
Augmenting Large Language Models (LLMs) with retrieved external knowledge has
proven effective for improving the factual accuracy of generated responses.
Despite their success, retrieval-augmented LLMs still face the distractibility
issue, where the generated responses are negatively influenced by noise from
both external and internal knowledge sources. In this paper, we introduce a
novel, training-free decoding method guided by entropy considerations to
mitigate this issue. Our approach utilizes entropy-based document-parallel
ensemble decoding to prioritize low-entropy distributions from retrieved
documents, thereby enhancing the extraction of relevant information of context.
Additionally, it incorporates a contrastive decoding mechanism that contrasts
the obtained low-entropy ensemble distribution with the high-entropy
distribution derived from the model's internal knowledge across layers, which
ensures a greater emphasis on reliable external information. Extensive
experiments on open-domain question answering datasets demonstrate the
superiority of our method.
Authors' comments: NAACL 2025 Main Conference
Ziyu Zhao, Leilei Gan, Guoyin Wang, Yuwei Hu, Tao Shen, Hongxia Yang, Kun Kuang, Fei Wu
Low-Rank Adaptation (LoRA) offers an efficient way to fine-tune large
language models (LLMs). Its modular and plug-and-play nature allows the
integration of various domain-specific LoRAs, enhancing LLM capabilities.
Open-source platforms like Huggingface and Modelscope have introduced a new
computational paradigm, Uploadable Machine Learning (UML). In UML, contributors
use decentralized data to train specialized adapters, which are then uploaded
to a central platform to improve LLMs. This platform uses these domain-specific
adapters to handle mixed-task requests requiring personalized service. Previous
research on LoRA composition either focuses on specific tasks or fixes the LoRA
selection during training. However, in UML, the pool of LoRAs is dynamically
updated with new uploads, requiring a generalizable selection mechanism for
unseen LoRAs. Additionally, the mixed-task nature of downstream requests
necessitates personalized services. To address these challenges, we propose
Retrieval-Augmented Mixture of LoRA Experts (RAMoLE), a framework that
adaptively retrieves and composes multiple LoRAs based on input prompts. RAMoLE
has three main components: LoraRetriever for identifying and retrieving
relevant LoRAs, an on-the-fly MoLE mechanism for coordinating the retrieved
LoRAs, and efficient batch inference for handling heterogeneous requests.
Experimental results show that RAMoLE consistently outperforms baselines,
highlighting its effectiveness and scalability.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2402.09997
Ni Wang, Dongliang Liao, Xing Xu
Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.
Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg
Retrieval models are often evaluated on partially-annotated datasets. Each
query is mapped to a few relevant texts and the remaining corpus is assumed to
be irrelevant. As a result, models that successfully retrieve false negatives
are punished in evaluation. Unfortunately, completely annotating all texts for
every query is not resource efficient. In this work, we show that using
partially-annotated datasets in evaluation can paint a distorted picture. We
curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to
contain all relevant passages for each query. Queries describe a group (e.g.,
"journals about linguistics") and relevant passages are evidence that entities
belong to the group (e.g., a passage indicating that "Language" is a journal
about linguistics). We show that evaluating on a dataset containing annotations
for only a subset of the relevant passages might result in misleading ranking
of the retrieval systems and that as more relevant texts are included in the
evaluation set, the rankings converge. We propose our dataset as a resource for
evaluation and our study as a recommendation for balance between
resource-efficiency and reliable evaluation when annotating evaluation sets for
text retrieval.
Authors' comments: Accepted to EMNLP 2024 main track. Our dataset can be downloaded from
https://D-MERIT.github.io
Paul Primus, Gerhard Widmer
Matching raw audio signals with textual descriptions requires understanding
the audio's content and the description's semantics and then drawing
connections between the two modalities. This paper investigates a hybrid
retrieval system that utilizes audio metadata as an additional clue to
understand the content of audio signals before matching them with textual
queries. We experimented with metadata often attached to audio recordings, such
as keywords and natural-language descriptions, and we investigated late and
mid-level fusion strategies to merge audio and metadata. Our hybrid approach
with keyword metadata and late fusion improved the retrieval performance over a
content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and
AudioCaps benchmarks, respectively.
Authors' comments: In Proceedings of the 32nd European Signal Processing Conference,
EUSIPCO 2024
Yannis Tevissen, Khalil Guetari, Frédéric Petitpont
Video content creators need efficient tools to repurpose content, a task that
often requires complex manual or automated searches. Crafting a new video from
large video libraries remains a challenge. In this paper we introduce the task
of Video Library Question Answering (VLQA) through an interoperable
architecture that applies Retrieval Augmented Generation (RAG) to video
libraries. We propose a system that uses large language models (LLMs) to
generate search queries, retrieving relevant video moments indexed by speech
and visual metadata. An answer generation module then integrates user queries
with this metadata to produce responses with specific video timestamps. This
approach shows promise in multimedia content retrieval, and AI-assisted video
content creation.
Authors' comments: Accepted in IEEE HSI 2024
Ziyan Jiang, Xueguang Ma, Wenhu Chen
In traditional RAG framework, the basic retrieval units are normally short.
The common retrievers like DPR normally work with 100-word Wikipedia
paragraphs. Such a design forces the retriever to search over a large corpus to
find the `needle' unit. In contrast, the readers only need to generate answers
from the short retrieved units. The imbalanced `heavy' retriever and `light'
reader design can lead to sub-optimal performance. The loss of contextual
information in the short, chunked units may increase the likelihood of
introducing hard negatives during the retrieval stage. Additionally, the reader
might not fully leverage the capabilities of recent advancements in LLMs. In
order to alleviate the imbalance, we propose a new framework LongRAG,
consisting of a `long retriever' and a `long reader'. In the two
Wikipedia-based datasets, NQ and HotpotQA, LongRAG processes the entire
Wikipedia corpus into 4K-token units by grouping related documents. By
increasing the unit size, we significantly reduce the total number of units.
This greatly reduces the burden on the retriever, resulting in strong retrieval
performance with only a few (less than 8) top units. Without requiring any
training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA, which
are on par with the (fully-trained) SoTA model. Furthermore, we test on two
non-Wikipedia-based datasets, Qasper and MultiFieldQA-en. LongRAG processes
each individual document as a single (long) unit rather than chunking them into
smaller units. By doing so, we achieve an F1 score of 25.9% on Qasper and 57.5%
on MultiFieldQA-en. Our study offers insights into the future roadmap for
combining RAG with long-context LLMs.
Authors' comments: Technical Report
Yu Bai, Yukai Miao, Li Chen, Dawei Wang, Dan Li, Yanyu Ren, Hongtao Xie, Ce Yang et al.
RAG systems face limitations when semantic relevance alone does not guarantee improved generation quality. This issue becomes particularly evident due to the sensitivity of large language models (LLMs) to the ordering of few-shot prompts, which can affect model performance. To address this challenge, aligning LLM outputs with human preferences using structured feedback, such as options to copy, regenerate, or dislike, offers a promising method for improvement. This feedback is applied to the entire list of inputs rather than giving specific ratings for individual documents, making it a Listwide Labels Learning-to-Rank task. To address this task, we propose Pistis-RAG, a new RAG framework designed with a content-centric approach to better align LLMs with human preferences. Pistis-RAG effectively utilizes human feedback, enhancing content ranking and generation quality. To validate our framework, we use public datasets to simulate human feedback, allowing us to evaluate and refine our method effectively. Experimental results indicate that Pistis-RAG improves alignment with human preferences relative to the baseline RAG system, showing a 6.06% increase in MMLU (English) and a 7.08% increase in C-EVAL (Chinese) accuracy metrics. These results highlight Pistis-RAG's effectiveness in overcoming the limitations associated with traditional RAG approaches.
William Fleshman, Benjamin Van Durme
Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
Yunmo Chen, Tongfei Chen, Harsh Jhamtani, Patrick Xia, Richard Shin, Jason Eisner, Benjamin Van Durme
We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs. By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, outperforming previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally, the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
Davit Abrahamyan, Fatemeh H. Fard
Developers spend much time finding information that is relevant to their questions. Stack Overflow has been the leading resource, and with the advent of Large Language Models (LLMs), generative models such as ChatGPT are used frequently. However, there is a catch in using each one separately. Searching for answers is time-consuming and tedious, as shown by the many tools developed by researchers to address this issue. On the other, using LLMs is not reliable, as they might produce irrelevant or unreliable answers (i.e., hallucination). In this work, we present StackRAG, a retrieval-augmented Multiagent generation tool based on LLMs that combines the two worlds: aggregating the knowledge from SO to enhance the reliability of the generated answers. Initial evaluations show that the generated answers are correct, accurate, relevant, and useful.
Di Wu, Jia-Chen Gu, Fan Yin, Nanyun Peng, Kai-Wei Chang
Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks. However, there are significant trustworthiness concerns as RALMs are prone to generating unfaithful outputs, including baseless information or contradictions with the retrieved context. This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics including sequence likelihood, uncertainty quantification, context influence, and semantic alignment to synchronously detect unfaithful sentences. By integrating efficiently measurable and complementary signals, SynCheck enables accurate and immediate feedback and intervention, achieving 0.85 AUROC in detecting faithfulness errors across six long-form retrieval-augmented generation tasks, improving prior best method by 4%. Leveraging SynCheck, we further introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation. Empirical results demonstrate that FOD outperforms traditional strategies such as abstention, reranking, or contrastive decoding significantly in terms of faithfulness, achieving over 10% improvement across six datasets.
Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, Fabian Gieseke
The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times
Zhepei Wei, Wei-Lin Chen, Yu Meng
Retrieval-augmented generation (RAG) has shown promising potential to enhance
the accuracy and factuality of language models (LMs). However, imperfect
retrievers or noisy corpora can introduce misleading or even erroneous
information to the retrieved contents, posing a significant challenge to the
generation quality. Existing RAG methods typically address this challenge by
directly predicting final answers despite potentially noisy inputs, resulting
in an implicit denoising process that is difficult to interpret and verify. On
the other hand, the acquisition of explicit denoising supervision is often
costly, involving significant human efforts. In this work, we propose
InstructRAG, where LMs explicitly learn the denoising process through
self-synthesized rationales -- First, we instruct the LM to explain how the
ground-truth answer is derived from retrieved documents. Then, these rationales
can be used either as demonstrations for in-context learning of explicit
denoising or as supervised fine-tuning data to train the model. Compared to
standard RAG approaches, InstructRAG requires no additional supervision, allows
for easier verification of the predicted answers, and effectively improves
generation accuracy. Experiments show InstructRAG consistently outperforms
existing RAG methods in both training-free and trainable scenarios, achieving a
relative improvement of 8.3% over the best baseline method on average across
five knowledge-intensive benchmarks. Extensive analysis indicates that
InstructRAG scales well with increased numbers of retrieved documents and
consistently exhibits robust denoising ability even in out-of-domain datasets,
demonstrating strong generalizability.
Authors' comments: ICLR 2025. Code: https://github.com/weizhepei/InstructRAG
Jirui Qi, Gabriele Sarti, Raquel Fernández, Arianna Bisazza
Ensuring the verifiability of model answers is a fundamental challenge for
retrieval-augmented generation (RAG) in the question answering (QA) domain.
Recently, self-citation prompting was proposed to make large language models
(LLMs) generate citations to supporting documents along with their answers.
However, self-citing LLMs often struggle to match the required format, refer to
non-existent sources, and fail to faithfully reflect LLMs' context usage
throughout the generation. In this work, we present MIRAGE --Model
Internals-based RAG Explanations -- a plug-and-play approach using model
internals for faithful answer attribution in RAG applications. MIRAGE detects
context-sensitive answer tokens and pairs them with retrieved documents
contributing to their prediction via saliency methods. We evaluate our proposed
approach on a multilingual extractive QA dataset, finding high agreement with
human answer attribution. On open-ended QA, MIRAGE achieves citation quality
and efficiency comparable to self-citation while also allowing for a
finer-grained control of attribution parameters. Our qualitative evaluation
highlights the faithfulness of MIRAGE's attributions and underscores the
promising application of model internals for RAG answer attribution.
Authors' comments: Accepted by EMNLP 2024 Main Conference. Code and data released at
https://github.com/Betswish/MIRAGE
Yige Shen, Hao Jiang, Hua Qu, Jihong Zhao
Despite their impressive capabilities, large language models (LLMs) often
face challenges such as temporal misalignment and generating hallucinatory
content. Enhancing LLMs with retrieval mechanisms to fetch relevant information
from external sources offers a promising solution. Inspired by the proverb
"Think twice before you act," we propose a dual-angle evaluated
retrieval-augmented generation framework \textit{Think-then-Act}. Unlike
previous approaches that indiscriminately rewrite queries or perform retrieval
regardless of necessity, or generate temporary responses before deciding on
additional retrieval, which increases model generation costs, our framework
employs a two-phase process: (i) assessing the input query for clarity and
completeness to determine if rewriting is necessary; and (ii) evaluating the
model's capability to answer the query and deciding if additional retrieval is
needed. Experimental results on five datasets show that the
\textit{Think-then-Act} framework significantly improves performance. Our
framework demonstrates notable improvements in accuracy and efficiency compared
to existing baselines and performs well in both English and non-English
contexts. Ablation studies validate the optimal model confidence threshold,
highlighting the resource optimization benefits of our approach.
Authors' comments: 12 pages, 8 figures
Masafumi Enomoto, Kunihiro Takeoka, Kosuke Akimoto, Kiril Gashteovski, Masafumi Oyamada
Open-Domain Multi-Document Summarization (ODMDS) is crucial for addressing
diverse information needs, which aims to generate a summary as answer to user's
query, synthesizing relevant content from multiple documents in a large
collection. Existing approaches that first find relevant passages and then
generate a summary using a language model are inadequate for ODMDS. This is
because open-ended queries often require additional context for the retrieved
passages to cover the topic comprehensively, making it challenging to retrieve
all relevant passages initially. While iterative retrieval methods have been
explored for multi-hop question answering (MQA), they are impractical for ODMDS
due to high latency from repeated large language model (LLM) inference for
reasoning. To address this issue, we propose LightPAL, a lightweight passage
retrieval method for ODMDS that constructs a graph representing passage
relationships using an LLM during indexing and employs random walk instead of
iterative reasoning and retrieval at inference time. Experiments on ODMDS
benchmarks show that LightPAL outperforms baseline retrievers in summary
quality while being significantly more efficient than an iterative MQA
approach.
Authors' comments: 13 pages, 3 figures
Rui Yang, Yilin Ning, Emilia Keppo, Mingxuan Liu, Chuan Hong, Danielle S Bitterman, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting et al.
Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the way for connecting this transformative technology with medical applications and is expected to bring innovations in equity, reliability, and personalization to health care.
Jaehee Kim, Yukyung Lee, Pilsung Kang
InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.
Sujoy Roychowdhury, Sumit Soman, Ranjani Hosakere Gireesha, Vansh Chhabra, Neeraj Gunda, Subhadip Bandyopadhyay, Sai Krishna Bala
A plethora of sentence embedding models makes it challenging to choose one,
especially for technical domains rich with specialized vocabulary. In this
work, we domain adapt embeddings using telecom data for question answering. We
evaluate embeddings obtained from publicly available models and their
domain-adapted variants, on both point retrieval accuracies, as well as their
(95%) confidence intervals. We establish a systematic method to obtain
thresholds for similarity scores for different embeddings. As expected, we
observe that fine-tuning improves mean bootstrapped accuracies. We also observe
that it results in tighter confidence intervals, which further improve when
pre-training is preceded by fine-tuning. We introduce metrics which measure the
distributional overlaps of top-$K$, correct and random document similarities
with the question. Further, we show that these metrics are correlated with
retrieval accuracy and similarity thresholds. Recent literature shows
conflicting effects of isotropy on retrieval accuracies. Our experiments
establish that the isotropy of embeddings (as measured by two independent
state-of-the-art isotropy metric definitions) is poorly correlated with
retrieval performance. We show that embeddings for domain-specific sentences
have little overlap with those for domain-agnostic ones, and fine-tuning moves
them further apart. Based on our results, we provide recommendations for use of
our methodology and metrics by researchers and practitioners.
Authors' comments: Accepted for the Workshop On Next Gen Networks Through LLMs Action
Models and Multi Agent Systems at IEEE International Conference on
Communications (ICC) 2025