Matin Mortaheb, Mohammad A. Amir Khojastepour, Srimat T. Chakradhar, Sennur Ulukus
Retrieval-augmented generation (RAG) improves large language models (LLMs) by using external knowledge to guide response generation, reducing hallucinations. However, RAG, particularly multi-modal RAG, can introduce new hallucination sources: (i) the retrieval process may select irrelevant pieces (e.g., documents, images) as raw context from the database, and (ii) retrieved images are processed into text-based context via vision-language models (VLMs) or directly used by multi-modal language models (MLLMs) like GPT-4o, which may hallucinate. To address this, we propose a novel framework to evaluate the reliability of multi-modal RAG using two performance measures: (i) the relevancy score (RS), assessing the relevance of retrieved entries to the query, and (ii) the correctness score (CS), evaluating the accuracy of the generated response. We train RS and CS models using a ChatGPT-derived database and human evaluator samples. Results show that both models achieve ~88% accuracy on test data. Additionally, we construct a 5000-sample human-annotated database evaluating the relevancy of retrieved pieces and the correctness of response statements. Our RS model aligns with human preferences 20% more often than CLIP in retrieval, and our CS model matches human preferences ~91% of the time. Finally, we assess various RAG systems' selection and generation performances using RS and CS.
David Otero, Javier Parapar, Álvaro Barreiro
Null Hypothesis Significance Testing is the \textit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.
Tamir Bendory, Dan Edidin
The classical phase retrieval problem involves estimating a signal from its Fourier magnitudes (power spectrum) by leveraging prior information about the desired signal. This paper extends the problem to compact groups, addressing the recovery of a set of matrices from their Gram matrices. In this broader context, the missing phases in Fourier space are replaced by missing unitary or orthogonal matrices arising from the action of a compact group on a finite-dimensional vector space. This generalization is driven by applications in multi-reference alignment and single-particle cryo-electron microscopy, a pivotal technology in structural biology. We define the generalized phase retrieval problem over compact groups and explore its underlying algebraic structure. We survey recent results on the uniqueness of solutions, focusing on the significant class of semialgebraic priors. Furthermore, we present a family of algorithms inspired by classical phase retrieval techniques. Finally, we propose a conjecture on the stability of the problem based on bi-Lipschitz analysis, supported by numerical experiments.
Liang Xiao, Wen Dai, Shuai Chen, Bin Qin, Chongyang Shi, Haopeng Jing, Tianyu Guo
Retrieval-augmented generation has gained significant attention due to its ability to integrate relevant external knowledge, enhancing the accuracy and reliability of the LLMs' responses. Most of the existing methods apply a dynamic multiple retrieval-generating process, to address multi-hop complex questions by decomposing them into sub-problems. However, these methods rely on an unidirectional forward reasoning paradigm, where errors from insufficient reasoning steps or inherent flaws in current retrieval systems are irreversible, potentially derailing the entire reasoning chain. For the first time, this work introduces Retroactive Retrieval-Augmented Generation (RetroRAG), a novel framework to build a retroactive reasoning paradigm. RetroRAG revises and updates the evidence, redirecting the reasoning chain to the correct direction. RetroRAG constructs an evidence-collation-discovery framework to search, generate, and refine credible evidence. It synthesizes inferential evidence related to the key entities in the question from the existing source knowledge and formulates search queries to uncover additional information. As new evidence is found, RetroRAG continually updates and organizes this information, enhancing its ability to locate further necessary evidence. Paired with an Answerer to generate and evaluate outputs, RetroRAG is capable of refining its reasoning process iteratively until a reliable answer is obtained. Empirical evaluations show that RetroRAG significantly outperforms existing methods.
Yuhao Zhou
We present a novel approach to automated proof generation for the TLA+ Proof System (TLAPS) using Large Language Models (LLMs). Our method combines two key components: a sub-proof obligation generation phase that breaks down complex proof obligations into simpler sub-obligations, and a proof generation phase that leverages Retrieval-Augmented Generation with verified proof examples. We evaluate our approach using proof obligations from varying complexity levels of proof obligations, spanning from fundamental arithmetic properties to the properties of algorithms. Our experiments demonstrate that while the method successfully generates valid proofs for intermediate-complexity obligations, it faces limitations with more complex theorems. These results indicate that our approach can effectively assist in proof development for certain classes of properties, contributing to the broader goal of integrating LLMs into formal verification workflows.
Mehmet Onurcan Kaya, Figen S. Oktem
Diffusion models have demonstrated their utility as learned priors for solving various inverse problems. However, most existing approaches are limited to linear inverse problems. This paper exploits the efficient and unsupervised posterior sampling framework of Denoising Diffusion Restoration Models (DDRM) for the solution of nonlinear phase retrieval problem, which requires reconstructing an image from its noisy intensity-only measurements such as Fourier intensity. The approach combines the model-based alternating-projection methods with the DDRM to utilize pretrained unconditional diffusion priors for phase retrieval. The performance is demonstrated through both simulations and experimental data. Results demonstrate the potential of this approach for improving the alternating-projection methods as well as its limitations.
Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy et al.
As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.
Shijie Wang, Wenqi Fan, Yue Feng, Shanru Lin, Xinyu Ma, Shuaiqiang Wang, Dawei Yin
Recommender systems have become increasingly vital in our daily lives,
helping to alleviate the problem of information overload across various
user-oriented online services. The emergence of Large Language Models (LLMs)
has yielded remarkable achievements, demonstrating their potential for the
development of next-generation recommender systems. Despite these advancements,
LLM-based recommender systems face inherent limitations stemming from their LLM
backbones, particularly issues of hallucinations and the lack of up-to-date and
domain-specific knowledge. Recently, Retrieval-Augmented Generation (RAG) has
garnered significant attention for addressing these limitations by leveraging
external knowledge sources to enhance the understanding and generation of LLMs.
However, vanilla RAG methods often introduce noise and neglect structural
relationships in knowledge, limiting their effectiveness in LLM-based
recommendations. To address these limitations, we propose to retrieve
high-quality and up-to-date structure information from the knowledge graph (KG)
to augment recommendations. Specifically, our approach develops a
retrieval-augmented framework, termed K-RagRec, that facilitates the
recommendation generation process by incorporating structure information from
the external KG. Extensive experiments have been conducted to demonstrate the
effectiveness of our proposed method.
Authors' comments: Accepted by ACL 2025 main conference
Kangcheng Luo, Quzhe Huang, Cong Jiang, Yansong Feng
Legal articles often include vague concepts for adapting to the ever-changing society. Providing detailed interpretations of these concepts is a critical and challenging task even for legal practitioners. It requires meticulous and professional annotations and summarizations by legal experts, which are admittedly time-consuming and expensive to collect at scale. By emulating legal experts' doctrinal method, we introduce a novel framework, ATRIE, using large language models (LLMs) to AuTomatically Retrieve concept-related information, Interpret legal concepts, and Evaluate generated interpretations, eliminating dependence on legal experts. ATRIE comprises a legal concept interpreter and a legal concept interpretation evaluator. The interpreter uses LLMs to retrieve relevant information from judicial precedents and interpret legal concepts. The evaluator uses performance changes on legal concept entailment, a downstream task we propose, as a proxy of interpretation quality. Automatic and multifaceted human evaluations indicate that the quality of our interpretations is comparable to those written by legal experts, with superior comprehensiveness and readability. Although there remains a slight gap in accuracy, it can already assist legal practitioners in improving the efficiency of concept interpretation.
Ren Hu, Pan Lian
Quaternionic signal processing provides powerful tools for efficiently managing color signals by preserving the intrinsic correlations among signal dimensions through quaternion algebra. In this paper, we address the quaternionic phase retrieval problem by systematically developing novel algorithms based on an amplitude-based model. Specifically, we propose the Quaternionic Reweighted Amplitude Flow (QRAF) algorithm, which is further enhanced by three of its variants: incremental, accelerated, and adapted QRAF algorithms. In addition, we introduce the Quaternionic Perturbed Amplitude Flow (QPAF) algorithm, which has linear convergence. Extensive numerical experiments on both synthetic data and real images, demonstrate that our proposed methods significantly improve recovery performance and computational efficiency compared to state-of-the-art approaches.
Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
Multimodal learning with incomplete modality is practical and challenging.
Recently, researchers have focused on enhancing the robustness of pre-trained
MultiModal Transformers (MMTs) under missing modality conditions by applying
learnable prompts. However, these prompt-based methods face several
limitations: (1) incomplete modalities provide restricted modal cues for
task-specific inference, (2) dummy imputation for missing content causes
information loss and introduces noise, and (3) static prompts are
instance-agnostic, offering limited knowledge for instances with various
missing conditions. To address these issues, we propose RAGPT, a novel
Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three
modules: (I) the multi-channel retriever, which identifies similar instances
through a within-modality retrieval strategy, (II) the missing modality
generator, which recovers missing information using retrieved contexts, and
(III) the context-aware prompter, which captures contextual knowledge from
relevant instances and generates dynamic prompts to largely enhance the MMT's
robustness. Extensive experiments conducted on three real-world datasets show
that RAGPT consistently outperforms all competitive baselines in handling
incomplete modality problems. The code of our work and prompt-based baselines
is available at https://github.com/Jian-Lang/RAGPT.
Authors' comments: 9 pages, 8 figures. Accepted by AAAI 2025. Codes are released at
https://github.com/Jian-Lang/RAGPT
Huichi Zhou, Kin-Hei Lee, Zhonghao Zhan, Yue Chen, Zhenhao Li, Zhaoyang Wang, Hamed Haddadi, Emine Yilmaz
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
Wonduk Seo, Zonghao Yuan, Yi Bu
Ensuring cultural values alignment in Large Language Models (LLMs) remains a
critical challenge, as these models often embed Western-centric biases from
their training data, leading to misrepresentations and fairness concerns in
cross-cultural applications. Existing approaches such as role assignment and
few-shot learning struggle to address these limitations effectively due to
their reliance on pre-trained knowledge, limited scalability, and inability to
capture nuanced cultural values. To address these issues, we propose ValuesRAG,
a novel and effective framework that applies Retrieval-Augmented Generation
(RAG) with In-Context Learning (ICL) to integrate cultural and demographic
knowledge dynamically during text generation. Leveraging the World Values
Survey (WVS) dataset, ValuesRAG first generates summaries of values for each
individual. We subsequently curate several representative regional datasets to
serve as test datasets and retrieve relevant summaries of values based on
demographic features, followed by a reranking step to select the top-k relevant
summaries. We evaluate ValuesRAG using 6 diverse regional datasets and show
that it consistently outperforms baselines: including zero-shot,
role-assignment, few-shot, and hybrid methods, both in main experiments and
ablation settings. Notably, ValuesRAG achieves the best overall performance
over prior methods, demonstrating its effectiveness in fostering culturally
aligned and inclusive AI systems. Our findings underscore the potential of
dynamic retrieval-based methods to bridge the gap between global LLM
capabilities and localized cultural values.
Authors' comments: preprint
Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu et al.
Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.
Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang
Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video captioning tasks in a unified framework, we develop a simple baseline based on a Multimodal Language Model (MLLM). By implementing a two-stage Supervised Fine-Tuning (SFT), we fully unlock the potential of MLLM, enabling it not only to generate detailed video descriptions but also to extract video features. Surprisingly, experimental results demonstrate that, compared to the CLIP-based models designed for retrieval and the popular MLLMs skilled in video captioning, our baseline shows competitive performance in both fine-grained video retrieval and video detailed captioning.
Matan Ben-Tov, Mahmood Sharif
Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art search results and popularizing the use of Retrieval Augmented Generation (RAG). Still, like other search methods, embedding-based retrieval may be susceptible to search-engine optimization (SEO) attacks, where adversaries promote malicious content by introducing adversarial passages to corpora. To faithfully assess and gain insights into the susceptibility of such systems to SEO, this work proposes the GASLITE attack, a mathematically principled gradient-based search method for generating adversarial passages without relying on the corpus content or modifying the model. Notably, GASLITE's passages (1) carry adversary-chosen information while (2) achieving high retrieval ranking for a selected query distribution when inserted to corpora. We use GASLITE to extensively evaluate retrievers' robustness, testing nine advanced models under varied threat models, while focusing on realistic adversaries targeting queries on a specific concept (e.g., a public figure). We found GASLITE consistently outperformed baselines by $\geq$140% success rate, in all settings. Particularly, adversaries using GASLITE require minimal effort to manipulate search results$\unicode{x2013}$by injecting a negligible amount of adversarial passages ($\leq$0.0001% of the corpus), they could make them visible in the top-10 results for 61-100% of unseen concept-specific queries against most evaluated models. Inspecting variance in retrievers' robustness, we identify key factors that may contribute to models' susceptibility to SEO, including specific properties in the embedding space's geometry.
Haoran Sun, Zimu Wang, Qiuyi Chen, Jianjun Chen, Jia Wang, Haiyang Zhang
In recent years, user-generated audio content has proliferated across various
media platforms, creating a growing need for efficient retrieval methods that
allow users to search for audio clips using natural language queries. This
task, known as language-based audio retrieval, presents significant challenges
due to the complexity of learning semantic representations from heterogeneous
data across both text and audio modalities. In this work, we introduce a novel
framework for the language-based audio retrieval task that leverages
co-attention mechanismto jointly learn meaningful representations from both
modalities. To enhance the model's ability to capture fine-grained cross-modal
interactions, we propose a cascaded co-attention architecture, where
co-attention modules are stacked or iterated to progressively refine the
semantic alignment between text and audio. Experiments conducted on two public
datasets show that the proposed method can achieve better performance than the
state-of-the-art method. Specifically, our best performed co-attention model
achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a
15.1% improvement on AudioCaps.
Authors' comments: Accepted at UIC 2024 proceedings. Accepted version
Shunpu Tang, Ruichen Zhang, Yuxuan Yan, Qianqian Yang, Dusit Niyato, Xianbin Wang, Shiwen Mao
Semantic communication (SemCom) is an emerging paradigm aiming at transmitting only task-relevant semantic information to the receiver, which can significantly improve communication efficiency. Recent advancements in generative artificial intelligence (GenAI) have empowered GenAI-enabled SemCom (GenSemCom) to further expand its potential in various applications. However, current GenSemCom systems still face challenges such as semantic inconsistency, limited adaptability to diverse tasks and dynamic environments, and the inability to leverage insights from past transmission. Motivated by the success of retrieval-augmented generation (RAG) in the domain of GenAI, this paper explores the integration of RAG in GenSemCom systems. Specifically, we first provide a comprehensive review of existing GenSemCom systems and the fundamentals of RAG techniques. We then discuss how RAG can be integrated into GenSemCom. Following this, we conduct a case study on semantic image transmission using an RAG-enabled diffusion-based SemCom system, demonstrating the effectiveness of the proposed integration. Finally, we outline future directions for advancing RAG-enabled GenSemCom systems.
Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim
Recent lightweight image captioning models using retrieved data mainly focus
on text prompts. However, previous works only utilize the retrieved text as
text prompts, and the visual information relies only on the CLIP visual
embedding. Because of this issue, there is a limitation that the image
descriptions inherent in the prompt are not sufficiently reflected in the
visual embedding space. To tackle this issue, we propose ViPCap, a novel
retrieval text-based visual prompt for lightweight image captioning. ViPCap
leverages the retrieved text with image information as visual prompts to
enhance the ability of the model to capture relevant visual information. By
mapping text prompts into the CLIP space and generating multiple randomized
Gaussian distributions, our method leverages sampling to explore randomly
augmented distributions and effectively retrieves the semantic features that
contain image information. These retrieved features are integrated into the
image and designated as the visual prompt, leading to performance improvements
on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results
demonstrate that ViPCap significantly outperforms prior lightweight captioning
models in efficiency and effectiveness, demonstrating the potential for a
plug-and-play solution. The source code is available at
https://github.com/taewhankim/VIPCAP.
Authors' comments: Accepted to AAAI 2025