Wonduk Seo, Seunghyun Lee
Query expansion is widely used in Information Retrieval (IR) to improve
search outcomes by enriching queries with additional contextual information.
Although recent Large Language Model (LLM) based methods generate
pseudo-relevant content and expanded terms via multiple prompts, they often
yield repetitive, narrow expansions that lack the diverse context needed to
retrieve all relevant information. In this paper, we introduce QA-Expand, a
novel and effective framework for query expansion. It first generates multiple
relevant questions from the initial query and subsequently produces
corresponding pseudo-answers as surrogate documents. A feedback model further
rewrites and filters these answers to ensure only the most informative
augmentations are incorporated. Extensive experiments on benchmarks such as
BEIR and TREC demonstrate that QA-Expand enhances retrieval performance by up
to 13% over state-of-the-art methods, offering a robust solution for modern
retrieval challenges.
Authors' comments: 8 pages
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari
Large Language Models (LLMs) struggle with hallucinations and outdated
knowledge due to their reliance on static training data. Retrieval-Augmented
Generation (RAG) mitigates these issues by integrating external dynamic
information enhancing factual and updated grounding. Recent advances in
multimodal learning have led to the development of Multimodal RAG,
incorporating multiple modalities such as text, images, audio, and video to
enhance the generated outputs. However, cross-modal alignment and reasoning
introduce unique challenges to Multimodal RAG, distinguishing it from
traditional unimodal RAG. This survey offers a structured and comprehensive
analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks,
evaluation, methodologies, and innovations in retrieval, fusion, augmentation,
and generation. We precisely review training strategies, robustness
enhancements, and loss functions, while also exploring the diverse Multimodal
RAG scenarios. Furthermore, we discuss open challenges and future research
directions to support advancements in this evolving field. This survey lays the
foundation for developing more capable and reliable AI systems that effectively
leverage multimodal dynamic external knowledge bases. Resources are available
at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
Authors' comments: GitHub repository:
https://github.com/llm-lab-org/Multimodal-RAG-Survey
Navid Rajabi, Jana Kosecka
In this work, we propose a modular approach for the Vision-Language Navigation (VLN) task by decomposing the problem into four sub-modules that use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting. Given navigation instruction in natural language, we first prompt LLM to extract the landmarks and the order in which they are visited. Assuming the known model of the environment, we retrieve the top-k locations of the last landmark and generate $k$ path hypotheses from the starting location to the last landmark using the shortest path algorithm on the topological map of the environment. Each path hypothesis is represented by a sequence of panoramas. We then use dynamic programming to compute the alignment score between the sequence of panoramas and the sequence of landmark names, which match scores obtained from VLM. Finally, we compute the nDTW metric between the hypothesis that yields the highest alignment score to evaluate the path fidelity. We demonstrate superior performance compared to other approaches that use joint semantic maps like VLMaps \cite{vlmaps} on the complex R2R-Habitat \cite{r2r} instruction dataset and quantify in detail the effect of visual grounding on navigation performance.
Jian Xu, Sichun Luo, Xiangyu Chen, Haoming Huang, Hanxu Hou, Linqi Song
Large Language Models (LLMs) have been integrated into recommendation systems
to enhance user behavior comprehension. The Retrieval Augmented Generation
(RAG) technique is further incorporated into these systems to retrieve more
relevant items and improve system performance. However, existing RAG methods
rely primarily on textual semantics and often fail to incorporate the most
relevant items, limiting the effectiveness of the systems.
In this paper, we propose Representation learning for retrieval-Augmented
Large Language model Recommendation (RALLRec). Specifically, we enhance textual
semantics by prompting LLMs to generate more detailed item descriptions,
followed by joint representation learning of textual and collaborative
semantics, which are extracted by the LLM and recommendation models,
respectively. Considering the potential time-varying characteristics of user
interest, a simple yet effective reranking method is further introduced to
capture the dynamics of user preference. We conducted extensive experiments on
three real-world datasets, and the evaluation results validated the
effectiveness of our method. Code is made public at
https://github.com/JianXu95/RALLRec.
Authors' comments: Accepted by TheWebConf'25 (WWW'25) as a Short Paper
Guoxin Chen, Minpeng Liao, Peiying Yu, Dingmin Wang, Zile Qiao, Chao Yang, Xin Zhao, Kai Fan
Retrieval-augmented generation (RAG) systems face a fundamental challenge in
aligning independently developed retrievers and large language models (LLMs).
Existing approaches typically involve modifying either component or introducing
simple intermediate modules, resulting in practical limitations and sub-optimal
performance. Inspired by human search behavior -- typically involving a
back-and-forth process of proposing search queries and reviewing documents, we
propose C-3PO, a proxy-centric framework that facilitates communication between
retrievers and LLMs through a lightweight multi-agent system. Our framework
implements three specialized agents that collaboratively optimize the entire
RAG pipeline without altering the retriever and LLMs. These agents work
together to assess the need for retrieval, generate effective queries, and
select information suitable for the LLMs. To enable effective multi-agent
coordination, we develop a tree-structured rollout approach for reward credit
assignment in reinforcement learning. Extensive experiments in both in-domain
and out-of-distribution scenarios demonstrate that C-3PO significantly enhances
RAG performance while maintaining plug-and-play flexibility and superior
generalization capabilities.
Authors' comments: Camera ready version for ICML 2025
Zhengyun Zhao, Hongyi Yuan, Jingjing Liu, Haichao Chen, Huaiyuan Ying, Songchi Zhou, Yue Zhong, Sheng Yu
Electronic Health Record (EHR) retrieval plays a pivotal role in various clinical tasks, but its development has been severely impeded by the lack of publicly available benchmarks. In this paper, we introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap. We consider two retrieval settings: Single-Patient Retrieval and Multi-Patient Retrieval, reflecting various real-world scenarios. Single-Patient Retrieval focuses on finding relevant parts within a patient note, while Multi-Patient Retrieval involves retrieving EHRs from multiple patients. We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III, and collect 1,246 unique queries with 77,206 relevance judgments by further leveraging powerful LLMs as annotators. Additionally, we include a novel assessment of the semantic gap issue in EHR retrieval by categorizing matching types into string match and four types of semantic matches. On our proposed benchmark, we conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers. Our experiments find that BM25 sets a strong baseline and performs competitively to the dense retrievers, and general domain dense retrievers surprisingly outperform those designed for the medical domain. In-depth analyses on various matching types reveal the strengths and drawbacks of different methods, enlightening the potential for targeted improvement. We believe that our benchmark will stimulate the research communities to advance EHR retrieval systems.
Serge Kas Hanna
This work presents a theoretical analysis of the probability of successfully
retrieving data encoded with MDS codes (e.g., Reed-Solomon codes) in DNA
storage systems. We study this probability under independent and identically
distributed (i.i.d.) substitution errors, focusing on a common code design
strategy that combines inner and outer MDS codes. Our analysis demonstrates how
this probability depends on factors such as the total number of sequencing
reads, their distribution across strands, the rates of the inner and outer
codes, and the substitution error probabilities. These results provide
actionable insights into optimizing DNA storage systems under reliability
constraints, including determining the minimum number of sequencing reads
needed for reliable data retrieval and identifying the optimal balance between
the rates of inner and outer MDS codes.
Authors' comments: A shorter version of this paper has been accepted for presentation at
ISIT 2025
Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, Heng Huang
Large language models (LLMs) are increasingly integrated into real-world
personalized applications through retrieval-augmented generation (RAG)
mechanisms to supplement their responses with domain-specific knowledge.
However, the valuable and often proprietary nature of the knowledge bases used
in RAG introduces the risk of unauthorized usage by adversaries. Existing
methods that can be generalized as watermarking techniques to protect these
knowledge bases typically involve poisoning or backdoor attacks. However, these
methods require altering the LLM's results of verification samples, inevitably
making these watermarks susceptible to anomaly detection and even introducing
new security risks. To address these challenges, we propose \name{} for
`harmless' copyright protection of knowledge bases. Instead of manipulating
LLM's final output, \name{} implants distinct yet benign verification behaviors
in the space of chain-of-thought (CoT) reasoning, maintaining the correctness
of the final answer. Our method has three main stages: (1) Generating CoTs: For
each verification question, we generate two `innocent' CoTs, including a target
CoT for building watermark behaviors; (2) Optimizing Watermark Phrases and
Target CoTs: Inspired by our theoretical analysis, we optimize them to minimize
retrieval errors under the \emph{black-box} and \emph{text-only} setting of
suspicious LLM, ensuring that only watermarked verification queries can
retrieve their correspondingly target CoTs contained in the knowledge base; (3)
Ownership Verification: We exploit a pairwise Wilcoxon test to verify whether a
suspicious LLM is augmented with the protected knowledge base by comparing its
responses to watermarked and benign verification queries. Our experiments on
diverse benchmarks demonstrate that \name{} effectively protects knowledge
bases and its resistance to adaptive attacks.
Authors' comments: The first two authors contributed equally to this work. 25 pages
Venktesh V, Vinay Setty
The field of automated fact-checking increasingly depends on retrieving
web-based evidence to determine the veracity of claims in real-world scenarios.
A significant challenge in this process is not only retrieving relevant
information, but also identifying evidence that can both support and refute
complex claims. Traditional retrieval methods may return documents that
directly address claims or lean toward supporting them, but often struggle with
more complex claims requiring indirect reasoning. While some existing
benchmarks and methods target retrieval for fact-checking, a comprehensive
real-world open-domain benchmark has been lacking. In this paper, we present a
real-world retrieval benchmark FactIR, derived from Factiverse production logs,
enhanced with human annotations. We rigorously evaluate state-of-the-art
retrieval models in a zero-shot setup on FactIR and offer insights for
developing practical retrieval systems for fact-checking. Code and data are
available at https://github.com/factiverse/factIR.
Authors' comments: Accepted to WWW 2025 resource track
Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong et al.
Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC). By integrating context retrieval into content generation, RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks. However, despite RAG's success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including robustness issues, privacy concerns, adversarial attacks, and accountability issues. Addressing these risks is critical for future applications of RAG systems, as they directly impact their trustworthiness. Although various methods have been developed to improve the trustworthiness of RAG methods, there is a lack of a unified perspective and framework for research in this topic. Thus, in this paper, we aim to address this gap by providing a comprehensive roadmap for developing trustworthy RAG systems. We place our discussion around five key perspectives: reliability, privacy, safety, fairness, explainability, and accountability. For each perspective, we present a general framework and taxonomy, offering a structured approach to understanding the current challenges, evaluating existing solutions, and identifying promising future research directions. To encourage broader adoption and innovation, we also highlight the downstream applications where trustworthy RAG systems have a significant impact.
Mohammad Kianpisheh
Content-based video retrieval is one of the most challenging tasks in surveillance systems. In this study, Latent Dirichlet Allocation (LDA) topic model is used to annotate surveillance videos in an unsupervised manner. In scene understanding methods, some of the learned patterns are ambiguous and represents a mixture of atomic actions. To address the ambiguity issue in the proposed method, feature vectors, and the primary model are processed to obtain a secondary model which describes the scene with primitive patterns that lack any ambiguity. Experiments show performance improvement in the retrieval task compared to other topic model-based methods. In terms of false positive and true positive responses, the proposed method achieves at least 80\% and 124\% improvement respectively. Four search strategies are proposed, and users can define and search for a variety of activities using the proposed query formulation which is based on topic models. In addition, the lightweight database in our method occupies much fewer storage which in turn speeds up the search procedure compared to the methods which are based on low-level features.
Subhamoy Chatterjee, Andres Munoz-Jaramillo, Anna Malanushenko
Deep generative models have shown immense potential in generating unseen data
that has properties of real data. These models learn complex data-generating
distributions starting from a smaller set of latent dimensions. However,
generative models have encountered great skepticism in scientific domains due
to the disconnection between generative latent vectors and scientifically
relevant quantities. In this study, we integrate three types of machine
learning models to generate solar magnetic patches in a physically
interpretable manner and use those as a query to find matching patches in real
observations. We use the magnetic field measurements from Space-weather HMI
Active Region Patches (SHARPs) to train a Generative Adversarial Network (GAN).
We connect the physical properties of GAN-generated images with their latent
vectors to train Support Vector Machines (SVMs) that do mapping between
physical and latent spaces. These produce directions in the GAN latent space
along which known physical parameters of the SHARPs change. We train a
self-supervised learner (SSL) to make queries with generated images and find
matches from real data. We find that the GAN-SVM combination enables users to
produce high-quality patches that change smoothly only with a prescribed
physical quantity, making generative models physically interpretable. We also
show that GAN outputs can be used to retrieve real data that shares the same
physical properties as the generated query. This elevates Generative Artificial
Intelligence (AI) from a means-to-produce artificial data to a novel tool for
scientific data interrogation, supporting its applicability beyond the domain
of heliophysics.
Authors' comments: 9 pages, 6 figures
Sicheng Zhong, Jiading Zhu, Yifang Tian, Xujie Si
Scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts, which are challenges overlooked by existing function-centric methods. We introduce RagVerus, a framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof synthesis for multi-module repositories, achieving a 27% relative improvement on our novel RepoVBench benchmark -- the first repository-level dataset for Verus with 383 proof completion tasks. RagVerus triples proof pass rates on existing benchmarks under constrained language model budgets, demonstrating a scalable and sample-efficient verification.
Rishabh Uapadhyay, Marco Viviani
The exponential surge in online health information, coupled with its increasing use by non-experts, highlights the pressing need for advanced Health Information Retrieval models that consider not only topical relevance but also the factual accuracy of the retrieved information, given the potential risks associated with health misinformation. To this aim, this paper introduces a solution driven by Retrieval-Augmented Generation (RAG), which leverages the capabilities of generative Large Language Models (LLMs) to enhance the retrieval of health-related documents grounded in scientific evidence. In particular, we propose a three-stage model: in the first stage, the user's query is employed to retrieve topically relevant passages with associated references from a knowledge base constituted by scientific literature. In the second stage, these passages, alongside the initial query, are processed by LLMs to generate a contextually relevant rich text (GenText). In the last stage, the documents to be retrieved are evaluated and ranked both from the point of view of topical relevance and factual accuracy by means of their comparison with GenText, either through stance detection or semantic similarity. In addition to calculating factual accuracy, GenText can offer a layer of explainability for it, aiding users in understanding the reasoning behind the retrieval. Experimental evaluation of our model on benchmark datasets and against baseline models demonstrates its effectiveness in enhancing the retrieval of both topically relevant and factually accurate health information, thus presenting a significant step forward in the health misinformation mitigation problem.
Dylan Brault, Corinne Fournier, Tatiana Latychevskaia
Iterative phase retrieval algorithms are widely used in digital optics for their efficiency and simplicity. Conventionally, these algorithms do not consider aberrations as they assume an ideal, aberration-free optical system. Here, we propose modified iterative phase retrieval algorithms that take into account the space-invariant and space-variant point spread function of the optical system.
Yuwei Yin, Giuseppe Carenini
Large language models (LLMs) achieve remarkable performance on challenging
benchmarks that are often structured as multiple-choice question-answering (QA)
tasks. Zero-shot Chain-of-Thought (CoT) prompting enhances reasoning in LLMs
but provides only vague and generic guidance ("think step by step"). This paper
introduces ARR, an intuitive and effective zero-shot prompting method that
explicitly incorporates three key steps in QA solving: analyzing the intent of
the question, retrieving relevant information, and reasoning step by step.
Comprehensive experiments across diverse and challenging QA tasks demonstrate
that ARR consistently improves the Baseline (without ARR prompting) and
outperforms CoT. Ablation and case studies further validate the positive
contributions of each component: analyzing, retrieving, and reasoning. Notably,
intent analysis plays a vital role in ARR. Additionally, extensive evaluations
across various model sizes, LLM series, and generation settings solidify the
effectiveness, robustness, and generalizability of ARR.
Authors' comments: 20 pages. Code: https://github.com/YuweiYin/ARR
Matthew Smart, Alberto Bietti, Anirvan M. Sengupta
We introduce in-context denoising, a task that refines the connection between
attention-based architectures and dense associative memory (DAM) networks, also
known as modern Hopfield networks. Using a Bayesian framework, we show
theoretically and empirically that certain restricted denoising problems can be
solved optimally even by a single-layer transformer. We demonstrate that a
trained attention layer processes each denoising prompt by performing a single
gradient descent update on a context-aware DAM energy landscape, where context
tokens serve as associative memories and the query token acts as an initial
state. This one-step update yields better solutions than exact retrieval of
either a context token or a spurious local minimum, providing a concrete
example of DAM networks extending beyond the standard retrieval paradigm.
Overall, this work solidifies the link between associative memory and attention
mechanisms first identified by Ramsauer et al., and demonstrates the relevance
of associative memory models in the study of in-context learning.
Authors' comments: Accepted to ICML 2025
Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at https://github.com/SNOWTEAM2023/MedRAG
Andreas Baumann, Peter Eberhard
Large Language Models (LLMs) are increasingly helpful in text generation,
even writing code in programming languages based on user prompts written in
natural language. They are even applied to generate simulation models for
multibody systems from natural language. Research results suggest that LLMs
surpass the mere replication of existing code examples, where some LLMs have
been trained on an open-source multibody simulation code. However, for
closed-source simulation software, such results are not to be expected as their
ideas and concepts might differ from other publicly available ones. LLMs can
hallucinate for knowledge-intensive tasks, such as model creation, which can
lead to wrong responses. This is especially the case for the LLM unknown
closed-source simulation software. The same applies to other internal knowledge
kept private to protect intellectual property or data privacy. The
Retrieval-Augmented Generation (RAG) approach might yield a solution for these
knowledge-intensive tasks. This paper explores the application of RAG to
closed-source simulation software and presents first experiments. After a brief
introduction to LLMs, the RAG approach, and the simulation method applied by
the close-source simulation software, several examples are provided to test
LLMs' knowledge of the simulation software and the creation of simulation
models using two RAG systems. The examples show promising results indicating
the benefits of applying RAG systems to closed-source simulation software,
helping to access their knowledge. Nevertheless, they also reveal gaps in the
applied information and open questions for further research.
Authors' comments: 11 pages, 6 tables
Qinhan Yu, Zhiyou Xiao, Binghui Li, Zhengren Wang, Chong Chen, Wentao Zhang
Recent advancements in Retrieval-Augmented Generation (RAG) have shown
remarkable performance in enhancing response accuracy and relevance by
integrating external knowledge into generative models. However, existing RAG
methods primarily focus on providing text-only answers, even in multimodal
retrieval-augmented generation scenarios. In this work, we introduce the
Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, which aims
to generate answers that combine both text and images, fully leveraging the
multimodal data within a corpus. Despite the importance of this task, there is
a notable absence of a comprehensive benchmark to effectively evaluate MRAMG
performance. To bridge this gap, we introduce the MRAMG-Bench, a carefully
curated, human-annotated dataset comprising 4,346 documents, 14,190 images, and
4,800 QA pairs, sourced from three categories: Web Data, Academic Papers, and
Lifestyle. The dataset incorporates diverse difficulty levels and complex
multi-image scenarios, providing a robust foundation for evaluating multimodal
generation tasks. To facilitate rigorous evaluation, our MRAMG-Bench
incorporates a comprehensive suite of both statistical and LLM-based metrics,
enabling a thorough analysis of the performance of popular generative models in
the MRAMG task. Besides, we propose an efficient multimodal answer generation
framework that leverages both LLMs and MLLMs to generate multimodal responses.
Our datasets are available at: https://huggingface.co/MRAMG.
Authors' comments: 11 pages