Qing Wang, Chong-Wah Ngo, Yu Cao, Ee-Peng Lim
Existing approaches for image-to-recipe retrieval have the implicit
assumption that a food image can fully capture the details textually documented
in its recipe. However, a food image only reflects the visual outcome of a
cooked dish and not the underlying cooking process. Consequently, learning
cross-modal representations to bridge the modality gap between images and
recipes tends to ignore subtle, recipe-specific details that are not visually
apparent but are crucial for recipe retrieval. Specifically, the
representations are biased to capture the dominant visual elements, resulting
in difficulty in ranking similar recipes with subtle differences in use of
ingredients and cooking methods. The bias in representation learning is
expected to be more severe when the training data is mixed of images and
recipes sourced from different cuisines. This paper proposes a novel causal
approach that predicts the culinary elements potentially overlooked in images,
while explicitly injecting these elements into cross-modal representation
learning to mitigate biases. Experiments are conducted on the standard
monolingual Recipe1M dataset and a newly curated multilingual multicultural
cuisine dataset. The results indicate that the proposed causal representation
learning is capable of uncovering subtle ingredients and cooking actions and
achieves impressive retrieval performance on both monolingual and multilingual
multicultural datasets.
Authors' comments: ACM Multimedia 2025
Rahul Raja, Arpita Vats
Question Answering (QA) systems have traditionally relied on structured text
data, but the rapid growth of multimedia content (images, audio, video, and
structured metadata) has introduced new challenges and opportunities for
retrieval-augmented QA. In this survey, we review recent advancements in QA
systems that integrate multimedia retrieval pipelines, focusing on
architectures that align vision, language, and audio modalities with user
queries. We categorize approaches based on retrieval methods, fusion
techniques, and answer generation strategies, and analyze benchmark datasets,
evaluation protocols, and performance tradeoffs. Furthermore, we highlight key
challenges such as cross-modal alignment, latency-accuracy tradeoffs, and
semantic grounding, and outline open problems and future research directions
for building more robust and context-aware QA systems leveraging multimedia
data.
Authors' comments: In Proceedings of the 2nd ACM Workshop in AI-powered Question and
Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New
York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
Hamed Jelodar, Mohammad Meymani, Roozbeh Razavi-Far, Ali A. Ghorbani
Generative AI and large language models (LLMs) have shown strong capabilities in code understanding, but their use in cybersecurity, particularly for malware detection and analysis, remains limited. Existing detection systems often fail to generalize to obfuscated or previously unseen threats, underscoring the need for more adaptable and explainable models. To address this challenge, we introduce XGen-Q, a domain-adapted LLM built on the Qwen-Coder architecture and pretrained on a large-scale corpus of over one million malware samples, spanning both source and assembly code. XGen-Q uses a multi-stage prompt strategy combined with retrieval-augmented generation (RAG) to deliver reliable malware identification and detailed forensic reporting, even in the presence of complex code obfuscation. To further enhance generalization, we design a training pipeline that systematically exposes the model to diverse obfuscation patterns. Experimental results show that XGen-Q achieves significantly lower perplexity than competitive baselines and exhibits strong performance on novel malware samples, demonstrating the promise of LLM-based approaches for interpretable and robust malware analysis.
Dong Yun, Marco Schouten, Dim Papadopoulos
User queries in information retrieval are often ambiguous, making it challenging for systems to identify a user's target from a single query. While recent dialogue-based interactive retrieval systems can clarify user intent, they are inefficient as they often lack an explicit strategy to ask the most informative questions. To address this limitation, we propose SherlockLLM, a dialogue-driven retrieval framework that learns an optimal questioning strategy via Reinforcement Learning (RL) and avoids the need for large-scale annotated dialogue data. In our framework, an agent is trained to generate a sequence of binary questions to efficiently narrow down the search space. To validate our approach, we introduce a benchmark with both structured and unstructured tasks. Experimental results show that SherlockLLM is a robust and efficient solution. On the structured tasks, its performance matches strong baselines and approaches the theoretical optimal defined by binary search. On the challenging unstructured task, our agent significantly outperforms these baselines, showcasing its ability to learn a highly effective information-seeking dialogue policy.
Zebin Yang, Sunjian Zheng, Tong Xie, Tianshi Xu, Bo Yu, Fan Wang, Jie Tang, Shaoshan Liu et al.
Object-goal navigation (ObjNav) tasks an agent with navigating to the
location of a specific object in an unseen environment. Embodied agents
equipped with large language models (LLMs) and online constructed navigation
maps can perform ObjNav in a zero-shot manner. However, existing agents heavily
rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small
LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to
limited model capacity for understanding complex navigation maps, which
prevents deploying ObjNav on local devices. At the same time, the long prompt
introduced by the navigation map description will cause high planning latency
on local devices. In this paper, we propose EfficientNav to enable on-device
efficient LLM-based zero-shot ObjNav. To help the smaller LLMs better
understand the environment, we propose semantics-aware memory retrieval to
prune redundant information in navigation maps. To reduce planning latency, we
propose discrete memory caching and attention-based memory clustering to
efficiently save and re-use the KV cache. Extensive experimental results
demonstrate that EfficientNav achieves 11.1% improvement in success rate on
HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time
latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our
code will be released soon.
Authors' comments: NeurIPS 2025
Wei-Chia Chang, Yan-Ann Chen
Vehicle make and model recognition (VMMR) is an important task in intelligent
transportation systems, but existing approaches struggle to adapt to newly
released models. Contrastive Language-Image Pretraining (CLIP) provides strong
visual-text alignment, yet its fixed pretrained weights limit performance
without costly image-specific finetuning. We propose a pipeline that integrates
vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to
support zero-shot recognition through text-based reasoning. A VLM converts
vehicle images into descriptive attributes, which are compared against a
database of textual features. Relevant entries are retrieved and combined with
the description to form a prompt, and a language model (LM) infers the make and
model. This design avoids large-scale retraining and enables rapid updates by
adding textual descriptions of new vehicles. Experiments show that the proposed
method improves recognition by nearly 20% over the CLIP baseline, demonstrating
the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city
applications.
Authors' comments: Accepted by The 38th Conference of Open Innovations Association
FRUCT, 2025
Jiaao Yu, Mingjie Han, Tao Gong, Jian Zhang, Man Lan
With the rapid growth of video data, text-video retrieval technology has
become increasingly important in numerous application scenarios such as
recommendation and search. Early text-video retrieval methods suffer from two
critical drawbacks: first, they heavily rely on large-scale annotated
video-text pairs, leading to high data acquisition costs; second, there is a
significant modal gap between video and text features, which limits cross-modal
alignment accuracy. With the development of vision-language model, adapting
CLIP to video tasks has attracted great attention. However, existing adaptation
methods generally lack enhancement for dynamic video features and fail to
effectively suppress static redundant features. To address this issue, this
paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise
CLIP-based training framework for text-video alignment. Specifically, the
method uses frame differences to generate dynamic region masks, which are input
into Alpha-CLIP as an additional Alpha channel. This proactively guides the
model to focus on semantically critical dynamic regions while suppressing
static background redundancy. Experiments demonstrate that frame
difference-guided video semantic encoding can effectively balance retrieval
efficiency and accuracy.
Authors' comments: 5 pages
Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat
In Bangladesh, many farmers continue to face challenges in accessing timely,
expert-level agricultural guidance. This paper presents KrishokBondhu, a
voice-enabled, call-centre-integrated advisory platform built on a
Retrieval-Augmented Generation (RAG) framework, designed specifically for
Bengali-speaking farmers. The system aggregates authoritative agricultural
handbooks, extension manuals, and NGO publications; applies Optical Character
Recognition (OCR) and document-parsing pipelines to digitize and structure the
content; and indexes this corpus in a vector database for efficient semantic
retrieval. Through a simple phone-based interface, farmers can call the system
to receive real-time, context-aware advice: speech-to-text converts the Bengali
query, the RAG module retrieves relevant content, a large language model (Gemma
3-4B) generates a context-grounded response, and text-to-speech delivers the
answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced
high-quality responses for 72.7% of diverse agricultural queries covering crop
management, disease control, and cultivation practices. Compared to the
KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on
a 5-point scale, a 44.7% improvement, with especially large gains in contextual
richness (+367%) and completeness (+100.4%), while maintaining comparable
relevance and technical specificity. Semantic similarity analysis further
revealed a strong correlation between retrieved context and answer quality,
emphasizing the importance of grounding generative responses in curated
documentation. KrishokBondhu demonstrates the feasibility of integrating
call-centre accessibility, multilingual voice interaction, and modern RAG
techniques to deliver expert-level agricultural guidance to remote Bangladeshi
farmers, paving the way toward a fully AI-driven agricultural advisory
ecosystem.
Authors' comments: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE
International Women in Engineering (WIE) Conference on Electrical and
Computer Engineering (WIECON-ECE 2025)
Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li
Incentivizing the reasoning ability of Multimodal Large Language Models
(MLLMs) is essential for medical applications to transparently analyze medical
scans and provide reliable diagnosis. However, existing medical MLLMs rely
solely on internal knowledge during reasoning, leading to hallucinated
reasoning and factual inaccuracies when encountering cases beyond their
training scope. Although recent Agentic Retrieval-Augmented Generation (RAG)
methods elicit the medical model's proactive retrieval ability during
reasoning, they are confined to unimodal LLMs, neglecting the crucial visual
information during reasoning and retrieval. Consequently, we propose the first
Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively
retrieves external knowledge by querying observed symptoms or domain-specific
medical concepts during reasoning. Specifically, we design a two-stage
reinforcement learning strategy with tailored rewards that stimulate the model
to leverage both visual diagnostic findings and textual clinical information
for effective retrieval. Building on this foundation, we further propose a
Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when
low prediction confidence is detected. Evaluation on various public medical
benchmarks demonstrates Med-RwR's significant improvements over baseline
models, proving the effectiveness of enhancing reasoning capabilities with
external knowledge integration. Furthermore, Med-RwR demonstrates remarkable
generalizability to unfamiliar domains, evidenced by 8.8% performance gain on
our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of
echocardiography data in the training corpus. Our data, model, and codes will
be made publicly available at https://github.com/xmed-lab/Med-RwR.
Authors' comments: Work in progress
Lei Li, Xiao Zhou, Yingying Zhang, Xian Wu
Medical question answering (QA) requires extensive access to domain-specific
knowledge. A promising direction is to enhance large language models (LLMs)
with external knowledge retrieved from medical corpora or parametric knowledge
stored in model parameters. Existing approaches typically fall into two
categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning
on externally retrieved evidence, and Generation-Augmented Generation (GAG),
which depends solely on the models internal knowledge to generate contextual
documents. However, RAG often suffers from noisy or incomplete retrieval, while
GAG is vulnerable to hallucinated or inaccurate information due to
unconstrained generation. Both issues can mislead reasoning and undermine
answer reliability. To address these challenges, we propose MedRGAG, a unified
retrieval-generation augmented framework that seamlessly integrates external
and parametric knowledge for medical QA. MedRGAG comprises two key modules:
Knowledge-Guided Context Completion (KGCC), which directs the generator to
produce background documents that complement the missing knowledge revealed by
retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively
selects an optimal combination of retrieved and generated documents to form
concise yet comprehensive evidence for answer generation. Extensive experiments
on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5%
improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the
effectiveness of unifying retrieval and generation for knowledge-intensive
reasoning. Our code and data are publicly available at
https://anonymous.4open.science/r/MedRGAG
Authors' comments: 13 pages, 4 figures
Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman
Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
Tenghui Huang, Jinbo Wen, Jiawen Kang, Siyong Chen, Zhengtao Li, Tao Zhang, Dongning Liu, Jiacheng Wang et al.
Smart contracts play a significant role in automating blockchain services. Nevertheless, vulnerabilities in smart contracts pose serious threats to blockchain security. Currently, traditional detection methods primarily rely on static analysis and formal verification, which can result in high false-positive rates and poor scalability. Large Language Models (LLMs) have recently made significant progress in smart contract vulnerability detection. However, they still face challenges such as high inference costs and substantial computational overhead. In this paper, we propose ParaVul, a parallel LLM and retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection. Specifically, we first develop Sparse Low-Rank Adaptation (SLoRA) for LLM fine-tuning. SLoRA introduces sparsification by incorporating a sparse matrix into quantized LoRA-based LLMs, thereby reducing computational overhead and resource requirements while enhancing their ability to understand vulnerability-related issues. We then construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system that integrates dense retrieval with Best Matching 25 (BM25), assisting in verifying the results generated by the LLM. Furthermore, we propose a meta-learning model to fuse the outputs of the RAG system and the LLM, thereby generating the final detection results. After completing vulnerability detection, we design chain-of-thought prompts to guide LLMs to generate comprehensive vulnerability detection reports. Simulation results demonstrate the superiority of ParaVul, especially in terms of F1 scores, achieving 0.9398 for single-label detection and 0.9330 for multi-label detection.
Zulun Zhu, Haoyu Liu, Mengke He, Siqiang Luo
Question answering in temporal knowledge graphs requires retrieval that is both time-consistent and efficient. Existing RAG methods are largely semantic and typically neglect explicit temporal constraints, which leads to time-inconsistent answers and inflated token usage. We propose STAR-RAG, a temporal GraphRAG framework that relies on two key ideas: building a time-aligned rule graph and conducting propagation on this graph to narrow the search space and prioritize semantically relevant, time-consistent evidence. This design enforces temporal proximity during retrieval, reduces the candidate set of retrieval results, and lowers token consumption without sacrificing accuracy. Compared with existing temporal RAG approaches, STAR-RAG eliminates the need for heavy model training and fine-tuning, thereby reducing computational cost and significantly simplifying deployment.Extensive experiments on real-world temporal KG datasets show that our method achieves improved answer accuracy while consuming fewer tokens than strong GraphRAG baselines.
Kunyu Peng, Di Wen, Jia Fu, Jiamin Wu, Kailun Yang, Junwei Zheng, Ruiping Liu, Yufan Chen et al.
Referring Atomic Video Action Recognition (RAVAR) aims to recognize
fine-grained, atomic-level actions of a specific person of interest conditioned
on natural language descriptions. Distinct from conventional action recognition
and detection tasks, RAVAR emphasizes precise language-guided action
understanding, which is particularly critical for interactive human action
analysis in complex multi-person scenarios. In this work, we extend our
previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million
frames and >75.1k annotated persons in total. We benchmark this dataset using
baselines from multiple related domains, including atomic action localization,
video question answering, and text-video retrieval, as well as our earlier
model, RefAtomNet. Although RefAtomNet surpasses other baselines by
incorporating agent attention to highlight salient features, its ability to
align and retrieve cross-modal information remains limited, leading to
suboptimal performance in localizing the target person and predicting
fine-grained actions. To overcome the aforementioned limitations, we introduce
RefAtomNet++, a novel framework that advances cross-modal token aggregation
through a multi-hierarchical semantic-aligned cross-attention mechanism
combined with multi-trajectory Mamba modeling at the partial-keyword,
scene-attribute, and holistic-sentence levels. In particular, scanning
trajectories are constructed by dynamically selecting the nearest visual
spatial tokens at each timestep for both partial-keyword and scene-attribute
levels. Moreover, we design a multi-hierarchical semantic-aligned
cross-attention strategy, enabling more effective aggregation of spatial and
temporal tokens across different semantic hierarchies. Experiments show that
RefAtomNet++ establishes new state-of-the-art results. The dataset and code are
released at https://github.com/KPeng9510/refAVA2.
Authors' comments: Extended version of ECCV 2024 paper arXiv:2407.01872. The dataset and
code are released at https://github.com/KPeng9510/refAVA2
Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, Ying Ding
As AI chatbots gain adoption in clinical medicine, developing effective
frameworks for complex, emerging diseases presents significant challenges. We
developed and evaluated six Retrieval-Augmented Generation (RAG) corpus
configurations for Long COVID (LC) clinical question answering, ranging from
expert-curated sources to large-scale literature databases. Our evaluation
employed an LLM-as-a-judge framework across faithfulness, relevance, and
comprehensiveness metrics using LongCOVID-CQ, a novel dataset of
expert-generated clinical questions. Our RAG corpus configuration combining
clinical guidelines with high-quality systematic reviews consistently
outperformed both narrow single-guideline approaches and large-scale literature
databases. Our findings suggest that for emerging diseases, retrieval grounded
in curated secondary reviews provides an optimal balance between narrow
consensus documents and unfiltered primary literature, supporting clinical
decision-making while avoiding information overload and oversimplified
guidance. We propose Guide-RAG, a chatbot system and accompanying evaluation
framework that integrates both curated expert knowledge and comprehensive
literature databases to effectively answer LC clinical questions.
Authors' comments: Accepted to 39th Conference on Neural Information Processing Systems
(NeurIPS 2025) Workshop: The Second Workshop on GenAI for Health: Potential,
Trust, and Policy Compliance
Jinliang Liu
Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.
Jinghao Huang, Yaxiong Chen, Ganchao Liu
With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.
Yijia Sun, Shanshan Huang, Zhiyuan Guan, Qiang Luo, Ruiming Tang, Kun Gai, Guorui Zhou
Industrial-scale recommender systems rely on a cascade pipeline in which the retrieval stage must return a high-recall candidate set from billions of items under tight latency. Existing solutions ei- ther (i) suffer from limited expressiveness in capturing fine-grained user-item interactions, as seen in decoupled dual-tower architectures that rely on separate encoders, or generative models that lack precise target-aware matching capabilities, or (ii) build structured indices (tree, graph, quantization) whose item-centric topologies struggle to incorporate dynamic user preferences and incur prohibitive construction and maintenance costs. We present GRank, a novel structured-index-free retrieval paradigm that seamlessly unifies target-aware learning with user-centric retrieval. Our key innovations include: (1) A target-aware Generator trained to perform personalized candidate generation via GPU-accelerated MIPS, eliminating semantic drift and maintenance costs of structured indexing; (2) A lightweight but powerful Ranker that performs fine-grained, candidate-specific inference on small subsets; (3) An end-to-end multi-task learning framework that ensures semantic consistency between generation and ranking objectives. Extensive experiments on two public benchmarks and a billion-item production corpus demonstrate that GRank improves Recall@500 by over 30% and 1.7$\times$ the P99 QPS of state-of-the-art tree- and graph-based retrievers. GRank has been fully deployed in production in our recommendation platform since Q2 2025, serving 400 million active users with 99.95% service availability. Online A/B tests confirm significant improvements in core engagement metrics, with Total App Usage Time increasing by 0.160% in the main app and 0.165% in the Lite version.
Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang et al.
Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.