benty-fields - Search paper

9621. M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, Mohit Bansal

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.04952v1

Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.
Authors' comments: Project webpage: https://m3docrag.github.io

Benty-search

9621. M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.04952v1

9622. DNN-based 3D Cloud Retrieval for Variable Solar Illumination and Multiview Spaceborne Imaging

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.04682v1

9623. Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.04677v2

9624. Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.05141v2

9625. Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.04006v1

9626. LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.05844v2

9627. JPEC: A Novel Graph Neural Network for Competitor Retrieval in Financial Knowledge Graphs

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02692v1

9628. HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02959v2

9629. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02937v3

9630. CAD-NeRF: Learning NeRFs from Uncalibrated Few-view Images by CAD Model Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02979v2

9631. TeleOracle: Fine-Tuned Retrieval-Augmented Generation with Long-Context Support for Network

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02617v1

9632. QCG-Rerank: Chunks Graph Rerank with Query Expansion in Retrieval-Augmented LLMs for Tourism Domain

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.08724v1

9633. Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.11875v1

9634. Social-RAG: Retrieving from Group Interactions to Socially Ground AI Generation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.02353v2

9635. Efficient Medical Image Retrieval Using DenseNet and FAISS for BIRADS Classification

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.01473v1

9636. Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.01022v1

9637. E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2411.00437v2

9638. EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2410.23968v1

9639. MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2410.23736v1

9640. Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations

Show abstract | Show figures | Show BibTeX | Show discussion 0 | View PDF | 2410.22874v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.04952v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.04682v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.04677v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.05141v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.04006v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.05844v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02692v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02959v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02937v3

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02979v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02617v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.08724v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.11875v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.02353v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.01473v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.01022v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2411.00437v2

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2410.23968v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2410.23736v1

Show abstract | Show figures | Show BibTeX | Show discussion | View PDF | 2410.22874v1