Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, Mohit Bansal
Document visual question answering (DocVQA) pipelines that answer questions
from documents have broad applications. Existing methods focus on handling
single-page documents with multi-modal language models (MLMs), or rely on
text-based retrieval-augmented generation (RAG) that uses text extraction tools
such as optical character recognition (OCR). However, there are difficulties in
applying these methods in real-world scenarios: (a) questions often require
information across different pages or documents, where MLMs cannot handle many
long documents; (b) documents often have important information in visual
elements such as figures, but text extraction tools ignore them. We introduce
M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various
document contexts (closed-domain and open-domain), question hops (single-hop
and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG
finds relevant documents and answers questions using a multi-modal retriever
and an MLM, so that it can efficiently handle single or many documents while
preserving visual information. Since previous DocVQA datasets ask questions in
the context of a specific document, we also present M3DocVQA, a new benchmark
for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results
show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance
than many strong baselines, including state-of-the-art performance in
MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and
retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully
handle various scenarios, such as when relevant information exists across
multiple pages and when answer evidence only exists in images.
Authors' comments: Project webpage: https://m3docrag.github.io
Tamar Klein, Tom Aizenberg, Roi Ronen
Climate studies often rely on remotely sensed images to retrieve
two-dimensional maps of cloud properties. To advance volumetric analysis, we
focus on recovering the three-dimensional (3D) heterogeneous extinction
coefficient field of shallow clouds using multiview remote sensing data.
Climate research requires large-scale worldwide statistics. To enable scalable
data processing, previous deep neural networks (DNNs) can infer at spaceborne
remote sensing downlink rates. However, prior methods are limited to a fixed
solar illumination direction. In this work, we introduce the first scalable
DNN-based system for 3D cloud retrieval that accommodates varying camera poses
and solar directions. By integrating multiview cloud intensity images with
camera poses and solar direction data, we achieve greater flexibility in
recovery. Training of the DNN is performed by a novel two-stage scheme to
address the high number of degrees of freedom in this problem. Our approach
shows substantial improvements over previous state-of-the-art, particularly in
handling variations in the sun's zenith angle.
Authors' comments: 4 pages, 4 figures
Ferdinand Schlatt, Maik Fröbe, Matthias Hagen
A wide range of transformer-based language models have been proposed for
information retrieval tasks. However, including transformer-based models in
retrieval pipelines is often complex and requires substantial engineering
effort. In this paper, we introduce Lightning IR, an easy-to-use PyTorch
Lightning-based framework for applying transformer-based language models in
retrieval scenarios. Lightning IR provides a modular and extensible
architecture that supports all stages of a retrieval pipeline: from fine-tuning
and indexing to searching and re-ranking. Designed to be scalable and
reproducible, Lightning IR is available as open-source:
https://github.com/webis-de/lightning-ir.
Authors' comments: Accepted as a demo at WSDM'25
Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra
This work focuses on improving Text-To-Audio (TTA) generation on zero-shot
and few-shot settings (i.e. generating unseen or uncommon audio events).
Inspired by the success of Retrieval-Augmented Generation (RAG) in Large
Language Models, we propose Audiobox TTA-RAG, a novel retrieval-augmented TTA
approach based on Audiobox, a flow-matching audio generation model. Unlike the
vanilla Audiobox TTA solution that generates audio conditioned on text only, we
extend the TTA process by augmenting the conditioning input with both text and
retrieved audio samples. Our retrieval method does not require the external
database to have labeled audio, offering more practical use cases. We show that
the proposed model can effectively leverage the retrieved audio samples and
significantly improve zero-shot and few-shot TTA performance, with large
margins on multiple evaluation metrics, while maintaining the ability to
generate semantically aligned audio for the in-domain setting.
Authors' comments: Interspeech 2025
Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini
This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.
Yukun Cao, Zengyi Gao, Zhiyang Li, Xike Xie, Kevin Zhou, Jianliang Xu
GraphRAG integrates (knowledge) graphs with large language models (LLMs) to improve reasoning accuracy and contextual relevance. Despite its promising applications and strong relevance to multiple research communities, such as databases and natural language processing, GraphRAG currently lacks modular workflow analysis, systematic solution frameworks, and insightful empirical studies. To bridge these gaps, we propose LEGO-GraphRAG, a modular framework that enables: 1) fine-grained decomposition of the GraphRAG workflow, 2) systematic classification of existing techniques and implemented GraphRAG instances, and 3) creation of new GraphRAG instances. Our framework facilitates comprehensive empirical studies of GraphRAG on large-scale real-world graphs and diverse query sets, revealing insights into balancing reasoning quality, runtime efficiency, and token or GPU cost, that are essential for building advanced GraphRAG systems.
Wanying Ding, Manoj Cherukumalli, Santosh Chikoti, Vinay K. Chaudhri
Knowledge graphs have gained popularity for their ability to organize and
analyze complex data effectively. When combined with graph embedding
techniques, such as graph neural networks (GNNs), knowledge graphs become a
potent tool in providing valuable insights. This study explores the application
of graph embedding in identifying competitors from a financial knowledge graph.
Existing state-of-the-art(SOTA) models face challenges due to the unique
attributes of our knowledge graph, including directed and undirected
relationships, attributed nodes, and minimal annotated competitor connections.
To address these challenges, we propose a novel graph embedding model,
JPEC(JPMorgan Proximity Embedding for Competitor Detection), which utilizes
graph neural network to learn from both first-order and second-order node
proximity together with vital features for competitor retrieval. JPEC had
outperformed most existing models in extensive experiments, showcasing its
effectiveness in competitor retrieval.
Authors' comments: 5 pages, 4 figures, accepted by SIGIR'24
Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen
Retrieval-Augmented Generation (RAG) has been shown to improve knowledge
capabilities and alleviate the hallucination problem of LLMs. The Web is a
major source of external knowledge used in RAG systems, and many commercial RAG
systems have used Web search engines as their major retrieval systems.
Typically, such RAG systems retrieve search results, download HTML sources of
the results, and then extract plain texts from the HTML sources. Plain text
documents or chunks are fed into the LLMs to augment the generation. However,
much of the structural and semantic information inherent in HTML, such as
headings and table structures, is lost during this plain-text-based RAG
process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead
of plain text as the format of retrieved knowledge in RAG. We believe HTML is
better than plain text in modeling knowledge in external documents, and most
LLMs possess robust capacities to understand HTML. However, utilizing HTML
presents new challenges. HTML contains additional content such as tags,
JavaScript, and CSS specifications, which bring extra input tokens and noise to
the RAG system. To address this issue, we propose HTML cleaning, compression,
and a two-step block-tree-based pruning strategy, to shorten the HTML while
minimizing the loss of information. Experiments on six QA datasets confirm the
superiority of using HTML in RAG systems.
Authors' comments: Accepted by WWW 2025 main conference. Repo:
https://github.com/plageon/HtmlRAG
Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng et al.
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at https://github.com/Alibaba-NLP/OmniSearch.
Xin Wen, Xuening Zhu, Renjiao Yi, Zhifeng Wang, Chenyang Zhu, Kai Xu
Reconstructing from multi-view images is a longstanding problem in 3D vision,
where neural radiance fields (NeRFs) have shown great potential and get
realistic rendered images of novel views. Currently, most NeRF methods either
require accurate camera poses or a large number of input images, or even both.
Reconstructing NeRF from few-view images without poses is challenging and
highly ill-posed. To address this problem, we propose CAD-NeRF, a method
reconstructed from less than 10 images without any known poses. Specifically,
we build a mini library of several CAD models from ShapeNet and render them
from many random views. Given sparse-view input images, we run a model and pose
retrieval from the library, to get a model with similar shapes, serving as the
density supervision and pose initializations. Here we propose a multi-view pose
retrieval method to avoid pose conflicts among views, which is a new and unseen
problem in uncalibrated NeRF methods. Then, the geometry of the object is
trained by the CAD guidance. The deformation of the density field and camera
poses are optimized jointly. Then texture and density are trained and
fine-tuned as well. All training phases are in self-supervised manners.
Comprehensive evaluations of synthetic and real images show that CAD-NeRF
successfully learns accurate densities with a large deformation from retrieved
CAD models, showing the generalization abilities.
Authors' comments: The article has been accepted by Frontiers of Computer Science (FCS)
Nouf Alabbasi, Omar Erak, Omar Alhussein, Ismail Lotfi, Sami Muhaidat, Merouane Debbah
The telecommunications industry's rapid evolution demands intelligent systems capable of managing complex networks and adapting to emerging technologies. While large language models (LLMs) show promise in addressing these challenges, their deployment in telecom environments faces significant constraints due to edge device limitations and inconsistent documentation. To bridge this gap, we present TeleOracle, a telecom-specialized retrieval-augmented generation (RAG) system built on the Phi-2 small language model (SLM). To improve context retrieval, TeleOracle employs a two-stage retriever that incorporates semantic chunking and hybrid keyword and semantic search. Additionally, we expand the context window during inference to enhance the model's performance on open-ended queries. We also employ low-rank adaption for efficient fine-tuning. A thorough analysis of the model's performance indicates that our RAG framework is effective in aligning Phi-2 to the telecom domain in a downstream question and answer (QnA) task, achieving a 30% improvement in accuracy over the base Phi-2 model, reaching an overall accuracy of 81.20%. Notably, we show that our model not only performs on par with the much larger LLMs but also achieves a higher faithfulness score, indicating higher adherence to the retrieved context.
Qikai Wei, Mingzhi Yang, Chunlong Han, Jingfu Wei, Minghao Zhang, Feifei Shi, Huansheng Ning
Retrieval-Augmented Generation (RAG) mitigates the issue of hallucination in Large Language Models (LLMs) by integrating information retrieval techniques. However, in the tourism domain, since the query is usually brief and the content in the database is diverse, existing RAG may contain a significant amount of irrelevant or contradictory information contents after retrieval. To address this challenge, we propose the QCG-Rerank model. This model first performs an initial retrieval to obtain candidate chunks and then enhances semantics by extracting critical information to expand the original query. Next, we utilize the expanded query and candidate chunks to calculate similarity scores as the initial transition probability and construct the chunks graph. Subsequently, We iteratively compute the transition probabilities based on an initial estimate until convergence. The chunks with the highest score are selected and input into the LLMs to generate responses. We evaluate the model on Cultour, IIRC, StrategyQA, HotpotQA, SQuAD, and MuSiQue datasets. The experimental results demonstrate the effectiveness and superiority of the QCG-Rerank method.
Zijun Min, Bingshuai Liu, Liang Zhang, Jia Song, Jinsong Su, Song He, Xiaochen Bo
The field of bioinformatics has seen significant progress, making the
cross-modal text-molecule retrieval task increasingly vital. This task focuses
on accurately retrieving molecule structures based on textual descriptions, by
effectively aligning textual descriptions and molecules to assist researchers
in identifying suitable molecular candidates. However, many existing approaches
overlook the details inherent in molecule sub-structures. In this work, we
introduce the Optimal TRansport-based Multi-grained Alignments model (ORMA), a
novel approach that facilitates multi-grained alignments between textual
descriptions and molecules. Our model features a text encoder and a molecule
encoder. The text encoder processes textual descriptions to generate both
token-level and sentence-level representations, while molecules are modeled as
hierarchical heterogeneous graphs, encompassing atom, motif, and molecule nodes
to extract representations at these three levels. A key innovation in ORMA is
the application of Optimal Transport (OT) to align tokens with motifs, creating
multi-token representations that integrate multiple token alignments with their
corresponding motifs. Additionally, we employ contrastive learning to refine
cross-modal alignments at three distinct scales: token-atom, multitoken-motif,
and sentence-molecule, ensuring that the similarities between correctly matched
text-molecule pairs are maximized while those of unmatched pairs are minimized.
To our knowledge, this is the first attempt to explore alignments at both the
motif and multi-token levels. Experimental results on the ChEBI-20 and PCdes
datasets demonstrate that ORMA significantly outperforms existing
state-of-the-art (SOTA) models.
Authors' comments: BIBM 2024 Regular Paper
Ruotong Wang, Xinyi Zhou, Lin Qiu, Joseph Chee Chang, Jonathan Bragg, Amy X. Zhang
AI agents are increasingly tasked with making proactive suggestions in online
spaces where groups collaborate, yet risk being unhelpful or even annoying if
they fail to match group preferences or behave in socially inappropriate ways.
Fortunately, group spaces have a rich history of prior interactions and
affordances for social feedback that can support grounding an agent's
generations to a group's interests and norms. We present Social-RAG, a workflow
for socially grounding agents that retrieves context from prior group
interactions, selects relevant social signals, and feeds them into a language
model to generate messages in a socially aligned manner. We implement this in
\textsc{PaperPing}, a system for posting paper recommendations in group chat,
leveraging social signals determined from formative studies with 39
researchers. From a three-month deployment in 18 channels reaching 500+
researchers, we observed PaperPing posted relevant messages in groups without
disrupting their existing social practices, fostering group common ground.
Authors' comments: To appear at CHI2025
MD Shaikh Rahman, Feiroz Humayara, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid
That datasets that are used in todays research are especially vast in the
medical field. Different types of medical images such as X-rays, MRI, CT scan
etc. take up large amounts of space. This volume of data introduces challenges
like accessing and retrieving specific images due to the size of the database.
An efficient image retrieval system is essential as the database continues to
grow to save time and resources. In this paper, we propose an approach to
medical image retrieval using DenseNet for feature extraction and use FAISS for
similarity search. DenseNet is well-suited for feature extraction in complex
medical images and FAISS enables efficient handling of high-dimensional data in
large-scale datasets. Unlike existing methods focused solely on classification
accuracy, our method prioritizes both retrieval speed and diagnostic relevance,
addressing a critical gap in real-time case comparison for radiologists. We
applied the classification of breast cancer images using the BIRADS system. We
utilized DenseNet's powerful feature representation and FAISSs efficient
indexing capabilities to achieve high precision and recall in retrieving
relevant images for diagnosis. We experimented on a dataset of 2006 images from
the Categorized Digital Database for Low Energy and Subtracted Contrast
Enhanced Spectral Mammography (CDD-CESM) images available on The Cancer Imaging
Archive (TCIA). Our method outperforms conventional retrieval techniques,
achieving a precision of 80% at k=5 for BIRADS classification. The dataset
includes annotated CESM images and medical reports, providing a comprehensive
foundation for our research.
Authors' comments: 34 pages, 5 figures
Hithesh Sankararaman, Mohammed Nasheed Yasin, Tanner Sorensen, Alessandro Di Bari, Andreas Stolcke
We present a light-weight approach for detecting nonfactual outputs from
retrieval-augmented generation (RAG). Given a context and putative output, we
compute a factuality score that can be thresholded to yield a binary decision
to check the results of LLM-based question-answering, summarization, or other
systems. Unlike factuality checkers that themselves rely on LLMs, we use
compact, open-source natural language inference (NLI) models that yield a
freely accessible solution with low latency and low cost at run-time, and no
need for LLM fine-tuning. The approach also enables downstream mitigation and
correction of hallucinations, by tracing them back to specific context chunks.
Our experiments show high area under the ROC curve (AUC) across a wide range of
relevant open source datasets, indicating the effectiveness of our method for
fact-checking RAG output.
Authors' comments: To appear in Proceedings of EMNLP 2024 Industry Track
Yun Jiang, Zilong Xie, Wei Zhang, Yun Fang, Shuai Pan
Retrieval-augmented generation methods often neglect the quality of content
retrieved from external knowledge bases, resulting in irrelevant information or
potential misinformation that negatively affects the generation results of
large language models. In this paper, we propose an end-to-end model with
adaptive filtering for retrieval-augmented generation (E2E-AFG), which
integrates answer existence judgment and text generation into a single
end-to-end framework. This enables the model to focus more effectively on
relevant content while reducing the influence of irrelevant information and
generating accurate answers. We evaluate E2E-AFG on six representative
knowledge-intensive language datasets, and the results show that it
consistently outperforms baseline models across all tasks, demonstrating the
effectiveness and robustness of the proposed approach.
Authors' comments: 13 pages, 3 figures, 5 tables
Meghan Booker, Grayson Byrd, Bethany Kemp, Aurora Schmidt, Corban Rivera
Recent advances in Large Language Models (LLMs) have helped facilitate exciting progress for robotic planning in real, open-world environments. 3D scene graphs (3DSGs) offer a promising environment representation for grounding such LLM-based planners as they are compact and semantically rich. However, as the robot's environment scales (e.g., number of entities tracked) and the complexity of scene graph information increases (e.g., maintaining more attributes), providing the 3DSG as-is to an LLM-based planner quickly becomes infeasible due to input token count limits and attentional biases present in LLMs. Inspired by the successes of Retrieval-Augmented Generation (RAG) methods that retrieve query-relevant document chunks for LLM question and answering, we adapt the paradigm for our embodied domain. Specifically, we propose a 3D scene subgraph retrieval framework, called EmbodiedRAG, that we augment an LLM-based planner with for executing natural language robotic tasks. Notably, our retrieved subgraphs adapt to changes in the environment as well as changes in task-relevancy as the robot executes its plan. We demonstrate EmbodiedRAG's ability to significantly reduce input token counts (by an order of magnitude) and planning time (up to 70% reduction in average time per planning step) while improving success rates on AI2Thor simulated household tasks with a single-arm, mobile manipulator. Additionally, we implement EmbodiedRAG on a quadruped with a manipulator to highlight the performance benefits for robot deployment at the edge in real environments.
Haiwen Li, Fei Su, Zhicheng Zhao
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
Leonardo Ranaldi, Marco Valentino, Andrè Freitas
Retrieval-augmented generation (RAG) has emerged as a critical mechanism in contemporary NLP to support Large Language Models(LLMs) in systematically accessing richer factual context. However, the integration of RAG mechanisms brings its inherent challenges, as LLMs need to deal with potentially noisy contexts. Recent studies have shown that LLMs still struggle to critically analyse RAG-based in-context information, a limitation that may lead to incorrect inferences and hallucinations. In this paper, we investigate how to elicit critical reasoning in RAG via contrastive explanations. In particular, we propose Contrastive-RAG (C-RAG), a framework that (i) retrieves relevant documents given a query, (ii) selects and exemplifies relevant passages, and (iii) generates explanations that explicitly contrast the relevance of the passages to (iv) support the final answer. We show the impact of C-RAG building contrastive reasoning demonstrations from LLMs to instruct smaller models for retrieval-augmented tasks. Extensive experiments demonstrate that C-RAG improves state-of-the-art RAG models while (a) requiring significantly fewer prompts and demonstrations and (b) being robust to perturbations in the retrieved documents.