Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, Hongtao Xie
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
Jamie Mahowald, Benjamin Charles Germain Lee
Multimodal approaches have shown great promise for searching and navigating
digital collections held by libraries, archives, and museums. In this paper, we
introduce map-RAS: a retrieval-augmented search system for historic maps. In
addition to introducing our framework, we detail our publicly-hosted demo for
searching 101,233 map images held by the Library of Congress. With our system,
users can multimodally query the map collection via ColPali, summarize search
results using Llama 3.2, and upload their own collections to perform
inter-collection search. We articulate potential use cases for archivists,
curators, and end-users, as well as future work with our system in both machine
learning and the digital humanities. Our demo can be viewed at:
http://www.mapras.com.
Authors' comments: 5 pages, 5 figures
Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, Riccardo Mattivi
Retrieval-Augmented Generation (RAG) systems often face limitations in
specialized domains such as fintech, where domain-specific ontologies, dense
terminology, and acronyms complicate effective retrieval and synthesis. This
paper introduces an agentic RAG architecture designed to address these
challenges through a modular pipeline of specialized agents. The proposed
system supports intelligent query reformulation, iterative sub-query
decomposition guided by keyphrase extraction, contextual acronym resolution,
and cross-encoder-based context re-ranking. We evaluate our approach against a
standard RAG baseline using a curated dataset of 85 question--answer--reference
triples derived from an enterprise fintech knowledge base. Experimental results
demonstrate that the agentic RAG system outperforms the baseline in retrieval
precision and relevance, albeit with increased latency. These findings suggest
that structured, multi-agent methodologies offer a promising direction for
enhancing retrieval robustness in complex, domain-specific settings.
Authors' comments: Keywords: RAG Agentic AI Fintech NLP KB Domain-Specific Ontology
Query Understanding
Prathamesh Tamhane
Modern astronomical surveys such as the Sloan Digital Sky Survey (SDSS) provide extensive astronomical databases enabling researchers to access vast amount of diverse data. However, retrieving data from archives requires knowledge of query languages and familiarity with their schema, which presents a barrier for non-experts. This work investigates the use of Microsoft Phi-2, a compact yet powerful transformer-based language model, fine-tuned on natural language--SQL pairs constructed from SDSS query examples. We develop an interface that translates user queries in natural language into SQL commands compatible with SDSS SkyServer. Preliminary evaluation shows that the fine-tuned model produces syntactically valid and largely semantically correct queries across a variety of astronomy-related requests. Our results show that even small-scale models, when carefully fine-tuned, can provide effective domain-specific natural language interfaces for large scientific databases.
Authors' comments: Submitted to the Proceedings of the International Astronomical Union (IAU Symposium 397, ''UniversAI: Exploring the Universe with Artificial Intelligence,'' 2025). 4 pages
Jiawei Zhou, Lei Chen
As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis
Retrieval-Augmented Generation (RAG) struggles on long, structured financial
filings where relevant evidence is sparse and cross-referenced. This paper
presents a systematic investigation of advanced metadata-driven
Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a
novel, multi-stage RAG architecture that leverages LLM-generated metadata. We
introduce a sophisticated indexing pipeline to create contextually rich
document chunks and benchmark a spectrum of enhancements, including
pre-retrieval filtering, post-retrieval reranking, and enriched embeddings,
benchmarked on the FinanceBench dataset. Our results reveal that while a
powerful reranker is essential for precision, the most significant performance
gains come from embedding chunk metadata directly with text ("contextual
chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval
optimizations with these contextual embeddings to achieve superior performance.
Additionally, we present a custom metadata reranker that offers a compelling,
cost-effective alternative to commercial solutions, highlighting a practical
trade-off between peak performance and operational efficiency. This study
provides a blueprint for building robust, metadata-aware RAG systems for
financial document analysis.
Authors' comments: Preprint version submitted to the International Journal of Accounting
Information Systems; currently under major revision. 20 pages, 1 figure, 1
table
Deniz Gorur, Antoni Rago, Francesca Toni
Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
Ziyu Liu, Yijing Liu, Jianfei Yuan, Minzhi Yan, Le Yue, Honghui Xiong, Yi Yang
Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
Hao Jia, Penghao Zhao, Hao Wu, Yuan Gao, Yangyu Tao, Bin Cui
Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system's intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system's local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model's predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
Authors' comments: https://github.com/alexmartin1722/mirage
Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Binxiao Xu, Junyu Feng, Ruichuan An, Yulin Luo, Shilin Yan, Hao Liang, Ming Lu, Wentao Zhang
The rapid development of Vision-language models (VLMs) enables open-ended
perception and reasoning. Recent works have started to investigate how to adapt
general-purpose VLMs into personalized assistants. Even commercial models such
as ChatGPT now support model personalization by incorporating user-specific
information. However, existing methods either learn a set of concept tokens or
train a VLM to utilize user-specific information. However, both pipelines
struggle to generate accurate answers as personalized assistants. We introduce
Jarvis, an innovative framework for a personalized AI assistant through
personal KV-Cache retrieval, which stores user-specific information in the
KV-Caches of both textual and visual tokens. The textual tokens are created by
summarizing user information into metadata, while the visual tokens are
produced by extracting distinct image patches from the user's images. When
answering a question, Jarvis first retrieves related KV-Caches from personal
storage and uses them to ensure accuracy in responses. We also introduce a
fine-grained benchmark built with the same distinct image patch mining
pipeline, emphasizing accurate question answering based on fine-grained
user-specific information. Jarvis is capable of providing more accurate
responses, particularly when they depend on specific local details. Jarvis
achieves state-of-the-art results in both visual question answering and
text-only tasks across multiple datasets, indicating a practical path toward
personalized AI assistants. The code and dataset will be released.
Authors' comments: 19 pages, 7 figures
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first." To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.
Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen
Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation. To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE demonstrate that document expansion substantially improves retrieval performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art results on both benchmarks. We further analyze the contribution of individual fields to retrieval effectiveness, as well as the broader impact of document expansion on both training and evaluation. Overall, our findings highlight both the promise and limitations of LLM-driven document expansion, positioning Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for future research in tool retrieval.
Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli
While Retrieval-Augmented Generation (RAG) mitigates hallucination and
knowledge staleness in Large Language Models (LLMs), existing frameworks often
falter on complex, multi-hop queries that require synthesizing information from
disparate sources. Current advanced RAG methods, employing iterative or
adaptive strategies, lack a robust mechanism to systematically identify and
fill evidence gaps, often propagating noise or failing to gather a
comprehensive context. We introduce FAIR-RAG, a novel agentic framework that
transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning
process. At its core is an Iterative Refinement Cycle governed by a module we
term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating
mechanism: it deconstructs the initial query into a checklist of required
findings and audits the aggregated evidence to identify confirmed facts and,
critically, explicit informational gaps. These gaps provide a precise signal to
an Adaptive Query Refinement agent, which generates new, targeted sub-queries
to retrieve missing information. This cycle repeats until the evidence is
verified as sufficient, ensuring a comprehensive context for a final, strictly
faithful generation. We conducted experiments on challenging multi-hop QA
benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified
experimental setup, FAIR-RAG significantly outperforms strong baselines. On
HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3
points over the strongest iterative baseline -- establishing a new
state-of-the-art for this class of methods on these benchmarks. Our work
demonstrates that a structured, evidence-driven refinement process with
explicit gap analysis is crucial for unlocking reliable and accurate reasoning
in advanced RAG systems for complex, knowledge-intensive tasks.
Authors' comments: 30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented
Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop
Question Answering, Faithfulness
Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu
Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.
Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
Large language models (LLMs) excel at factual recall yet still propagate
stale or incorrect knowledge. In-context knowledge editing offers a
gradient-free remedy suitable for black-box APIs, but current editors rely on
static demonstration sets chosen by surface-level similarity, leading to two
persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of
adaptivity to task difficulty. We address these issues by dynamically selecting
supporting demonstrations according to their utility for the edit. We propose
Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight
framework that (1) trains a BERT retriever with REINFORCE to rank
demonstrations by editing reward, and (2) employs a learnable threshold to
prune low-value examples, shortening the prompt when the edit is easy and
expanding it when the task is hard. DR-IKE performs editing without modifying
model weights, relying solely on forward passes for compatibility with
black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to
17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries,
demonstrating scalable and adaptive knowledge editing. The code is available at
https://github.com/mwnafee/DR-IKE .
Authors' comments: Accepted at EMNLP 2025. \c{opyright} 2025 Association for
Computational Linguistics (CC BY 4.0)
Julia Wilkins, Jaehun Kim, Matthew E. P. Davies, Juan Pablo Bello, Matthew C. McCallum
Music representations are the backbone of modern recommendation systems,
powering playlist generation, similarity search, and personalized discovery.
Yet most embeddings offer little control for adjusting a single musical
attribute, e.g., changing only the mood of a track while preserving its genre
or instrumentation. In this work, we address the problem of controllable music
retrieval through embedding-based transformation, where the objective is to
retrieve songs that remain similar to a seed track but are modified along one
chosen dimension. We propose a novel framework for mood-guided music embedding
transformation, which learns a mapping from a seed audio embedding to a target
embedding guided by mood labels, while preserving other musical attributes.
Because mood cannot be directly altered in the seed audio, we introduce a
sampling mechanism that retrieves proxy targets to balance diversity with
similarity to the seed. We train a lightweight translation model using this
sampling strategy and introduce a novel joint objective that encourages
transformation and information preservation. Extensive experiments on two
datasets show strong mood transformation performance while retaining genre and
instrumentation far better than training-free baselines, establishing
controllable embedding transformation as a promising paradigm for personalized
music retrieval.
Authors' comments: Preprint; under review
Fangjian Zhang, Xiaoyong Zhuge, Wenlan Wang, Haixia Xiao, Yuying Zhu, Siyang Cheng
Artificial intelligence has advanced quantitative remote sensing, yet its
effectiveness is constrained by imbalanced label distribution. This imbalance
leads conventionally trained models to favor common samples, which in turn
degrades retrieval performance for rare ones. Rainfall retrieval exemplifies
this issue, with performance particularly compromised for heavy rain. This
study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework.
Following a divide-and-conquer strategy, imbalance in the rain distribution is
decomposed into two components: zero inflation, defined by the predominance of
non-rain samples; and long tail, defined by the disproportionate abundance of
light-rain samples relative to heavy-rain samples. A hurdle model is adopted to
handle the zero inflation, while IMDL is proposed to address the long tail by
transforming the learning object into an unbiased ideal inverse model.
Comprehensive evaluation via statistical metrics and case studies investigating
rainy weather in eastern China confirms Hurdle-IMDL's superiority over
conventional, cost-sensitive, generative, and multi-task learning methods. Its
key advancements include effective mitigation of systematic underestimation and
a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a
generalizable approach for addressing imbalance in distributions of
environmental variables, enabling enhanced retrieval of rare yet high-impact
events.
Authors' comments: 26 pages