Heydar Soudani, Hamed Zamani, Faegheh Hasibi
Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.
Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras
Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ''sub-components''. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model's capabilities.
Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Xiang Chen
To effectively leverage user-specific data, retrieval augmented generation
(RAG) is employed in multimodal large language model (MLLM) applications.
However, conventional retrieval approaches often suffer from limited retrieval
accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by
decomposing queries and matching against segmented images. They still suffer
from sub-optimal accuracy and efficiency, overlooking alignment between the
query and varying image objects and redundant fine-grained image segments. In
this work, we present an efficient scheduling framework for image retrieval -
HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple
intermediate granularities for varying image objects to enhance alignment.
Second, we minimize redundancy in retrieval by leveraging cross-hierarchy
similarity consistency and hierarchy sparsity to minimize unnecessary matching
computation. Furthermore, we configure parameters for each dataset
automatically for practicality across diverse scenarios. Our empirical study
shows that, HiMIR not only achieves substantial accuracy improvements but also
reduces computation by up to 3.5 times over the existing MVR system.
Authors' comments: Under Review
Arkadeep Acharya, Akash Ghosh, Pradeepika Verma, Kitsuchart Pasupa, Sriparna Saha, Priti Singh
With the increasing use of RetrievalAugmented Generation (RAG), strong
retrieval models have become more important than ever. In healthcare,
multimodal retrieval models that combine information from both text and images
offer major advantages for many downstream tasks such as question answering,
cross-modal retrieval, and multimodal summarization, since medical data often
includes both formats. However, there is currently no standard benchmark to
evaluate how well these models perform in medical settings. To address this
gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark.
M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over
1.2 Million text documents and 164K multimodal queries, all collected under
approved licenses. We evaluate leading multimodal retrieval models on this
benchmark to explore the challenges specific to different medical specialities
and to understand their impact on retrieval performance. By releasing
M3Retrieve, we aim to enable systematic evaluation, foster model innovation,
and accelerate research toward building more capable and reliable multimodal
retrieval systems for medical applications. The dataset and the baselines code
are available in this github page https://github.com/AkashGhosh/M3Retrieve.
Authors' comments: EMNLP Mains 2025
Yu-Fei Shih, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen
People often struggle to remember specific details of past experiences, which can lead to the need to revisit these memories. Consequently, lifelog retrieval has emerged as a crucial application. Various studies have explored methods to facilitate rapid access to personal lifelogs for memory recall assistance. In this paper, we propose a Captioning-Integrated Visual Lifelog (CIVIL) Retrieval System for extracting specific images from a user's visual lifelog based on textual queries. Unlike traditional embedding-based methods, our system first generates captions for visual lifelogs and then utilizes a text embedding model to project both the captions and user queries into a shared vector space. Visual lifelogs, captured through wearable cameras, provide a first-person viewpoint, necessitating the interpretation of the activities of the individual behind the camera rather than merely describing the scene. To address this, we introduce three distinct approaches: the single caption method, the collective caption method, and the merged caption method, each designed to interpret the life experiences of lifeloggers. Experimental results show that our method effectively describes first-person visual images, enhancing the outcomes of lifelog retrieval. Furthermore, we construct a textual dataset that converts visual lifelogs into captions, thereby reconstructing personal life experiences.
Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse
Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.
Leopold Müller, Joshua Holstein, Sarah Bause, Gerhard Satzger, Niklas Kühl
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to
enhance Large Language Models with enterprise-specific knowledge. However,
current data quality (DQ) frameworks have been primarily developed for static
datasets, and only inadequately address the dynamic, multi-stage nature of RAG
systems. This study aims to develop DQ dimensions for this new type of AI-based
systems. We conduct 16 semi-structured interviews with practitioners of leading
IT service companies. Through a qualitative content analysis, we inductively
derive 15 distinct DQ dimensions across the four processing stages of RAG
systems: data extraction, data transformation, prompt & search, and generation.
Our findings reveal that (1) new dimensions have to be added to traditional DQ
frameworks to also cover RAG contexts; (2) these new dimensions are
concentrated in early RAG steps, suggesting the need for front-loaded quality
management strategies, and (3) DQ issues transform and propagate through the
RAG pipeline, necessitating a dynamic, step-aware approach to quality
management.
Authors' comments: Preprint version. Accepted for presentation at the International
Conference on Information Systems (ICIS 2025). Please cite the published
version when available
Xiaoyu Song, William Han, Tony Chen, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Interest in generative Electrocardiogram-Language Models (ELMs) is growing,
as they can produce textual responses conditioned on ECG signals and textual
queries. Unlike traditional classifiers that output label probabilities, ELMs
are more versatile, supporting domain-specific tasks (e.g., waveform analysis,
diagnosis, prognosis) as well as general tasks (e.g., open-ended questions,
dialogue). Retrieval-Augmented Generation (RAG), widely used in Large Language
Models (LLMs) to ground LLM outputs in retrieved knowledge, helps reduce
hallucinations and improve natural language generation (NLG). However, despite
its promise, no open-source implementation or systematic study of RAG pipeline
design for ELMs currently exists. To address this gap, we present the first
open-source RAG pipeline for ELMs, along with baselines and ablation studies
for NLG. Experiments on three public datasets show that ELMs with RAG
consistently improves performance over non-RAG baselines and highlights key ELM
design considerations. Our code is available at:
https://github.com/willxxy/ECG-Bench.
Authors' comments: 5 pages, 2 figures; Submitted to ICASSP 2026
Victor Sechaud, Patrice Abry, Laurent Jacques, Julián Tachella
In recent years, deep neural networks have emerged as a solution for inverse
imaging problems. These networks are generally trained using pairs of images:
one degraded and the other of high quality, the latter being called 'ground
truth'. However, in medical and scientific imaging, the lack of fully sampled
data limits supervised learning. Recent advances have made it possible to
reconstruct images from measurement data alone, eliminating the need for
references. However, these methods remain limited to linear problems, excluding
non-linear problems such as phase retrieval. We propose a self-supervised
method that overcomes this limitation in the case of phase retrieval by using
the natural invariance of images to translations.
Authors' comments: in French language. GRETSI, Aug 2025, Strasboug, France
Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
Despite their consistent performance improvements, cross-modal retrieval
models (e.g., CLIP) show degraded performances with retrieving keys composed of
fused image-text modality (e.g., Wikipedia pages with both images and text). To
address this critical challenge, multimodal retrieval has been recently
explored to develop a unified single retrieval model capable of retrieving keys
across diverse modality combinations. A common approach involves constructing
new composed sets of image-text triplets (e.g., retrieving a pair of image and
text given a query image). However, such an approach requires careful curation
to ensure the dataset quality and fails to generalize to unseen modality
combinations. To overcome these limitations, this paper proposes Generalized
Contrastive Learning (GCL), a novel loss formulation that improves multimodal
retrieval performance without the burdensome need for new dataset curation.
Specifically, GCL operates by enforcing contrastive learning across all
modalities within a mini-batch, utilizing existing image-caption paired
datasets to learn a unified representation space. We demonstrate the
effectiveness of GCL by showing consistent performance improvements on
off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP)
using the M-BEIR, MMEB, and CoVR benchmarks.
Authors' comments: Accepted to NeurIPS 2025
Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu
GUI agents powered by vision-language models (VLMs) show promise in
automating complex digital tasks. However, their effectiveness in real-world
applications is often limited by scarce training data and the inherent
complexity of these tasks, which frequently require long-tailed knowledge
covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that
leverages web tutorials at inference time. RAG-GUI is first warm-started via
supervised finetuning (SFT) and further refined through self-guided rejection
sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as
a generic plug-in that enhances any VLM-based agent. Evaluated across three
distinct tasks, it consistently outperforms baseline agents and surpasses other
inference baselines by 2.6% to 13.3% across two model sizes, demonstrating
strong generalization and practical plug-and-play capabilities in real-world
scenarios.
Authors' comments: Accepted to EMNLP 2025 (Main Conference)
Yaxiong Wu, Jianyuan Bo, Yongyue Zhang, Sheng Liang, Yong Liu
Graph-based retrieval-augmented generation (RAG) enriches large language
models (LLMs) with external knowledge for long-context understanding and
multi-hop reasoning, but existing methods face a granularity dilemma:
fine-grained entity-level graphs incur high token costs and lose context, while
coarse document-level graphs fail to capture nuanced relations. We introduce
QCG-RAG, a query-centric graph RAG framework that enables query-granular
indexing and multi-hop chunk retrieval. Our query-centric approach leverages
Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with
controllable granularity, improving graph quality and interpretability. A
tailored multi-hop retrieval mechanism then selects relevant chunks via the
generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG
consistently outperforms prior chunk-based and graph-based RAG methods in
question answering accuracy, establishing a new paradigm for multi-hop
reasoning.
Authors' comments: 25 pages, 6 figures, 1 table
Renxiang Wang, Li Zhang
Certain strong LLMs have shown promise for zero-shot formal planning by
generating planning languages like PDDL. Yet, performance of most open-source
models under 50B parameters has been reported to be close to zero due to the
low-resource nature of these languages. We significantly improve their
performance via a series of lightweight pipelines that integrates documentation
retrieval with modular code generation and error refinement. With models like
Llama-4-Maverick, our best pipeline improves plan correctness from 0\% to over
80\% on the common BlocksWorld domain. However, while syntactic errors are
substantially reduced, semantic errors persist in more challenging domains,
revealing fundamental limitations in current models' reasoning
capabilities.\footnote{Our code and data can be found at
https://github.com/Nangxxxxx/PDDL-RAG
Authors' comments: 12 pages, 14 figures, 1 table
Seunghan Yang, Juntae Lee, Jihwan Bang, Kyuhong Shim, Minsoo Kim, Simyung Chang
Effective conversational search demands a deep understanding of user intent
across multiple dialogue turns. Users frequently use abbreviations and shift
topics in the middle of conversations, posing challenges for conventional
retrievers. While query rewriting techniques improve clarity, they often incur
significant computational cost due to additional autoregressive steps.
Moreover, although LLM-based retrievers demonstrate strong performance, they
are not explicitly optimized to track user intent in multi-turn settings, often
failing under topic drift or contextual ambiguity. To address these
limitations, we propose ContextualRetriever, a novel LLM-based retriever that
directly incorporates conversational context into the retrieval process. Our
approach introduces: (1) a context-aware embedding mechanism that highlights
the current query within the dialogue history; (2) intent-guided supervision
based on high-quality rewritten queries; and (3) a training strategy that
preserves the generative capabilities of the base LLM. Extensive evaluations
across multiple conversational search benchmarks demonstrate that
ContextualRetriever significantly outperforms existing methods while incurring
no additional inference overhead.
Authors' comments: EMNLP 2025 main conference
Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present Impliret, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving "two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.
Changjiang Zhou, Ruqing Zhang, Jiafeng Guo, Yu-An Liu, Fan Zhang, Ganyuan Luo, Xueqi Cheng
Formulating information retrieval as a variant of generative modeling,
specifically using autoregressive models to generate relevant identifiers for a
given query, has recently attracted considerable attention. However, its
application to personalized sticker retrieval remains largely unexplored and
presents unique challenges: existing relevance-based generative retrieval
methods typically lack personalization, leading to a mismatch between diverse
user expectations and the retrieved results. To address this gap, we propose
PEARL, a novel generative framework for personalized sticker retrieval, and
make two key contributions: (i) To encode user-specific sticker preferences, we
design a representation learning model to learn discriminative user
representations. It is trained on three prediction tasks that leverage personal
information and click history; and (ii) To generate stickers aligned with a
user's query intent, we propose a novel intent-aware learning objective that
prioritizes stickers associated with higher-ranked intents. Empirical results
from both offline evaluations and online tests demonstrate that PEARL
significantly outperforms state-of-the-art methods.
Authors' comments: Findings of EMNLP2025
Stanislas Ducotterd, Zhiyuan Hu, Michael Unser, Jonathan Dong
Phase retrieval seeks to recover a complex signal from amplitude-only measurements, a challenging nonlinear inverse problem. Current theory and algorithms often ignore signal priors. By contrast, we evaluate here a variety of image priors in the context of severe undersampling with structured random Fourier measurements. Our results show that those priors significantly improve reconstruction, allowing accurate reconstruction even below the weak recovery threshold.
Letian Zhang, Guanghao Meng, Xudong Ren, Yiming Wang, Shu-Tao Xia
Retrieval-augmented Generation (RAG) is a prevalent approach for domain-specific LLMs, yet it is often plagued by "Retrieval Hallucinations"--a phenomenon where fine-tuned models fail to recognize and act upon poor-quality retrieved documents, thus undermining performance. To address this, we propose the Adversarial Collaboration RAG (AC-RAG) framework. AC-RAG employs two heterogeneous agents: a generalist Detector that identifies knowledge gaps, and a domain-specialized Resolver that provides precise solutions. Guided by a moderator, these agents engage in an adversarial collaboration, where the Detector's persistent questioning challenges the Resolver's expertise. This dynamic process allows for iterative problem dissection and refined knowledge retrieval. Extensive experiments show that AC-RAG significantly improves retrieval accuracy and outperforms state-of-the-art RAG methods across various vertical domains.
Ying Li, Mengyu Wang, Miguel de Carvalho, Sotirios Sabanis, Tiejun Ma
Financial disclosures such as 10-K filings present challenging retrieval problems due to their length, regulatory section hierarchy, and domain-specific language, which standard retrieval-augmented generation (RAG) models underuse. We introduce FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval), a retrieval framework tailored to financial documents. FinGEAR combines a finance lexicon for Item-level guidance (FLAM), dual hierarchical indices for within-Item search (Summary Tree and Question Tree), and a two-stage cross-encoder reranker. This design aligns retrieval with disclosure structure and terminology, enabling fine-grained, query-aware context selection. Evaluated on full 10-Ks with queries aligned to the FinQA dataset, FinGEAR delivers consistent gains in precision, recall, F1, and relevancy, improving F1 by up to 56.7% over flat RAG, 12.5% over graph-based RAGs, and 217.6% over prior tree-based systems, while also increasing downstream answer accuracy with a fixed reader. By jointly modeling section hierarchy and domain lexicon signals, FinGEAR improves retrieval fidelity and provides a practical foundation for high-stakes financial analysis.