Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He et al.
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Neural retrieval methods using transformer-based pre-trained language models
have advanced multilingual and cross-lingual retrieval. However, their
effectiveness for low-resource, morphologically rich languages such as Amharic
remains underexplored due to data scarcity and suboptimal tokenization. We
address this gap by introducing Amharic-specific dense retrieval models based
on pre-trained Amharic BERT and RoBERTa backbones. Our proposed
RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative
improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest
multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact
variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while
being over 13x smaller. Additionally, we train a ColBERT-based late interaction
retrieval model that achieves the highest MRR@10 score (0.843) among all
evaluated models. We benchmark our proposed models against both sparse and
dense retrieval baselines to systematically assess retrieval effectiveness in
Amharic. Our analysis highlights key challenges in low-resource settings and
underscores the importance of language-specific adaptation. To foster future
research in low-resource IR, we publicly release our dataset, codebase, and
trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
Authors' comments: 10 pages (excluding references and appendix), 10 figures. Accepted to
ACL 2025 Findings. Public release includes dataset, code, and trained models:
https://github.com/kidist-amde/amharic-ir-benchmarks
Zirui Li, Siwei Wu, Xingyu Wang, Yi Zhou, Yizhi Li, Chenghua Lin
The rapid advancement of unsupervised representation learning and large-scale
pre-trained vision-language models has significantly improved cross-modal
retrieval tasks. However, existing multi-modal information retrieval (MMIR)
studies lack a comprehensive exploration of document-level retrieval and suffer
from the absence of cross-domain datasets at this granularity. To address this
limitation, we introduce DocMMIR, a novel multi-modal document retrieval
framework designed explicitly to unify diverse document formats and domains,
including Wikipedia articles, scientific papers (arXiv), and presentation
slides, within a comprehensive retrieval scenario. We construct a large-scale
cross-domain multimodal benchmark, comprising 450K samples, which
systematically integrates textual and visual information. Our comprehensive
experimental analysis reveals substantial limitations in current
state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our
tasks, with only CLIP demonstrating reasonable zero-shot performance.
Furthermore, we conduct a systematic investigation of training strategies,
including cross-modal fusion methods and loss functions, and develop a tailored
approach to train CLIP on our benchmark. This results in a +31% improvement in
MRR@10 compared to the zero-shot baseline. All our data and code are released
in https://github.com/J1mL1/DocMMIR.
Authors' comments: Comments: 13 pages, 7 figures. Code and data publicly available at
https://github.com/J1mL1/DocMMIR
Robin D. Pesl, Jerin G. Mathew, Massimo Mecella, Marco Aiello
Integrating multiple (sub-)systems is essential to create advanced
Information Systems. Difficulties mainly arise when integrating dynamic
environments, e.g., the integration at design time of not yet existing
services. This has been traditionally addressed using a registry that provides
the API documentation of the endpoints. Large Language Models have shown to be
capable of automatically creating system integrations (e.g., as service
composition) based on this documentation but require concise input due to input
oken limitations, especially regarding comprehensive API descriptions.
Currently, it is unknown how best to preprocess these API descriptions. In the
present work, we (i) analyze the usage of Retrieval Augmented Generation for
endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice
OpenAPIs to reduce the input oken length while preserving the most relevant
information. To further reduce the input token length for the composition
prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that
only receives a summary of the most relevant endpoints nd retrieves
specification details on demand. We evaluate RAG for endpoint discovery using
(iii) a proposed novel service discovery benchmark SOCBench-D representing a
general setting across numerous domains and the real-world RestBench enchmark,
first, for the different chunking possibilities and parameters measuring the
endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same
test data set. The prototype shows how to successfully employ RAG for endpoint
discovery to reduce the token count. Our experiments show that endpoint-based
approaches outperform naive chunking methods for preprocessing. Relying on an
agent significantly improves precision while being prone to decrease recall,
disclosing the need for further reasoning capabilities.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2411.19804
Yaoyang Liu, Junlin Li, Yinjun Wu, Zhen Chen
Although Multi-Vector Retrieval (MVR) has achieved the state of the art on
many information retrieval (IR) tasks, its performance highly depends on how to
decompose queries into smaller pieces, say phrases or tokens. However,
optimizing query decomposition for MVR performance is not end-to-end
differentiable. Even worse, jointly solving this problem and training the
downstream retrieval-based systems, say RAG systems could be highly
inefficient. To overcome these challenges, we propose Performance-Oriented
Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD
leverages one LLM for query decomposition and searches the optimal prompt with
an LLM-based optimizer. We further propose an end-to-end training algorithm to
alternatively optimize the prompt for query decomposition and the downstream
models. This algorithm can achieve superior MVR performance at a reasonable
training cost as our theoretical analysis suggests. POQD can be integrated
seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented
Generation (RAG) systems. Extensive empirical studies on representative
RAG-based QA tasks show that POQD outperforms existing query decomposition
strategies in both retrieval performance and end-to-end QA accuracy. POQD is
available at https://github.com/PKU-SDS-lab/POQD-ICML25.
Authors' comments: Published in ICML 2025
Abhijit Chakraborty, Chahana Dahal, Vivek Gupta
Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham's guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.
Yongjie Wang, Jonathan Leung, Zhiqi Shen
Large Language Models (LLMs) have shown promise in character imitation,
enabling immersive and engaging conversations. However, they often generate
content that is irrelevant or inconsistent with a character's background. We
attribute these failures to: (1) the inability to accurately recall
character-specific knowledge due to entity ambiguity, and (2) a lack of
awareness of the character's cognitive boundaries. To address these issues, we
propose RoleRAG, a retrieval-based framework that integrates efficient entity
disambiguation for knowledge indexing with a boundary-aware retriever for
extracting contextually appropriate information from a structured knowledge
graph. Experiments on role-playing benchmarks show that RoleRAG's calibrated
retrieval helps both general-purpose and role-specific LLMs better align with
character knowledge and reduce hallucinated responses.
Authors' comments: A Retrieval-enhanced LLM Role-playing
Ainulla Khan, Yamada Moyuru, Srinidhi Akella
Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.
Hansa Meghwani, Amit Agarwal, Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Srikant Panda
Enterprise search systems often struggle to retrieve accurate,
domain-specific information due to semantic mismatches and overlapping
terminologies. These issues can degrade the performance of downstream
applications such as knowledge management, customer support, and
retrieval-augmented generation agents. To address this challenge, we propose a
scalable hard-negative mining framework tailored specifically for
domain-specific enterprise data. Our approach dynamically selects semantically
challenging but contextually irrelevant documents to enhance deployed
re-ranking models.
Our method integrates diverse embedding models, performs dimensionality
reduction, and uniquely selects hard negatives, ensuring computational
efficiency and semantic precision. Evaluation on our proprietary enterprise
corpus (cloud services domain) demonstrates substantial improvements of 15\% in
MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other
negative sampling techniques. Further validation on public domain-specific
datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability
and readiness for real-world applications.
Authors' comments: Accepted to ACL 2025
Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
Document Visual Question Answering (DocVQA) faces dual challenges in
processing lengthy multimodal documents (text, images, tables) and performing
cross-modal reasoning. Current document retrieval-augmented generation (DocRAG)
methods remain limited by their text-centric approaches, frequently missing
critical visual information. The field also lacks robust benchmarks for
assessing multimodal evidence selection and integration. We introduce MMDocRAG,
a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with
multi-page, cross-modal evidence chains. Our framework introduces innovative
metrics for evaluating multimodal quote selection and enables answers that
interleave text with relevant visual elements. Through large-scale experiments
with 60 VLM/LLM models and 14 retrieval systems, we identify persistent
challenges in multimodal evidence retrieval, selection, and integration.Key
findings reveal advanced proprietary LVMs show superior performance than
open-sourced alternatives. Also, they show moderate advantages using multimodal
inputs over text-only inputs, while open-source alternatives show significant
performance degradation. Notably, fine-tuned LLMs achieve substantial
improvements when using detailed image descriptions. MMDocRAG establishes a
rigorous testing ground and provides actionable insights for developing more
robust multimodal DocVQA systems. Our benchmark and code are available at
https://mmdocrag.github.io/MMDocRAG/.
Authors' comments: preprint. code available at
\url{https://mmdocrag.github.io/MMDocRAG/}
Pierre Achkar, Tim Gollub, Martin Potthast
The exponential growth of scientific publications has made it increasingly
difficult for researchers to stay updated and synthesize knowledge effectively.
This paper presents XSum, a modular pipeline for multi-document summarization
(MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The
pipeline includes two core components: a question-generation module and an
editor module. The question-generation module dynamically generates questions
adapted to the input papers, ensuring the retrieval of relevant and accurate
information. The editor module synthesizes the retrieved content into coherent
and well-structured summaries that adhere to academic standards for proper
citation. Evaluated on the SurveySum dataset, XSum demonstrates strong
performance, achieving considerable improvements in metrics such as CheckEval,
G-Eval and Ref-F1 compared to existing approaches. This work provides a
transparent, adaptable framework for scientific summarization with potential
applications in a wide range of domains. Code available at
https://github.com/webis-de/scolia25-xsum
Authors' comments: Accepted at SCOLIA@ECIR 2025 Workshop
Quentin Macé, António Loison, Manuel Faysse
The ViDoRe Benchmark V1 was approaching saturation with top models exceeding
90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2
introduces realistic, challenging retrieval scenarios via blind contextual
querying, long and cross-document queries, and a hybrid synthetic and
human-in-the-loop query generation process. It comprises four diverse,
multilingual datasets and provides clear evaluation instructions. Initial
results demonstrate substantial room for advancement and highlight insights on
model generalization and multilingual capability. This benchmark is designed as
a living resource, inviting community contributions to maintain relevance
through future evaluations.
Authors' comments: Published as a HuggingFace Blog
Siting Li, Xiang Gao, Simon Shaolei Du
While an image is worth more than a thousand words, only a few provide
crucial information for a given task and thus should be focused on. In light of
this, ideal text-to-image (T2I) retrievers should prioritize specific visual
attributes relevant to queries. To evaluate current retrievers on handling
attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with
9,112 queries about diverse attributes of interest. We find that CLIP-like
retrievers, which are widely adopted due to their efficiency and zero-shot
ability, have poor and imbalanced performance, possibly because their image
embeddings focus on global semantics and subjects while leaving out other
details. Notably, we reveal that even recent Multimodal Large Language Model
(MLLM)-based, stronger retrievers with a larger output dimension struggle with
this limitation. Hence, we hypothesize that retrieving with general image
embeddings is suboptimal for performing such queries. As a solution, we propose
to use promptable image embeddings enabled by these multimodal retrievers,
which boost performance by highlighting required attributes. Our pipeline for
deriving such embeddings generalizes across query types, image pools, and base
retriever architectures. To enhance real-world applicability, we offer two
acceleration strategies: Pre-processing promptable embeddings and using linear
approximations. We show that the former yields a 15% improvement in Recall@5
when prompts are predefined, while the latter achieves an 8% improvement when
prompts are only available during inference.
Authors' comments: 25 pages, 5 figures
Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
Marc Allain, Selin Aslan, Wim Coene, Sjoerd Dirksen, Jonathan Dong, Julien Flamant, Mark Iwen, Felix Krahmer et al.
Phase retrieval is an inverse problem that, on one hand, is crucial in many applications across imaging and physics, and, on the other hand, leads to deep research questions in theoretical signal processing and applied harmonic analysis. This survey paper is an outcome of the recent workshop Phase Retrieval in Mathematics and Applications (PRiMA) (held on August 5--9 2024 at the Lorentz Center in Leiden, The Netherlands) that brought together experts working on theoretical and practical aspects of the phase retrieval problem with the purpose to formulate and explore essential open problems in the field.
Adarsh Singh, Kushal Raj Bhandari, Jianxi Gao, Soham Dan, Vivek Gupta
Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose $\textbf{CRAFT}$, a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate $\textbf{CRAFT}$ effectiveness.
Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen et al.
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.
Kai Yin, Xiangjue Dong, Chengkai Liu, Lipai Huang, Yiming Xiao, Zhewei Liu, Ali Mostafavi, James Caverlee
Effective disaster management requires timely access to accurate and contextually relevant information. Existing Information Retrieval (IR) benchmarks, however, focus primarily on general or specialized domains, such as medicine or finance, neglecting the unique linguistic complexity and diverse information needs encountered in disaster management scenarios. To bridge this gap, we introduce DisastIR, the first comprehensive IR evaluation benchmark specifically tailored for disaster management. DisastIR comprises 9,600 diverse user queries and more than 1.3 million labeled query-passage pairs, covering 48 distinct retrieval tasks derived from six search intents and eight general disaster categories that include 301 specific event types. Our evaluations of 30 state-of-the-art retrieval models demonstrate significant performance variances across tasks, with no single model excelling universally. Furthermore, comparative analyses reveal significant performance gaps between general-domain and disaster management-specific tasks, highlighting the necessity of disaster management-specific benchmarks for guiding IR model selection to support effective decision-making in disaster management scenarios. All source codes and DisastIR are available at https://github.com/KaiYin97/Disaster_IR.
Lei Li, Xiao Zhou, Zheng Liu
Current medical retrieval benchmarks primarily emphasize lexical or shallow
semantic similarity, overlooking the reasoning-intensive demands that are
central to clinical decision-making. In practice, physicians often retrieve
authoritative medical evidence to support diagnostic hypotheses. Such evidence
typically aligns with an inferred diagnosis rather than the surface form of a
patient's symptoms, leading to low lexical or semantic overlap between queries
and relevant documents. To address this gap, we introduce R2MED, the first
benchmark explicitly designed for reasoning-driven medical retrieval. It
comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical
evidence retrieval, and clinical case retrieval. These tasks are drawn from
five representative medical scenarios and twelve body systems, capturing the
complexity and diversity of real-world medical information needs. We evaluate
15 widely-used retrieval systems on R2MED and find that even the best model
achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical
re-ranking and generation-augmented retrieval methods offer only modest
improvements. Although large reasoning models improve performance via
intermediate inference generation, the best results still peak at 41.4 nDCG@10.
These findings underscore a substantial gap between current retrieval
techniques and the reasoning demands of real clinical tasks. We release R2MED
as a challenging benchmark to foster the development of next-generation medical
retrieval systems with enhanced reasoning capabilities. Data and code are
available at https://github.com/R2MED/R2MED
Authors' comments: 38 pages, 16 figures
Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang et al.
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.