Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li
To tackle long-horizon tasks, recent hierarchical vision-language-action
(VLAs) frameworks employ vision-language model (VLM)-based planners to
decompose complex manipulation tasks into simpler sub-tasks that low-level
visuomotor policies can easily handle. Typically, the VLM planner is finetuned
to learn to decompose a target task. This finetuning requires target task
demonstrations segmented into sub-tasks by either human annotation or heuristic
rules. However, the heuristic subtasks can deviate significantly from the
training data of the visuomotor policy, which degrades task performance. To
address these issues, we propose a Retrieval-based Demonstration Decomposer
(RDD) that automatically decomposes demonstrations into sub-tasks by aligning
the visual features of the decomposed sub-task intervals with those from the
training data of the low-level visuomotor policies. Our method outperforms the
state-of-the-art sub-task decomposer on both simulation and real-world tasks,
demonstrating robustness across diverse settings. Code and more results are
available at rdd-neurips.github.io.
Authors' comments: 39th Conference on Neural Information Processing Systems (NeurIPS
2025); Project Website: rdd-neurips.github.io
Rikiya Takehi, Benjamin Clavié, Sean Lee, Aamir Shakir
In this work, we introduce mxbai-edge-colbert-v0 models, at two different parameter counts: 17M and 32M. As part of our research, we conduct numerous experiments to improve retrieval and late-interaction models, which we intend to distill into smaller models as proof-of-concepts. Our ultimate aim is to support retrieval at all scales, from large-scale retrieval which lives in the cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a model that we hope will serve as a solid foundation backbone for all future experiments, representing the first version of a long series of small proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we conducted multiple ablation studies, of which we report the results. In terms of downstream performance, mxbai-edge-colbert-v0 is a particularly capable small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and representing a large step forward in long-context tasks, with unprecedented efficiency.
Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang et al.
Knowledge-based visual question answering (KB-VQA) requires visual language
models (VLMs) to integrate visual understanding with external knowledge
retrieval. Although retrieval-augmented generation (RAG) achieves significant
advances in this task by combining knowledge-base querying, it still struggles
with the quality of multimodal queries and the relevance of retrieved results.
To overcome these challenges, we propose a novel three-stage method, termed
Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing
stage dynamically invokes visual tools to extract precise multimodal
information for retrieval. The retrieval stage integrates visual and text
features to achieve multimodal knowledge retrieval. The filtering stage
performs relevance filtering and concentration on retrieval results. To this
end, we introduce a visual language model trained with answer accuracy and
format consistency as reward signals via a reinforcement learning manner. This
enhances the model's reasoning, tool invocation for accurate queries, and
filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and
InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality,
achieving state-of-the-art performance. Code is available at
https://github.com/cqu-student/Wiki-PRF
Authors' comments: Accepted by NeurIPS 2025
Rashmi R, Vidyadhar Upadhya
Current Retrieval-Augmented Generation (RAG) systems primarily operate on
unimodal textual data, limiting their effectiveness on unstructured multimodal
documents. Such documents often combine text, images, tables, equations, and
graphs, each contributing unique information. In this work, we present a
Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for
multimodal question answering with reasoning through a modality-aware knowledge
graph. MAHA integrates dense vector retrieval with structured graph traversal,
where the knowledge graph encodes cross-modal semantics and relationships. This
design enables both semantically rich and context-aware retrieval across
diverse modalities. Evaluations on multiple benchmark datasets demonstrate that
MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of
0.486, providing complete modality coverage. These results highlight MAHA's
ability to combine embeddings with explicit document structure, enabling
effective multimodal retrieval. Our work establishes a scalable and
interpretable retrieval framework that advances RAG systems by enabling
modality-aware reasoning over unstructured multimodal data.
Authors' comments: 12 pages, 6 figures, submitted for review
Keima Abe, Hayato Muraki, Shuhei Tomoshige, Kenichi Oishi, Hitoshi Iyatomi
Medical images like MR scans often show domain shifts across imaging sites
due to scanner and protocol differences, which degrade machine learning
performance in tasks such as disease classification. Domain harmonization is
thus a critical research focus. Recent approaches encode brain images
$\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then
disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and
$\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these
methods often lack interpretability$-$an essential requirement in medical
applications$-$leaving practical issues unresolved. We propose
Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a
general framework for domain harmonization and interpretable representation
learning that preserves disease-relevant information in brain MR images.
PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract
$\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image
$f_D$, and a domain predictor $g_D$. Beyond adversarial training between the
encoder and domain predictor, the model learns to reconstruct the input image
$\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and
$\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared
to prior methods, PL-SE-ADA achieves equal or better performance in image
reconstruction, disease classification, and domain recognition. It also enables
visualization of both domain-independent brain features and domain-specific
components, offering high interpretability across the entire framework.
Authors' comments: 6 pages,3 figures, 3 tables. Accepted at 2025 IEEE International
Conference on Systems, Man, and Cybernetics (IEEE SMC 2025)
Md Mahadi Hasan Nahid, Davood Rafiei, Weiwei Zhang, Yong Zhang
Schema linking -- the process of aligning natural language questions with
database schema elements -- is a critical yet underexplored component of
Text-to-SQL systems. While recent methods have focused primarily on improving
SQL generation, they often neglect the retrieval of relevant schema elements,
which can lead to hallucinations and execution failures. In this work, we
propose a context-aware bidirectional schema retrieval framework that treats
schema linking as a standalone problem. Our approach combines two complementary
strategies: table-first retrieval followed by column selection, and
column-first retrieval followed by table selection. It is further augmented
with techniques such as question decomposition, keyword extraction, and
keyphrase extraction. Through comprehensive evaluations on challenging
benchmarks such as BIRD and Spider, we demonstrate that our method
significantly improves schema recall while reducing false positives. Moreover,
SQL generation using our retrieved schema consistently outperforms full-schema
baselines and closely approaches oracle performance, all without requiring
query refinement. Notably, our method narrows the performance gap between full
and perfect schema settings by 50\%. Our findings highlight schema linking as a
powerful lever for enhancing Text-to-SQL accuracy and efficiency.
Authors' comments: 30 Pages
Lifu Tu, Yingbo Zhou, Semih Yavuz
Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
Jihao Zhao, Zhiyuan Ji, Simin Niu, Hanyu Wang, Feiyu Xiong, Zhiyu Li
The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
Sudarshan Srinivasa Ramanujam, Antonio Alonso, Saurabh Kataria, Siddharth Dangi, Akhilesh Gupta, Birjodh Singh Tiwana, Manas Somaiya, Luke Simon et al.
In large scale recommendation systems like the LinkedIn Feed, the retrieval
stage is critical for narrowing hundreds of millions of potential candidates to
a manageable subset for ranking. LinkedIn's Feed serves suggested content from
outside of the member's network (based on the member's topical interests),
where 2000 candidates are retrieved from a pool of hundreds of millions
candidate with a latency budget of a few milliseconds and inbound QPS of
several thousand per second. This paper presents a novel retrieval approach
that fine-tunes a large causal language model (Meta's LLaMA 3) as a dual
encoder to generate high quality embeddings for both users (members) and
content (items), using only textual input. We describe the end to end pipeline,
including prompt design for embedding generation, techniques for fine-tuning at
LinkedIn's scale, and infrastructure for low latency, cost effective online
serving. We share our findings on how quantizing numerical features in the
prompt enables the information to get properly encoded in the embedding,
facilitating greater alignment between the retrieval and ranking layer. The
system was evaluated using offline metrics and an online A/B test, which showed
substantial improvements in member engagement. We observed significant gains
among newer members, who often lack strong network connections, indicating that
high-quality suggested content aids retention. This work demonstrates how
generative language models can be effectively adapted for real time, high
throughput retrieval in industrial applications.
Authors' comments: 9 pages, 4 figures
Aofan Liu, Shiyuan Song, Haoxuan Li, Cehao Yang, Yiyan Qi
The escalating complexity of modern codebases has intensified the need for retrieval systems capable of interpreting cross-component change intents, a capability fundamentally absent in conventional function-level search paradigms. While recent studies have improved the alignment between natural language queries and code snippets, retrieving contextually relevant code for specific change requests remains largely underexplored. To address this gap, we introduce RepoAlign-Bench, the first benchmark specifically designed to evaluate repository-level code retrieval under change request driven scenarios, encompassing 52k annotated instances. This benchmark shifts the retrieval paradigm from function-centric matching to holistic repository-level reasoning. Furthermore, we propose ReflectCode, an adversarial reflection augmented dual-tower architecture featuring disentangled code_encoder and doc_encoder components. ReflectCode dynamically integrates syntactic patterns, function dependencies, and semantic expansion intents through large language model guided reflection. Comprehensive experiments demonstrate that ReflectCode achieves 12.2% improvement in Top-5 Accuracy and 7.1% in Recall over state-of-the-art baselines, establishing a new direction for context-aware code retrieval.
Authors' comments: Accepted by EMNLP 2025
Kai Yin, Xiangjue Dong, Chengkai Liu, Allen Lin, Lingfeng Shi, Ali Mostafavi, James Caverlee
Effective and efficient access to relevant information is essential for disaster management. However, no retrieval model is specialized for disaster management, and existing general-domain models fail to handle the varied search intents inherent to disaster management scenarios, resulting in inconsistent and unreliable performance. To this end, we introduce DMRetriever, the first series of dense retrieval models (33M to 7.6B) tailored for this domain. It is trained through a novel three-stage framework of bidirectional attention adaptation, unsupervised contrastive pre-training, and difficulty-aware progressive instruction fine-tuning, using high-quality data generated through an advanced data refinement pipeline. Comprehensive experiments demonstrate that DMRetriever achieves state-of-the-art (SOTA) performance across all six search intents at every model scale. Moreover, DMRetriever is highly parameter-efficient, with 596M model outperforming baselines over 13.3 X larger and 33M model exceeding baselines with only 7.6% of their parameters. All codes, data, and checkpoints are available at https://github.com/KaiYin97/DMRETRIEVER
Mert Sonmezer, Matthew Zheng, Pinar Yanardag
Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.
Jiale Han, Austin Cheung, Yubai Wei, Zheng Yu, Xusheng Wang, Bing Zhu, Yi Yang
Knowledge is inherently time-sensitive and continuously evolves over time. Although current Retrieval-Augmented Generation (RAG) systems enrich LLMs with external knowledge, they largely ignore this temporal nature. This raises two challenges for RAG. First, current RAG methods lack effective time-aware representations. Same facts of different time are difficult to distinguish with vector embeddings or conventional knowledge graphs. Second, most RAG evaluations assume a static corpus, leaving a blind spot regarding update costs and retrieval stability as knowledge evolves. To make RAG time-aware, we propose Temporal GraphRAG (TG-RAG), which models external corpora as a bi-level temporal graph consisting of a temporal knowledge graph with timestamped relations and a hierarchical time graph. Multi-granularity temporal summaries are generated for each time node to capture both key events and broader trends at that time. The design supports incremental updates by extracting new temporal facts from the incoming corpus and merging them into the existing graph. The temporal graph explicitly represents identical facts at different times as distinct edges to avoid ambiguity, and the time hierarchy graph allows only generating reports for new leaf time nodes and their ancestors, ensuring effective and efficient updates. During inference, TG-RAG dynamically retrieves a subgraph within the temporal and semantic scope of the query, enabling precise evidence gathering. Moreover, we introduce ECT-QA, a time-sensitive question-answering dataset featuring both specific and abstract queries, along with a comprehensive evaluation protocol designed to assess incremental update capabilities of RAG systems. Extensive experiments show that TG-RAG significantly outperforms existing baselines, demonstrating the effectiveness of our method in handling temporal knowledge and incremental updates.
Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng et al.
LLMs hold great promise for healthcare applications, but the rapid evolution
of medical knowledge and errors in training data often cause them to generate
outdated or inaccurate information, limiting their applicability in high-stakes
clinical practice. Model editing has emerged as a potential remedy without full
retraining. While parameter-based editing often compromises locality and is
thus ill-suited for the medical domain, retrieval-based editing offers a more
viable alternative. However, it still faces two critical challenges: (1)
representation overlap within the medical knowledge space often causes
inaccurate retrieval and reduces editing accuracy; (2) existing methods are
restricted to single-sample edits, while batch-editing remains largely
unexplored despite its importance for real-world medical applications. To
address these challenges, we first construct MedVersa, \hk{an enhanced
benchmark with broader coverage of medical subjects, designed to evaluate both
single and batch edits under strict locality constraints}. We then propose
MedREK, a retrieval-based editing framework that integrates a shared query-key
module for precise matching with an attention-based prompt encoder for
informative guidance. Experimental results on various medical benchmarks
demonstrate that our MedREK achieves superior performance across different core
metrics and provides the first validated solution for batch-editing in medical
LLMs. Our code and dataset are available at
https://github.com/mylittleriver/MedREK.
Authors' comments: Preprint, work in progress
Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
Jiamin Chen, Yuchen Li, Xinyu Ma, Xinran Chen, Xiaokun Zhang, Shuaiqiang Wang, Chen Ma, Dawei Yin
Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.
Subhendu Khatuya, Shashwat Naidu, Pawan Goyal, Niloy Ganguly
Despite continuous advancements in the capabilities of large language models
(LLMs), numerical reasoning remains a challenging area. Techniques like
chain-of-thought prompting, tree-of-thought prompting, and program-of-thought
prompting guide LLMs through intermediate reasoning steps. Although in-context
learning with few-shot prompting has improved performance, LLMs still lag
behind state-of-the-art models on financial numerical reasoning datasets such
as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step
framework, to enhance LLMs' capabilities in financial numerical reasoning. The
first step utilizes a generative retriever to extract relevant facts from
unstructured data, including both text and tables. This is followed by
context-aware Program of Thought prompting with dynamic selection of in-context
examples. Our model FINDER achieves a new state-of-the-art performance on both
the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution
accuracy improvements of 5.98% and 4.05%, respectively.
Authors' comments: This work has been accepted for publication in the Main Conference of
the Empirical Methods in Natural Language Processing (EMNLP) 2025
Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.
Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Rose, Jesse C. Cresswell
Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.
Authors' comments: 8 pages
Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny
Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
Authors' comments: NeurIPS 2025 (Spotlight). Webpage at https://xiaoqian-shen.github.io/Vgent