Zhiqiang Xu, Zili Xu, Xinyue Zhang
This paper aims to characterize the optimal frame for phase retrieval, defined as the frame whose condition number for phase retrieval attains its minimal value. In the context of the two-dimensional real case, we reveal the connection between optimal frames for phase retrieval and the perimeter-maximizing isodiametric problem, originally proposed by Reinhardt in 1922. Our work establishes that every optimal solution to the perimeter-maximizing isodiametric problem inherently leads to an optimal frame in ${\mathbb R}^2$. By recasting the optimal polygons problem as one concerning the discrepancy of roots of unity, we characterize all optimal polygons. Building upon this connection, we then characterize all optimal frames with $m$ vectors in ${\mathbb R}^2$ for phase retrieval when $m \geq 3$ has an odd factor. As a key corollary, we show that the harmonic frame $E_m$ is {\em not} optimal for any even integer $m \geq 4$. This finding disproves a conjecture proposed by Xia, Xu, and Xu (Math. Comp., 90(356): 2931-2960). Previous work has established that the harmonic frame $E_m \subset {\mathbb R}^2$ is indeed optimal when $m$ is an odd integer. Exploring the connection between phase retrieval and discrete geometry, this paper aims to illuminate advancements in phase retrieval and offer new perspectives on the perimeter-maximizing isodiametric problem.
Zhenyu Pan, Yucheng Lu, Han Liu
We present MetaFind, a scene-aware tri-modal compositional retrieval
framework designed to enhance scene generation in the metaverse by retrieving
3D assets from large-scale repositories. MetaFind addresses two core
challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic,
and stylistic constraints, and (ii) the absence of a standardized retrieval
paradigm specifically tailored for 3D asset retrieval, as existing approaches
mainly rely on general-purpose 3D shape representation models. Our key
innovation is a flexible retrieval mechanism that supports arbitrary
combinations of text, image, and 3D modalities as queries, enhancing spatial
reasoning and style consistency by jointly modeling object-level features
(including appearance) and scene-level layout structures. Methodologically,
MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that
captures spatial relationships and object appearance features, ensuring
retrieved 3D assets are contextually and stylistically coherent with the
existing scene, regardless of coordinate frame transformations. The framework
supports iterative scene construction by continuously adapting retrieval
results to current scene updates. Empirical evaluations demonstrate the
improved spatial and stylistic consistency of MetaFind in various retrieval
tasks compared to baseline methods.
Authors' comments: The Thirty-Ninth Annual Conference on Neural Information Processing
Systems (NeurIPS 2025)
Kirandeep Kaur, Preetam Prabhu Srikar Dammu, Hideo Joho, Chirag Shah
Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These limitations hinder our ability to assess whether agents can meaningfully adapt to individuals across dynamic, longitudinal interactions. In this perspective paper, we propose a conceptual lens for rethinking evaluation in adaptive personalization, shifting the focus from static performance snapshots to interaction-aware, evolving assessments. We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks. While recent works have embraced LLM-driven user simulation, we situate this practice within a broader paradigm for evaluating agents over time. To illustrate our ideas, we conduct a case study in e-commerce search using the PersonalWAB dataset. Beyond presenting a framework, our work lays a conceptual foundation for understanding and evaluating personalization as a continuous, user-centric endeavor.
Yohan Lee, Yongwoo Song, Sangyeop Kim
We present the Conversational Data Retrieval (CDR) benchmark, the first
comprehensive test set for evaluating systems that retrieve conversation data
for product insights. With 1.6k queries across five analytical tasks and 9.1k
conversations, our benchmark provides a reliable standard for measuring
conversational data retrieval performance. Our evaluation of 16 popular
embedding models shows that even the best models reach only around NDCG@10 of
0.51, revealing a substantial gap between document and conversational data
retrieval capabilities. Our work identifies unique challenges in conversational
data retrieval (implicit state recognition, turn dynamics, contextual
references) while providing practical query templates and detailed error
analysis across different task categories. The benchmark dataset and code are
available at https://github.com/l-yohai/CDR-Benchmark.
Authors' comments: Accepted by EMNLP 2025 Industry Track
Aydin Javadov, Samir Garibov, Tobias Hoesli, Qiyang Sun, Florian von Wangenheim, Joseph Ollier, Björn W. Schuller
Medical time series analysis is challenging due to data sparsity, noise, and
highly variable recording lengths. Prior work has shown that stochastic sparse
sampling effectively handles variable-length signals, while retrieval-augmented
approaches improve explainability and robustness to noise and weak temporal
correlations. In this study, we generalize the stochastic sparse sampling
framework for retrieval-informed classification. Specifically, we weight window
predictions by within-channel similarity and aggregate them in probability
space, yielding convex series-level scores and an explicit evidence trail for
explainability. Our method achieves competitive iEEG classification performance
and provides practitioners with greater transparency and explainability. We
evaluate our method in iEEG recordings collected in four medical centers,
demonstrating its potential for reliable and explainable clinical
variable-length time series classification.
Authors' comments: Accepted at the NeurIPS 2025 Workshop on Learning from Time Series
for Health
Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, Shiqi Wang
The success of DeepSeek-R1 demonstrates the immense potential of using
reinforcement learning (RL) to enhance LLMs' reasoning capabilities. This paper
introduces Retrv-R1, the first R1-style MLLM specifically designed for
multimodal universal retrieval, achieving higher performance by employing
step-by-step reasoning to produce more accurate retrieval results. We find that
directly applying the methods of DeepSeek-R1 to retrieval tasks is not
feasible, mainly due to (1) the high computational cost caused by the large
token consumption required for multiple candidates with reasoning processes,
and (2) the instability and suboptimal results when directly applying RL to
train for retrieval tasks. To address these issues, Retrv-R1 introduces an
information compression module with a details inspection mechanism, which
enhances computational efficiency by reducing the number of tokens while
ensuring that critical information for challenging candidates is preserved.
Furthermore, a new training paradigm is proposed, including an activation stage
using a retrieval-tailored synthetic CoT dataset for more effective
optimization, followed by RL with a novel curriculum reward to improve both
performance and efficiency. Incorporating these novel designs, Retrv-R1
achieves SOTA performance, high efficiency, and strong generalization ability,
as demonstrated by experiments across multiple benchmarks and tasks.
Authors' comments: NeurIPS 2025
Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge et al.
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo, Dat Quoc Nguyen
We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.
Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu
Recent studies suggest that the deeper layers of Large Language Models (LLMs)
contribute little to representation learning and can often be removed without
significant performance loss. However, such claims are typically drawn from
narrow evaluations and may overlook important aspects of model behavior. In
this work, we present a systematic study of depth utilization across diverse
dimensions, including evaluation protocols, task categories, and model
architectures. Our analysis confirms that very deep layers are generally less
effective than earlier ones, but their contributions vary substantially with
the evaluation setting. Under likelihood-based metrics without generation,
pruning most layers preserves performance, with only the initial few being
critical. By contrast, generation-based evaluation uncovers indispensable roles
for middle and deeper layers in enabling reasoning and maintaining long-range
coherence. We further find that knowledge and retrieval are concentrated in
shallow components, whereas reasoning accuracy relies heavily on deeper layers
-- yet can be reshaped through distillation. These results highlight that depth
usage in LLMs is highly heterogeneous and context-dependent, underscoring the
need for task-, metric-, and model-aware perspectives in both interpreting and
compressing large models.
Authors' comments: ICASSP 2025
Lovely Yeswanth Panchumarthi, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya
The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.
Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu
A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.
Zhengyang Shen, Xuehao Zhai, Hua Tu, Mayue Shi
Chagas disease affects nearly 6 million people worldwide, with Chagas
cardiomyopathy representing its most severe complication. In regions where
serological testing capacity is limited, AI-enhanced electrocardiogram (ECG)
screening provides a critical diagnostic alternative. However, existing machine
learning approaches face challenges such as limited accuracy, reliance on large
labeled datasets, and more importantly, weak integration with evidence-based
clinical diagnostic indicators. We propose a retrieval-augmented generation
framework, CardioRAG, integrating large language models with interpretable
ECG-based clinical features, including right bundle branch block, left anterior
fascicular block, and heart rate variability metrics. The framework uses
variational autoencoder-learned representations for semantic case retrieval,
providing contextual cases to guide clinical reasoning. Evaluation demonstrated
high recall performance of 89.80%, with a maximum F1 score of 0.68 for
effective identification of positive cases requiring prioritized serological
testing. CardioRAG provides an interpretable, clinical evidence-based approach
particularly valuable for resource-limited settings, demonstrating a pathway
for embedding clinical indicators into trustworthy medical AI systems.
Authors' comments: 4 pages, 2 figures. Accepted for oral presentation at the 52nd
international Computing in Cardiology Conference (CinC2025)
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
Leveraging query variants (QVs), i.e., queries with potentially similar information needs to the target query, has been shown to improve the effectiveness of query performance prediction (QPP) approaches. Existing QV-based QPP methods generate QVs facilitated by either query expansion or non-contextual embeddings, which may introduce topical drifts and hallucinations. In this paper, we propose a method that retrieves QVs from a training set (e.g., MS MARCO) for a given target query of QPP. To achieve a high recall in retrieving queries with the most similar information needs as the target query from a training set, we extend the directly retrieved QVs (1-hop QVs) by a second retrieval using their denoted relevant documents (which yields 2-hop QVs). Our experiments, conducted on TREC DL'19 and DL'20, show that the QPP methods with QVs retrieved by our method outperform the best-performing existing generated-QV-based QPP approaches by as much as around 20\%, on neural ranking models like MonoT5.
Authors' comments: 11 pages, 4 figures
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
Oussama Gabouj, Kamel Charaf, Ivan Zakazov, Nicolas Baldwin, Robert West
Large Language Models (LLMs) achieve strong performance across diverse tasks,
but their effectiveness often depends on the quality of the provided context.
Retrieval-Augmented Generation (RAG) enriches prompts with external
information, but its reliance on static databases constrains adaptability and
can result in irrelevant demonstrations. In this work, we propose a Generative
Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach
where an LLM model is trained to generate input-specific concise
demonstrations. By tailoring demonstrations to each input, our method offers
better contextual support than traditional RAG approaches. We demonstrate the
superiority of GRAD under budget constraints, where we limit both the number of
tokens used per demonstration and the number of tokens used for the final
output. Trained solely on a math dataset, GRAD consistently outperforms strong
baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM
questions, highlighting GRAD's robust generalization to out-of-distribution
(OOD) domains such as physics, chemistry, and computer science. Furthermore, we
show that demonstrations generated by trained smaller models can effectively
guide larger target models, reducing training costs while maintaining
competitive accuracy. Overall, this work introduces a scalable demonstration
generator model presenting the first step toward a dynamic few-shot learning
paradigm in resource-constrained settings. We release the code used for the
project.
Authors' comments: EMNLP 2025 (findings)
Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen
Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.
Roksana Goworek, Olivia Macmillan-Scott, Eda B. Özyiğit
Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.
Loris Bergeron, Ioana Buhnila, Jérôme François, Radu State
Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.
Yanming Sun, Runzhe Zhan, Chi Seng Cheang, Han Wu, Xuebo Liu, Yuyao Niu, Fengying Ye, Kaixin Lan et al.
\textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.
Gao Huang, Song Li, Deanna Needell
We investigate stable recovery guarantees for phase retrieval under two
realistic and challenging noise models: the Poisson model and the heavy-tailed
model. Our analysis covers both nonconvex least squares (NCVX-LS) and convex
least squares (CVX-LS) estimators. For the Poisson model, we demonstrate that
in the high-energy regime where the true signal $pmb{x}$ exceeds a certain
energy threshold, both estimators achieve a signal-independent, minimax optimal
error rate $\mathcal{O}(\sqrt{\frac{n}{m}})$, with $n$ denoting the signal
dimension and $m$ the number of sampling vectors. In contrast, in the
low-energy regime, the NCVX-LS estimator attains an error rate of
$\mathcal{O}(\|\pmb{x}\|^{1/4}_2\cdot(\frac{n}{m})^{1/4})$, which decreases as
the energy of signal $\pmb{x}$ diminishes and remains nearly optimal with
respect to the oversampling ratio. This demonstrates a signal-energy-adaptive
behavior in the Poisson setting. For the heavy-tailed model with noise having a
finite $q$-th moment ($q>2$), both estimators attain the minimax optimal error
rate $\mathcal{O}( \frac{\| \xi \|_{L_q}}{\| \pmb{x} \|_2} \cdot
\sqrt{\frac{n}{m}} )$ in the high-energy regime, while the NCVX-LS estimator
further achieves the minimax optimal rate $\mathcal{O}( \sqrt{\|\xi
\|_{L_q}}\cdot (\frac{n}{m})^{1/4} )$ in the low-energy regime. Our analysis
builds on two key ideas: the use of multiplier inequalities to handle noise
that may exhibit dependence on the sampling vectors, and a novel interpretation
of Poisson noise as sub-exponential in the high-energy regime yet heavy-tailed
in the low-energy regime. These insights form the foundation of a unified
analytical framework, which we further apply to a range of related problems,
including sparse phase retrieval, low-rank PSD matrix recovery, and random
blind deconvolution.
Authors' comments: 77 pages, 6 figures