Esakkivel Esakkiraja, Denis Akhiyarov, Aditya Shanmugham, Chitra Ganapathy
Current search techniques are limited to standard RAG query-document
applications. In this paper, we propose a novel technique to expand the code
and index for predicting the required APIs, directly enabling high-quality,
end-to-end code generation for auto-completion and agentic AI applications. We
address the problem of API leaks in current code-to-code benchmark datasets by
introducing a new dataset built from real-world ServiceNow Script Includes that
capture the challenge of unclear API usage intent in the code. Our evaluation
metrics show that this method achieves 87.86% top-40 retrieval accuracy,
allowing the critical context with APIs needed for successful downstream code
generation. To enable real-time predictions, we develop a comprehensive
post-training pipeline that optimizes a compact 0.6B reranker through synthetic
dataset generation, supervised fine-tuning, and reinforcement learning. This
approach enables our compact reranker to outperform a much larger 8B model
while maintaining 2.5x reduced latency, effectively addressing the nuances of
enterprise-specific code without the computational overhead of larger models.
Authors' comments: Retrieval-Augmented Generation, API Prediction, Context-Aware Code
Generation, Enterprise Code Completion, Reinforcement Learning, ServiceNow,
Real-Time Code Search, Query Enhancement, Fine-Tuning, Embedding, Reranker
Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao, Defu Lian
With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*'s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.
Zhongbin Xie, Thomas Lukasiewicz
Dense retrieval models usually adopt vectors from the last hidden layer of
the document encoder to represent a document, which is in contrast to the fact
that representations in different layers of a pre-trained language model
usually contain different kinds of linguistic knowledge, and behave differently
during fine-tuning. Therefore, we propose to investigate utilizing
representations from multiple encoder layers to make up the representation of a
document, which we denote Multi-layer Representations (MLR). We first
investigate how representations in different layers affect MLR's performance
under the multi-vector retrieval setting, and then propose to leverage pooling
strategies to reduce multi-vector models to single-vector ones to improve
retrieval efficiency. Experiments demonstrate the effectiveness of MLR over
dual encoder, ME-BERT and ColBERT in the single-vector retrieval setting, as
well as demonstrate that it works well with other advanced training techniques
such as retrieval-oriented pre-training and hard negative mining.
Authors' comments: Accepted to Findings of EMNLP 2025
Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald, Fragkiskos D. Malliaros
Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.
Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yu Tang et al.
The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences -- including length, multilingual, and modality shifts -- by transforming queries into forms better aligned with the retriever's training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts -- including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) -- achieving an average improvement of 4.9\% in Recall\@10. The code is available at https://github.com/Chinese0123456/GRAPE.git
Yichi Zhang, Jun Bai, Zhixin Cai, Shuhan Qin, Zhuofan Chen, Jinghua Guan, Wenge Rong
Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) can reformulate queries to capture complex reasoning, applying them universally incurs significant computational cost. In this work, we propose Adaptive Query Reasoning (AdaQR), a hybrid query rewriting framework. Within this framework, a Reasoner Router dynamically directs each query to either fast dense reasoning or deep LLM reasoning. The dense reasoning is achieved by the Dense Reasoner, which performs LLM-style reasoning directly in the embedding space, enabling a controllable trade-off between efficiency and accuracy. Experiments on large-scale retrieval benchmarks BRIGHT show that AdaQR reduces reasoning cost by 28% while preserving-or even improving-retrieval performance by 7%.
Authors' comments: 16 pages, 11 figures
Pratik Shah, Rajat Ghosh, Aryan Singhal, Debojyoti Dutta
General-purpose automated software engineering (ASE) includes tasks such as code completion, retrieval, repair, QA, and summarization. These tasks require a code retrieval system that can handle specific queries about code entities, or code entity queries (for example, locating a specific class or retrieving the dependencies of a function), as well as general queries without explicit code entities, or natural language queries (for example, describing a task and retrieving the corresponding code). We present RANGER, a repository-level code retrieval agent designed to address both query types, filling a gap in recent works that have focused primarily on code-entity queries. We first present a tool that constructs a comprehensive knowledge graph of the entire repository, capturing hierarchical and cross-file dependencies down to the variable level, and augments graph nodes with textual descriptions and embeddings to bridge the gap between code and natural language. RANGER then operates on this graph through a dual-stage retrieval pipeline. Entity-based queries are answered through fast Cypher lookups, while natural language queries are handled by MCTS-guided graph exploration. We evaluate RANGER across four diverse benchmarks that represent core ASE tasks including code search, question answering, cross-file dependency retrieval, and repository-level code completion. On CodeSearchNet and RepoQA it outperforms retrieval baselines that use embeddings from strong models such as Qwen3-8B. On RepoBench, it achieves superior cross-file dependency retrieval over baselines, and on CrossCodeEval, pairing RANGER with BM25 delivers the highest exact match rate in code completion compared to other RAG methods.
Authors' comments: 24 pages, 4 figures
Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu
Inspired by the dual-process theory of human cognition from \textit{Thinking,
Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated
Memory for Enhanced Reasoning), a multi-agent reasoning framework that
dynamically integrates \textbf{System 1} (fast, intuitive thinking) and
\textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick
Thinking Agent (System 1) to generate a rapid answer; if uncertainty is
detected, it then triggers a structured System 2 reasoning pipeline composed of
specialized agents for \textit{planning}, \textit{hypothesis generation},
\textit{retrieval}, \textit{information integration}, and
\textit{decision-making}. This multi-agent design faithfully mimics human
cognitive processes and enhances both efficiency and accuracy. Experimental
results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to
perform competitively with state-of-the-art closed-source models like GPT-4 and
GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This
research establishes PRIME as a scalable solution for improving LLMs in domains
requiring complex, knowledge-intensive reasoning.
Authors' comments: 8 pages
Taiga Sasaki, Takehiro Yamamoto, Hiroaki Ohshima, Sumio Fujita
In this study, we evaluate the effect of model merging in ad-hoc retrieval
tasks. Model merging is a technique that combines the diverse characteristics
of multiple models. We hypothesized that applying model merging to
domain-specific ad-hoc retrieval tasks could improve retrieval effectiveness.
To verify this hypothesis, we merged the weights of a source retrieval model
and a domain-specific (non-retrieval) model using a linear interpolation
approach. A key advantage of our approach is that it requires no additional
fine-tuning of the models. We conducted two experiments each in the medical and
Japanese domains. The first compared the merged model with the source retrieval
model, and the second compared it with a LoRA fine-tuned model under both full
and limited data settings for model construction. The experimental results
indicate that model merging has the potential to produce more effective
domain-specific retrieval models than the source retrieval model, and may serve
as a practical alternative to LoRA fine-tuning, particularly when only a
limited amount of data is available.
Authors' comments: Accepted at CIKM 2025, 5 pages
Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar
Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
Lei Hei, Tingjing Liao, Yingxin Pei, Yiyang Qi, Jiaqi Wang, Ruiting Li, Feiliang Ren
Relation extraction (RE) aims to identify semantic relations between entities
in unstructured text. Although recent work extends traditional RE to multimodal
scenarios, most approaches still adopt classification-based paradigms with
fused multimodal features, representing relations as discrete labels. This
paradigm has two significant limitations: (1) it overlooks structural
constraints like entity types and positional cues, and (2) it lacks semantic
expressiveness for fine-grained relation understanding. We propose
\underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a
novel framework that reformulates multimodal RE as a retrieval task driven by
relation semantics. ROC integrates entity type and positional information
through a multimodal encoder, expands relation labels into natural language
descriptions using a large language model, and aligns entity-relation pairs via
semantic similarity-based contrastive learning. Experiments show that our
method achieves state-of-the-art performance on the benchmark datasets MNRE and
MORE and exhibits stronger robustness and interpretability.
Authors' comments: Accepted by EMNLP 2025 Main Conference
Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng et al.
Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on our platform, which serves over 300 million users daily, reveals a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
Guo Chen, Qiuyuan Li, Qiuxian Li, Hongliang Dai, Xiang Chen, Piji Li
In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.
Meng Yuan, Justin Zobel
A range of approaches have been proposed for estimating the accuracy or robustness of the measured performance of IR methods. One is to use bootstrapping of test sets, which, as we confirm, provides an estimate of variation in performance. For IR methods that rely on a seed, such as those that involve machine learning, another approach is to use a random set of seeds to examine performance variation. Using three different IR tasks we have used such randomness to examine a range of traditional statistical learning models and transformer-based learning models. While the statistical models are stable, the transformer models show huge variation as seeds are changed. In 9 of 11 cases the F1-scores (in the range 0.0--1.0) had a standard deviation of over 0.075; while 7 of 11 precision values (also in the range 0.0--1.0) had a standard deviation of over 0.125. This is in a context where differences of less than 0.02 have been used as evidence of method improvement. Our findings highlight the vulnerability of transformer models to training instabilities and moreover raise questions about the reliability of previous results, thus underscoring the need for rigorous evaluation practices.
Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Clinical diagnosis is a highly specialized discipline requiring both domain
expertise and strict adherence to rigorous guidelines. While current AI-driven
medical research predominantly focuses on knowledge graphs or natural text
pretraining paradigms to incorporate medical knowledge, these approaches
primarily rely on implicitly encoded knowledge within model parameters,
neglecting task-specific knowledge required by diverse downstream tasks. To
address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a
novel framework that explicitly injects external knowledge into multimodal
models directly on downstream tasks. Specifically, RAD operates through three
key mechanisms: retrieval and refinement of disease-centered knowledge from
multiple medical sources, a guideline-enhanced contrastive loss that constrains
the latent distance between multi-modal features and guideline knowledge, and
the dual transformer decoder that employs guidelines as queries to steer
cross-modal fusion, aligning the models with clinical diagnostic workflows from
guideline acquisition to feature extraction and decision-making. Moreover,
recognizing the lack of quantitative evaluation of interpretability for
multimodal diagnostic models, we introduce a set of criteria to assess the
interpretability from both image and text perspectives. Extensive evaluations
across four datasets with different anatomies demonstrate RAD's
generalizability, achieving state-of-the-art performance. Furthermore, RAD
enables the model to concentrate more precisely on abnormal regions and
critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code
is available at https://github.com/tdlhl/RAD.
Authors' comments: Accepted to NeurIPS 2025
Dimitrios Siskos, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Anastasios Drosou
This work investigates retrieval augmented generation as an efficient
strategy for automatic context discovery in context-aware Automatic Speech
Recognition (ASR) system, in order to improve transcription accuracy in the
presence of rare or out-of-vocabulary terms. However, identifying the right
context automatically remains an open challenge. This work proposes an
efficient embedding-based retrieval approach for automatic context discovery in
ASR. To contextualize its effectiveness, two alternatives based on large
language models (LLMs) are also evaluated: (1) large language model (LLM)-based
context generation via prompting, and (2) post-recognition transcript
correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech
demonstrate that the proposed approach reduces WER by up to 17% (percentage
difference) relative to using no-context, while the oracle context results in a
reduction of up to 24.1%.
Authors' comments: Accepted at EMNLP 2025
Qiao Xiao, Hong Ting Tsang, Jiaxin Bai
Graph-based Retrieval-augmented generation (RAG) has become a widely studied
approach for improving the reasoning, accuracy, and factuality of Large
Language Models. However, many existing graph-based RAG systems overlook the
high cost associated with LLM token usage during graph construction, hindering
large-scale adoption. To address this, we propose TERAG, a simple yet effective
framework designed to build informative graphs at a significantly lower cost.
Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the
retrieval phase, and we achieve at least 80% of the accuracy of widely used
graph-based RAG methods while consuming only 3%-11% of the output tokens.
Authors' comments: 16 pages, 2 figures, 4 tables. Submitted to the 2026 18th
International Conference on Machine Learning and Computing (ICMLC 2026),
under review
Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang
Commit messages play a key role in documenting the intent behind code
changes. However, they are often low-quality, vague, or incomplete, limiting
their usefulness. Commit Message Generation (CMG) aims to automatically
generate descriptive commit messages from code diffs to reduce developers'
effort and improve message quality. Although recent advances in LLMs have shown
promise in automating CMG, their performance remains limited. This paper aims
to enhance CMG performance by retrieving similar diff-message pairs to guide
LLMs to generate commit messages that are more precise and informative. We
proposed CoRaCMG, a Contextual Retrieval-augmented framework for Commit Message
Generation, structured in three phases: (1) Retrieve: retrieving the similar
diff-message pairs; (2) Augment: combining them with the query diff into a
structured prompt; and (3) Generate: generating commit messages corresponding
to the query diff via LLMs. CoRaCMG enables LLMs to learn project-specific
terminologies and writing styles from the retrieved diff-message pairs, thereby
producing high-quality commit messages. We evaluated our method on various
LLMs, including closed-source GPT models and open-source DeepSeek models.
Experimental results show that CoRaCMG significantly boosts LLM performance
across four metrics (BLEU, Rouge-L, METEOR, and CIDEr). Specifically,
DeepSeek-R1 achieves relative improvements of 76% in BLEU and 71% in CIDEr when
augmented with a single retrieved example pair. After incorporating the single
example pair, GPT-4o achieves the highest improvement rate, with BLEU
increasing by 89%. Moreover, performance gains plateau after more than three
examples are used, indicating diminishing returns. Further analysis shows that
the improvements are attributed to the model's ability to capture the
terminologies and writing styles of human-written commit messages from the
retrieved example pairs.
Authors' comments: 15 pages, 4 images, 6 tables, Manuscript submitted to a Journal
(2025)
Yuval Kern, Ido Nisim, Michael Birk, Andrei Rasputnyi, Doron Behar, Zhaopin Chen, Ido Kaminer, Pavel Sidorenko et al.
Bright squeezed vacuum (BSV) is an intense quantum state of light with zero
mean electric field and huge photon number fluctuations, sufficiently intense
to drive extreme nonlinear processes and imprint nonclassical statistics.
However, the temporal structure of single BSV shots has not been fully
characterized. Here, we retrieve the spectral and temporal pulse
characteristics of a set of single-peak BSV shots. It is obtained by realizing
a femtosecond BSV source at 1040 nm with a single spatial mode and perform
single-shot spectral interferometry with a fully characterized coherent-state
reference pulse. Our approach reveals that the group delay is consistent
between the various shots, resulting in an average pulse duration of 27.2 fs,
much shorter than the pump pulse, and a variation of 5.5 fs (standard
deviation). We also observe a characteristic nodal structure in the spectral
interferograms, demonstrating the BSV's random phase ambiguity of $\pi$ rad.
Our approach demonstrates that BSV is a viable source of femtosecond light
pulses for attosecond sub-cycle metrology of ultrafast electron dynamics.
Authors' comments: 5 pages, 4 figures, Supplemental Document available on request
Lvzhou Luo, Yixuan Cao, Ping Luo
Retrieval-augmented generation improves the factual accuracy of Large
Language Models (LLMs) by incorporating external context, but often suffers
from irrelevant retrieved content that hinders effectiveness. Context
compression addresses this issue by filtering out irrelevant information from
context before LLM generation. However, existing methods struggle to adaptively
adjust compression rates for different context, maintain low latency and
integrate information across multiple documents. To overcome these limitations,
We introduce AttnComp, an adaptive, efficient and context-aware compression
framework. By leveraging the attention mechanism of LLMs to identify relevant
information, AttnComp employs a Top-P compression algorithm to retain the
minimal set of documents whose cumulative attention weights exceeds a
predefined threshold. In addition to compression, AttnComp estimates response
confidence by assessing the overall relevance of the retrieved content,
enabling users to gauge response reliability. Experiments demonstrate that
AttnComp outperforms existing compression methods and uncompressed baselines,
achieving higher accuracy with substantial compression rates and lower latency.
Authors' comments: Accepted at EMNLP 2025 (Findings)