Taiga Sasaki, Takehiro Yamamoto, Hiroaki Ohshima, Sumio Fujita
In this study, we evaluate the effect of model merging in ad-hoc retrieval
tasks. Model merging is a technique that combines the diverse characteristics
of multiple models. We hypothesized that applying model merging to
domain-specific ad-hoc retrieval tasks could improve retrieval effectiveness.
To verify this hypothesis, we merged the weights of a source retrieval model
and a domain-specific (non-retrieval) model using a linear interpolation
approach. A key advantage of our approach is that it requires no additional
fine-tuning of the models. We conducted two experiments each in the medical and
Japanese domains. The first compared the merged model with the source retrieval
model, and the second compared it with a LoRA fine-tuned model under both full
and limited data settings for model construction. The experimental results
indicate that model merging has the potential to produce more effective
domain-specific retrieval models than the source retrieval model, and may serve
as a practical alternative to LoRA fine-tuning, particularly when only a
limited amount of data is available.
Authors' comments: Accepted at CIKM 2025, 5 pages
Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar
Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
Lei Hei, Tingjing Liao, Yingxin Pei, Yiyang Qi, Jiaqi Wang, Ruiting Li, Feiliang Ren
Relation extraction (RE) aims to identify semantic relations between entities
in unstructured text. Although recent work extends traditional RE to multimodal
scenarios, most approaches still adopt classification-based paradigms with
fused multimodal features, representing relations as discrete labels. This
paradigm has two significant limitations: (1) it overlooks structural
constraints like entity types and positional cues, and (2) it lacks semantic
expressiveness for fine-grained relation understanding. We propose
\underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a
novel framework that reformulates multimodal RE as a retrieval task driven by
relation semantics. ROC integrates entity type and positional information
through a multimodal encoder, expands relation labels into natural language
descriptions using a large language model, and aligns entity-relation pairs via
semantic similarity-based contrastive learning. Experiments show that our
method achieves state-of-the-art performance on the benchmark datasets MNRE and
MORE and exhibits stronger robustness and interpretability.
Authors' comments: Accepted by EMNLP 2025 Main Conference
Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng et al.
Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on our platform, which serves over 300 million users daily, reveals a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
Guo Chen, Qiuyuan Li, Qiuxian Li, Hongliang Dai, Xiang Chen, Piji Li
In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.
Meng Yuan, Justin Zobel
A range of approaches have been proposed for estimating the accuracy or robustness of the measured performance of IR methods. One is to use bootstrapping of test sets, which, as we confirm, provides an estimate of variation in performance. For IR methods that rely on a seed, such as those that involve machine learning, another approach is to use a random set of seeds to examine performance variation. Using three different IR tasks we have used such randomness to examine a range of traditional statistical learning models and transformer-based learning models. While the statistical models are stable, the transformer models show huge variation as seeds are changed. In 9 of 11 cases the F1-scores (in the range 0.0--1.0) had a standard deviation of over 0.075; while 7 of 11 precision values (also in the range 0.0--1.0) had a standard deviation of over 0.125. This is in a context where differences of less than 0.02 have been used as evidence of method improvement. Our findings highlight the vulnerability of transformer models to training instabilities and moreover raise questions about the reliability of previous results, thus underscoring the need for rigorous evaluation practices.
Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Clinical diagnosis is a highly specialized discipline requiring both domain
expertise and strict adherence to rigorous guidelines. While current AI-driven
medical research predominantly focuses on knowledge graphs or natural text
pretraining paradigms to incorporate medical knowledge, these approaches
primarily rely on implicitly encoded knowledge within model parameters,
neglecting task-specific knowledge required by diverse downstream tasks. To
address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a
novel framework that explicitly injects external knowledge into multimodal
models directly on downstream tasks. Specifically, RAD operates through three
key mechanisms: retrieval and refinement of disease-centered knowledge from
multiple medical sources, a guideline-enhanced contrastive loss that constrains
the latent distance between multi-modal features and guideline knowledge, and
the dual transformer decoder that employs guidelines as queries to steer
cross-modal fusion, aligning the models with clinical diagnostic workflows from
guideline acquisition to feature extraction and decision-making. Moreover,
recognizing the lack of quantitative evaluation of interpretability for
multimodal diagnostic models, we introduce a set of criteria to assess the
interpretability from both image and text perspectives. Extensive evaluations
across four datasets with different anatomies demonstrate RAD's
generalizability, achieving state-of-the-art performance. Furthermore, RAD
enables the model to concentrate more precisely on abnormal regions and
critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code
is available at https://github.com/tdlhl/RAD.
Authors' comments: Accepted to NeurIPS 2025
Dimitrios Siskos, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Anastasios Drosou
This work investigates retrieval augmented generation as an efficient
strategy for automatic context discovery in context-aware Automatic Speech
Recognition (ASR) system, in order to improve transcription accuracy in the
presence of rare or out-of-vocabulary terms. However, identifying the right
context automatically remains an open challenge. This work proposes an
efficient embedding-based retrieval approach for automatic context discovery in
ASR. To contextualize its effectiveness, two alternatives based on large
language models (LLMs) are also evaluated: (1) large language model (LLM)-based
context generation via prompting, and (2) post-recognition transcript
correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech
demonstrate that the proposed approach reduces WER by up to 17% (percentage
difference) relative to using no-context, while the oracle context results in a
reduction of up to 24.1%.
Authors' comments: Accepted at EMNLP 2025
Qiao Xiao, Hong Ting Tsang, Jiaxin Bai
Graph-based Retrieval-augmented generation (RAG) has become a widely studied
approach for improving the reasoning, accuracy, and factuality of Large
Language Models. However, many existing graph-based RAG systems overlook the
high cost associated with LLM token usage during graph construction, hindering
large-scale adoption. To address this, we propose TERAG, a simple yet effective
framework designed to build informative graphs at a significantly lower cost.
Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the
retrieval phase, and we achieve at least 80% of the accuracy of widely used
graph-based RAG methods while consuming only 3%-11% of the output tokens.
Authors' comments: 16 pages, 2 figures, 4 tables. Submitted to the 2026 18th
International Conference on Machine Learning and Computing (ICMLC 2026),
under review
Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang
Commit messages play a key role in documenting the intent behind code
changes. However, they are often low-quality, vague, or incomplete, limiting
their usefulness. Commit Message Generation (CMG) aims to automatically
generate descriptive commit messages from code diffs to reduce developers'
effort and improve message quality. Although recent advances in LLMs have shown
promise in automating CMG, their performance remains limited. This paper aims
to enhance CMG performance by retrieving similar diff-message pairs to guide
LLMs to generate commit messages that are more precise and informative. We
proposed CoRaCMG, a Contextual Retrieval-augmented framework for Commit Message
Generation, structured in three phases: (1) Retrieve: retrieving the similar
diff-message pairs; (2) Augment: combining them with the query diff into a
structured prompt; and (3) Generate: generating commit messages corresponding
to the query diff via LLMs. CoRaCMG enables LLMs to learn project-specific
terminologies and writing styles from the retrieved diff-message pairs, thereby
producing high-quality commit messages. We evaluated our method on various
LLMs, including closed-source GPT models and open-source DeepSeek models.
Experimental results show that CoRaCMG significantly boosts LLM performance
across four metrics (BLEU, Rouge-L, METEOR, and CIDEr). Specifically,
DeepSeek-R1 achieves relative improvements of 76% in BLEU and 71% in CIDEr when
augmented with a single retrieved example pair. After incorporating the single
example pair, GPT-4o achieves the highest improvement rate, with BLEU
increasing by 89%. Moreover, performance gains plateau after more than three
examples are used, indicating diminishing returns. Further analysis shows that
the improvements are attributed to the model's ability to capture the
terminologies and writing styles of human-written commit messages from the
retrieved example pairs.
Authors' comments: 15 pages, 4 images, 6 tables, Manuscript submitted to a Journal
(2025)
Yuval Kern, Ido Nisim, Michael Birk, Andrei Rasputnyi, Doron Behar, Zhaopin Chen, Ido Kaminer, Pavel Sidorenko et al.
Bright squeezed vacuum (BSV) is an intense quantum state of light with zero
mean electric field and huge photon number fluctuations, sufficiently intense
to drive extreme nonlinear processes and imprint nonclassical statistics.
However, the temporal structure of single BSV shots has not been fully
characterized. Here, we retrieve the spectral and temporal pulse
characteristics of a set of single-peak BSV shots. It is obtained by realizing
a femtosecond BSV source at 1040 nm with a single spatial mode and perform
single-shot spectral interferometry with a fully characterized coherent-state
reference pulse. Our approach reveals that the group delay is consistent
between the various shots, resulting in an average pulse duration of 27.2 fs,
much shorter than the pump pulse, and a variation of 5.5 fs (standard
deviation). We also observe a characteristic nodal structure in the spectral
interferograms, demonstrating the BSV's random phase ambiguity of $\pi$ rad.
Our approach demonstrates that BSV is a viable source of femtosecond light
pulses for attosecond sub-cycle metrology of ultrafast electron dynamics.
Authors' comments: 5 pages, 4 figures, Supplemental Document available on request
Lvzhou Luo, Yixuan Cao, Ping Luo
Retrieval-augmented generation improves the factual accuracy of Large
Language Models (LLMs) by incorporating external context, but often suffers
from irrelevant retrieved content that hinders effectiveness. Context
compression addresses this issue by filtering out irrelevant information from
context before LLM generation. However, existing methods struggle to adaptively
adjust compression rates for different context, maintain low latency and
integrate information across multiple documents. To overcome these limitations,
We introduce AttnComp, an adaptive, efficient and context-aware compression
framework. By leveraging the attention mechanism of LLMs to identify relevant
information, AttnComp employs a Top-P compression algorithm to retain the
minimal set of documents whose cumulative attention weights exceeds a
predefined threshold. In addition to compression, AttnComp estimates response
confidence by assessing the overall relevance of the retrieved content,
enabling users to gauge response reliability. Experiments demonstrate that
AttnComp outperforms existing compression methods and uncompressed baselines,
achieving higher accuracy with substantial compression rates and lower latency.
Authors' comments: Accepted at EMNLP 2025 (Findings)
Tianyuan Li, Lei Wang, Ahtamjan Ahmat, Yating Yang, Bo Ma, Rui Dong, Bangju Han
Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
Dayu Yang, Hui Fang
Connecting conversation with external domain knowledge is vital for conversational recommender systems (CRS) to correctly understand user preferences. However, existing solutions either require domain-specific engineering, which limits flexibility, or rely solely on large language models, which increases the risk of hallucination. While Retrieval-Augmented Generation (RAG) holds promise, its naive use in CRS is hindered by noisy dialogues that weaken retrieval and by overlooked nuances among similar items. We propose ReGeS, a reciprocal Retrieval-Generation Synergy framework that unifies generation-augmented retrieval to distill informative user intent from conversations and retrieval-augmented generation to differentiate subtle item features. This synergy obviates the need for extra annotations, reduces hallucinations, and simplifies continuous updates. Experiments on multiple CRS benchmarks show that ReGeS achieves state-of-the-art performance in recommendation accuracy, demonstrating the effectiveness of reciprocal synergy for knowledge-intensive CRS tasks.
Authors' comments: Accepted by WISE 2025: 26th International Web Information Systems Engineering conference. Our code is publicly available at the link: https://github.com/dayuyang1999/ReGeS
Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
Shunyu Liu, Guangdong Bai, Mark Utting, Guowei Yang
Automated Program Repair (APR) has emerged as a promising paradigm for
reducing debugging time and improving the overall efficiency of software
development. Recent advances in Large Language Models (LLMs) have demonstrated
their potential for automated bug fixing and other software engineering tasks.
Nevertheless, the general-purpose nature of LLM pre-training means these models
often lack the capacity to perform project-specific repairs, which require
understanding of domain-specific identifiers, code structures, and contextual
relationships within a particular codebase. As a result, LLMs may struggle to
generate correct patches when the repair depends on project-specific
information.
To address this limitation, we introduce RelRepair, a novel approach that
retrieves relevant project-specific code to enhance automated program repair.
RelRepair first identifies relevant function signatures by analyzing function
names and code comments within the project. It then conducts deeper code
analysis to retrieve code snippets relevant to the repair context. The
retrieved relevant information is then incorporated into the LLM's input
prompt, guiding the model to generate more accurate and informed patches. We
evaluate RelRepair on two widely studied datasets, Defects4J V1.2 and
ManySStuBs4J, and compare its performance against several state-of-the-art
LLM-based APR approaches. RelRepair successfully repairs 101 bugs in Defects4J
V1.2. Furthermore, RelRepair achieves a 17.1\% improvement in the ManySStuBs4J
dataset, increasing the overall fix rate to 48.3\%. These results highlight the
importance of providing relevant project-specific information to LLMs, shedding
light on effective strategies for leveraging LLMs in APR tasks.
Authors' comments: 11 pages, 5 figures, under review at TSE
Hyun Jun Kim, Hyeong Yong Choi, Changwon Lim
This report presents the AISTAT team's submission to the language-based audio
retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder
architecture, where audio and text modalities are encoded separately, and their
representations are aligned using contrastive learning. Drawing inspiration
from methodologies of the previous year's challenge, we implemented a
distillation approach and leveraged large language models (LLMs) for effective
data augmentation techniques, including back-translation and LLM mix.
Additionally, we incorporated clustering to introduce an auxiliary
classification task for further finetuning. Our best single system achieved a
mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on
the Clotho development test split.
Authors' comments: 5 pages, 1 figure, DCASE2025 Task2 technical report
Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye et al.
In real applications, person re-identification (ReID) is expected to retrieve
the target person at any time, including both daytime and nighttime, ranging
from short-term to long-term. However, existing ReID tasks and datasets can not
meet this requirement, as they are constrained by available time and only
provide training and evaluation for specific scenarios. Therefore, we
investigate a new task called Anytime Person Re-identification (AT-ReID), which
aims to achieve effective retrieval in multiple scenarios based on variations
in time. To address the AT-ReID problem, we collect the first large-scale
dataset, AT-USTC, which contains 403k images of individuals wearing multiple
clothes captured by RGB and IR cameras. Our data collection spans 21 months,
and 270 volunteers were photographed on average 29.1 times across different
dates or scenes, 4-15 times more than current datasets, providing conditions
for follow-up investigations in AT-ReID. Further, to tackle the new challenge
of multi-scenario retrieval, we propose a unified model named Uni-AT, which
comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific
features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate
inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW)
strategy to ensure balanced training across all scenarios. Extensive
experiments show that our model leads to satisfactory results and exhibits
excellent generalization to all scenarios.
Authors' comments: Accepted by IJCAI 2025
Jialin Chen, Houyu Zhang, Seongjun Yun, Alejandro Mottini, Rex Ying, Xiang Song, Vassilis N. Ioannidis, Zheng Li et al.
Retrieval-Augmented Generation (RAG) has significantly mitigated the hallucinations of Large Language Models (LLMs) by grounding the generation with external knowledge. Recent extensions of RAG to graph-based retrieval offer a promising direction, leveraging the structural knowledge for multi-hop reasoning. However, existing graph RAG typically decouples retrieval and reasoning processes, which prevents the retriever from adapting to the reasoning needs of the LLM. They also struggle with scalability when performing multi-hop expansion over large-scale graphs, or depend heavily on annotated ground-truth entities, which are often unavailable in open-domain settings. To address these challenges, we propose a novel graph retriever trained end-to-end with LLM, which features an attention-based growing and pruning mechanism, adaptively navigating multi-hop relevant entities while filtering out noise. Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together, thereby enhancing its reasoning capability and facilitating interactive joint training of the graph retriever and the LLM reasoner. Experimental results across three QA benchmarks show that our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks. Notably, our framework eliminates the need for predefined ground-truth entities by directly optimizing the retriever using LLM logits as implicit feedback, making it especially effective in open-domain settings.
Jongyeop Hyun, Bumsoo Kim
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
Authors' comments: Accepted at EMNLP 2025 main conference