Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first." To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.
Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen
Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation. To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE demonstrate that document expansion substantially improves retrieval performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art results on both benchmarks. We further analyze the contribution of individual fields to retrieval effectiveness, as well as the broader impact of document expansion on both training and evaluation. Overall, our findings highlight both the promise and limitations of LLM-driven document expansion, positioning Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for future research in tool retrieval.
Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli
While Retrieval-Augmented Generation (RAG) mitigates hallucination and
knowledge staleness in Large Language Models (LLMs), existing frameworks often
falter on complex, multi-hop queries that require synthesizing information from
disparate sources. Current advanced RAG methods, employing iterative or
adaptive strategies, lack a robust mechanism to systematically identify and
fill evidence gaps, often propagating noise or failing to gather a
comprehensive context. We introduce FAIR-RAG, a novel agentic framework that
transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning
process. At its core is an Iterative Refinement Cycle governed by a module we
term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating
mechanism: it deconstructs the initial query into a checklist of required
findings and audits the aggregated evidence to identify confirmed facts and,
critically, explicit informational gaps. These gaps provide a precise signal to
an Adaptive Query Refinement agent, which generates new, targeted sub-queries
to retrieve missing information. This cycle repeats until the evidence is
verified as sufficient, ensuring a comprehensive context for a final, strictly
faithful generation. We conducted experiments on challenging multi-hop QA
benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified
experimental setup, FAIR-RAG significantly outperforms strong baselines. On
HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3
points over the strongest iterative baseline -- establishing a new
state-of-the-art for this class of methods on these benchmarks. Our work
demonstrates that a structured, evidence-driven refinement process with
explicit gap analysis is crucial for unlocking reliable and accurate reasoning
in advanced RAG systems for complex, knowledge-intensive tasks.
Authors' comments: 30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented
Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop
Question Answering, Faithfulness
Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu
Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.
Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
Large language models (LLMs) excel at factual recall yet still propagate
stale or incorrect knowledge. In-context knowledge editing offers a
gradient-free remedy suitable for black-box APIs, but current editors rely on
static demonstration sets chosen by surface-level similarity, leading to two
persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of
adaptivity to task difficulty. We address these issues by dynamically selecting
supporting demonstrations according to their utility for the edit. We propose
Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight
framework that (1) trains a BERT retriever with REINFORCE to rank
demonstrations by editing reward, and (2) employs a learnable threshold to
prune low-value examples, shortening the prompt when the edit is easy and
expanding it when the task is hard. DR-IKE performs editing without modifying
model weights, relying solely on forward passes for compatibility with
black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to
17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries,
demonstrating scalable and adaptive knowledge editing. The code is available at
https://github.com/mwnafee/DR-IKE .
Authors' comments: Accepted at EMNLP 2025. \c{opyright} 2025 Association for
Computational Linguistics (CC BY 4.0)
Julia Wilkins, Jaehun Kim, Matthew E. P. Davies, Juan Pablo Bello, Matthew C. McCallum
Music representations are the backbone of modern recommendation systems,
powering playlist generation, similarity search, and personalized discovery.
Yet most embeddings offer little control for adjusting a single musical
attribute, e.g., changing only the mood of a track while preserving its genre
or instrumentation. In this work, we address the problem of controllable music
retrieval through embedding-based transformation, where the objective is to
retrieve songs that remain similar to a seed track but are modified along one
chosen dimension. We propose a novel framework for mood-guided music embedding
transformation, which learns a mapping from a seed audio embedding to a target
embedding guided by mood labels, while preserving other musical attributes.
Because mood cannot be directly altered in the seed audio, we introduce a
sampling mechanism that retrieves proxy targets to balance diversity with
similarity to the seed. We train a lightweight translation model using this
sampling strategy and introduce a novel joint objective that encourages
transformation and information preservation. Extensive experiments on two
datasets show strong mood transformation performance while retaining genre and
instrumentation far better than training-free baselines, establishing
controllable embedding transformation as a promising paradigm for personalized
music retrieval.
Authors' comments: Preprint; under review
Fangjian Zhang, Xiaoyong Zhuge, Wenlan Wang, Haixia Xiao, Yuying Zhu, Siyang Cheng
Artificial intelligence has advanced quantitative remote sensing, yet its
effectiveness is constrained by imbalanced label distribution. This imbalance
leads conventionally trained models to favor common samples, which in turn
degrades retrieval performance for rare ones. Rainfall retrieval exemplifies
this issue, with performance particularly compromised for heavy rain. This
study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework.
Following a divide-and-conquer strategy, imbalance in the rain distribution is
decomposed into two components: zero inflation, defined by the predominance of
non-rain samples; and long tail, defined by the disproportionate abundance of
light-rain samples relative to heavy-rain samples. A hurdle model is adopted to
handle the zero inflation, while IMDL is proposed to address the long tail by
transforming the learning object into an unbiased ideal inverse model.
Comprehensive evaluation via statistical metrics and case studies investigating
rainy weather in eastern China confirms Hurdle-IMDL's superiority over
conventional, cost-sensitive, generative, and multi-task learning methods. Its
key advancements include effective mitigation of systematic underestimation and
a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a
generalizable approach for addressing imbalance in distributions of
environmental variables, enabling enhanced retrieval of rare yet high-impact
events.
Authors' comments: 26 pages
Louis Mozart Kamdem Teyou, Luke Friedrichs, N'Dah Jean Kouagou, Caglar Demir, Yasir Mahmood, Stefan Heindorf, Axel-Cyrille Ngonga Ngomo
Concept learning exploits background knowledge in the form of description
logic axioms to learn explainable classification models from knowledge bases.
Despite recent breakthroughs in neuro-symbolic concept learning, most
approaches still cannot be deployed on real-world knowledge bases. This is due
to their use of description logic reasoners, which are not robust against
inconsistencies nor erroneous data. We address this challenge by presenting a
novel neural reasoner dubbed EBR. Our reasoner relies on embeddings to
approximate the results of a symbolic reasoner. We show that EBR solely
requires retrieving instances for atomic concepts and existential restrictions
to retrieve or approximate the set of instances of any concept in the
description logic $\mathcal{SHOIQ}$. In our experiments, we compare EBR with
state-of-the-art reasoners. Our results suggest that EBR is robust against
missing and erroneous data in contrast to existing reasoners.
Authors' comments: Accepted as a full research paper at K-CAP 2025
Ajay Sridhar, Jennifer Pan, Satvik Sharma, Chelsea Finn
Humans routinely rely on memory to perform tasks, yet most robot policies
lack this capability; our goal is to endow robot policies with the same
ability. Naively conditioning on long observation histories is computationally
expensive and brittle under covariate shift, while indiscriminate subsampling
of history leads to irrelevant or redundant information. We propose a
hierarchical policy framework, where the high-level policy is trained to select
and track previous relevant keyframes from its experience. The high-level
policy uses selected keyframes and the most recent frames when generating text
instructions for a low-level policy to execute. This design is compatible with
existing vision-language-action (VLA) models and enables the system to
efficiently reason over long-horizon dependencies. In our experiments, we
finetune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level
policies respectively, using demonstrations supplemented with minimal language
annotations. Our approach, MemER, outperforms prior methods on three real-world
long-horizon robotic manipulation tasks that require minutes of memory. Videos
and code can be found at https://jen-pan.github.io/memer/.
Authors' comments: Project page: https://jen-pan.github.io/memer/
Joseph Casale, Andrew Silverschotz, Joseph DeSimone
Top-K masking schemes have been proposed as a method to promote sparse
representations in Information Retrieval (IR) tasks, as a simple alternative to
Floating Point Operations per Second (FLOPS) regularization. Algorithms such as
Bilingual Lexical and Document Expansion Model (BLADE), adopt this approach as
a post-processing stage. We propose using Top-P Dynamic Masking similar to
Nucleus Sampling in Large Language Models, and demonstrate better performance
than Top-K masking. Specifically, we evaluate our methods in the domain of
Cross Language Information Retrieval (CLIR)
Authors' comments: Unsubmitted
Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov
Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
Zhengxuan Wei, Jiajin Tang, Sibei Yang
Existing Moment Retrieval methods face three critical bottlenecks: (1) data
scarcity forces models into shallow keyword-feature associations; (2) boundary
ambiguity in transition regions between adjacent events; (3) insufficient
discrimination of fine-grained semantics (e.g., distinguishing ``kicking" vs.
``throwing" a ball). In this paper, we propose a zero-external-dependency
Augmented Moment Retrieval framework, AMR, designed to overcome local optima
caused by insufficient data annotations and the lack of robust boundary and
semantic discrimination capabilities. AMR is built upon two key insights: (1)
it resolves ambiguous boundary information and semantic confusion in existing
annotations without additional data (avoiding costly manual labeling), and (2)
it preserves boundary and semantic discriminative capabilities enhanced by
training while generalizing to real-world scenarios, significantly improving
performance. Furthermore, we propose a two-stage training framework with
cold-start and distillation adaptation. The cold-start stage employs curriculum
learning on augmented data to build foundational boundary/semantic awareness.
The distillation stage introduces dual query sets: Original Queries maintain
DETR-based localization using frozen Base Queries from the cold-start model,
while Active Queries dynamically adapt to real-data distributions. A
cross-stage distillation loss enforces consistency between Original and Base
Queries, preventing knowledge forgetting while enabling real-world
generalization. Experiments on multiple benchmarks show that AMR achieves
improved performance over prior state-of-the-art approaches.
Authors' comments: This work is accepted by ICCV 2025
Zhiping Zhou, Xiaohong Li, Ruitao Feng, Yao Zhang, Yuekang Li, Wenbu Feng, Yunqian Wang, Yuqing Li
Decompilation converts machine code into human-readable form, enabling analysis and debugging without source code. However, fidelity issues often degrade the readability and semantic accuracy of decompiled output. Existing methods, such as variable renaming or structural simplification, provide partial improvements but lack robust detection and correction, particularly for complex closed-source binaries. We present FidelityGPT, a framework that enhances decompiled code accuracy and readability by systematically detecting and correcting semantic distortions. FidelityGPT introduces distortion-aware prompt templates tailored to closed-source settings and integrates Retrieval-Augmented Generation (RAG) with a dynamic semantic intensity algorithm to locate distorted lines and retrieve semantically similar code from a database. A variable dependency algorithm further mitigates long-context limitations by analyzing redundant variables and integrating their dependencies into the prompt context. Evaluated on 620 function pairs from a binary similarity benchmark, FidelityGPT achieved an average detection accuracy of 89% and a precision of 83%. Compared to the state-of-the-art DeGPT (Fix Rate 83%, Corrected Fix Rate 37%), FidelityGPT attained 94% FR and 64% CFR, demonstrating significant gains in accuracy and readability. These results highlight its potential to advance LLM-based decompilation and reverse engineering.
Jaya Krishna Mandivarapu
Learning a set of tasks over time, also known as continual learning (CL), is one of the most challenging problems in artificial intelligence due to catastrophic forgetting. Large language models (LLMs) are often impractical to frequent re-training and continual learning , due to high cost of computational resources for training. Moreover, LLM are not suitable for continual learning as updating these models over time for acquiring new knowledge leads to overwrites existing knowledge leading to common phenomenon know as \textit{catastrophic forgetting}. In this paper, we aim to address these concerns using a novel framework , COLA that employs an autoencoder to learn capture low-dimensional embeddings of the weights associated with various tasks. Our approach facilitates the transfer of knowledge to new tasks while preventing catastrophic forgetting, all without using data replay or a substantial set of task-specific parameters. Our approach, COLA, makes the LLM efficiently learn new tasks with minimal training, insignificant performance degradation on previous tasks, and eliminates the need for retaining earlier training data. Empirical evaluation on different datasets ranging from task oriented dialouge system to intent classsfication datasets showcases that our method not only overcomes catastrophic forgetting but also achieves significant reduction in parameter usage and memory size, across multiple tasks and outperforming the existing state of the art methods across multiple datasets.
Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Sen Li, Wenjun Peng, Fuyu Lv et al.
Product search is a crucial component of modern e-commerce platforms, with
billions of user queries every day. In product search systems, first-stage
retrieval should achieve high recall while ensuring efficient online
deployment. Sparse retrieval is particularly attractive in this context due to
its interpretability and storage efficiency. However, sparse retrieval methods
suffer from severe vocabulary mismatch issues, leading to suboptimal
performance in product search scenarios.With their potential for semantic
analysis, large language models (LLMs) offer a promising avenue for mitigating
vocabulary mismatch issues and thereby improving retrieval quality. Directly
applying LLMs to sparse retrieval in product search exposes two key
challenges:(1)Queries and product titles are typically short and highly
susceptible to LLM-induced hallucinations, such as generating irrelevant
expansion terms or underweighting critical literal terms like brand names and
model numbers;(2)The large vocabulary space of LLMs leads to difficulty in
initializing training effectively, making it challenging to learn meaningful
sparse representations in such ultra-high-dimensional spaces.To address these
challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs
as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that
alleviates hallucination in lexical expansion by reinforcing underweighted
literal terms through a residual compensation mechanism; and (2)A lexical
focusing window that facilitates effective training initialization via a
coarse-to-fine sparsification strategy.Extensive offline and online experiments
show that PROSPER significantly outperforms sparse baselines and achieves
recall performance comparable to advanced dense retrievers, while also
achieving revenue increments online.
Authors' comments: 16 pages
Ji Du, Xin Wang, Fangwei Hao, Mingyang Yu, Chunyuan Chen, Jiesheng Wu, Bin Wang, Jing Xu et al.
At the core of Camouflaged Object Detection (COD) lies segmenting objects
from their highly similar surroundings. Previous efforts navigate this
challenge primarily through image-level modeling or annotation-based
optimization. Despite advancing considerably, this commonplace practice hardly
taps valuable dataset-level contextual information or relies on laborious
annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented
paradigm that exploits the entire training dataset to generate pseudo-labels
for single images, which could be used to train COD models. RISE begins by
constructing prototype libraries for environments and camouflaged objects using
training images (without ground truth), followed by K-Nearest Neighbor (KNN)
retrieval to generate pseudo-masks for each image based on these libraries. It
is important to recognize that using only training images without annotations
exerts a pronounced challenge in crafting high-quality prototype libraries. In
this light, we introduce a Clustering-then-Retrieval (CR) strategy, where
coarse masks are first generated through clustering, facilitating subsequent
histogram-based image filtering and cross-category retrieval to produce
high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect
of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which
integrates retrieval results from diverse views to produce more robust and
precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms
state-of-the-art unsupervised and prompt-based methods. Code is available at
https://github.com/xiaohainku/RISE.
Authors' comments: ICCV 2025
Felix Michalak, Steven Abreu
We demonstrate complete functional segregation in hybrid SSM-Transformer architectures: retrieval depends exclusively on self-attention layers. Across RecurrentGemma-2B/9B and Jamba-Mini-1.6, attention ablation causes catastrophic retrieval failure (0% accuracy), while SSM layers show no compensatory mechanisms even with improved prompting. Conversely, sparsifying attention to just 15% of heads maintains near-perfect retrieval while preserving 84% MMLU performance, suggesting self-attention specializes primarily for retrieval tasks. We identify precise mechanistic requirements for retrieval: needle tokens must be exposed during generation and sufficient context must be available during prefill or generation. This strict functional specialization challenges assumptions about redundancy in hybrid architectures and suggests these models operate as specialized modules rather than integrated systems, with immediate implications for architecture optimization and interpretability.
Authors' comments: 16 pages, 10 figures
Zhiming Lin
Multi turn intent understanding is central to task oriented chatbots, yet
real deployments face tight token budgets and noisy contexts, and most
retrieval pipelines emphasize relevance while overlooking set level diversity
and confounds such as more context or exemplar order. We ask whether retrieval
diversity, rather than longer prompts, systematically improves LLM intent
understanding under fixed budgets. We present a diversity aware retrieval
framework that selects in context exemplars to balance intent coverage and
linguistic variety, and integrates this selection with standard LLM decoders;
the evaluation enforces budget matched prompts and randomized positions, and
includes sensitivity analyses over exemplar count, diversity strength, and
backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in
Joint Goal Accuracy under equal token budgets, surpassing strong LLM/DST
baselines, with consistent improvements across K from 4 to 7 and moderate
latency. Overall, the study isolates and validates the impact of content
diversity in retrieval and offers a simple, deployable selection principle for
building accurate, budget constrained multi turn intent systems.
Authors' comments: 15 pages,6 figs
George Ma, Anurag Koul, Qi Chen, Yawen Wu, Sachit Kuhar, Yu Yu, Aritra Sengupta, Varun Kumar et al.
Large Language Models (LLMs) excel at code-related tasks but often struggle in realistic software repositories, where project-specific APIs and cross-file dependencies are crucial. Retrieval-augmented methods mitigate this by injecting repository context at inference time. The low inference-time latency budget affects either retrieval quality or the added latency adversely impacts user experience. We address this limitation with SpecAgent, an agent that improves both latency and code-generation quality by proactively exploring repository files during indexing and constructing speculative context that anticipates future edits in each file. This indexing-time asynchrony allows thorough context computation, masking latency, and the speculative nature of the context improves code-generation quality. Additionally, we identify the problem of future context leakage in existing benchmarks, which can inflate reported performance. To address this, we construct a synthetic, leakage-free benchmark that enables a more realistic evaluation of our agent against baselines. Experiments show that SpecAgent consistently achieves absolute gains of 9-11% (48-58% relative) compared to the best-performing baselines, while significantly reducing inference latency.