Louis Mozart Kamdem Teyou, Luke Friedrichs, N'Dah Jean Kouagou, Caglar Demir, Yasir Mahmood, Stefan Heindorf, Axel-Cyrille Ngonga Ngomo
Concept learning exploits background knowledge in the form of description
logic axioms to learn explainable classification models from knowledge bases.
Despite recent breakthroughs in neuro-symbolic concept learning, most
approaches still cannot be deployed on real-world knowledge bases. This is due
to their use of description logic reasoners, which are not robust against
inconsistencies nor erroneous data. We address this challenge by presenting a
novel neural reasoner dubbed EBR. Our reasoner relies on embeddings to
approximate the results of a symbolic reasoner. We show that EBR solely
requires retrieving instances for atomic concepts and existential restrictions
to retrieve or approximate the set of instances of any concept in the
description logic $\mathcal{SHOIQ}$. In our experiments, we compare EBR with
state-of-the-art reasoners. Our results suggest that EBR is robust against
missing and erroneous data in contrast to existing reasoners.
Authors' comments: Accepted as a full research paper at K-CAP 2025
Ajay Sridhar, Jennifer Pan, Satvik Sharma, Chelsea Finn
Humans routinely rely on memory to perform tasks, yet most robot policies
lack this capability; our goal is to endow robot policies with the same
ability. Naively conditioning on long observation histories is computationally
expensive and brittle under covariate shift, while indiscriminate subsampling
of history leads to irrelevant or redundant information. We propose a
hierarchical policy framework, where the high-level policy is trained to select
and track previous relevant keyframes from its experience. The high-level
policy uses selected keyframes and the most recent frames when generating text
instructions for a low-level policy to execute. This design is compatible with
existing vision-language-action (VLA) models and enables the system to
efficiently reason over long-horizon dependencies. In our experiments, we
finetune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level
policies respectively, using demonstrations supplemented with minimal language
annotations. Our approach, MemER, outperforms prior methods on three real-world
long-horizon robotic manipulation tasks that require minutes of memory. Videos
and code can be found at https://jen-pan.github.io/memer/.
Authors' comments: Project page: https://jen-pan.github.io/memer/
Joseph Casale, Andrew Silverschotz, Joseph DeSimone
Top-K masking schemes have been proposed as a method to promote sparse
representations in Information Retrieval (IR) tasks, as a simple alternative to
Floating Point Operations per Second (FLOPS) regularization. Algorithms such as
Bilingual Lexical and Document Expansion Model (BLADE), adopt this approach as
a post-processing stage. We propose using Top-P Dynamic Masking similar to
Nucleus Sampling in Large Language Models, and demonstrate better performance
than Top-K masking. Specifically, we evaluate our methods in the domain of
Cross Language Information Retrieval (CLIR)
Authors' comments: Unsubmitted
Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov
Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
Zhengxuan Wei, Jiajin Tang, Sibei Yang
Existing Moment Retrieval methods face three critical bottlenecks: (1) data
scarcity forces models into shallow keyword-feature associations; (2) boundary
ambiguity in transition regions between adjacent events; (3) insufficient
discrimination of fine-grained semantics (e.g., distinguishing ``kicking" vs.
``throwing" a ball). In this paper, we propose a zero-external-dependency
Augmented Moment Retrieval framework, AMR, designed to overcome local optima
caused by insufficient data annotations and the lack of robust boundary and
semantic discrimination capabilities. AMR is built upon two key insights: (1)
it resolves ambiguous boundary information and semantic confusion in existing
annotations without additional data (avoiding costly manual labeling), and (2)
it preserves boundary and semantic discriminative capabilities enhanced by
training while generalizing to real-world scenarios, significantly improving
performance. Furthermore, we propose a two-stage training framework with
cold-start and distillation adaptation. The cold-start stage employs curriculum
learning on augmented data to build foundational boundary/semantic awareness.
The distillation stage introduces dual query sets: Original Queries maintain
DETR-based localization using frozen Base Queries from the cold-start model,
while Active Queries dynamically adapt to real-data distributions. A
cross-stage distillation loss enforces consistency between Original and Base
Queries, preventing knowledge forgetting while enabling real-world
generalization. Experiments on multiple benchmarks show that AMR achieves
improved performance over prior state-of-the-art approaches.
Authors' comments: This work is accepted by ICCV 2025
Zhiping Zhou, Xiaohong Li, Ruitao Feng, Yao Zhang, Yuekang Li, Wenbu Feng, Yunqian Wang, Yuqing Li
Decompilation converts machine code into human-readable form, enabling analysis and debugging without source code. However, fidelity issues often degrade the readability and semantic accuracy of decompiled output. Existing methods, such as variable renaming or structural simplification, provide partial improvements but lack robust detection and correction, particularly for complex closed-source binaries. We present FidelityGPT, a framework that enhances decompiled code accuracy and readability by systematically detecting and correcting semantic distortions. FidelityGPT introduces distortion-aware prompt templates tailored to closed-source settings and integrates Retrieval-Augmented Generation (RAG) with a dynamic semantic intensity algorithm to locate distorted lines and retrieve semantically similar code from a database. A variable dependency algorithm further mitigates long-context limitations by analyzing redundant variables and integrating their dependencies into the prompt context. Evaluated on 620 function pairs from a binary similarity benchmark, FidelityGPT achieved an average detection accuracy of 89% and a precision of 83%. Compared to the state-of-the-art DeGPT (Fix Rate 83%, Corrected Fix Rate 37%), FidelityGPT attained 94% FR and 64% CFR, demonstrating significant gains in accuracy and readability. These results highlight its potential to advance LLM-based decompilation and reverse engineering.
Jaya Krishna Mandivarapu
Learning a set of tasks over time, also known as continual learning (CL), is one of the most challenging problems in artificial intelligence due to catastrophic forgetting. Large language models (LLMs) are often impractical to frequent re-training and continual learning , due to high cost of computational resources for training. Moreover, LLM are not suitable for continual learning as updating these models over time for acquiring new knowledge leads to overwrites existing knowledge leading to common phenomenon know as \textit{catastrophic forgetting}. In this paper, we aim to address these concerns using a novel framework , COLA that employs an autoencoder to learn capture low-dimensional embeddings of the weights associated with various tasks. Our approach facilitates the transfer of knowledge to new tasks while preventing catastrophic forgetting, all without using data replay or a substantial set of task-specific parameters. Our approach, COLA, makes the LLM efficiently learn new tasks with minimal training, insignificant performance degradation on previous tasks, and eliminates the need for retaining earlier training data. Empirical evaluation on different datasets ranging from task oriented dialouge system to intent classsfication datasets showcases that our method not only overcomes catastrophic forgetting but also achieves significant reduction in parameter usage and memory size, across multiple tasks and outperforming the existing state of the art methods across multiple datasets.
Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Sen Li, Wenjun Peng, Fuyu Lv et al.
Product search is a crucial component of modern e-commerce platforms, with
billions of user queries every day. In product search systems, first-stage
retrieval should achieve high recall while ensuring efficient online
deployment. Sparse retrieval is particularly attractive in this context due to
its interpretability and storage efficiency. However, sparse retrieval methods
suffer from severe vocabulary mismatch issues, leading to suboptimal
performance in product search scenarios.With their potential for semantic
analysis, large language models (LLMs) offer a promising avenue for mitigating
vocabulary mismatch issues and thereby improving retrieval quality. Directly
applying LLMs to sparse retrieval in product search exposes two key
challenges:(1)Queries and product titles are typically short and highly
susceptible to LLM-induced hallucinations, such as generating irrelevant
expansion terms or underweighting critical literal terms like brand names and
model numbers;(2)The large vocabulary space of LLMs leads to difficulty in
initializing training effectively, making it challenging to learn meaningful
sparse representations in such ultra-high-dimensional spaces.To address these
challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs
as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that
alleviates hallucination in lexical expansion by reinforcing underweighted
literal terms through a residual compensation mechanism; and (2)A lexical
focusing window that facilitates effective training initialization via a
coarse-to-fine sparsification strategy.Extensive offline and online experiments
show that PROSPER significantly outperforms sparse baselines and achieves
recall performance comparable to advanced dense retrievers, while also
achieving revenue increments online.
Authors' comments: 16 pages
Ji Du, Xin Wang, Fangwei Hao, Mingyang Yu, Chunyuan Chen, Jiesheng Wu, Bin Wang, Jing Xu et al.
At the core of Camouflaged Object Detection (COD) lies segmenting objects
from their highly similar surroundings. Previous efforts navigate this
challenge primarily through image-level modeling or annotation-based
optimization. Despite advancing considerably, this commonplace practice hardly
taps valuable dataset-level contextual information or relies on laborious
annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented
paradigm that exploits the entire training dataset to generate pseudo-labels
for single images, which could be used to train COD models. RISE begins by
constructing prototype libraries for environments and camouflaged objects using
training images (without ground truth), followed by K-Nearest Neighbor (KNN)
retrieval to generate pseudo-masks for each image based on these libraries. It
is important to recognize that using only training images without annotations
exerts a pronounced challenge in crafting high-quality prototype libraries. In
this light, we introduce a Clustering-then-Retrieval (CR) strategy, where
coarse masks are first generated through clustering, facilitating subsequent
histogram-based image filtering and cross-category retrieval to produce
high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect
of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which
integrates retrieval results from diverse views to produce more robust and
precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms
state-of-the-art unsupervised and prompt-based methods. Code is available at
https://github.com/xiaohainku/RISE.
Authors' comments: ICCV 2025
Felix Michalak, Steven Abreu
We demonstrate complete functional segregation in hybrid SSM-Transformer architectures: retrieval depends exclusively on self-attention layers. Across RecurrentGemma-2B/9B and Jamba-Mini-1.6, attention ablation causes catastrophic retrieval failure (0% accuracy), while SSM layers show no compensatory mechanisms even with improved prompting. Conversely, sparsifying attention to just 15% of heads maintains near-perfect retrieval while preserving 84% MMLU performance, suggesting self-attention specializes primarily for retrieval tasks. We identify precise mechanistic requirements for retrieval: needle tokens must be exposed during generation and sufficient context must be available during prefill or generation. This strict functional specialization challenges assumptions about redundancy in hybrid architectures and suggests these models operate as specialized modules rather than integrated systems, with immediate implications for architecture optimization and interpretability.
Authors' comments: 16 pages, 10 figures
Zhiming Lin
Multi turn intent understanding is central to task oriented chatbots, yet
real deployments face tight token budgets and noisy contexts, and most
retrieval pipelines emphasize relevance while overlooking set level diversity
and confounds such as more context or exemplar order. We ask whether retrieval
diversity, rather than longer prompts, systematically improves LLM intent
understanding under fixed budgets. We present a diversity aware retrieval
framework that selects in context exemplars to balance intent coverage and
linguistic variety, and integrates this selection with standard LLM decoders;
the evaluation enforces budget matched prompts and randomized positions, and
includes sensitivity analyses over exemplar count, diversity strength, and
backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in
Joint Goal Accuracy under equal token budgets, surpassing strong LLM/DST
baselines, with consistent improvements across K from 4 to 7 and moderate
latency. Overall, the study isolates and validates the impact of content
diversity in retrieval and offers a simple, deployable selection principle for
building accurate, budget constrained multi turn intent systems.
Authors' comments: 15 pages,6 figs
George Ma, Anurag Koul, Qi Chen, Yawen Wu, Sachit Kuhar, Yu Yu, Aritra Sengupta, Varun Kumar et al.
Large Language Models (LLMs) excel at code-related tasks but often struggle in realistic software repositories, where project-specific APIs and cross-file dependencies are crucial. Retrieval-augmented methods mitigate this by injecting repository context at inference time. The low inference-time latency budget affects either retrieval quality or the added latency adversely impacts user experience. We address this limitation with SpecAgent, an agent that improves both latency and code-generation quality by proactively exploring repository files during indexing and constructing speculative context that anticipates future edits in each file. This indexing-time asynchrony allows thorough context computation, masking latency, and the speculative nature of the context improves code-generation quality. Additionally, we identify the problem of future context leakage in existing benchmarks, which can inflate reported performance. To address this, we construct a synthetic, leakage-free benchmark that enables a more realistic evaluation of our agent against baselines. Experiments show that SpecAgent consistently achieves absolute gains of 9-11% (48-58% relative) compared to the best-performing baselines, while significantly reducing inference latency.
Shantanu Agarwal, Joel Barry, Steven Fincke, Scott Miller
Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
Wonduk Seo, Juhyeon Lee, Junseo Koh, Hyunjin An, Jian Park, Seunghyun Lee, Haihua Chen, Yi Bu
Prompt optimization has emerged as an effective alternative to retraining for
improving the performance of Large Language Models (LLMs). However, most
existing approaches treat evaluation as a black box, relying solely on
numerical scores while offering limited insight into why a prompt succeeds or
fails. They also depend heavily on trial-and-error refinements, which are
difficult to interpret and control. In this paper, we introduce MA-SAPO, a
Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior
methods, MA-SAPO explicitly couples evaluation outcomes with structured
reasoning to guide systematic edits. The framework specifically consists of two
stages: during the Reasoning Phase, agents collaboratively explain metric
scores, diagnose weaknesses, and synthesize targeted refinements that are
stored as reusable reasoning assets; during the Test Phase, agents retrieve
these assets to analyze optimized prompts and apply only evidence-grounded
edits. By turning evaluation signals into interpretable reasoning chains,
MA-SAPO produces prompt refinements that are more transparent, auditable, and
controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent
improvements over single-pass prompting, retrieval-augmented baselines, and
prior multi-agent strategies, validating the effectiveness of our approach.
Authors' comments: Preprint
Wenbiao Tao, Yunshi Lan, Weining Qian
Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average win rate of 95.41\% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.
Effrosyni Sokli, Pranav Kasela, Georgios Peikos, Gabriella Pasi
Dense Retrieval Models (DRMs) are a prominent development in Information
Retrieval (IR). A key challenge with these neural Transformer-based models is
that they often struggle to generalize beyond the specific tasks and domains
they were trained on. To address this challenge, prior research in IR
incorporated the Mixture-of-Experts (MoE) framework within each Transformer
layer of a DRM, which, though effective, substantially increased the number of
additional parameters. In this paper, we propose a more efficient design, which
introduces a single MoE block (SB-MoE) after the final Transformer layer. To
assess the retrieval effectiveness of SB-MoE, we perform an empirical
evaluation across three IR tasks. Our experiments involve two evaluation
setups, aiming to assess both in-domain effectiveness and the model's zero-shot
generalizability. In the first setup, we fine-tune SB-MoE with four different
underlying DRMs on seven IR benchmarks and evaluate them on their respective
test sets. In the second setup, we fine-tune SB-MoE on MSMARCO and perform
zero-shot evaluation on thirteen BEIR datasets. Additionally, we perform
further experiments to analyze the model's dependency on its hyperparameters
(i.e., the number of employed and activated experts) and investigate how this
variation affects SB-MoE's performance. The obtained results show that SB-MoE
is particularly effective for DRMs with lightweight base models, such as
TinyBERT and BERT-Small, consistently exceeding standard model fine-tuning
across benchmarks. For DRMs with more parameters, such as BERT-Base and
Contriever, our model requires a larger number of training samples to achieve
improved retrieval performance. Our code is available online at:
https://github.com/FaySokli/SB-MoE.
Authors' comments: 8 pages, 4 figures, 3 tables, reproducible code available at
https://github.com/FaySokli/SB-MoE , Accepted for publication in Proceedings
of the 2025 IEEE/WIC International Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT 2025)
Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber
We present SQuAI (https://squai.scads.ai/), a scalable and trustworthy
multi-agent retrieval-augmented generation (RAG) framework for scientific
question answering (QA) with large language models (LLMs). SQuAI addresses key
limitations of existing RAG systems in the scholarly domain, where complex,
open-domain questions demand accurate answers, explicit claims with citations,
and retrieval across millions of scientific documents. Built on over 2.3
million full-text papers from arXiv.org, SQuAI employs four collaborative
agents to decompose complex questions into sub-questions, retrieve targeted
evidence via hybrid sparse-dense retrieval, and adaptively filter documents to
improve contextual relevance. To ensure faithfulness and traceability, SQuAI
integrates in-line citations for each generated claim and provides supporting
sentences from the source documents. Our system improves faithfulness, answer
relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG
baseline. We further release a benchmark of 1,000 scientific
question-answer-evidence triplets to support reproducibility. With transparent
reasoning, verifiable citations, and domain-wide scalability, SQuAI
demonstrates how multi-agent RAG enables more trustworthy scientific QA with
LLMs.
Authors' comments: Accepted at CIKM 2025
Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji
Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
Jaewan Park, Solbee Cho, Jay-Yoon Lee
Iterative retrieval-augmented generation (RAG) enables large language models
to answer complex multi-hop questions, but each additional loop increases
latency, costs, and the risk of introducing distracting evidence, motivating
the need for an efficient stopping strategy. Existing methods either use a
predetermined number of iterations or rely on confidence proxies that poorly
reflect whether more retrieval will actually help. We cast iterative RAG as a
finite-horizon Markov decision process and introduce Stop-RAG, a value-based
controller that adaptively decides when to stop retrieving. Trained with
full-width forward-view Q($\lambda$) targets from complete trajectories,
Stop-RAG learns effective stopping policies while remaining compatible with
black-box APIs and existing pipelines. On multi-hop question-answering
benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines
and prompting-based stopping with LLMs. These results highlight adaptive
stopping as a key missing component in current agentic systems, and demonstrate
that value-based control can improve the accuracy of RAG systems.
Authors' comments: NeurIPS 2025 MTI-LLM Workshop
Jianting Tang, Dongshuai Li, Tao Wen, Fuyu Lv, Dan Ou, Linli Xu
In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China's largest e-commerce platform since August 2025.