Jianing Yang, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari
With the advancement of self-supervised learning (SSL), fine-tuning pretrained SSL models for mean opinion score (MOS) prediction has achieved state-of-the-art performance. However, during fine-tuning, these SSL-based MOS prediction models often suffer from catastrophic forgetting of the pretrained knowledge and tend to overfit the training set, resulting in poor generalization performance. In this study, we propose DistilMOS, a novel method that learns to predict not only MOS but also token IDs obtained by clustering the hidden representations of each layer in the pretrained SSL model. These layer-wise token targets serve as self-distillation signals that enables the MOS prediction model to extract rich internal knowledge from SSL models, enhancing both prediction accuracy and generalization capability. Experimental evaluations demonstrate that our method significantly outperforms standard SSL-based MOS prediction models on both in-domain and out-of-domain evaluations, verifying the effectiveness and practicality of the proposed method.
Authors' comments: Accepted to ICASSP 2026
Shima Sadat Mousavi, Xiao Tan, Aaron D. Ames
This paper develops certificates that propagate compatibility of multiple control barrier function (CBF) constraints from sampled vertices to their convex hull. Under mild concavity and affinity assumptions, we present three sufficient feasibility conditions under which feasible inputs over the convex hull can be obtained per coordinate, with a common input, or via convex blending. We also describe the associated computational methods, based on interval intersections or an offline linear program (LP). Beyond certifying compatibility, we give conditions under which the quadratic-program (QP) safety filter is affine in the state. This enables explicit implementations via convex combinations of vertex-feasible inputs. Case studies illustrate the results.
Atharva Dange, Ramon E. Lopez, Louis Deslauriers, Nimish Shah
This exploratory study examines the classroom deployment of aiPlato, an AI-enabled homework platform, in a large introductory physics course at the University of Texas at Arlington. Designed to support open-ended problem solving, aiPlato provides step-wise feedback and iterative guidance through tools such as "Evaluate My Work" and "AI Tutor Chat", while preserving opportunities for productive struggle. Over four optional extra-credit assignments, the platform captured detailed student interaction data, which were analyzed alongside course performance and end-of-semester survey responses. We examine how students engaged with different feedback tools, whether engagement patterns were associated with performance on the cumulative final exam, and how students perceived the platform's usability and learning value. Students who engaged more frequently with aiPlato tended to achieve higher final exam scores, with a mean difference corresponding to a standardized effect size of approximately 0.81 between high and low engagement groups after controlling for prior academic performance. Usage patterns and survey responses indicate that students primarily relied on iterative, formative feedback rather than solution-revealing assistance. As a quasi-experimental pilot study, these findings do not establish causality and may reflect self-selection effects. Nonetheless, the results demonstrate the feasibility of integrating AI-mediated, step-wise feedback into authentic physics homework and motivate future controlled studies of AI-assisted tutoring systems.
Jonas Römer, Timo Dickscheid
End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.
Yuqing Zhou, Zhuoer Wang, Jie Yuan, Hong Wang, Samson Koelle, Ziwei Zhu, Wei Niu
Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent's execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and $τ^2$-bench show consistent improvement across base models.
Authors' comments: 19 pages
M. Galloway, E. Gjerløw, M. San, R. M. Sullivan, D. J. Watts, R. Aurvik, A. Basyrov, L. A. Bianchi et al.
We present a model of starlight emission in the Diffuse Infrared Background Explorer (DIRBE) data between 1.25 and 25$\,μ$m based on \textit{Gaia} and WISE measurements. We include two classes of compact objects, namely bright stars with individual spectral energy densities (SEDs) measured by \textit{Gaia}, and a combined diffuse background of dim point source emission. Of the 424\ 829 bright sources that we fit, the number of stars with a flux density detected by WISE at Galactic latitudes $|b|>20^{\circ}$ at more than $5\,σ$ is 94\,680, for an average of 1.36~stars per DIRBE beam area. For each star, we adopt physical parameters ($T_{\mathrm{eff}}$, $\log g$, and [M/H]) from \textit{Gaia}; use these to identify a best-fit effective SED with the PHOENIX stellar model library; convolve with the respective DIRBE bandpass; and fit an overall free amplitude per star within the Bayesian end-to-end \texttt{Cosmoglobe} DR2 framework. The contributions from faint sources are accounted for by coadding all 710\ 825\ 587 WISE sources not included as bright stars, and fit one single overall amplitude per DIRBE band. Based on this model we find that total star emission accounts for 91\,\% of the observed flux density at 2.2\,$μ$m; 54\,\% at 4.9$\,μ$m; and 1\,\% at 25\,$μ$m. As shown in companion papers, this new model is sufficiently accurate to support high-precision measurements of both the Cosmic Infrared Background monopole and zodiacal light emission in the three highest DIRBE frequencies.
Authors' comments: 13 pages, 15 figures
Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
Authors' comments: Source code is available at https://github.com/TANIGUCHIREI/ASL
Bo-Lun Huang, Zhen-Zhao Tao, Tong-Jie Zhang
We search for galaxy-scale (Dysonian) waste heat in the mid-infrared using WISE. Starting from the 2MASS Redshift Survey (2MRS), we cross-match to CatWISE2020 and AllWISE, apply standard MIR AGN/starburst vetoes (Stern, Assef R90, Jarrett), and treat W1 and W2 as stellar baselines and W3 and W4 as constraining bands. For each galaxy and for blackbody waste heat temperatures T=150-600 K, we convert W3/W4 photometry into conservative 3-sigma per-galaxy upper limits on the bolometric waste heat luminosity using the WISE bandpass (RSR) color correction. The resulting distributions have median caps of ~(5-9) x 10^8 L_sun across T=150-600 K. Aggregated at the population level, the one-sided 95% upper bound on the fraction of nearby galaxies that could host waste heat above a given threshold monotonically decreases with threshold and asymptotes to ~1/6500 at high thresholds (set by the sample size). Sensitivity transitions from W4 at T <= 200K to W3 at T >= 300K. Interpreted with the AGENT formalism, a fiducial Milky Way like stellar luminosity L_=3 x 10^10 L_sun implies typical per galaxy caps of alpha = L_wh/L_ <= 1.7-2.9% over T=150-600 K (e.g., alpha <= 1.8% at T=300 K). At T ~= 300K, no more than f_95 ~= 1.61 x 10^-4 (~= 0.0161%) of nearby galaxies can host KIII-scale systems reprocessing >= 21% of a Milky Way-like stellar luminosity into ~ 300K waste heat.
Authors' comments: 18 pages, 12 figures, 2 tables. Accepted for publication in The Astronomical Journal
Amirreza Zamani, Parastoo Sadeghi, Mikael Skoglund
An information-theoretic privacy mechanism design is studied, where an agent observes useful data $Y$ which is correlated with the private data $X$. The agent wants to reveal the information to a user, hence, the agent utilizes a privacy mechanism to produce disclosed data $U$ that can be revealed. We assume that the agent has no direct access to $X$, i.e., the private data is hidden. We study privacy mechanism design that maximizes the disclosed information about $Y$, measured by the mutual information between $Y$ and $U$, while satisfying a point-wise constraint with different privacy leakage budgets. We introduce a new measure, called the \emph{multi-level point-wise leakage}, which allows us to impose different leakage levels for different realizations of $U$. In contrast to previous studies on point-wise measures, which use the same leakage level for each realization, we consider a more general scenario in which each data point can leak information up to a different threshold. As a result, this concept also covers cases in which some data points should not leak any information about the private data, i.e., they must satisfy perfect privacy. In other words, a combination of perfect privacy and non-zero leakage can be considered. When the leakage is sufficiently small, concepts from information geometry allow us to locally approximate the mutual information. We show that when the leakage matrix $P_{X|Y}$ is invertible, utilizing this approximation leads to a quadratic optimization problem that has closed-form solution under some constraints. In particular, we show that it is sufficient to consider only binary $U$ to attain the optimal utility. This leads to simple privacy designs with low complexity which are based on finding the maximum singular value and singular vector of a matrix.
Kaiyan Zhao, Zijie Meng, Zheyong Xie, Jin Duan, Yao Hu, Zuozhu Liu, Shaosheng Cao
Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications to assist customer services in tasks such as product inquiries, recommendations, and order management. Existing benchmarks primarily evaluate whether these agents successfully complete the final task, overlooking the intermediate reasoning stages that are crucial for effective decision-making. To address this gap, we propose EComStage, a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process: Perception (understanding user intent), Planning (formulating an action plan), and Action (executing the decision). EComStage evaluates LLMs through seven separate representative tasks spanning diverse e-commerce scenarios, with all samples human-annotated and quality-checked. Unlike prior benchmarks that focus only on customer-oriented interactions, EComStage also evaluates merchant-oriented scenarios, including promotion management, content review, and operational support relevant to real-world applications. We evaluate a wide range of over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs, revealing stage/orientation- specific strengths and weaknesses. Our results provide fine-grained, actionable insights for designing and optimizing LLM-based agents in real-world e-commerce settings.
Authors' comments: preprint
Hongzhan Lin, Zixin Chen, Zhiqi Shen, Ziyang Luo, Zhen Ye, Jing Ma, Tat-Seng Chua, Guandong Xu
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated reference guidelines to ensure unbiased and consistent pairwise comparisons across heterogeneous judge agents; and (iii) an arena-driven claim-evolution module that adaptively generates more challenging and semantically controlled claims to probe LLMs' factual robustness beyond fixed seed data. Across 16 state-of-the-art LLMs spanning seven model families, FactArena produces stable and interpretable rankings. Our analyses further reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence, highlighting the necessity of holistic evaluation. The proposed framework offers a scalable and trustworthy paradigm for diagnosing LLMs' factual reasoning, guiding future model development, and advancing the reliable deployment of LLMs in safety-critical fact-checking applications.
Authors' comments: 17 pages, 21 figures, 7 tables
Xiangjun Kong, Qingkang Bao, Tibebe Yalew, Gerardo Adesso, Samanta Piano
Driven by the growing demand for high-speed 3D measurement in advanced manufacturing, optical metrology algorithms must deliver high accuracy and robustness under dynamic conditions. Fringe projection profilometry (FPP) offers high precision, yet the 2pi ambiguity of the wrapped phase means that conventional absolute phase recovery typically relies on multiple coded patterns, sacrificing temporal resolution. Deep learning-based composite FPP (CFPP) shows promise for single-shot phase recovery from a composite fringe, but limited interpretability makes it difficult to assess reconstruction reliability or trace error sources in the absence of ground truth. To address this, we propose HSURE-CFPP (Heteroscedastic Snapshot-ensemble Uncertainty-aware Ratio Estimation for CFPP). HSURE-CFPP predicts the numerator-denominator ratio used for wrapped-phase computation with a heteroscedastic snapshot-ensemble network, enabling ultra-fast 3D imaging from a single composite fringe and producing pixel-wise uncertainty maps for confidence assessment and unreliable-region identification. Specifically, a heteroscedastic likelihood jointly estimates pixel-wise noise variance to capture data uncertainty, while a snapshot ensemble quantifies model uncertainty via dispersion across snapshots, yielding total predictive uncertainty as an interpretable reliability measure. Experiments on static and dynamic scenes demonstrate that HSURE-CFPP achieves high-accuracy reconstruction at high speed and that the predicted uncertainty correlates well with reconstruction errors, providing a deployable quality-assessment mechanism for deep-learning-based FPP.
Authors' comments: 19 pages, 10 figures
Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim
Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.
Authors' comments: 16 pages, 20 figures
Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao, Wangli Lan, Jing Lin, Zhixin Ma et al.
In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.
Ajvad Haneef K, Karan Kuwar Singh, Madhu Kumar S D
Confronting the substantial challenges of malware detection in cybersecurity necessitates solutions that are both robust and adaptable to the ever-evolving threat environment. The paper introduces Meta Learning Malware Detection (MeLeMaD), a novel framework leveraging the adaptability and generalization capabilities of Model-Agnostic Meta-Learning (MAML) for malware detection. MeLeMaD incorporates a novel feature selection technique, Chunk-wise Feature Selection based on Gradient Boosting (CFSGB), tailored for handling large-scale, high-dimensional malware datasets, significantly enhancing the detection efficiency. Two benchmark malware datasets (CIC-AndMal2020 and BODMAS) and a custom dataset (EMBOD) were used for rigorously validating the MeLeMaD, achieving a remarkable performance in terms of key evaluation measures, including accuracy, precision, recall, F1-score, MCC, and AUC. With accuracies of 98.04\% on CIC-AndMal2020 and 99.97\% on BODMAS, MeLeMaD outperforms the state-of-the-art approaches. The custom dataset, EMBOD, also achieves a commendable accuracy of 97.85\%. The results underscore the MeLeMaD's potential to address the challenges of robustness, adaptability, and large-scale, high-dimensional datasets in malware detection, paving the way for more effective and efficient cybersecurity solutions.
Authors' comments: 20 pages, 8 Figures
Hengyi Wu, Zhenyi Wang, Heng Huang
Continual learning aims to acquire new tasks while preserving performance on previously learned ones, but most methods struggle with catastrophic forgetting. Existing approaches typically treat all layers uniformly, often trading stability for plasticity or vice versa. However, different layers naturally exhibit varying levels of uncertainty (entropy) when classifying tasks. High-entropy layers tend to underfit by failing to capture task-specific patterns, while low-entropy layers risk overfitting by becoming overly confident and specialized. To address this imbalance, we propose an entropy-aware continual learning method that employs a dynamic feedback mechanism to regulate each layer based on its entropy. Specifically, our approach reduces entropy in high-entropy layers to mitigate underfitting and increases entropy in overly confident layers to alleviate overfitting. This adaptive regulation encourages the model to converge to wider local minima, which have been shown to improve generalization. Our method is general and can be seamlessly integrated with both replay- and regularization-based approaches. Experiments on various datasets demonstrate substantial performance gains over state-of-the-art continual learning baselines.
Authors' comments: 14 pages
Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Hao Wang, Zhiqiang Ge, Daoyi Dong, Yong Liu et al.
Optimizing time series models via point-wise loss functions (e.g., MSE) relying on a flawed point-wise independent and identically distributed (i.i.d.) assumption that disregards the causal temporal structure, an issue with growing awareness yet lacking formal theoretical grounding. Focusing on the core independence issue under covariance stationarity, this paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB), formalizing it information-theoretically as the discrepancy between the true joint distribution and its flawed i.i.d. counterpart. Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function. We derive the first closed-form quantification for the non-deterministic EOB across linear and non-linear systems, and prove EOB is an intrinsic data property, governed exclusively by sequence length and our proposed Structural Signal-to-Noise Ratio (SSNR). This theoretical diagnosis motivates our principled debiasing program that eliminates the bias through sequence length reduction and structural orthogonalization. We present a concrete solution that simultaneously achieves both principles via DFT or DWT. Furthermore, a novel harmonized $\ell_p$ norm framework is proposed to rectify gradient pathologies of high-variance series. Extensive experiments validate EOB Theory's generality and the superior performance of debiasing program.
Wenhan Guo, Jinglun Yu, Yaning Wang, Jin U. Kang, Yu Sun
Diffusion models are highly expressive image priors for Bayesian inverse problems. However, most diffusion models cannot operate on large-scale, high-dimensional data due to high training and inference costs. In this work, we introduce a Plug-and-play algorithm for 3D stochastic inference with latent diffusion prior (PSI3D) to address massive ($1024\times 1024\times 128$) volumes. Specifically, we formulate a Markov chain Monte Carlo approach to reconstruct each two-dimensional (2D) slice by sampling from a 2D latent diffusion model. To enhance inter-slice consistency, we also incorporate total variation (TV) regularization stochastically along the concatenation axis. We evaluate our performance on optical coherence tomography (OCT) super-resolution. Our method significantly improves reconstruction quality for large-scale scientific imaging compared to traditional and learning-based baselines, while providing robust and credible reconstructions.
Authors' comments: 10 pages, 3 figures
Wuyi Liu, Le Jin, Junxian Yang, Yuanchao Yu, Zishuo Peng, Jinfeng Xu, Xianzhi Li, Jun Zhou
Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.
Harsh Vardhan Bansal
Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications
Authors' comments: Accepted and presented at 13th IEEE International Conference on Intelligent Systems and Embedded Design (ISED-2025)