Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang
How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.
Authors' comments: AAAI 2026
Mingkun Yang, Ran Zhu, Qing Wang, Jie Yang
Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25$\times$). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.
Authors' comments: Paper accepted by AAAI 2026
Remi Luschei, Werner Brannath
We consider clinical trials with multiple, overlapping patient populations, that test multiple treatment policies specifically tailored to these populations. Such designs may lead to multiplicity issues, as false statements will affect several populations. For type I error control, often the family-wise error rate (FWER) is controlled, which is the probability to reject at least one true null hypothesis. If the joint distribution of the test statistics is known, the FWER level can be exhausted by determining critical values or adjusted $α$-levels. The adjustment is typically done under the common ANOVA assumptions. However, the performed tests are then only valid under the rather strong assumption of homogeneous null effects, i.e., when the null hypothesis applies to all subpopulations and their intersections. We show that under cancelling null effects, when heterogeneous effects cancel out in some or all subpopulations, this procedure does not provide FWER control. We also suggest different alternatives and compare them in terms of FWER control and their power.
Liang Luo, Lei Zhang
Image restoration requires a careful balance between noise suppression and structure preservation. While first-order total variation (TV) regularization effectively preserves edges, it often introduces staircase artifacts, whereas higher-order TV removes such artifacts but oversmooths fine details. To reconcile these competing effects, we propose a semi-convergent stage-wise framework that sequentially integrates first- and higher-order TV regularizers within an iterative restoration process implemented via ADMM. Each stage exhibits semi-convergence behavior, i.e., the iterates initially approach the ground truth before being degraded by over-regularization. By monitoring this evolution, the algorithm adaptively selects the locally optimal iterate (e.g., with the highest PSNR) and propagates it as the initial point for the next stage. This select-and-propagate mechanism effectively transfers local semi-convergence into a globally convergent iterative process. We establish theoretical guarantees showing that the sequence of stage-wise iterates is bounded, the objective values decrease monotonically. Extensive numerical experiments on denoising and deblurring benchmarks confirm that the proposed method achieves superior quantitative and perceptual performance compared with conventional first-, higher-order, hybrid TV methods, and learning based methods, while maintaining theoretical interpretability and algorithmic simplicity.
Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin et al.
Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
Authors' comments: Preprint
Grigory Kovalev, Mikhail Tikhomirov
This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.
Timothy W. H. Yiu, Harish K. Vedantham, Joseph R. Callingham, Timothy W. Shimwell
Brown dwarfs display Jupiter-like auroral phenomena, such as rotationally
modulated electron cyclotron maser radio emission. Radio observations of
cyclotron maser emission can be used to measure their magnetic field strength,
topology, and to deduce the presence of magnetically interacting exoplanets.
Observations of the coldest brown dwarfs (spectral types T and Y) are
especially intriguing, as their magnetospheric phenomena could closely resemble
those of gas-giant exoplanets. Here we report observations made over ten
epochs, amounting to 44 hours, of WISEP J101905.63+652954.2 (J1019+65
hereinafter) using the LOFAR telescope between 120 and 168 MHz. J1019+65 is a
methane dwarf binary (T5.5+T7) whose radio emission was originally detected in
a single-epoch LOFAR observation to be highly circular polarised and
rotationally modulated at $\approx 3$h. Unexpectedly, our long-term monitoring
reveals an additional periodic signature at $\approx 0.787$h. We consider
several explanations for the second period and suggest that it could be the
rotationally modulated emission of the second brown dwarf in the binary,
although follow-up infrared observations are necessary to confirm this
hypothesis. In addition, the data also allow us to statistically estimate the
duty cycle and observed radio-loud fraction of the 120-168\,MHz cyclotron
emission from methane dwarfs to be $\langle D \rangle =
0.030^{+0.034}_{-0.030}$ and $F^{'}_{\rm radio} = 0.088^{+0.168}_{-0.088}$
respectively.
Authors' comments: 16 pages, 15 figures, 1 table. Accepted for publication in A&A
Iancu Andrei, Marius Kloetzer, Cristian Mahulea, Catalin Dosoftei
In this paper, we propose a computationally efficient quadratic programming (QP) approach for generating smooth, $C^1$ continuous paths for mobile robots using piece-wise quadratic Bezier (PWB) curves. Our method explicitly incorporates safety margins within a structured optimization framework, balancing trajectory smoothness and robustness with manageable numerical complexity suitable for real-time and embedded applications. Comparative simulations demonstrate clear advantages over traditional piece-wise linear (PWL) path planning methods, showing reduced trajectory deviations, enhanced robustness, and improved overall path quality. These benefits are validated through simulations using a Pure-Pursuit controller in representative scenarios, highlighting the practical effectiveness and scalability of our approach for safe navigation.
Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He et al.
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method
widely used in large language models (LLMs). LoRA essentially describes the
projection of an input space into a low-dimensional output space, with the
dimensionality determined by the LoRA rank. In standard LoRA, all input tokens
share the same weights and undergo an identical input-output projection. This
limits LoRA's ability to capture token-specific information due to the inherent
semantic differences among tokens. To address this limitation, we propose
Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts
LoRA weights according to the input token, thereby learning token-wise
input-output projections in an end-to-end manner. Formally, the weights of
TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank
matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated
from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA
weights but achieves more granular adaptation by learning token-wise LoRA
weights (i.e., token-wise input-output projections). Extensive experiments
across multiple models and datasets demonstrate that TopLoRA consistently
outperforms LoRA and its variants. The code is available at
https://github.com/Leopold1423/toplora-neurips25.
Authors' comments: Accepted by NeurIPS 2025
Jinhee Kim, Jae Jun An, Kang Eun Jeon, Jong Hwan Ko
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.
Bin Gu, Lipeng Dai, Huipeng Du, Haitao Zhao, Jibo Wei
Learning robust speaker representations under noisy conditions presents significant challenges, which requires careful handling of both discriminative and noise-invariant properties. In this work, we proposed an anchor-based stage-wise learning strategy for robust speaker representation learning. Specifically, our approach begins by training a base model to establish discriminative speaker boundaries, and then extract anchor embeddings from this model as stable references. Finally, a copy of the base model is fine-tuned on noisy inputs, regularized by enforcing proximity to their corresponding fixed anchor embeddings to preserve speaker identity under distortion. Experimental results suggest that this strategy offers advantages over conventional joint optimization, particularly in maintaining discrimination while improving noise robustness. The proposed method demonstrates consistent improvements across various noise conditions, potentially due to its ability to handle boundary stabilization and variation suppression separately.
Zahra Mobini, Ahmet Hasim Gokceoglu, Li Wang, Gunnar Peters, Hyundong Shin, Hien Quoc Ngo
We exploit a general cluster-based network architecture for a fronthaul-limited user-centric cell-free massive multiple-input multiple-output (CF-mMIMO) system under different degrees of cooperation among the access points (APs) to achieve scalable implementation. In particular, we consider a CF-mMIMO system wherein the available APs are grouped into multiple processing clusters (PCs) to share channel state information (CSI), ensuring that they have knowledge of the CSI for all users assigned to the given cluster for the purposes of designing resource allocation and precoding. We utilize the sum pseudo-SE metric, which accounts for intra-cluster interference and intercluster-leakage, providing a close approximation to the true sum achievable SE. For a given PC, we formulate two optimization problems to maximize the cluster-wise weighted sum pseudo-SE under fronthaul constraints, relying solely on local CSI. These optimization problems are associated with different computational complexity requirements. The first optimization problem jointly designs precoding, user association, and power allocation, and is performed at the small-scale fading time scale. The second optimization problem optimizes user association and power allocation at the large-scale fading time scale. Accordingly, we develop a novel application of modified weighted minimum mean square error (WMMSE)-based approach to solve the challenging formulated non-convex mixed-integer problems.
Jinwoo Baek
Transformers trained in low precision can suffer forward-error amplification.
We give a first-order, module-wise theory that predicts when and where errors
grow. For self-attention we derive a per-layer bound that factorizes into three
interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise
softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$.
We prove a residual relaxation inequality showing that residual blocks
attenuate depth-wise accumulation, and we introduce a precision- and
width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order
bound in the $\epsilon$-dominated regime. These pieces yield a unified
forward-stability bound whose right-hand side is directly estimable during
training.
On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined
predictor $\kappa_{\rm softmax},(1+\kappa_{\rm
score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks
FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions;
scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The
time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal,
leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation
$p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a
small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent
stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$,
cap$=10^{-2}$) with negligible overhead.
Overall, our theory supplies actionable, unitless diagnostics that (i)
explain when self-attention is fragile, (ii) forecast instability, and (iii)
motivate a minimally invasive mitigation.
Authors' comments: 15 pages
Thao Nguyen, Jiaqi Ma, Fahad Shahbaz Khan, Souhaib Ben Taieb, Salman Khan
Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.
Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu
Retrieval-Augmented Generation (RAG) systems leverage Large Language Models
(LLMs) to generate accurate and reliable responses that are grounded in
retrieved context. However, LLMs often generate inconsistent outputs for
semantically equivalent inputs, a problem compounded by the scarcity of
consistency-focused training data and the limitations of current fine-tuning
techniques in enhancing output consistency. We propose a new approach combining
systematic synthetic data generation, triplet loss for better embeddings, and a
novel layer-wise model merging approach. Using consistency-aware weights
derived from intermediate layer activations, our method effectively integrates
knowledge from specialized models. Experimental results how that our merged
model significantly enhances output consistency, achieving a ~47.5\%
improvement in response similarity over the baseline, thus offering a practical
solution for increasing the reliability of an industrial RAG system.
Authors' comments: EMNLP 2025 Industry track
Runlin Zhou, Letian Li, Zemin Zheng
We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors across clients and induces within-row sparsity, and we develop RowFed, a communication-efficient federated algorithm that embeds SROF into a linearized ADMM framework with privacy-preserving partial participation. Theoretically, we establish an oracle property for SROF-achieving correct variable-level group recovery with asymptotic normality-and prove convergence of RowFed to a stationary solution. Under random client participation, the iterate gap contracts at a rate that improves with participation probability. Empirically, simulations in heterogeneous regimes show that RowFed consistently lowers estimation and prediction error and strengthens variable-level cluster recovery over NonFed, FedAvg, and a personalized matrix-fusion baseline. A real-data study further corroborates these gains while preserving interpretability. Together, our results position row-wise fusion as an effective and transparent paradigm for large-scale personalized federated multivariate learning, bridging the gap between entry-wise and matrix-wise formulations.
Xiao Zheng, Wenchi Cheng, Jingqing Wang, Zhuohui Yao, Jiangzhou Wang
Active reconfigurable intelligent surface (RIS) emerges as an effective
technique to resist the double-fading attenuation of passive RIS. By embedding
with power harvesting function, it further evolves to zero-power active RIS,
which can effectively enhance the flexibility of RIS deployment without
external power demand. Nevertheless, existing works neglected the inherent
difficulty of channel estimation (CE) for RIS-assisted systems, and the
discrete phase shift constraint in practical deployment. In this paper we
design a new element-wise RIS architecture and propose a distributed
location-aided transmission scheme with low complexity to enhance the reflected
gain for channel state information (CSI)-limited RIS-assisted near-field
communications. Specifically, the new element-wise RIS provides dynamic element
selection capability with low hardware resources. Based on Fresnel diffraction
theory, we construct the mapping from locations in space-domain to phase
distributions of waves in phase-domain and reveal the priority of elements for
harvesting and reflecting. {Then, the distributed beamforming design with the
phase of determine-then-align is proposed, where the estimation overhead
reduction stems from exempted requirements of RIS-associated CE at base station
(BS).} The asymptotic analysis indicates that the proposed scheme can achieve
the optimal gain with a fixed proportion of reflective elements when RIS is
large, followed by simulations to verify its superiority to other protocols.
Authors' comments: 17 Pages
Hong-Kai Zheng, Piji Li
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang et al.
Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one
language that captures the emotion of a speaker from another language while
maintaining the target voice's timbre. This process of cross-lingual emotional
speech synthesis presents a complex challenge, necessitating flexible control
over emotion, timbre, and language. However, emotion and timbre are highly
entangled in speech signals, making fine-grained control challenging. To
address this issue, we propose EMM-TTS, a novel two-stage cross-lingual
emotional speech synthesis framework based on perturbed self-supervised
learning (SSL) representations. In the first stage, the model explicitly and
implicitly encodes prosodic cues to capture emotional expressiveness, while the
second stage restores the timbre from perturbed SSL representations. We further
investigate the effect of different speaker perturbation strategies-formant
shifting and speaker anonymization-on the disentanglement of emotion and
timbre. To strengthen speaker preservation and expressive control, we introduce
Speaker Consistency Loss (SCL) and Speaker-Emotion Adaptive Layer Normalization
(SEALN) modules. Additionally, we find that incorporating explicit acoustic
features (e.g., F0, energy, and duration) alongside pretrained latent features
improves voice cloning performance. Comprehensive multi-metric evaluations,
including both subjective and objective measures, demonstrate that EMM-TTS
achieves superior naturalness, emotion transferability, and timbre consistency
across languages.
Authors' comments: Submitted to Expert Systems with Applications,11 pages
Shuwei Chen, Jiajun Cui, Zhengqi Xu, Fan Zhang, Jiangke Fan, Teng Zhang, Xingxing Wang
Click-through rate (CTR) prediction, which models behavior sequence and
non-sequential features (e.g., user/item profiles or cross features) to infer
user interest, underpins industrial recommender systems. However, most methods
face three forms of heterogeneity that degrade predictive performance: (i)
Feature Heterogeneity persists when limited sequence side features provide less
granular interest representation compared to extensive non-sequential features,
thereby impairing sequence modeling performance; (ii) Context Heterogeneity
arises because a user's interest in an item will be influenced by other items,
yet point-wise prediction neglects cross-item interaction context from the
entire item set; (iii) Architecture Heterogeneity stems from the fragmented
integration of specialized network modules, which compounds the model's
effectiveness, efficiency and scalability in industrial deployments. To tackle
the above limitations, we propose HoMer, a Homogeneous-Oriented TransforMer for
modeling sequential and set-wise contexts. First, we align sequence side
features with non-sequential features for accurate sequence modeling and
fine-grained interest representation. Second, we shift the prediction paradigm
from point-wise to set-wise, facilitating cross-item interaction in a highly
parallel manner. Third, HoMer's unified encoder-decoder architecture achieves
dual optimization through structural simplification and shared computation,
ensuring computational efficiency while maintaining scalability with model
size. Without arduous modification to the prediction pipeline, HoMer
successfully scales up and outperforms our industrial baseline by 0.0099 in the
AUC metric, and enhances online business metrics like CTR/RPM by 1.99%/2.46%.
Additionally, HoMer saves 27% of GPU resources via preliminary engineering
optimization, further validating its superiority and practicality.
Authors' comments: 10 pages, 6 figures