Joseph Onuegbu, Dafne Guetta, Yael Hillman, Volker Perdelwitz, Massimo Della Valle
We present a novel approach for characterizing nova candidates by exploiting the infrared capabilities of the Wide-field Infrared Survey Explorer (WISE) catalog. We developed a pipeline to identify novae based on well-defined infrared criteria, and leveraging this pipeline, we successfully identified 41 optically confirmed novae in the WISE catalog. In particular, we focus on the color difference between the optical V band and the WISE 3.4 microns W1 band as a diagnostic. We compared their infrared light curves with their optical counterparts. We identified a strong correlation from which we proposed a color difference model that can be used for further identification and characterization of novae. Our analysis validates the mass-loss timescale theory, which predicts that systems with lower accretion rates accumulate larger envelopes and produce more massive ejecta. We also confirm models' prediction that the early color evolution of novae is governed by ejecta expansion and cooling. From our sample statistics, we infer a Galactic nova rate of approximately 40 to 50 novae per year, consistent with modern and infrared-corrected estimates. The resultant model from this work paves the way for future large-scale investigations of nova candidates.
Farhana Amin, Sabiha Afroz, Kanchon Gharami, Mona Moghadampanah, Dimitrios S. Nikolopoulos
Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.
Jiahao Wang, Bokang Fu, Yu Zhu, Yuli Liu
LLM-based agents are emerging as a promising paradigm for simulating user behavior to enhance recommender systems. However, their effectiveness is often limited by existing studies that focus on modeling user ratings for individual items. This point-wise approach leads to prevalent issues such as inaccurate user preference comprehension and rigid item-semantic representations. To address these limitations, we propose the novel Set-wise Reflective Learning Framework (SRLF). Our framework operationalizes a closed-loop "assess-validate-reflect" cycle that harnesses the powerful in-context learning capabilities of LLMs. SRLF departs from conventional point-wise assessment by formulating a holistic judgment on an entire set of items. It accomplishes this by comprehensively analyzing both the intricate interrelationships among items within the set and their collective alignment with the user's preference profile. This method of set-level contextual understanding allows our model to capture complex relational patterns essential to user behavior, making it significantly more adept for sequential recommendation. Extensive experiments validate our approach, confirming that this set-wise perspective is crucial for achieving state-of-the-art performance in sequential recommendation tasks.
Jawad Ibn Ahad, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman
Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.
Ci Lin, Tet Yeap, Iluju Kiringa, Biwei Zhang
Deep neural networks are known to be vulnerable to adversarial perturbations, which are small and carefully crafted inputs that lead to incorrect predictions. In this paper, we propose DeepDefense, a novel defense framework that applies Gradient-Feature Alignment (GFA) regularization across multiple layers to suppress adversarial vulnerability. By aligning input gradients with internal feature representations, DeepDefense promotes a smoother loss landscape in tangential directions, thereby reducing the model's sensitivity to adversarial noise.
We provide theoretical insights into how adversarial perturbation can be decomposed into radial and tangential components and demonstrate that alignment suppresses loss variation in tangential directions, where most attacks are effective. Empirically, our method achieves significant improvements in robustness across both gradient-based and optimization-based attacks. For example, on CIFAR-10, CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD attacks and 24.7% under FGSM attacks. Against optimization-based attacks such as DeepFool and EADEN, DeepDefense requires 20 to 30 times higher perturbation magnitudes to cause misclassification, indicating stronger decision boundaries and a flatter loss landscape. Our approach is architecture-agnostic, simple to implement, and highly effective, offering a promising direction for improving the adversarial robustness of deep learning models.
Authors' comments: no available
Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang
How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.
Authors' comments: AAAI 2026
Mingkun Yang, Ran Zhu, Qing Wang, Jie Yang
Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25$\times$). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.
Authors' comments: Paper accepted by AAAI 2026
Remi Luschei, Werner Brannath
We consider clinical trials with multiple, overlapping patient populations, that test multiple treatment policies specifically tailored to these populations. Such designs may lead to multiplicity issues, as false statements will affect several populations. For type I error control, often the family-wise error rate (FWER) is controlled, which is the probability to reject at least one true null hypothesis. If the joint distribution of the test statistics is known, the FWER level can be exhausted by determining critical values or adjusted $α$-levels. The adjustment is typically done under the common ANOVA assumptions. However, the performed tests are then only valid under the rather strong assumption of homogeneous null effects, i.e., when the null hypothesis applies to all subpopulations and their intersections. We show that under cancelling null effects, when heterogeneous effects cancel out in some or all subpopulations, this procedure does not provide FWER control. We also suggest different alternatives and compare them in terms of FWER control and their power.
Liang Luo, Lei Zhang
Image restoration requires a careful balance between noise suppression and structure preservation. While first-order total variation (TV) regularization effectively preserves edges, it often introduces staircase artifacts, whereas higher-order TV removes such artifacts but oversmooths fine details. To reconcile these competing effects, we propose a semi-convergent stage-wise framework that sequentially integrates first- and higher-order TV regularizers within an iterative restoration process implemented via ADMM. Each stage exhibits semi-convergence behavior, i.e., the iterates initially approach the ground truth before being degraded by over-regularization. By monitoring this evolution, the algorithm adaptively selects the locally optimal iterate (e.g., with the highest PSNR) and propagates it as the initial point for the next stage. This select-and-propagate mechanism effectively transfers local semi-convergence into a globally convergent iterative process. We establish theoretical guarantees showing that the sequence of stage-wise iterates is bounded, the objective values decrease monotonically. Extensive numerical experiments on denoising and deblurring benchmarks confirm that the proposed method achieves superior quantitative and perceptual performance compared with conventional first-, higher-order, hybrid TV methods, and learning based methods, while maintaining theoretical interpretability and algorithmic simplicity.
Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin et al.
Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
Authors' comments: Preprint
Grigory Kovalev, Mikhail Tikhomirov
This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.
Timothy W. H. Yiu, Harish K. Vedantham, Joseph R. Callingham, Timothy W. Shimwell
Brown dwarfs display Jupiter-like auroral phenomena, such as rotationally
modulated electron cyclotron maser radio emission. Radio observations of
cyclotron maser emission can be used to measure their magnetic field strength,
topology, and to deduce the presence of magnetically interacting exoplanets.
Observations of the coldest brown dwarfs (spectral types T and Y) are
especially intriguing, as their magnetospheric phenomena could closely resemble
those of gas-giant exoplanets. Here we report observations made over ten
epochs, amounting to 44 hours, of WISEP J101905.63+652954.2 (J1019+65
hereinafter) using the LOFAR telescope between 120 and 168 MHz. J1019+65 is a
methane dwarf binary (T5.5+T7) whose radio emission was originally detected in
a single-epoch LOFAR observation to be highly circular polarised and
rotationally modulated at $\approx 3$h. Unexpectedly, our long-term monitoring
reveals an additional periodic signature at $\approx 0.787$h. We consider
several explanations for the second period and suggest that it could be the
rotationally modulated emission of the second brown dwarf in the binary,
although follow-up infrared observations are necessary to confirm this
hypothesis. In addition, the data also allow us to statistically estimate the
duty cycle and observed radio-loud fraction of the 120-168\,MHz cyclotron
emission from methane dwarfs to be $\langle D \rangle =
0.030^{+0.034}_{-0.030}$ and $F^{'}_{\rm radio} = 0.088^{+0.168}_{-0.088}$
respectively.
Authors' comments: 16 pages, 15 figures, 1 table. Accepted for publication in A&A
Iancu Andrei, Marius Kloetzer, Cristian Mahulea, Catalin Dosoftei
In this paper, we propose a computationally efficient quadratic programming (QP) approach for generating smooth, $C^1$ continuous paths for mobile robots using piece-wise quadratic Bezier (PWB) curves. Our method explicitly incorporates safety margins within a structured optimization framework, balancing trajectory smoothness and robustness with manageable numerical complexity suitable for real-time and embedded applications. Comparative simulations demonstrate clear advantages over traditional piece-wise linear (PWL) path planning methods, showing reduced trajectory deviations, enhanced robustness, and improved overall path quality. These benefits are validated through simulations using a Pure-Pursuit controller in representative scenarios, highlighting the practical effectiveness and scalability of our approach for safe navigation.
Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He et al.
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method
widely used in large language models (LLMs). LoRA essentially describes the
projection of an input space into a low-dimensional output space, with the
dimensionality determined by the LoRA rank. In standard LoRA, all input tokens
share the same weights and undergo an identical input-output projection. This
limits LoRA's ability to capture token-specific information due to the inherent
semantic differences among tokens. To address this limitation, we propose
Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts
LoRA weights according to the input token, thereby learning token-wise
input-output projections in an end-to-end manner. Formally, the weights of
TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank
matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated
from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA
weights but achieves more granular adaptation by learning token-wise LoRA
weights (i.e., token-wise input-output projections). Extensive experiments
across multiple models and datasets demonstrate that TopLoRA consistently
outperforms LoRA and its variants. The code is available at
https://github.com/Leopold1423/toplora-neurips25.
Authors' comments: Accepted by NeurIPS 2025
Jinhee Kim, Jae Jun An, Kang Eun Jeon, Jong Hwan Ko
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.
Bin Gu, Lipeng Dai, Huipeng Du, Haitao Zhao, Jibo Wei
Learning robust speaker representations under noisy conditions presents significant challenges, which requires careful handling of both discriminative and noise-invariant properties. In this work, we proposed an anchor-based stage-wise learning strategy for robust speaker representation learning. Specifically, our approach begins by training a base model to establish discriminative speaker boundaries, and then extract anchor embeddings from this model as stable references. Finally, a copy of the base model is fine-tuned on noisy inputs, regularized by enforcing proximity to their corresponding fixed anchor embeddings to preserve speaker identity under distortion. Experimental results suggest that this strategy offers advantages over conventional joint optimization, particularly in maintaining discrimination while improving noise robustness. The proposed method demonstrates consistent improvements across various noise conditions, potentially due to its ability to handle boundary stabilization and variation suppression separately.
Zahra Mobini, Ahmet Hasim Gokceoglu, Li Wang, Gunnar Peters, Hyundong Shin, Hien Quoc Ngo
We exploit a general cluster-based network architecture for a fronthaul-limited user-centric cell-free massive multiple-input multiple-output (CF-mMIMO) system under different degrees of cooperation among the access points (APs) to achieve scalable implementation. In particular, we consider a CF-mMIMO system wherein the available APs are grouped into multiple processing clusters (PCs) to share channel state information (CSI), ensuring that they have knowledge of the CSI for all users assigned to the given cluster for the purposes of designing resource allocation and precoding. We utilize the sum pseudo-SE metric, which accounts for intra-cluster interference and intercluster-leakage, providing a close approximation to the true sum achievable SE. For a given PC, we formulate two optimization problems to maximize the cluster-wise weighted sum pseudo-SE under fronthaul constraints, relying solely on local CSI. These optimization problems are associated with different computational complexity requirements. The first optimization problem jointly designs precoding, user association, and power allocation, and is performed at the small-scale fading time scale. The second optimization problem optimizes user association and power allocation at the large-scale fading time scale. Accordingly, we develop a novel application of modified weighted minimum mean square error (WMMSE)-based approach to solve the challenging formulated non-convex mixed-integer problems.
Jinwoo Baek
Transformers trained in low precision can suffer forward-error amplification.
We give a first-order, module-wise theory that predicts when and where errors
grow. For self-attention we derive a per-layer bound that factorizes into three
interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise
softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$.
We prove a residual relaxation inequality showing that residual blocks
attenuate depth-wise accumulation, and we introduce a precision- and
width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order
bound in the $\epsilon$-dominated regime. These pieces yield a unified
forward-stability bound whose right-hand side is directly estimable during
training.
On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined
predictor $\kappa_{\rm softmax},(1+\kappa_{\rm
score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks
FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions;
scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The
time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal,
leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation
$p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a
small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent
stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$,
cap$=10^{-2}$) with negligible overhead.
Overall, our theory supplies actionable, unitless diagnostics that (i)
explain when self-attention is fragile, (ii) forecast instability, and (iii)
motivate a minimally invasive mitigation.
Authors' comments: 15 pages
Thao Nguyen, Jiaqi Ma, Fahad Shahbaz Khan, Souhaib Ben Taieb, Salman Khan
Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.
Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu
Retrieval-Augmented Generation (RAG) systems leverage Large Language Models
(LLMs) to generate accurate and reliable responses that are grounded in
retrieved context. However, LLMs often generate inconsistent outputs for
semantically equivalent inputs, a problem compounded by the scarcity of
consistency-focused training data and the limitations of current fine-tuning
techniques in enhancing output consistency. We propose a new approach combining
systematic synthetic data generation, triplet loss for better embeddings, and a
novel layer-wise model merging approach. Using consistency-aware weights
derived from intermediate layer activations, our method effectively integrates
knowledge from specialized models. Experimental results how that our merged
model significantly enhances output consistency, achieving a ~47.5\%
improvement in response similarity over the baseline, thus offering a practical
solution for increasing the reliability of an industrial RAG system.
Authors' comments: EMNLP 2025 Industry track