Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.
Dake Guo, Jixun Yao, Linhan Ma, Wang He, Lei Xie
Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: https://dukguo.github.io/StreamFlow/}
Yash Vardhan Tomar
Bias in predictive machine learning (ML) models is a fundamental challenge due to the skewed or unfair outcomes produced by biased models. Existing mitigation strategies rely on either post-hoc corrections or rigid constraints. However, emerging research claims that these techniques can limit scalability and reduce generalizability. To address this, this paper introduces a feature-wise mixing framework to mitigate contextual bias. This was done by redistributing feature representations across multiple contextual datasets. To assess feature-wise mixing's effectiveness, four ML classifiers were trained using cross-validation and evaluated with bias-sensitive loss functions, including disparity metrics and mean squared error (MSE), which served as a standard measure of predictive performance. The proposed method achieved an average bias reduction of 43.35% and a statistically significant decrease in MSE across all classifiers trained on mixed datasets. Additionally, benchmarking against established bias mitigation techniques found that feature-wise mixing consistently outperformed SMOTE oversampling and demonstrated competitive effectiveness without requiring explicit bias attribute identification. Feature-wise mixing efficiently avoids the computational overhead typically associated with fairness-aware learning algorithms. Future work could explore applying feature-wise mixing for real-world fields where accurate predictions are necessary.
Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu
Recently, autoregressive (AR) language models have emerged as a dominant
approach in speech synthesis, offering expressive generation and scalable
training. However, conventional AR speech synthesis models relying on the
next-token prediction paradigm often encounter significant challenges when
handling long speech sequences. These models often struggle to construct stable
frame-to-frame attention, leading to increased latency and degraded synthesis
quality, thereby limiting their feasibility for real-time applications. To
address these limitations, we introduce a novel dynamic chunk-wise
autoregressive synthesis framework, termed DCAR, designed to enhance both
efficiency and intelligibility robustness in AR speech generation. DCAR
introduces a chunk-to-frame attention mechanism through training with
multi-token prediction, enabling dynamic chunk prediction in variable speech
contexts using a lightweight module trained on-policy. DCAR dynamically adjusts
the token prediction span, significantly reducing the sequence length
dependency while obtaining high synthesis quality. Comprehensive empirical
evaluations demonstrate that DCAR substantially outperforms traditional
next-token prediction models, achieving up to 72.27% intelligibility
improvement and 2.61x inference speedup simultaneously on the test set.
Furthermore, we conduct comprehensive analysis to support it as a versatile
foundation for next-generation speech synthesis systems.
Authors' comments: 17 pages, 8 figures, 5 tables
Taiga Someya, Ryo Yoshida, Hitomi Yanaka, Yohei Oseki
Recent work has demonstrated that neural language models encode syntactic structures in their internal representations, yet the derivations by which these structures are constructed across layers remain poorly understood. In this paper, we propose Derivational Probing to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers. Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers. Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma
RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in
various applications like intelligent surveillance. However, as the number of
modalities increases, the required data storage and transmission costs also
double. Therefore, efficient RGB-IR data compression is essential. This work
proposes a joint compression framework for RGB-IR image pair. Specifically, to
fully utilize cross-modality prior information for accurate context probability
modeling within and between modalities, we propose a Channel-wise
Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context
Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are
designed for extracting and aggregating the global low-frequency information
from both modalities, which assist the model in predicting entropy parameters
more accurately. Experimental results demonstrate that our approach outperforms
existing RGB-IR image pair and single-modality compression methods on LLVIP and
KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate
saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec
presented at CVPR 2022.
Authors' comments: IEEE International Conference on Systems, Man, and Cybernetics 2025.
(SMC), under review
Lucas Tendick, Costantino Budroni, Marco Túlio Quintino
The incompatibility of quantum measurements, i.e. the fact that certain
observable quantities cannot be measured jointly is widely regarded as a
distinctive quantum feature with important implications for the foundations and
the applications of quantum information theory. While the standard
incompatibility of multiple measurements has been the focus of attention since
the inception of quantum theory, its generalizations, such as measurement
simulability, $n$-wise incompatibility, and mulit-copy incompatibility have
only been proposed recently. Here, we point out that all these generalizations
are differing notions of the question of how many measurements are genuinely
contained in a measurement device. We then show, that all notions do differ not
only in their operational meaning but also mathematically in the set of
measurement assemblages they describe. We then fully resolve the relations
between these different generalizations, by showing a strict hierarchy between
these notions. Hence, we provide a general framework for generalized
measurement incompatibility. Finally, we consider the implications our results
have for recent works using these different notions.
Authors' comments: 24 pages, 6 figures, comments are welcome!
Lingfei Wang, Yu Xing, Yuhao Yi, Ming Cao, Karl H. Johansson
This paper investigates the problem of leadership development for an external influencer using the Friedkin-Johnsen (FJ) opinion dynamics model, where the influencer is modeled as a fully stubborn agent and leadership is quantified by social power. The influencer seeks to maximize her social power by strategically adding a limited number of links to regular agents. This optimization problem is shown to be equivalent to maximizing the absorbing probability to the influencer in an augmented Markov chain. The resulting objective function is both monotone and submodular, enabling the use of a greedy algorithm to compute an approximate solution. To handle large-scale networks efficiently, a random walk sampling over the Markov chain is employed to reduce computational complexity. Analytical characterizations of the solution are provided for both low and high stubbornness of regular agents. Specific network topologies are also examined: for complete graphs with rank-one weight matrices, the problem reduces to a hyperbolic 0-1 programmming problem, which is solvable in polynomial time; for symmetric ring graphs with circulant weight matrices and uniform agent stubbornness, the optimal strategy involves selecting agents that are sufficiently dispersed across the network. Numerical simulations are presented for illustration.
Yu-Yang Qian, Yuan-Ze Xu, Zhen-Yu Zhang, Peng Zhao, Zhi-Hua Zhou
Many real-world applications collect data in a streaming environment, where
learning tasks are encountered sequentially. This necessitates continual
learning (CL) to update models online, enabling adaptation to new tasks while
preserving past knowledge to prevent catastrophic forgetting. Nowadays, with
the flourish of large pre-trained models (LPMs), efficiency has become
increasingly critical for CL, due to their substantial computational demands
and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of
Low-Rank Adapters), a novel approach that constructs layer-wise adapters by
leveraging hierarchical gradient similarity to enable efficient CL,
particularly for LPMs. To reduce the computational burden of task similarity
estimation, we employ bandit techniques to develop an algorithm based on lower
confidence bounds to efficiently explore the task structure. Furthermore, we
use sparse gradient updates to facilitate parameter optimization, making the
approach better suited for LPMs. Theoretical analysis is provided to justify
the rationale behind our approach, and experiments on both vision transformers
(ViTs) and large language models (LLMs) demonstrate the effectiveness and
efficiency of our approach across various domains, including vision and natural
language processing tasks.
Authors' comments: ICML 2025
Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li
Recent advances in medical imaging have established deep learning-based
segmentation as the predominant approach, though it typically requires large
amounts of manually annotated data. However, obtaining annotations for
intracranial hemorrhage (ICH) remains particularly challenging due to the
tedious and costly labeling process. Semi-supervised learning (SSL) has emerged
as a promising solution to address the scarcity of labeled data, especially in
volumetric medical image segmentation. Unlike conventional SSL methods that
primarily focus on high-confidence pseudo-labels or consistency regularization,
we propose SWDL-Net, a novel SSL framework that exploits the complementary
advantages of Laplacian pyramid and deep convolutional upsampling. The
Laplacian pyramid excels at edge sharpening, while deep convolutions enhance
detail precision through flexible feature mapping. Our framework achieves
superior segmentation of lesion details and boundaries through a difference
learning mechanism that effectively integrates these complementary approaches.
Extensive experiments on a 271-case ICH dataset and public benchmarks
demonstrate that SWDL-Net outperforms current state-of-the-art methods in
scenarios with only 2% labeled data. Additional evaluations on the publicly
available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data
further confirm the superiority of our approach. Code and data have been
released at https://github.com/SIAT-CT-LAB/SWDL.
Authors' comments: 11 pages, 4 figures, 6 Tables
Shiji Zhao, Chi Chen, Ranjie Duan, Xizhe Wang, Xingxing Wei
Adversarial Training (AT) is widely recognized as an effective approach to
enhance the adversarial robustness of Deep Neural Networks. As a variant of AT,
Adversarial Robustness Distillation (ARD) has shown outstanding performance in
enhancing the robustness of small models. However, both AT and ARD face robust
fairness issue: these models tend to display strong adversarial robustness
against some classes (easy classes) while demonstrating weak adversarial
robustness against others (hard classes). This paper explores the underlying
factors of this problem and points out the smoothness degree of soft labels for
different classes significantly impacts the robust fairness from both empirical
observation and theoretical analysis. Based on the above exploration, we
propose Anti-Bias Soft Label Distillation (ABSLD) within the Knowledge
Distillation framework to enhance the adversarial robust fairness.
Specifically, ABSLD adaptively reduces the student's error risk gap between
different classes, which is accomplished by adjusting the class-wise smoothness
degree of teacher's soft labels during the training process, and the adjustment
is managed by assigning varying temperatures to different classes.
Additionally, as a label-based approach, ABSLD is highly adaptable and can be
integrated with the sample-based methods. Extensive experiments demonstrate
ABSLD outperforms state-of-the-art methods on the comprehensive performance of
robustness and fairness.
Authors' comments: arXiv admin note: text overlap with arXiv:2312.05508
Zengjue Chen, Runliang Niu, He Kong, Qi Wang
Recent advances in Vision-Language-Action (VLA) model have demonstrated strong generalization capabilities across diverse scenes, tasks, and robotic platforms when pretrained at large-scale datasets. However, these models still require task-specific fine-tuning in novel environments, a process that relies almost exclusively on supervised fine-tuning (SFT) using static trajectory datasets. Such approaches neither allow robot to interact with environment nor do they leverage feedback from live execution. Also, their success is critically dependent on the size and quality of the collected trajectories. Reinforcement learning (RL) offers a promising alternative by enabling closed-loop interaction and aligning learned policies directly with task objectives. In this work, we draw inspiration from the ideas of GRPO and propose the Trajectory-wise Group Relative Policy Optimization (TGRPO) method. By fusing step-level and trajectory-level advantage signals, this method improves GRPO's group-level advantage estimation, thereby making the algorithm more suitable for online reinforcement learning training of VLA. Experimental results on ten manipulation tasks from the libero-object benchmark demonstrate that TGRPO consistently outperforms various baseline methods, capable of generating more robust and efficient policies across multiple tested scenarios. Our source codes are available at: https://github.com/hahans/TGRPO
Tarushri N. S.
Universal Differential Equations (UDEs), which blend neural networks with physical differential equations, have emerged as a powerful framework for scientific machine learning (SciML), enabling data-efficient, interpretable, and physically consistent modeling. In the context of smart grid systems, modeling node-wise battery dynamics remains a challenge due to the stochasticity of solar input and variability in household load profiles. Traditional approaches often struggle with generalization and fail to capture unmodeled residual dynamics. This work proposes a UDE-based approach to learn node-specific battery evolution by embedding a neural residual into a physically inspired battery ODE. Synthetic yet realistic solar generation and load demand data are used to simulate battery dynamics over time. The neural component learns to model unobserved or stochastic corrections arising from heterogeneity in node demand and environmental conditions. Comprehensive experiments reveal that the trained UDE aligns closely with ground truth battery trajectories, exhibits smooth convergence behavior, and maintains stability in long-term forecasts. These findings affirm the viability of UDE-based SciML approaches for battery modeling in decentralized energy networks and suggest broader implications for real-time control and optimization in renewable-integrated smart grids.
Yao Yan
Multi-digit addition is a clear probe of the computational power of large
language models. To dissect the internal arithmetic processes in
LLaMA-3-8B-Instruct, we combine linear probing with logit-lens inspection.
Inspired by the step-by-step manner in which humans perform addition, we
propose and analyze a coherent four-stage trajectory in the forward
pass:Formula-structure representations become linearly decodable first, while
the answer token is still far down the candidate list.Core computational
features then emerge prominently.At deeper activation layers, numerical
abstractions of the result become clearer, enabling near-perfect detection and
decoding of the individual digits in the sum.Near the output, the model
organizes and generates the final content, with the correct token reliably
occupying the top rank.This trajectory suggests a hierarchical process that
favors internal computation over rote memorization. We release our code and
data to facilitate reproducibility.
Authors' comments: 12 pages, including appendix, 7 figures. EMNLP 2025 submission (ARR
May 2025 cycle, reviews pending)
Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng et al.
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).
Yeonjoo Park, Aiguo Han
Among inferential problems in functional data analysis, domain selection is one of the practical interests aiming to identify sub-interval(s) of the domain where desired functional features are displayed. Motivated by applications in quantitative ultrasound signal analysis, we propose the robust domain selection method, particularly aiming to discover a subset of the domain presenting distinct behaviors on location parameters among different groups. By extending the interval testing approach, we propose to take into account multiple aspects of functional features simultaneously to detect the practically interpretable domain. To further handle potential outliers and missing segments on collected functional trajectories, we perform interval testing with a test statistic based on functional M-estimators for the inference. In addition, we introduce the effect size heatmap by calculating robustified effect sizes from the lowest to the largest scales over the domain to reflect dynamic functional behaviors among groups so that clinicians get a comprehensive understanding and select practically meaningful sub-interval(s). The performance of the proposed method is demonstrated through simulation studies and an application to motivating quantitative ultrasound measurements.
Patrik Czakó, Gábor Kertész, Sándor Szénási
We present SmoothRot, a novel post-training quantization technique to enhance
the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot
addresses the critical challenge of massive activation outliers, by integrating
channel-wise scaling with Hadamard transformations. Our technique effectively
transforms extreme outliers into quantization-friendly activations,
significantly improving quantization accuracy. Experiments conducted on popular
LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot
consistently reduces the performance gap between quantized and FP16 models by
approximately 10-30\% across language generation and zero-shot reasoning tasks,
without introducing additional inference latency. Code is available at
https://github.com/czakop/smoothrot.
Authors' comments: 6 pages, 3 figures, 5 tables. Submitted to the IEEE SMC 2025
conference
Mingjun Sun, Chongjun Ouyang, Shaochuan Wu, Yuanwei Liu
The pinching-antenna system (PASS) reconstructs wireless channels through pinching beamforming, i.e., optimizing the activated locations of pinching antennas (PAs) along the waveguide. The aim of this article is to investigate the joint design of baseband beamforming and pinching beamforming. A low-complexity element-wise sequential optimization framework is proposed to address the sum-rate maximization problem in PASS-enabled downlink and uplink channels. i) For the downlink scenario, maximum ratio transmission (MRT), zero-forcing (ZF), and minimum mean square error (MMSE) beamforming schemes are employed as baseband beamformers. For each beamformer, a closed-form expression for the downlink sum-rate is derived as a single-variable function with respect to the pinching beamformer. Based on this, a sequential optimization method is proposed, where the positions of the PAs are updated element-wise using a low-complexity one-dimensional search. ii) For the uplink scenario, signal detection is performed using maximum ratio combining (MRC), ZF, and MMSE combiners. A closed-form sum-rate expression is derived for each linear combiner, and a similar element-wise design is applied to optimize the pinching beamforming. Numerical results are provided to validate the effectiveness of the proposed method and demonstrate that: (i) For all considered linear beamformers, the proposed PASS architecture outperforms conventional fixed-antenna systems in terms of sum-rate performance; (ii) in both downlink and uplink channels, ZF achieves performance close to that of MMSE and significantly outperforms MRT or MRC; and (iii) the proposed element-wise design eliminates the need for alternating updates between the baseband and pinching beamformers, thereby ensuring low computational complexity.
Junghyun Lee, Kyoungseok Jang, Kwang-Sung Jun, Milan Vojnović, Se-Young Yun
We present `GL-LowPopArt`, a novel Catoni-style estimator for generalized
low-rank trace regression. Building on `LowPopArt` (Jang et al., 2024), it
employs a two-stage approach: nuclear norm regularization followed by matrix
Catoni estimation. We establish state-of-the-art estimation error bounds,
surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and
reveal a novel experimental design objective, $\mathrm{GL}(\pi)$. The key
technical challenge is controlling bias from the nonlinear inverse link
function, which we address by our two-stage approach. We prove a *local*
minimax lower bound, showing that our `GL-LowPopArt` enjoys instance-wise
optimality up to the condition number of the ground-truth Hessian. Applications
include generalized linear matrix completion, where `GL-LowPopArt` achieves a
state-of-the-art Frobenius error guarantee, and **bilinear dueling bandits**, a
novel setting inspired by general preference learning (Zhang et al., 2024). Our
analysis of a `GL-LowPopArt`-based explore-then-commit algorithm reveals a new,
potentially interesting problem-dependent quantity, along with improved Borda
regret bound than vectorization (Wu et al., 2024).
Authors' comments: 53 pages, 2 figures, 3 tables; Accepted as a Spotlight Poster to the
42nd International Conference on Machine Learning (ICML 2025)
Hongtao Huang, Xiaojun Chang, Lina Yao
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model's outputs instead of ground truth, slashing evaluation time by over $90\%$. In practice, Flexiffusion achieves at least $2\times$ acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under $5\%$, outperforming prior NAS and caching methods. Notably, it attains $5.1\times$ speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.