Yeonjoo Park, Aiguo Han
Among inferential problems in functional data analysis, domain selection is one of the practical interests aiming to identify sub-interval(s) of the domain where desired functional features are displayed. Motivated by applications in quantitative ultrasound signal analysis, we propose the robust domain selection method, particularly aiming to discover a subset of the domain presenting distinct behaviors on location parameters among different groups. By extending the interval testing approach, we propose to take into account multiple aspects of functional features simultaneously to detect the practically interpretable domain. To further handle potential outliers and missing segments on collected functional trajectories, we perform interval testing with a test statistic based on functional M-estimators for the inference. In addition, we introduce the effect size heatmap by calculating robustified effect sizes from the lowest to the largest scales over the domain to reflect dynamic functional behaviors among groups so that clinicians get a comprehensive understanding and select practically meaningful sub-interval(s). The performance of the proposed method is demonstrated through simulation studies and an application to motivating quantitative ultrasound measurements.
Patrik Czakó, Gábor Kertész, Sándor Szénási
We present SmoothRot, a novel post-training quantization technique to enhance
the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot
addresses the critical challenge of massive activation outliers, by integrating
channel-wise scaling with Hadamard transformations. Our technique effectively
transforms extreme outliers into quantization-friendly activations,
significantly improving quantization accuracy. Experiments conducted on popular
LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot
consistently reduces the performance gap between quantized and FP16 models by
approximately 10-30\% across language generation and zero-shot reasoning tasks,
without introducing additional inference latency. Code is available at
https://github.com/czakop/smoothrot.
Authors' comments: 6 pages, 3 figures, 5 tables. Submitted to the IEEE SMC 2025
conference
Mingjun Sun, Chongjun Ouyang, Shaochuan Wu, Yuanwei Liu
The pinching-antenna system (PASS) reconstructs wireless channels through pinching beamforming, i.e., optimizing the activated locations of pinching antennas (PAs) along the waveguide. The aim of this article is to investigate the joint design of baseband beamforming and pinching beamforming. A low-complexity element-wise sequential optimization framework is proposed to address the sum-rate maximization problem in PASS-enabled downlink and uplink channels. i) For the downlink scenario, maximum ratio transmission (MRT), zero-forcing (ZF), and minimum mean square error (MMSE) beamforming schemes are employed as baseband beamformers. For each beamformer, a closed-form expression for the downlink sum-rate is derived as a single-variable function with respect to the pinching beamformer. Based on this, a sequential optimization method is proposed, where the positions of the PAs are updated element-wise using a low-complexity one-dimensional search. ii) For the uplink scenario, signal detection is performed using maximum ratio combining (MRC), ZF, and MMSE combiners. A closed-form sum-rate expression is derived for each linear combiner, and a similar element-wise design is applied to optimize the pinching beamforming. Numerical results are provided to validate the effectiveness of the proposed method and demonstrate that: (i) For all considered linear beamformers, the proposed PASS architecture outperforms conventional fixed-antenna systems in terms of sum-rate performance; (ii) in both downlink and uplink channels, ZF achieves performance close to that of MMSE and significantly outperforms MRT or MRC; and (iii) the proposed element-wise design eliminates the need for alternating updates between the baseband and pinching beamformers, thereby ensuring low computational complexity.
Junghyun Lee, Kyoungseok Jang, Kwang-Sung Jun, Milan Vojnović, Se-Young Yun
We present `GL-LowPopArt`, a novel Catoni-style estimator for generalized
low-rank trace regression. Building on `LowPopArt` (Jang et al., 2024), it
employs a two-stage approach: nuclear norm regularization followed by matrix
Catoni estimation. We establish state-of-the-art estimation error bounds,
surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and
reveal a novel experimental design objective, $\mathrm{GL}(\pi)$. The key
technical challenge is controlling bias from the nonlinear inverse link
function, which we address by our two-stage approach. We prove a *local*
minimax lower bound, showing that our `GL-LowPopArt` enjoys instance-wise
optimality up to the condition number of the ground-truth Hessian. Applications
include generalized linear matrix completion, where `GL-LowPopArt` achieves a
state-of-the-art Frobenius error guarantee, and **bilinear dueling bandits**, a
novel setting inspired by general preference learning (Zhang et al., 2024). Our
analysis of a `GL-LowPopArt`-based explore-then-commit algorithm reveals a new,
potentially interesting problem-dependent quantity, along with improved Borda
regret bound than vectorization (Wu et al., 2024).
Authors' comments: 53 pages, 2 figures, 3 tables; Accepted as a Spotlight Poster to the
42nd International Conference on Machine Learning (ICML 2025)
Hongtao Huang, Xiaojun Chang, Lina Yao
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model's outputs instead of ground truth, slashing evaluation time by over $90\%$. In practice, Flexiffusion achieves at least $2\times$ acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under $5\%$, outperforming prior NAS and caching methods. Notably, it attains $5.1\times$ speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang
Generalized Category Discovery (GCD) aims to identify unlabeled samples by
leveraging the base knowledge from labeled ones, where the unlabeled set
consists of both base and novel classes. Since clustering methods are
time-consuming at inference, parametric-based approaches have become more
popular. However, recent parametric-based methods suffer from inferior base
discrimination due to unreliable self-supervision. To address this issue, we
propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary
branch devoted to base classification. During training, the main branch filters
the pseudo-base samples to the auxiliary branch. In response, the auxiliary
branch provides more reliable soft labels for the main branch, leading to a
virtuous cycle. Furthermore, we introduce Class-wise Distribution
Regularization (CDR) to mitigate the learning bias towards base classes. CDR
essentially increases the prediction confidence of the unlabeled data and
boosts the novel class performance. Combined with both components, our proposed
method, RLCD, achieves superior performance in all classes with negligible
extra computation. Comprehensive experiments across seven GCD datasets validate
its superiority. Our codes are available at https://github.com/APORduo/RLCD.
Authors' comments: ICML2025 Poster
Tianyu Lin, Xinran Li, Chuntung Zhuang, Qi Chen, Yuanhao Cai, Kai Ding, Alan L. Yuille, Zongwei Zhou
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
Yufan Zhou, Yongbo Xiao, An Liu
In this paper, we consider uplink channel estimation for massive multi-input multi-output (MIMO) systems with partially connected hybrid beamforming (PC-HBF) structures. Existing beam design and channel estimation schemes are usually based on ideal assumptions and require transmitting pilots across multiple timeslots, making them unsuitable for practical PC-HBF systems. To overcome these drawbacks, we propose a novel beam design and a corresponding channel estimation algorithm to achieve accurate and real-time uplink channel estimation. Firstly, we introduce a group-wise narrow beam design in the vertical dimension to suppress interference from undesired angular components and improve vertical angle estimation accuracy,which divides the columns of the uniform planar array (UPA)into groups and the vertical angle interval into sub-intervals.In this way, each group is assigned with a narrow beam to cover one vertical angle sub-interval, and the set of narrow beams is designed based on the filter design theory. Secondly, we optimize the antenna grouping pattern using the Estimation of Distribution Algorithm (EDA), balancing interference suppression and resolution capability in the horizontal dimension, leading to a better horizontal angle estimation performance. Finally, we design a low-complexity group-wise subspace constrained variational Bayesian inference (GW-SC-VBI) algorithm to fully take advantage of the proposed beam design to achieve both low-complexity and high-accurate channel estimation. Simulation results demonstrate that the proposed scheme achieves notable performance gains over baseline methods.
Rukuo Li, Jianchun Liu, Hongli Xu, Liusheng Huang
Federated fine-tuning (FedFT) provides an effective paradigm for fine-tuning large language models (LLMs) in privacy-sensitive scenarios. However, practical deployment remains challenging due to the limited resources on end devices. Existing methods typically utilize parameter-efficient fine-tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), to substantially reduce communication overhead. Nevertheless, significant memory usage for activation storage and computational demands from full backpropagation remain major barriers to efficient deployment on resource-constrained end devices. Moreover, substantial resource heterogeneity across devices results in severe synchronization bottlenecks, diminishing the overall fine-tuning efficiency. To address these issues, we propose FedQuad, a novel LoRA-based FedFT framework that adaptively adjusts the LoRA depth (the number of consecutive tunable LoRA layers from the output) according to device computational capabilities, while employing activation quantization to reduce memory overhead, thereby enabling efficient deployment on resource-constrained devices. Specifically, FedQuad first identifies the feasible and efficient combinations of LoRA depth and the number of activation quantization layers based on device-specific resource constraints. Subsequently, FedQuad employs a greedy strategy to select the optimal configurations for each device, effectively accommodating system heterogeneity. Extensive experiments demonstrate that FedQuad achieves a 1.4-5.3x convergence acceleration compared to state-of-the-art baselines when reaching target accuracy, highlighting its efficiency and deployability in resource-constrained and heterogeneous end-device environments.
Yang Sui, Qi Xu, Yang Bai, Annie Qu
Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, various forms of heterogeneity are presented in MTL problems, such as block-wise, distribution, and posterior heterogeneity. Existing methods, however, struggle to tackle these forms of heterogeneity simultaneously in a unified framework. In this paper, we propose a two-step learning strategy for MTL which addresses the aforementioned heterogeneity. First, we impute the missing blocks using shared representations extracted from homogeneous source across different tasks. Next, we disentangle the mappings between input features and responses into a shared component and a task-specific component, respectively, thereby enabling information borrowing through the shared component. Our numerical experiments and real-data analysis from the ADNI database demonstrate the superior MTL performance of the proposed method compared to other competing methods.
Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangli-ao Geng, Boyang Li
Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i.e., differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4.4% (ViT-B/32) over state-of-the-art approaches.
Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Continually expanding new languages for existing large language models (LLMs)
is a promising yet challenging approach to building powerful multilingual LLMs.
The biggest challenge is to make the model continuously learn new languages
while preserving the proficient ability of old languages. To achieve this,
recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new
languages by adding new experts and avoid catastrophic forgetting of old
languages by routing corresponding tokens to the original model backbone (old
experts). Although intuitive, this kind of method is parameter-costly when
expanding new languages and still inevitably impacts the performance of old
languages. To address these limitations, we analyze the language
characteristics of different layers in LLMs and propose a layer-wise expert
allocation algorithm (LayerMoE) to determine the appropriate number of new
experts for each layer. Specifically, we find different layers in LLMs exhibit
different representation similarities between languages and then utilize the
similarity as the indicator to allocate experts for each layer, i.e., the
higher similarity, the fewer experts. Additionally, to further mitigate the
forgetting of old languages, we add a classifier in front of the router network
on the layers with higher similarity to guide the routing of old language
tokens. Experimental results show that our method outperforms the previous
state-of-the-art baseline with 60% fewer experts in the single-expansion
setting and with 33.3% fewer experts in the lifelong-expansion setting,
demonstrating the effectiveness of our method.
Authors' comments: ACL 2025 (Main), 16 pages, 5 figures, 11 tables
Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh
While recent success of large reasoning models (LRMs) significantly advanced
LLMs' reasoning capability by optimizing the final answer accuracy using
reinforcement learning, they may also drastically increase the output length
due to overthinking, characterized by unnecessarily complex reasoning paths
that waste computation and potentially degrade the performance. We hypothesize
that such inefficiencies stem from LRMs' limited capability to dynamically
select the proper modular reasoning strategies, termed thinking patterns at the
right position. To investigate this hypothesis, we propose a dynamic
optimization framework that segments model-generated reasoning paths into
distinct thinking patterns, systematically identifying and promoting beneficial
patterns that improve the answer while removing detrimental ones. Empirical
analysis confirms that our optimized thinking paths yield more concise yet
sufficiently informative trajectories, enhancing reasoning efficiency by
reducing attention FLOPs by up to 47% while maintaining accuracy for originally
correct responses. Moreover, a non-trivial portion of originally incorrect
responses are transformed into correct ones, achieving a 15.6% accuracy
improvement with reduced length. Motivated by the improvement brought by the
optimized thinking paths, we apply a preference optimization technique
supported by a pairwise dataset contrasting suboptimal and optimal reasoning
paths. Experimental evaluations across multiple mathematical reasoning
benchmarks reveal that our method notably reduces computational overhead while
simultaneously improving reasoning accuracy, achieving up to a 12% accuracy
improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
Authors' comments: Work In Progress
Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha
Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.
Nima Shahbazi, Stavros Sintos, Abolfazl Asudeh
Frequency estimation in streaming data often relies on sketches like Count-Min (CM) to provide approximate answers with sublinear space. However, CM sketches introduce additive errors that disproportionately impact low-frequency elements, creating fairness concerns across different groups of elements. We introduce Fair-Count-Min, a frequency estimation sketch that guarantees equal expected approximation factors across element groups, thus addressing the unfairness issue. We propose a column partitioning approach with group-aware semi-uniform hashing to eliminate collisions between elements from different groups. We provide theoretical guarantees for fairness, analyze the price of fairness, and validate our theoretical findings through extensive experiments on real-world and synthetic datasets. Our experimental results show that Fair-Count-Min achieves fairness with minimal additional error and maintains competitive efficiency compared to standard CM sketches.
Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Yiren Song
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao et al.
The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.
Dingqiang Ye, Chao Fan, Zhanbo Huang, Chengwen Luo, Jianqiang Li, Shiqi Yu, Xiaoming Liu
Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.
Florentin Beck, William Rudman, Carsten Eickhoff
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
Yizhi Zhou, Haina Zhu, Hangting Chen
Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.