Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang
Generalized Category Discovery (GCD) aims to identify unlabeled samples by
leveraging the base knowledge from labeled ones, where the unlabeled set
consists of both base and novel classes. Since clustering methods are
time-consuming at inference, parametric-based approaches have become more
popular. However, recent parametric-based methods suffer from inferior base
discrimination due to unreliable self-supervision. To address this issue, we
propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary
branch devoted to base classification. During training, the main branch filters
the pseudo-base samples to the auxiliary branch. In response, the auxiliary
branch provides more reliable soft labels for the main branch, leading to a
virtuous cycle. Furthermore, we introduce Class-wise Distribution
Regularization (CDR) to mitigate the learning bias towards base classes. CDR
essentially increases the prediction confidence of the unlabeled data and
boosts the novel class performance. Combined with both components, our proposed
method, RLCD, achieves superior performance in all classes with negligible
extra computation. Comprehensive experiments across seven GCD datasets validate
its superiority. Our codes are available at https://github.com/APORduo/RLCD.
Authors' comments: ICML2025 Poster
Tianyu Lin, Xinran Li, Chuntung Zhuang, Qi Chen, Yuanhao Cai, Kai Ding, Alan L. Yuille, Zongwei Zhou
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
Yufan Zhou, Yongbo Xiao, An Liu
In this paper, we consider uplink channel estimation for massive multi-input multi-output (MIMO) systems with partially connected hybrid beamforming (PC-HBF) structures. Existing beam design and channel estimation schemes are usually based on ideal assumptions and require transmitting pilots across multiple timeslots, making them unsuitable for practical PC-HBF systems. To overcome these drawbacks, we propose a novel beam design and a corresponding channel estimation algorithm to achieve accurate and real-time uplink channel estimation. Firstly, we introduce a group-wise narrow beam design in the vertical dimension to suppress interference from undesired angular components and improve vertical angle estimation accuracy,which divides the columns of the uniform planar array (UPA)into groups and the vertical angle interval into sub-intervals.In this way, each group is assigned with a narrow beam to cover one vertical angle sub-interval, and the set of narrow beams is designed based on the filter design theory. Secondly, we optimize the antenna grouping pattern using the Estimation of Distribution Algorithm (EDA), balancing interference suppression and resolution capability in the horizontal dimension, leading to a better horizontal angle estimation performance. Finally, we design a low-complexity group-wise subspace constrained variational Bayesian inference (GW-SC-VBI) algorithm to fully take advantage of the proposed beam design to achieve both low-complexity and high-accurate channel estimation. Simulation results demonstrate that the proposed scheme achieves notable performance gains over baseline methods.
Rukuo Li, Jianchun Liu, Hongli Xu, Liusheng Huang
Federated fine-tuning (FedFT) provides an effective paradigm for fine-tuning large language models (LLMs) in privacy-sensitive scenarios. However, practical deployment remains challenging due to the limited resources on end devices. Existing methods typically utilize parameter-efficient fine-tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), to substantially reduce communication overhead. Nevertheless, significant memory usage for activation storage and computational demands from full backpropagation remain major barriers to efficient deployment on resource-constrained end devices. Moreover, substantial resource heterogeneity across devices results in severe synchronization bottlenecks, diminishing the overall fine-tuning efficiency. To address these issues, we propose FedQuad, a novel LoRA-based FedFT framework that adaptively adjusts the LoRA depth (the number of consecutive tunable LoRA layers from the output) according to device computational capabilities, while employing activation quantization to reduce memory overhead, thereby enabling efficient deployment on resource-constrained devices. Specifically, FedQuad first identifies the feasible and efficient combinations of LoRA depth and the number of activation quantization layers based on device-specific resource constraints. Subsequently, FedQuad employs a greedy strategy to select the optimal configurations for each device, effectively accommodating system heterogeneity. Extensive experiments demonstrate that FedQuad achieves a 1.4-5.3x convergence acceleration compared to state-of-the-art baselines when reaching target accuracy, highlighting its efficiency and deployability in resource-constrained and heterogeneous end-device environments.
Yang Sui, Qi Xu, Yang Bai, Annie Qu
Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, various forms of heterogeneity are presented in MTL problems, such as block-wise, distribution, and posterior heterogeneity. Existing methods, however, struggle to tackle these forms of heterogeneity simultaneously in a unified framework. In this paper, we propose a two-step learning strategy for MTL which addresses the aforementioned heterogeneity. First, we impute the missing blocks using shared representations extracted from homogeneous source across different tasks. Next, we disentangle the mappings between input features and responses into a shared component and a task-specific component, respectively, thereby enabling information borrowing through the shared component. Our numerical experiments and real-data analysis from the ADNI database demonstrate the superior MTL performance of the proposed method compared to other competing methods.
Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangli-ao Geng, Boyang Li
Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i.e., differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4.4% (ViT-B/32) over state-of-the-art approaches.
Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Continually expanding new languages for existing large language models (LLMs)
is a promising yet challenging approach to building powerful multilingual LLMs.
The biggest challenge is to make the model continuously learn new languages
while preserving the proficient ability of old languages. To achieve this,
recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new
languages by adding new experts and avoid catastrophic forgetting of old
languages by routing corresponding tokens to the original model backbone (old
experts). Although intuitive, this kind of method is parameter-costly when
expanding new languages and still inevitably impacts the performance of old
languages. To address these limitations, we analyze the language
characteristics of different layers in LLMs and propose a layer-wise expert
allocation algorithm (LayerMoE) to determine the appropriate number of new
experts for each layer. Specifically, we find different layers in LLMs exhibit
different representation similarities between languages and then utilize the
similarity as the indicator to allocate experts for each layer, i.e., the
higher similarity, the fewer experts. Additionally, to further mitigate the
forgetting of old languages, we add a classifier in front of the router network
on the layers with higher similarity to guide the routing of old language
tokens. Experimental results show that our method outperforms the previous
state-of-the-art baseline with 60% fewer experts in the single-expansion
setting and with 33.3% fewer experts in the lifelong-expansion setting,
demonstrating the effectiveness of our method.
Authors' comments: ACL 2025 (Main), 16 pages, 5 figures, 11 tables
Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh
While recent success of large reasoning models (LRMs) significantly advanced
LLMs' reasoning capability by optimizing the final answer accuracy using
reinforcement learning, they may also drastically increase the output length
due to overthinking, characterized by unnecessarily complex reasoning paths
that waste computation and potentially degrade the performance. We hypothesize
that such inefficiencies stem from LRMs' limited capability to dynamically
select the proper modular reasoning strategies, termed thinking patterns at the
right position. To investigate this hypothesis, we propose a dynamic
optimization framework that segments model-generated reasoning paths into
distinct thinking patterns, systematically identifying and promoting beneficial
patterns that improve the answer while removing detrimental ones. Empirical
analysis confirms that our optimized thinking paths yield more concise yet
sufficiently informative trajectories, enhancing reasoning efficiency by
reducing attention FLOPs by up to 47% while maintaining accuracy for originally
correct responses. Moreover, a non-trivial portion of originally incorrect
responses are transformed into correct ones, achieving a 15.6% accuracy
improvement with reduced length. Motivated by the improvement brought by the
optimized thinking paths, we apply a preference optimization technique
supported by a pairwise dataset contrasting suboptimal and optimal reasoning
paths. Experimental evaluations across multiple mathematical reasoning
benchmarks reveal that our method notably reduces computational overhead while
simultaneously improving reasoning accuracy, achieving up to a 12% accuracy
improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
Authors' comments: Work In Progress
Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha
Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.
Nima Shahbazi, Stavros Sintos, Abolfazl Asudeh
Frequency estimation in streaming data often relies on sketches like Count-Min (CM) to provide approximate answers with sublinear space. However, CM sketches introduce additive errors that disproportionately impact low-frequency elements, creating fairness concerns across different groups of elements. We introduce Fair-Count-Min, a frequency estimation sketch that guarantees equal expected approximation factors across element groups, thus addressing the unfairness issue. We propose a column partitioning approach with group-aware semi-uniform hashing to eliminate collisions between elements from different groups. We provide theoretical guarantees for fairness, analyze the price of fairness, and validate our theoretical findings through extensive experiments on real-world and synthetic datasets. Our experimental results show that Fair-Count-Min achieves fairness with minimal additional error and maintains competitive efficiency compared to standard CM sketches.
Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Yiren Song
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao et al.
The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.
Dingqiang Ye, Chao Fan, Zhanbo Huang, Chengwen Luo, Jianqiang Li, Shiqi Yu, Xiaoming Liu
Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.
Florentin Beck, William Rudman, Carsten Eickhoff
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
Yizhi Zhou, Haina Zhu, Hangting Chen
Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.
Longlong Li, Cunquan Qu, Guanghui Wang
Conventional Graph Neural Networks (GNNs) aggregate neighbor embeddings as holistic vectors, lacking the ability to identify fine-grained, direction-specific feature relevance. We propose MSH-GNN (Multi-Scale Harmonic Graph Neural Network), a novel architecture that performs feature-wise adaptive message passing through node-specific harmonic projections. For each node, MSH-GNN dynamically projects neighbor features onto frequency-sensitive directions determined by the target node's own representation. These projections are further modulated using learnable sinusoidal encodings at multiple frequencies, enabling the model to capture both smooth and oscillatory structural patterns across scales. A frequency-aware attention pooling mechanism is introduced to emphasize spectrally and structurally salient nodes during readout. Theoretically, we prove that MSH-GNN approximates shift-invariant kernels and matches the expressive power of the 1-Weisfeiler-Lehman (1-WL) test. Empirically, MSH-GNN consistently outperforms state-of-the-art models on a wide range of graph and node classification tasks. Furthermore, in challenging classification settings involving joint variations in graph topology and spectral frequency, MSH-GNN excels at capturing structural asymmetries and high-frequency modulations, enabling more accurate graph discrimination.
Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Efficient multi-hop reasoning requires Large Language Models (LLMs) based
agents to acquire high-value external knowledge iteratively. Previous work has
explored reinforcement learning (RL) to train LLMs to perform search-based
document retrieval, achieving notable improvements in QA performance, but
underperform on complex, multi-hop QA resulting from the sparse rewards from
global signal only. To address this gap in existing research, we introduce
StepSearch, a framework for search LLMs that trained with step-wise proximal
policy optimization method. It consists of richer and more detailed
intermediate search rewards and token-level process supervision based on
information gain and redundancy penalties to better guide each search step. We
constructed a fine-grained question-answering dataset containing
sub-question-level search trajectories based on open source datasets through a
set of data pipeline method. On standard multi-hop QA benchmarks, it
significantly outperforms global-reward baselines, achieving 11.2% and 4.2%
absolute improvements for 3B and 7B models over various search with RL
baselines using only 19k training data, demonstrating the effectiveness of
fine-grained, stepwise supervision in optimizing deep search LLMs. Our code
will be released on https://github.com/Zillwang/StepSearch.
Authors' comments: 20 pages, 6 figures
Guoming Li, Jian Yang, Yifan Chen
Filtering-based graph neural networks (GNNs) constitute a distinct class of
GNNs that employ graph filters to handle graph-structured data, achieving
notable success in various graph-related tasks. Conventional methods adopt a
graph-wise filtering paradigm, imposing a uniform filter across all nodes, yet
recent findings suggest that this rigid paradigm struggles with heterophilic
graphs. To overcome this, recent works have introduced node-wise filtering,
which assigns distinct filters to individual nodes, offering enhanced
adaptability. However, a fundamental gap remains: a comprehensive framework
unifying these two strategies is still absent, limiting theoretical insights
into the filtering paradigms. Moreover, through the lens of Contextual
Stochastic Block Model, we reveal that a synthesis of graph-wise and node-wise
filtering provides a sufficient solution for classification on graphs
exhibiting both homophily and heterophily, suggesting the risk of excessive
parameterization and potential overfitting with node-wise filtering. To address
the limitations, this paper introduces Coarsening-guided Partition-wise
Filtering (CPF). CPF innovates by performing filtering on node partitions. The
method begins with structure-aware partition-wise filtering, which filters node
partitions obtained via graph coarsening algorithms, and then performs
feature-aware partition-wise filtering, refining node embeddings via filtering
on clusters produced by $k$-means clustering over features. In-depth analysis
is conducted for each phase of CPF, showing its superiority over other
paradigms. Finally, benchmark node classification experiments, along with a
real-world graph anomaly detection application, validate CPF's efficacy and
practical utility.
Authors' comments: Accepted at the 31st ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, KDD 2025 February Cycle
Jan Hurt, Stefan Thurner, Peter Klimek
Dynamic input-output models are standard tools for understanding inter-industry dependencies and how economies respond to shocks like disasters and pandemics. However, traditional approaches often assume fixed prices, limiting their ability to capture realistic economic behavior. Here, we introduce an adaptive extension to dynamic input-output recovery models where producers respond to shocks through simultaneous price and quantity adjustments. Our framework preserves the economic constraints of the Leontief input-output model while converging towards equilibrium configurations based on sector-specific behavioral parameters. When applied to input-output data, the model allows us to compute behavioral metrics indicating whether specific sectors predominantly favor price or quantity adjustments. Using the World Input-Output Database, we identify strong, consistent regional and sector-specific behavioral patterns. These findings provide insights into how different regions employ distinct strategies to manage shocks, thereby influencing economic resilience and recovery dynamics.
Alexandre Moreira, Patricia Silva, Miguel Heleno, Andre Marcato
Long-duration energy storage (LDES) assets can be fundamental resources for the next-generation power systems. However, LDES technologies are still immature and their future technology costs remain highly uncertain. In this context, we perform in this paper an extensive study to estimate the maximum LDES technology costs (which we define as viability costs) under which LDES systems would be economically viable in each state of the contiguous U.S. according to their characteristics. Our results indicate that only 4 states (out of 48) would be able to remove firm conventional generation supported by LDES systems without increasing their total system costs under the current US-DOE cost target of 1,100 US$/kW for multi-day LDES. In addition, we find that states with the highest LDES viability costs have in general low participation of thermal generation, a high share of wind generation, and higher thermal-related fixed operation and maintenance (FO&M) costs.