Tao Song, Yicheng Wu, Minhao Hu, Xiangde Luo, Linda Wei, Guotai Wang, Yi Guo, Feng Xu et al.
Multimodal MR image synthesis aims to generate missing modality images by effectively fusing and mapping from a subset of available MRI modalities. Most existing methods adopt an image-to-image translation paradigm, treating multiple modalities as input channels. However, these approaches often yield sub-optimal results due to the inherent difficulty in achieving precise feature- or semantic-level alignment across modalities. To address these challenges, we propose an Adaptive Group-wise Interaction Network (AGI-Net) that explicitly models both inter-modality and intra-modality relationships for multimodal MR image synthesis. Specifically, feature channels are first partitioned into predefined groups, after which an adaptive rolling mechanism is applied to conventional convolutional kernels to better capture feature and semantic correspondences between different modalities. In parallel, a cross-group attention module is introduced to enable effective feature fusion across groups, thereby enhancing the network's representational capacity. We validate the proposed AGI-Net on the publicly available IXI and BraTS2023 datasets. Experimental results demonstrate that AGI-Net achieves state-of-the-art performance in multimodal MR image synthesis tasks, confirming the effectiveness of its modality-aware interaction design. We release the relevant code at: https://github.com/zunzhumu/Adaptive-Group-wise-Interaction-Network-for-Multimodal-MRI-Synthesis.git.
Nuria Fonseca-Bonilla, Luis Cerdán, Alberto Noriega-Crespo, Amaya Moro-Martín
While WISE is the largest, best quality infrared all-sky survey to date, a
smaller coverage mission, Spitzer, was designed to have better sensitivity and
spatial resolution at similar wavelengths. Confusion and contamination in WISE
data result in discrepancies between them. We present a novel approach to work
with WISE measurements with the goal of maintaining both its high coverage and
vast amount of data while taking full advantage of the higher sensitivity and
spatial resolution of Spitzer. We have applied machine learning (ML) techniques
to a complete WISE data sample of open cluster members, using a training set of
paired data from high-quality Spitzer Enhanced Imaging Products (SEIP), MIPS
and IRAC, and allWISE catalogs, W1 (3.4 {\mu}m) to W4 (22 {\mu}m) bands. We
have tested several ML regression models with the aim of predicting
mid-infrared fluxes at MIPS1 (24 {\mu}m) and IRAC4 (8 {\mu}m) bands from WISE
fluxes and quality flags. In addition, to improve the prediction quality, we
have implemented feature selection techniques to remove irrelevant WISE
variables. We have notably enhanced WISE detection capabilities, mostly at
lowest magnitudes, which previously showed the largest discrepancies with
Spitzer. In our particular case, extremely randomized trees was found to be the
best algorithm to predict mid-infrared fluxes from WISE variables. We have
tested our results in the SED of members of IC 348. We show discrepancies in
the measurements of Spitzer and WISE and demonstrate the good concordance of
our predicted fluxes with the real ones. ML is a fast and powerful tool that
can be used to find hidden relationships between datasets, as the ones that
exist between WISE and Spitzer fluxes. We believe this approach could be
employed for other samples from the allWISE catalog with SEIP positional
counterparts, and in other astrophysical studies with analogous discrepancies.
Authors' comments: 13 pages, 10 figures
Yasaman Saadati, M. Hadi Amini
Federated Learning (FL) is a decentralized learning approach that protects sensitive information by utilizing local model parameters rather than sharing clients' raw datasets. While this privacy-preserving method is widely employed across various applications, it still requires significant development and optimization. Automated Machine Learning (Auto-ML) has been adapted for reducing the need for manual adjustments. Previous studies have explored the integration of AutoML with different FL algorithms to evaluate their effectiveness in enhancing FL settings. However, Automated FL (Auto-FL) faces additional challenges due to the involvement of a large cohort of clients and global training rounds between clients and the server, rendering the tuning process time-consuming and nearly impossible on resource-constrained edge devices (e.g., IoT devices). This paper investigates the deployment and integration of two lightweight Hyper-Parameter Optimization (HPO) tools, Raytune and Optuna, within the context of FL settings. A step-wise feedback mechanism has also been designed to accelerate the hyper-parameter tuning process and coordinate AutoML toolkits with the FL server. To this end, both local and global feedback mechanisms are integrated to limit the search space and expedite the HPO process. Further, a novel client selection technique is introduced to mitigate the straggler effect in Auto-FL. The selected hyper-parameter tuning tools are evaluated using two benchmark datasets, FEMNIST, and CIFAR10. Further, the paper discusses the essential properties of successful HPO tools, the integration mechanism with the FL pipeline, and the challenges posed by the distributed and heterogeneous nature of FL environments.
Ying Yang, De Cheng, Chaowei Fang, Yubiao Wang, Changzhe Jiao, Lechao Cheng, Nannan Wang
Unsupervised out-of-distribution (OOD) detection aims to identify
out-of-domain data by learning only from unlabeled In-Distribution (ID)
training samples, which is crucial for developing a safe real-world machine
learning system. Current reconstruction-based methods provide a good
alternative approach by measuring the reconstruction error between the input
and its corresponding generative counterpart in the pixel/feature space.
However, such generative methods face a key dilemma: improving the
reconstruction power of the generative model while keeping a compact
representation of the ID data. To address this issue, we propose the
diffusion-based layer-wise semantic reconstruction approach for unsupervised
OOD detection. The innovation of our approach is that we leverage the diffusion
model's intrinsic data reconstruction ability to distinguish ID samples from
OOD samples in the latent feature space. Moreover, to set up a comprehensive
and discriminative feature representation, we devise a multi-layer semantic
feature extraction strategy. By distorting the extracted features with Gaussian
noise and applying the diffusion model for feature reconstruction, the
separation of ID and OOD samples is implemented according to the reconstruction
errors. Extensive experimental results on multiple benchmarks built upon
various datasets demonstrate that our method achieves state-of-the-art
performance in terms of detection accuracy and speed. Code is available at
<https://github.com/xbyym/DLSR>.
Authors' comments: 26 pages, 23 figures, published to Neurlps2024
Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie
There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256$\times$256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \url{https://github.com/OliverRensu/MVAR}.
Huali Xu, Yongxiang Liu, Li Liu, Shuaifeng Zhi, Shuzhou Sun, Tianpeng Liu, MingMing Cheng
Existing cross-domain few-shot learning (CDFSL) methods, which develop
source-domain training strategies to enhance model transferability, face
challenges with large-scale pre-trained models (LMs) due to inaccessible source
data and training strategies. Moreover, fine-tuning LMs for CDFSL demands
substantial computational resources, limiting practicality. This paper
addresses the source-free CDFSL (SF-CDFSL) problem, tackling few-shot learning
(FSL) in the target domain using only pre-trained models and a few target
samples without source data or strategies. To overcome the challenge of
inaccessible source data, this paper introduces Step-wise Distribution
Alignment Guided Style Prompt Tuning (StepSPT), which implicitly narrows domain
gaps through prediction distribution optimization. StepSPT proposes a style
prompt to align target samples with the desired distribution and adopts a
dual-phase optimization process. In the external process, a step-wise
distribution alignment strategy factorizes prediction distribution optimization
into a multi-step alignment problem to tune the style prompt. In the internal
process, the classifier is updated using standard cross-entropy loss.
Evaluations on five datasets demonstrate that StepSPT outperforms existing
prompt tuning-based methods and SOTAs. Ablation studies further verify its
effectiveness. Code will be made publicly available at
\url{https://github.com/xuhuali-mxj/StepSPT}.
Authors' comments: 15 pages, 12 figures, 7 tables
Hao Tang, Junhao Lu, Guoheng Huang, Ming Li, Xuhang Chen, Guo Zhong, Zhengguang Tan, Zinuo Li
In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity. However, in natural scenes, the spatial arrangement of key instances is often inconsistent across images. This spatial misalignment can result in mismatched semantic pixels, leading to inaccurate similarity measurements. To address this issue, we propose a novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons. Our method enhances model performance through two key modules: (1) the Layer-Wise Embedding (LWE) Module, which refines the cross-correlation of image pairs to generate well-focused feature maps for each layer; (2)the Semantic-Pixel Matching (SPM) Module, which aligns critical pixels based on semantic embeddings using an assignment algorithm. We conducted extensive experiments to evaluate our method on four widely used few-shot classification benchmarks: miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS. The results indicate that LWFM-SPM achieves competitive performance across these benchmarks. Our code will be publicly available on https://github.com/Halo2Tang/Code-for-LWFM-SPM.
Ioannis Caragiannis, Nick Gravin, Zhile Jiang
The problem of identifying the satisfiability threshold of random $3$-SAT
formulas has received a lot of attention during the last decades and has
inspired the study of other threshold phenomena in random combinatorial
structures. The classical assumption in this line of research is that, for a
given set of $n$ Boolean variables, each clause is drawn uniformly at random
among all sets of three literals from these variables, independently from other
clauses. Here, we keep the uniform distribution of each clause, but deviate
significantly from the independence assumption and consider richer families of
probability distributions. For integer parameters $n$, $m$, and $k$, we denote
by $\DistFamily_k(n,m)$ the family of probability distributions that produce
formulas with $m$ clauses, each selected uniformly at random from all sets of
three literals from the $n$ variables, so that the clauses are $k$-wise
independent. Our aim is to make general statements about the satisfiability or
unsatisfiability of formulas produced by distributions in $\DistFamily_k(n,m)$
for different values of the parameters $n$, $m$, and $k$.
Authors' comments: 26 pages, 1 fugure
Zhirui Deng, Zhicheng Dou, Yutao Zhu, Ji-Rong Wen, Ruibin Xiong, Mang Wang, Weipeng Chen
The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
Xiaoqing Chen, Siyang Li, Yunlu Tu, Ziwei Wang, Dongrui Wu
Objective: An electroencephalogram (EEG)-based brain-computer interface (BCI) is a direct communication pathway between the human brain and a computer. Most research so far studied more accurate BCIs, but much less attention has been paid to the ethics of BCIs. Aside from task-specific information, EEG signals also contain rich private information, e.g., user identity, emotion, disorders, etc., which should be protected. Approach: We show for the first time that adding user-wise perturbations can make identity information in EEG unlearnable. We propose four types of user-wise privacy-preserving perturbations, i.e., random noise, synthetic noise, error minimization noise, and error maximization noise. After adding the proposed perturbations to EEG training data, the user identity information in the data becomes unlearnable, while the BCI task information remains unaffected. Main results: Experiments on six EEG datasets using three neural network classifiers and various traditional machine learning models demonstrated the robustness and practicability of the proposed perturbations. Significance: Our research shows the feasibility of hiding user identity information in EEG data without impacting the primary BCI task information.
Chengting Yu, Fengzhao Zhang, Ruizhe Chen, Aili Wang, Zuozhu Liu, Shurun Tan, Er-Ping Li
Knowledge Distillation (KD), a learning manner with a larger teacher network guiding a smaller student network, transfers dark knowledge from the teacher to the student via logits or intermediate features, with the aim of producing a well-performed lightweight model. Notably, many subsequent feature-based KD methods outperformed the earliest logit-based KD method and iteratively generated numerous state-of-the-art distillation methods. Nevertheless, recent work has uncovered the potential of the logit-based method, bringing the simple KD form based on logits back into the limelight. Features or logits? They partially implement the KD with entirely distinct perspectives; therefore, choosing between logits and features is not straightforward. This paper provides a unified perspective of feature alignment in order to obtain a better comprehension of their fundamental distinction. Inheriting the design philosophy and insights of feature-based and logit-based methods, we introduce a block-wise logit distillation framework to apply implicit logit-based feature alignment by gradually replacing teacher's blocks as intermediate stepping-stone models to bridge the gap between the student and the teacher. Our method obtains comparable or superior results to state-of-the-art distillation methods. This paper demonstrates the great potential of combining logit and features, and we hope it will inspire future research to revisit KD from a higher vantage point.
Nikita Guseynov, Nana Liu
Efficiently uploading data into quantum states is essential for many quantum
algorithms to achieve advantage across various applications. In this paper, we
address this challenge by proposing a method to upload a polynomial function
$f(x)$ on the interval $x \in (a, b)$ into a pure quantum state consisting of
qubits, where a discretized $f(x)$ is the amplitude of this state. The
preparation cost has $\mathcal{O}(n\log n)$ scaling in the number of qubits $n$
and linear scaling with the degree of the polynomial $Q$. This efficiency
allows the preparation of states whose amplitudes correspond to high-degree
polynomials, enabling the approximation of almost any continuous function. We
introduce an explicit algorithm for uploading such functions using four real
polynomials that meet specific parity and boundedness conditions. We also
generalize this approach to piece-wise polynomial functions, with the algorithm
scaling linearly with the number of piecewise parts. Our method achieves
efficient quantum circuit implementation and we present detailed gate counting
and resource analysis.
Authors' comments: 17 pages, 9 figures, 2 tables
Wenhan Chang, Tianqing Zhu, Yufeng Wu, Wanlei Zhou
In the rapid advancement of artificial intelligence, privacy protection has
become crucial, giving rise to machine unlearning. Machine unlearning is a
technique that removes specific data influences from trained models without the
need for extensive retraining. However, it faces several key challenges,
including accurately implementing unlearning, ensuring privacy protection
during the unlearning process, and achieving effective unlearning without
significantly compromising model performance. This paper presents a novel
approach to machine unlearning by employing Layer-wise Relevance Analysis and
Neuronal Path Perturbation. We address three primary challenges: the lack of
detailed unlearning principles, privacy guarantees in zero-shot unlearning
scenario, and the balance between unlearning effectiveness and model utility.
Our method balances machine unlearning performance and model utility by
identifying and perturbing highly relevant neurons, thereby achieving effective
unlearning. By using data not present in the original training set during the
unlearning process, we satisfy the zero-shot unlearning scenario and ensure
robust privacy protection. Experimental results demonstrate that our approach
effectively removes targeted data from the target unlearning model while
maintaining the model's utility, offering a practical solution for
privacy-preserving machine learning.
Authors' comments: 17 pages, 5 figures
Stefan Stojanovic, Yassir Jedra, Alexandre Proutiere
We consider the problem of learning an $\varepsilon$-optimal policy in
controlled dynamical systems with low-rank latent structure. For this problem,
we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm
alternating between policy improvement and policy evaluation steps. In the
latter, the algorithm estimates the low-rank matrix corresponding to the
(state, action) value function of the current policy using the following
two-phase procedure. The entries of the matrix are first sampled uniformly at
random to estimate, via a spectral method, the leverage scores of its rows and
columns. These scores are then used to extract a few important rows and columns
whose entries are further sampled. The algorithm exploits these new samples to
complete the matrix estimation using a CUR-like method. For this leveraged
matrix estimation procedure, we establish entry-wise guarantees that
remarkably, do not depend on the coherence of the matrix but only on its
spikiness. These guarantees imply that LoRa-PI learns an $\varepsilon$-optimal
policy using $\widetilde{O}({S+A\over \mathrm{poly}(1-\gamma)\varepsilon^2})$
samples where $S$ (resp. $A$) denotes the number of states (resp. actions) and
$\gamma$ the discount factor. Our algorithm achieves this order-optimal (in
$S$, $A$ and $\varepsilon$) sample complexity under milder conditions than
those assumed in previously proposed approaches.
Authors' comments: Accepted for presentation at the Conference on Neural Information
Processing Systems (NeurIPS) 2024
Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang
The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park
We present Unified Microphone Conversion, a unified generative framework
designed to bolster sound event classification (SEC) systems against device
variability. While our prior CycleGAN-based methods effectively simulate device
characteristics, they require separate models for each device pair, limiting
scalability. Our approach overcomes this constraint by conditioning the
generator on frequency response data, enabling many-to-many device mappings
through unpaired training. We integrate frequency-response information via
Feature-wise Linear Modulation, further enhancing scalability. Additionally,
incorporating synthetic frequency response differences improves the
applicability of our framework for real-world application. Experimental results
show that our method outperforms the state-of-the-art by 2.6% and reduces
variability by 0.8% in macro-average F1 score.
Authors' comments: Accepted to Interspeech 2025
Khunanon Thongkham, Anthony H. Gonzalez, Mark Brodwin, Ariane Trudeau, Peter Eisenhardt, S. A. Stanford, Emily Moravec, Thomas Connor et al.
We present the second data release of the Massive and Distant Clusters of
WISE Survey 2 (MaDCoWS2). We expand from the equatorial first data release to
most of the Dark Energy Camera Legacy Survey area, covering a total area of
6498 deg^2. The catalog consists of 133,036 S/N $\geq5$ galaxy cluster
candidates at $0.1\leq z \leq2$, including 6790 candidates at z > 1.5. We train
a convolutional neural network (CNN) to identify spurious detections, and
include CNN-based cluster probabilities in the final catalog. We also compare
the MaDCoWS2 sample with literature catalogs in the same area. The larger
sample provides robust results that are consistent with our first data release.
At S/N $\geq5$, we rediscover 59-91% of clusters in existing catalogs that lie
in the unmasked area of MC2. The median positional offsets are under 250 kpc,
and the standard deviation of the redshifts is 0.031(1+z). We fit a
redshift-dependent power law to the relation between MaDCoWS2 S/N and
observables from existing catalogs. Over the redshift ranges where the surveys
overlap with MaDCoWS2, the lowest scatter is found between S/N and observables
from optical/infrared surveys. We also assess the performance of our method
using a mock light cone measuring purity and completeness as a function of
cluster mass. The purity is above 90%, and we estimate the 50% completeness
threshold at a virial mass of log(M/M$_\odot$)$\approx14.3$. The completeness
estimate is uncertain due to the small number of massive halos in the light
cone, but consistent with the recovery fraction found by comparing to other
cluster catalogs.
Authors' comments: 21 pages, 14 figures, 4 tables. Accepted for publication in ApJ
M. E. Cluver, T. H. Jarrett, D. A. Dale, J. -D. T. Smith, M. J. I. Brown, W. van Kempen, E. Lengerer, R. Incoll et al.
In this work we present source-tailored WISE mid-infrared photometry (at
3.4$\mu$m, 4.6$\mu$m, 12$\mu$m, and 23$\mu$m) of 2812 galaxies in the extended
Spitzer Survey of Stellar Structure in Galaxies (S$^4$G) sample, and
characterise the mid-infrared colors and dust properties of this legacy nearby
galaxy data set. Informed by the relative emission between W3 (12$\mu$ m) and
W4 (23$\mu$ m), we re-derive star formation rate (SFR) scaling relations
calibrated to L$_{\rm TIR}$, which results in improved agreement between the
two tracers. By inverse-variance weighting the W3 and W4-derived SFRs, we
generate a combined mid-infrared SFR that is a broadly robust measure of star
formation activity in dusty, star-forming galaxies in the nearby Universe. In
addition, we investigate the use of a W3-derived dust density metric,
$\Sigma_{\rm 12\mu m}$ (L$_\odot$/kpc$^2$), to estimate the SFR deficit of low
mass, low dust galaxies. This is achieved by combining WISE with existing GALEX
ultraviolet (UV) photometry, which we further use to explore the relationship
between dust and UV emission as a function of morphology. Finally, we use our
derived SFR prescriptions to examine the location of galaxies in the log SFR -
log M$_\textrm{stellar}$ plane, as a function of morphological type, which
underscores the complexity of dust-derived properties seen in galaxies of
progressively earlier type.
Authors' comments: Accepted to ApJ
Qian Tao, Wenyuan Yu, Jingren Zhou
Large language models have shown exceptional capabilities in a wide range of
tasks, such as text generation and video generation, among others. However, due
to their massive parameter count, these models often require substantial
storage space, imposing significant constraints on the machines deploying LLMs.
To overcome this limitation, one research direction proposes to compress the
models using integer replacements for floating-point numbers, in a process
known as Quantization. Some recent studies suggest quantizing the key and value
cache (KV Cache) of LLMs, and designing quantization techniques that treat the
key and value matrices equivalently.
This work delves deeper into the asymmetric structural roles of KV Cache, a
phenomenon where the transformer's output loss is more sensitive to the
quantization of key matrices. We conduct a systematic examination of the
attention output error resulting from key and value quantization. The
phenomenon inspires us to propose an asymmetric quantization strategy. Our
approach allows for 1-bit quantization of the KV cache by implementing distinct
configurations for key and value matrices. We carry out experiments across a
variety of datasets, demonstrating that our proposed model allows for the
quantization of up to 75% decoder layers with 1 bit, while simultaneously
maintaining performance levels comparable to those of the models with floating
parameters.
Authors' comments: 12 pages, 4 figures
Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang
Recent work on pruning large language models (LLMs) has shown that one can
eliminate a large number of parameters without compromising performance, making
pruning a promising strategy to reduce LLM model size. Existing LLM pruning
strategies typically assign uniform pruning ratios across layers, limiting
overall pruning ability; and recent work on layerwise pruning of LLMs is often
based on heuristics that can easily lead to suboptimal performance. In this
paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in
particular the shape of empirical spectral densities (ESDs) of weight matrices,
to design improved layerwise pruning ratios for LLMs. Our analysis reveals a
wide variability in how well-trained, and thus relatedly how prunable,
different layers of an LLM are. Based on this, we propose AlphaPruning, which
uses shape metrics to allocate layerwise sparsity ratios in a more
theoretically principled manner. AlphaPruning can be used in conjunction with
multiple existing LLM pruning methods. Our empirical results show that
AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable
perplexity, marking a first in the literature on LLMs. We have open-sourced our
code at https://github.com/haiquanlu/AlphaPruning.
Authors' comments: NeurIPS 2024, first two authors contributed equally