Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Li Shen
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at https://github.com/Lslland/T-Vaccine.
Yanfeng Jiang, Zelan Yang, Bohua Chen, Shen Li, Yong Li, Tao Li
Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.
Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
Cai Dieball, Yasamin Mohebi Satalsari, Angel B. Zuccolotto-Bernezb, Stefan U. Egelhaaf, Manuel A. Escobedo-Sánchez, Aljaž Godec
We investigate path-wise observables in experiments on driven colloids in a periodic light field to dissect selected intricate transport features, kinetics, and transition-path time statistics out of thermodynamic equilibrium. These observables directly reflect the properties of individual paths in contrast to the properties of an ensemble of particles, such as radial distribution functions or mean-squared displacements. In particular, we present two distinct albeit equivalent formulations of the underlying stochastic equation of motion, highlight their respective practical relevance, and show how to interchange between them. We discuss conceptually different notions of local velocities and interrogate one- and two-sided first-passage and transition-path time statistics in and out of equilibrium. Our results reiterate how path-wise observables may be employed to systematically assess the quality of experimental data and demonstrate that, given sufficient control and sampling, one may quantitatively verify subtle theoretical predictions.
David Alonso
We present a generalisation of the standard pseudo-$C_\ell$ approach for
power spectrum estimation to the case of spin-$s$ fields weighted by a general
positive-definite weight matrix that couples the different spin components of
the field (e.g. $Q$ and $U$ maps in CMB polarisation analyses, or $\gamma_1$
and $\gamma_2$ shear components in weak lensing). Relevant use cases are, for
example, data with significantly anisotropic noise properties, or situations in
which different masks must be applied to the different field components. The
weight matrix map is separated into a spin-0 part, which corresponds to the
"mask" in the standard pseudo-$C_\ell$ approach, and a spin-$2s$ part sourced
solely by the anisotropic elements of the matrix, leading to additional
coupling between angular scales and $E/B$ modes. The general expressions for
the mode-coupling coefficients involving the power spectra of these anisotropic
weight components are derived and validated. The generalised algorithm is as
computationally efficient as the standard approach. We implement the method in
the public code NaMaster.
Authors' comments: Accepted in the Open Journal of Astrophysics
Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas König, David Kappel, Anand Subramoney
The increasing size of deep learning models has made distributed training
across multiple devices essential. However, current methods such as distributed
data-parallel training suffer from large communication and synchronization
overheads when training across devices, leading to longer training times as a
result of suboptimal hardware utilization. Asynchronous stochastic gradient
descent (ASGD) methods can improve training speed, but are sensitive to delays
due to both communication and differences throughput. Moreover, the
backpropagation algorithm used within ASGD workers is bottlenecked by the
interlocking between its forward and backward passes. Current methods also do
not take advantage of the large differences in the computation required for the
forward and backward passes. Therefore, we propose an extension to ASGD called
Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses
separate threads for the forward and backward passes, decoupling the updates
and allowing for a higher ratio of forward to backward threads than the usual
1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise
(partial) model updates concurrently across multiple threads. This reduces
parameter staleness and consequently improves robustness to delays. Our
approach yields close to state-of-the-art results while running up to
$5.95\times$ faster than synchronous data parallelism in the presence of
delays, and up to $2.14\times$ times faster than comparable ASGD algorithms by
achieving higher model flops utilization. We mathematically describe the
gradient bias introduced by our method, establish an upper bound, and prove
convergence.
Authors' comments: 17 pages, 5 figures
Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that
addresses sequence alignment without relying on an auxiliary duration predictor
and complex autoregressive (AR) or non-autoregressive (NAR) frame-level
sequence modeling. SegINR simplifies the process by converting text sequences
directly into frame-level features. It leverages an optimal text encoder to
extract embeddings, transforming each into a segment of frame-level features
using a conditional implicit neural representation (INR). This method, named
segment-wise INR (SegINR), models temporal dynamics within each segment and
autonomously defines segment boundaries, reducing computational costs. We
integrate SegINR into a two-stage TTS framework, using it for semantic token
prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate
that SegINR outperforms conventional methods in speech quality with
computational efficiency.
Authors' comments: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
Xianlong Wang, Minghui Li, Wei Liu, Hangtao Zhang, Shengshan Hu, Yechao Zhang, Ziqi Zhou, Hai Jin
Traditional unlearnable strategies have been proposed to prevent unauthorized
users from training on the 2D image data. With more 3D point cloud data
containing sensitivity information, unauthorized usage of this new type data
has also become a serious concern. To address this, we propose the first
integral unlearnable framework for 3D point clouds including two processes: (i)
we propose an unlearnable data protection scheme, involving a class-wise
setting established by a category-adaptive allocation strategy and
multi-transformations assigned to samples; (ii) we propose a data restoration
scheme that utilizes class-wise inverse matrix transformation, thus enabling
authorized-only training for unlearnable data. This restoration process is a
practical issue overlooked in most existing unlearnable literature, \ie, even
authorized users struggle to gain knowledge from 3D unlearnable data. Both
theoretical and empirical results (including 6 datasets, 16 models, and 2
tasks) demonstrate the effectiveness of our proposed unlearnable framework. Our
code is available at \url{https://github.com/CGCL-codes/UnlearnablePC}
Authors' comments: NeurIPS 2024
Minh Duong Nguyen, Khanh Le, Khoi Do, Nguyen H. Tran, Duc Nguyen, Chien Trinh, Zhaohui Yang
In personalized Federated Learning (pFL), high data heterogeneity can cause significant gradient divergence across devices, adversely affecting the learning process. This divergence, especially when gradients from different users form an obtuse angle during aggregation, can negate progress, leading to severe weight and gradient update degradation. To address this issue, we introduce a new approach to pFL design, namely Federated Learning with Layer-wise Aggregation via Gradient Analysis (FedLAG), utilizing the concept of gradient conflict at the layer level. Specifically, when layer-wise gradients of different clients form acute angles, those gradients align in the same direction, enabling updates across different clients toward identifying client-invariant features. Conversely, when layer-wise gradient pairs make create obtuse angles, the layers tend to focus on client-specific tasks. In hindsights, FedLAG assigns layers for personalization based on the extent of layer-wise gradient conflicts. Specifically, layers with gradient conflicts are excluded from the global aggregation process. The theoretical evaluation demonstrates that when integrated into other pFL baselines, FedLAG enhances pFL performance by a certain margin. Therefore, our proposed method achieves superior convergence behavior compared with other baselines. Extensive experiments show that our FedLAG outperforms several state-of-the-art methods and can be easily incorporated with many existing methods to further enhance performance.
Eduard Tulchinskii, Laida Kushnareva, Kristian Kuznetsov, Anastasia Voznyuk, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien
While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang et al.
The expanding context windows in large language models (LLMs) have greatly
enhanced their capabilities in various applications, but they also introduce
significant challenges in maintaining low latency, particularly in Time to
First Token (TTFT). This paper identifies that the sharp rise in TTFT as
context length increases is predominantly driven by queuing delays, which are
caused by the growing demands for GPU Key-Value (KV) cache allocation clashing
with the limited availability of KV cache blocks. To address this issue, we
propose LayerKV, a simple yet effective plug-in method that effectively reduces
TTFT without requiring additional hardware or compromising output performance,
while seamlessly integrating with existing parallelism strategies and
scheduling techniques. Specifically, LayerKV introduces layer-wise KV block
allocation, management, and offloading for fine-grained control over system
memory, coupled with an SLO-aware scheduler to optimize overall Service Level
Objectives (SLOs). Comprehensive evaluations on representative models, ranging
from 7B to 70B parameters, across various GPU configurations, demonstrate that
LayerKV improves TTFT latency up to 69x and reduces SLO violation rates by
28.7%, significantly enhancing the user experience.
Authors' comments: 11 pages, 7 figures, 1 table
Shahed Masoudian, Markus Frohmann, Navid Rekabsaz, Markus Schedl
Language models frequently inherit societal biases from their training data.
Numerous techniques have been proposed to mitigate these biases during both the
pre-training and fine-tuning stages. However, fine-tuning a pre-trained
debiased language model on a downstream task can reintroduce biases into the
model. Additionally, existing debiasing methods for downstream tasks either (i)
require labels of protected attributes (e.g., age, race, or political views)
that are often not available or (ii) rely on indicators of bias, which
restricts their applicability to gender debiasing since they rely on
gender-specific words. To address this, we introduce a novel debiasing
regularization technique based on the class-wise variance of embeddings.
Crucially, our method does not require attribute labels and targets any
attribute, thus addressing the shortcomings of existing debiasing methods. Our
experiments on encoder language models and three datasets demonstrate that our
method outperforms existing strong debiasing baselines that rely on target
attribute labels while maintaining performance on the target task.
Authors' comments: Accepted to EMNLP 2024
Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Jose M Martínez
Merging parameters of multiple models has resurfaced as an effective strategy to enhance task performance and robustness, but prior work is limited by the high costs of ensemble creation and inference. In this paper, we leverage the abundance of freely accessible trained models to introduce a cost-free approach to model merging. It focuses on a layer-wise integration of merged models, aiming to maintain the distinctiveness of the task-specific final layers while unifying the initial layers, which are primarily associated with feature extraction. This approach ensures parameter consistency across all layers, essential for boosting performance. Moreover, it facilitates seamless integration of knowledge, enabling effective merging of models from different datasets and tasks. Specifically, we investigate its applicability in Unsupervised Domain Adaptation (UDA), an unexplored area for model merging, for Semantic and Panoptic Segmentation. Experimental results demonstrate substantial UDA improvements without additional costs for merging same-architecture models from distinct datasets ($\uparrow 2.6\%$ mIoU) and different-architecture models with a shared backbone ($\uparrow 6.8\%$ mIoU). Furthermore, merging Semantic and Panoptic Segmentation models increases mPQ by $\uparrow 7\%$. These findings are validated across a wide variety of UDA strategies, architectures, and datasets.
Dohyeong Kim, Hyeokjin Kwon, Junseok Kim, Gunmin Lee, Songhwai Oh
As the complexity of tasks addressed through reinforcement learning (RL)
increases, the definition of reward functions also has become highly
complicated. We introduce an RL method aimed at simplifying the reward-shaping
process through intuitive strategies. Initially, instead of a single reward
function composed of various terms, we define multiple reward and cost
functions within a constrained multi-objective RL (CMORL) framework. For tasks
involving sequential complex movements, we segment the task into distinct
stages and define multiple rewards and costs for each stage. Finally, we
introduce a practical CMORL algorithm that maximizes objectives based on these
rewards while satisfying constraints defined by the costs. The proposed method
has been successfully demonstrated across a variety of acrobatic tasks in both
simulation and real-world environments. Additionally, it has been shown to
successfully perform tasks compared to existing RL and constrained RL
algorithms. Our code is available at
https://github.com/rllab-snu/Stage-Wise-CMORL.
Authors' comments: 7 pages
Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, Fei Wu
Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like Huggingface. This modularity has sparked interest in combining multiple LoRAs to enhance LLM capabilities. However, existing methods for LoRA composition primarily focus on task-specific adaptations that require additional training, and current model merging techniques often fail to fully leverage LoRA's modular nature, leading to parameter interference and performance degradation. In this paper, we investigate the feasibility of disassembling and reassembling multiple LoRAs at a finer granularity, analogous to assembling LEGO blocks. We introduce the concept of Minimal Semantic Units (MSUs), where the parameters corresponding to each rank in LoRA function as independent units. These MSUs demonstrate permutation invariance and concatenation-summation equivalence properties, enabling flexible combinations to create new LoRAs. Building on these insights, we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter clustering by grouping MSUs from different LoRAs into $k$ clusters. The centroid of each cluster serves as a representative MSU, enabling the assembly of a merged LoRA with an adjusted rank of $k$. Additionally, we apply a dual reweighting strategy to optimize the scale of the merged LoRA. Experiments across various benchmarks demonstrate that our method outperforms existing approaches in LoRA merging.
Xian Zhong, Shengwang Hu, Wenxuan Liu, Wenxin Huang, Jianhao Ding, Zhaofei Yu, Tiejun Huang
Spiking neural networks (SNNs) have garnered significant attention for their low power consumption and high biological interpretability. Their rich spatio-temporal information processing capability and event-driven nature make them ideally well-suited for neuromorphic datasets. However, current SNNs struggle to balance accuracy and latency in classifying these datasets. In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. Our work disentangles the dependency between the number of event frames and the time steps of SNNs, utilizing more event frames during the training stage to improve performance, while using fewer event frames during the inference stage to reduce latency. Nevertheless, the average output of SNNs across all time steps is susceptible to individual time step with abnormal outputs, particularly at extremely low time steps. To tackle this issue, we implement Step-wise Knowledge Distillation (SKD) module that considers variations in the output distribution of SNNs at each time step. Empirical evidence demonstrates that our method yields competitive performance in classification tasks on neuromorphic datasets, especially at lower time steps. Our code will be available at: {https://github.com/hsw0929/HSD}.
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie
Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
Zhixing Hou, Maoxu Gao, Hang Yu, Mengyu Yang, Chio-In Ieong
This paper introduces a Spiking Diffusion Policy (SDP) learning method for robotic manipulation by integrating Spiking Neurons and Learnable Channel-wise Membrane Thresholds (LCMT) into the diffusion policy model, thereby enhancing computational efficiency and achieving high performance in evaluated tasks. Specifically, the proposed SDP model employs the U-Net architecture as the backbone for diffusion learning within the Spiking Neural Network (SNN). It strategically places residual connections between the spike convolution operations and the Leaky Integrate-and-Fire (LIF) nodes, thereby preventing disruptions to the spiking states. Additionally, we introduce a temporal encoding block and a temporal decoding block to transform static and dynamic data with timestep $T_S$ into each other, enabling the transmission of data within the SNN in spike format. Furthermore, we propose LCMT to enable the adaptive acquisition of membrane potential thresholds, thereby matching the conditions of varying membrane potentials and firing rates across channels and avoiding the cumbersome process of manually setting and tuning hyperparameters. Evaluating the SDP model on seven distinct tasks with SNN timestep $T_S=4$, we achieve results comparable to those of the ANN counterparts, along with faster convergence speeds than the baseline SNN method. This improvement is accompanied by a reduction of 94.3\% in dynamic energy consumption estimated on 45nm hardware.
Andrew W. Blain
Soon after the release of the WISE all-sky catalogue of 500 million
mid-infrared (IR) objects, suggestions were made that it could be used to
search for extrasolar devices constructed by an advanced civilization to
convert a significant fraction of their host star's luminosity into useful
work: "technostructures", "megastructures" or "Dyson spheres/structures",
hereafter DSMs, whose inevitable waste heat would be seen by WISE at mid-IR
wavelengths. However, a trawl of several million potentially-habitable
Gaia-detected stars for mid-IR-excess signatures is fraught with danger, due to
both noise from such a large sample and, more importantly, confusion with the
emission from dusty background galaxies. In light of a recent claim of seven
potential DSMs in MNRAS, a brief rebuttal appeared on arXiv. Further to this
response, the relevance of WISE-detected galaxies is discussed in more detail,
leading to a seemingly tight limit on the number and lifetime of DSMs, and
indeed intelligent worlds, in the ~600-pc-radius region patrolled by Gaia.
However, the detectability of DSMs is questioned: a DSM might extinguish its
star at optical/near-IR wavelengths, and thus either not appear or appear
anomalously faint in a stellar catalogue. Moreover, a civilization advanced
enough to construct a DSM is likely to be advanced enough to use
countermeasures to mask its presence from us.
Authors' comments: 6 pages. No figures. Submitted to MNRAS, possibly letters