Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park
We present Unified Microphone Conversion, a unified generative framework
designed to bolster sound event classification (SEC) systems against device
variability. While our prior CycleGAN-based methods effectively simulate device
characteristics, they require separate models for each device pair, limiting
scalability. Our approach overcomes this constraint by conditioning the
generator on frequency response data, enabling many-to-many device mappings
through unpaired training. We integrate frequency-response information via
Feature-wise Linear Modulation, further enhancing scalability. Additionally,
incorporating synthetic frequency response differences improves the
applicability of our framework for real-world application. Experimental results
show that our method outperforms the state-of-the-art by 2.6% and reduces
variability by 0.8% in macro-average F1 score.
Authors' comments: Accepted to Interspeech 2025
Khunanon Thongkham, Anthony H. Gonzalez, Mark Brodwin, Ariane Trudeau, Peter Eisenhardt, S. A. Stanford, Emily Moravec, Thomas Connor et al.
We present the second data release of the Massive and Distant Clusters of
WISE Survey 2 (MaDCoWS2). We expand from the equatorial first data release to
most of the Dark Energy Camera Legacy Survey area, covering a total area of
6498 deg^2. The catalog consists of 133,036 S/N $\geq5$ galaxy cluster
candidates at $0.1\leq z \leq2$, including 6790 candidates at z > 1.5. We train
a convolutional neural network (CNN) to identify spurious detections, and
include CNN-based cluster probabilities in the final catalog. We also compare
the MaDCoWS2 sample with literature catalogs in the same area. The larger
sample provides robust results that are consistent with our first data release.
At S/N $\geq5$, we rediscover 59-91% of clusters in existing catalogs that lie
in the unmasked area of MC2. The median positional offsets are under 250 kpc,
and the standard deviation of the redshifts is 0.031(1+z). We fit a
redshift-dependent power law to the relation between MaDCoWS2 S/N and
observables from existing catalogs. Over the redshift ranges where the surveys
overlap with MaDCoWS2, the lowest scatter is found between S/N and observables
from optical/infrared surveys. We also assess the performance of our method
using a mock light cone measuring purity and completeness as a function of
cluster mass. The purity is above 90%, and we estimate the 50% completeness
threshold at a virial mass of log(M/M$_\odot$)$\approx14.3$. The completeness
estimate is uncertain due to the small number of massive halos in the light
cone, but consistent with the recovery fraction found by comparing to other
cluster catalogs.
Authors' comments: 21 pages, 14 figures, 4 tables. Accepted for publication in ApJ
M. E. Cluver, T. H. Jarrett, D. A. Dale, J. -D. T. Smith, M. J. I. Brown, W. van Kempen, E. Lengerer, R. Incoll et al.
In this work we present source-tailored WISE mid-infrared photometry (at
3.4$\mu$m, 4.6$\mu$m, 12$\mu$m, and 23$\mu$m) of 2812 galaxies in the extended
Spitzer Survey of Stellar Structure in Galaxies (S$^4$G) sample, and
characterise the mid-infrared colors and dust properties of this legacy nearby
galaxy data set. Informed by the relative emission between W3 (12$\mu$ m) and
W4 (23$\mu$ m), we re-derive star formation rate (SFR) scaling relations
calibrated to L$_{\rm TIR}$, which results in improved agreement between the
two tracers. By inverse-variance weighting the W3 and W4-derived SFRs, we
generate a combined mid-infrared SFR that is a broadly robust measure of star
formation activity in dusty, star-forming galaxies in the nearby Universe. In
addition, we investigate the use of a W3-derived dust density metric,
$\Sigma_{\rm 12\mu m}$ (L$_\odot$/kpc$^2$), to estimate the SFR deficit of low
mass, low dust galaxies. This is achieved by combining WISE with existing GALEX
ultraviolet (UV) photometry, which we further use to explore the relationship
between dust and UV emission as a function of morphology. Finally, we use our
derived SFR prescriptions to examine the location of galaxies in the log SFR -
log M$_\textrm{stellar}$ plane, as a function of morphological type, which
underscores the complexity of dust-derived properties seen in galaxies of
progressively earlier type.
Authors' comments: Accepted to ApJ
Qian Tao, Wenyuan Yu, Jingren Zhou
Large language models have shown exceptional capabilities in a wide range of
tasks, such as text generation and video generation, among others. However, due
to their massive parameter count, these models often require substantial
storage space, imposing significant constraints on the machines deploying LLMs.
To overcome this limitation, one research direction proposes to compress the
models using integer replacements for floating-point numbers, in a process
known as Quantization. Some recent studies suggest quantizing the key and value
cache (KV Cache) of LLMs, and designing quantization techniques that treat the
key and value matrices equivalently.
This work delves deeper into the asymmetric structural roles of KV Cache, a
phenomenon where the transformer's output loss is more sensitive to the
quantization of key matrices. We conduct a systematic examination of the
attention output error resulting from key and value quantization. The
phenomenon inspires us to propose an asymmetric quantization strategy. Our
approach allows for 1-bit quantization of the KV cache by implementing distinct
configurations for key and value matrices. We carry out experiments across a
variety of datasets, demonstrating that our proposed model allows for the
quantization of up to 75% decoder layers with 1 bit, while simultaneously
maintaining performance levels comparable to those of the models with floating
parameters.
Authors' comments: 12 pages, 4 figures
Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang
Recent work on pruning large language models (LLMs) has shown that one can
eliminate a large number of parameters without compromising performance, making
pruning a promising strategy to reduce LLM model size. Existing LLM pruning
strategies typically assign uniform pruning ratios across layers, limiting
overall pruning ability; and recent work on layerwise pruning of LLMs is often
based on heuristics that can easily lead to suboptimal performance. In this
paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in
particular the shape of empirical spectral densities (ESDs) of weight matrices,
to design improved layerwise pruning ratios for LLMs. Our analysis reveals a
wide variability in how well-trained, and thus relatedly how prunable,
different layers of an LLM are. Based on this, we propose AlphaPruning, which
uses shape metrics to allocate layerwise sparsity ratios in a more
theoretically principled manner. AlphaPruning can be used in conjunction with
multiple existing LLM pruning methods. Our empirical results show that
AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable
perplexity, marking a first in the literature on LLMs. We have open-sourced our
code at https://github.com/haiquanlu/AlphaPruning.
Authors' comments: NeurIPS 2024, first two authors contributed equally
Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Li Shen
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at https://github.com/Lslland/T-Vaccine.
Yanfeng Jiang, Zelan Yang, Bohua Chen, Shen Li, Yong Li, Tao Li
Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.
Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
Cai Dieball, Yasamin Mohebi Satalsari, Angel B. Zuccolotto-Bernezb, Stefan U. Egelhaaf, Manuel A. Escobedo-Sánchez, Aljaž Godec
We investigate path-wise observables in experiments on driven colloids in a periodic light field to dissect selected intricate transport features, kinetics, and transition-path time statistics out of thermodynamic equilibrium. These observables directly reflect the properties of individual paths in contrast to the properties of an ensemble of particles, such as radial distribution functions or mean-squared displacements. In particular, we present two distinct albeit equivalent formulations of the underlying stochastic equation of motion, highlight their respective practical relevance, and show how to interchange between them. We discuss conceptually different notions of local velocities and interrogate one- and two-sided first-passage and transition-path time statistics in and out of equilibrium. Our results reiterate how path-wise observables may be employed to systematically assess the quality of experimental data and demonstrate that, given sufficient control and sampling, one may quantitatively verify subtle theoretical predictions.
David Alonso
We present a generalisation of the standard pseudo-$C_\ell$ approach for
power spectrum estimation to the case of spin-$s$ fields weighted by a general
positive-definite weight matrix that couples the different spin components of
the field (e.g. $Q$ and $U$ maps in CMB polarisation analyses, or $\gamma_1$
and $\gamma_2$ shear components in weak lensing). Relevant use cases are, for
example, data with significantly anisotropic noise properties, or situations in
which different masks must be applied to the different field components. The
weight matrix map is separated into a spin-0 part, which corresponds to the
"mask" in the standard pseudo-$C_\ell$ approach, and a spin-$2s$ part sourced
solely by the anisotropic elements of the matrix, leading to additional
coupling between angular scales and $E/B$ modes. The general expressions for
the mode-coupling coefficients involving the power spectra of these anisotropic
weight components are derived and validated. The generalised algorithm is as
computationally efficient as the standard approach. We implement the method in
the public code NaMaster.
Authors' comments: Accepted in the Open Journal of Astrophysics
Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas König, David Kappel, Anand Subramoney
The increasing size of deep learning models has made distributed training
across multiple devices essential. However, current methods such as distributed
data-parallel training suffer from large communication and synchronization
overheads when training across devices, leading to longer training times as a
result of suboptimal hardware utilization. Asynchronous stochastic gradient
descent (ASGD) methods can improve training speed, but are sensitive to delays
due to both communication and differences throughput. Moreover, the
backpropagation algorithm used within ASGD workers is bottlenecked by the
interlocking between its forward and backward passes. Current methods also do
not take advantage of the large differences in the computation required for the
forward and backward passes. Therefore, we propose an extension to ASGD called
Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses
separate threads for the forward and backward passes, decoupling the updates
and allowing for a higher ratio of forward to backward threads than the usual
1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise
(partial) model updates concurrently across multiple threads. This reduces
parameter staleness and consequently improves robustness to delays. Our
approach yields close to state-of-the-art results while running up to
$5.95\times$ faster than synchronous data parallelism in the presence of
delays, and up to $2.14\times$ times faster than comparable ASGD algorithms by
achieving higher model flops utilization. We mathematically describe the
gradient bias introduced by our method, establish an upper bound, and prove
convergence.
Authors' comments: 17 pages, 5 figures
Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that
addresses sequence alignment without relying on an auxiliary duration predictor
and complex autoregressive (AR) or non-autoregressive (NAR) frame-level
sequence modeling. SegINR simplifies the process by converting text sequences
directly into frame-level features. It leverages an optimal text encoder to
extract embeddings, transforming each into a segment of frame-level features
using a conditional implicit neural representation (INR). This method, named
segment-wise INR (SegINR), models temporal dynamics within each segment and
autonomously defines segment boundaries, reducing computational costs. We
integrate SegINR into a two-stage TTS framework, using it for semantic token
prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate
that SegINR outperforms conventional methods in speech quality with
computational efficiency.
Authors' comments: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
Xianlong Wang, Minghui Li, Wei Liu, Hangtao Zhang, Shengshan Hu, Yechao Zhang, Ziqi Zhou, Hai Jin
Traditional unlearnable strategies have been proposed to prevent unauthorized
users from training on the 2D image data. With more 3D point cloud data
containing sensitivity information, unauthorized usage of this new type data
has also become a serious concern. To address this, we propose the first
integral unlearnable framework for 3D point clouds including two processes: (i)
we propose an unlearnable data protection scheme, involving a class-wise
setting established by a category-adaptive allocation strategy and
multi-transformations assigned to samples; (ii) we propose a data restoration
scheme that utilizes class-wise inverse matrix transformation, thus enabling
authorized-only training for unlearnable data. This restoration process is a
practical issue overlooked in most existing unlearnable literature, \ie, even
authorized users struggle to gain knowledge from 3D unlearnable data. Both
theoretical and empirical results (including 6 datasets, 16 models, and 2
tasks) demonstrate the effectiveness of our proposed unlearnable framework. Our
code is available at \url{https://github.com/CGCL-codes/UnlearnablePC}
Authors' comments: NeurIPS 2024
Minh Duong Nguyen, Khanh Le, Khoi Do, Nguyen H. Tran, Duc Nguyen, Chien Trinh, Zhaohui Yang
In personalized Federated Learning (pFL), high data heterogeneity can cause significant gradient divergence across devices, adversely affecting the learning process. This divergence, especially when gradients from different users form an obtuse angle during aggregation, can negate progress, leading to severe weight and gradient update degradation. To address this issue, we introduce a new approach to pFL design, namely Federated Learning with Layer-wise Aggregation via Gradient Analysis (FedLAG), utilizing the concept of gradient conflict at the layer level. Specifically, when layer-wise gradients of different clients form acute angles, those gradients align in the same direction, enabling updates across different clients toward identifying client-invariant features. Conversely, when layer-wise gradient pairs make create obtuse angles, the layers tend to focus on client-specific tasks. In hindsights, FedLAG assigns layers for personalization based on the extent of layer-wise gradient conflicts. Specifically, layers with gradient conflicts are excluded from the global aggregation process. The theoretical evaluation demonstrates that when integrated into other pFL baselines, FedLAG enhances pFL performance by a certain margin. Therefore, our proposed method achieves superior convergence behavior compared with other baselines. Extensive experiments show that our FedLAG outperforms several state-of-the-art methods and can be easily incorporated with many existing methods to further enhance performance.
Eduard Tulchinskii, Laida Kushnareva, Kristian Kuznetsov, Anastasia Voznyuk, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien
While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang et al.
The expanding context windows in large language models (LLMs) have greatly
enhanced their capabilities in various applications, but they also introduce
significant challenges in maintaining low latency, particularly in Time to
First Token (TTFT). This paper identifies that the sharp rise in TTFT as
context length increases is predominantly driven by queuing delays, which are
caused by the growing demands for GPU Key-Value (KV) cache allocation clashing
with the limited availability of KV cache blocks. To address this issue, we
propose LayerKV, a simple yet effective plug-in method that effectively reduces
TTFT without requiring additional hardware or compromising output performance,
while seamlessly integrating with existing parallelism strategies and
scheduling techniques. Specifically, LayerKV introduces layer-wise KV block
allocation, management, and offloading for fine-grained control over system
memory, coupled with an SLO-aware scheduler to optimize overall Service Level
Objectives (SLOs). Comprehensive evaluations on representative models, ranging
from 7B to 70B parameters, across various GPU configurations, demonstrate that
LayerKV improves TTFT latency up to 69x and reduces SLO violation rates by
28.7%, significantly enhancing the user experience.
Authors' comments: 11 pages, 7 figures, 1 table
Shahed Masoudian, Markus Frohmann, Navid Rekabsaz, Markus Schedl
Language models frequently inherit societal biases from their training data.
Numerous techniques have been proposed to mitigate these biases during both the
pre-training and fine-tuning stages. However, fine-tuning a pre-trained
debiased language model on a downstream task can reintroduce biases into the
model. Additionally, existing debiasing methods for downstream tasks either (i)
require labels of protected attributes (e.g., age, race, or political views)
that are often not available or (ii) rely on indicators of bias, which
restricts their applicability to gender debiasing since they rely on
gender-specific words. To address this, we introduce a novel debiasing
regularization technique based on the class-wise variance of embeddings.
Crucially, our method does not require attribute labels and targets any
attribute, thus addressing the shortcomings of existing debiasing methods. Our
experiments on encoder language models and three datasets demonstrate that our
method outperforms existing strong debiasing baselines that rely on target
attribute labels while maintaining performance on the target task.
Authors' comments: Accepted to EMNLP 2024
Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Jose M Martínez
Merging parameters of multiple models has resurfaced as an effective strategy to enhance task performance and robustness, but prior work is limited by the high costs of ensemble creation and inference. In this paper, we leverage the abundance of freely accessible trained models to introduce a cost-free approach to model merging. It focuses on a layer-wise integration of merged models, aiming to maintain the distinctiveness of the task-specific final layers while unifying the initial layers, which are primarily associated with feature extraction. This approach ensures parameter consistency across all layers, essential for boosting performance. Moreover, it facilitates seamless integration of knowledge, enabling effective merging of models from different datasets and tasks. Specifically, we investigate its applicability in Unsupervised Domain Adaptation (UDA), an unexplored area for model merging, for Semantic and Panoptic Segmentation. Experimental results demonstrate substantial UDA improvements without additional costs for merging same-architecture models from distinct datasets ($\uparrow 2.6\%$ mIoU) and different-architecture models with a shared backbone ($\uparrow 6.8\%$ mIoU). Furthermore, merging Semantic and Panoptic Segmentation models increases mPQ by $\uparrow 7\%$. These findings are validated across a wide variety of UDA strategies, architectures, and datasets.
Dohyeong Kim, Hyeokjin Kwon, Junseok Kim, Gunmin Lee, Songhwai Oh
As the complexity of tasks addressed through reinforcement learning (RL)
increases, the definition of reward functions also has become highly
complicated. We introduce an RL method aimed at simplifying the reward-shaping
process through intuitive strategies. Initially, instead of a single reward
function composed of various terms, we define multiple reward and cost
functions within a constrained multi-objective RL (CMORL) framework. For tasks
involving sequential complex movements, we segment the task into distinct
stages and define multiple rewards and costs for each stage. Finally, we
introduce a practical CMORL algorithm that maximizes objectives based on these
rewards while satisfying constraints defined by the costs. The proposed method
has been successfully demonstrated across a variety of acrobatic tasks in both
simulation and real-world environments. Additionally, it has been shown to
successfully perform tasks compared to existing RL and constrained RL
algorithms. Our code is available at
https://github.com/rllab-snu/Stage-Wise-CMORL.
Authors' comments: 7 pages