Suneel Nadipalli
Fine-tuning pre-trained transformers is a powerful technique for enhancing
the performance of base models on specific tasks. From early applications in
models like BERT to fine-tuning Large Language Models (LLMs), this approach has
been instrumental in adapting general-purpose architectures for specialized
downstream tasks. Understanding the fine-tuning process is crucial for
uncovering how transformers adapt to specific objectives, retain general
representations, and acquire task-specific features. This paper explores the
underlying mechanisms of fine-tuning, specifically in the BERT transformer, by
analyzing activation similarity, training Sparse AutoEncoders (SAEs), and
visualizing token-level activations across different layers. Based on
experiments conducted across multiple datasets and BERT layers, we observe a
steady progression in how features adapt to the task at hand: early layers
primarily retain general representations, middle layers act as a transition
between general and task-specific features, and later layers fully specialize
in task adaptation. These findings provide key insights into the inner workings
of fine-tuning and its impact on representation learning within transformer
architectures.
Authors' comments: 14 pages, 5 figures
Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.
Xinghan Pan
This paper investigates the efficacy of RWKV, a novel language model
architecture known for its linear attention mechanism, for generating sentence
embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate
the semantic similarity captured by embeddings from different hidden layers of
a pre-trained RWKV model. The performance is assessed on the Microsoft Research
Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared
against a GloVe-based baseline. My results indicate that while RWKV embeddings
capture some semantic relatedness, they underperform compared to the GloVe
baseline in terms of Spearman correlation. I also analyze the inference time
and GPU memory usage, highlighting the computational trade-offs associated with
RWKV embeddings. The findings suggest that while RWKV offers potential
advantages in terms of linear scaling, its zero-shot sentence embedding quality
for semantic similarity tasks requires further investigation and potential
task-specific fine-tuning to match or exceed simpler baselines.
Authors' comments: 17 pages, 3 tables, preprint on ArXiV, includes detailed analysis of
RWKV for semantic similarity tasks
Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu
Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.
Zheng Li, Bingxu Xie, Chao Chu, Weiqing Li, Zhiyong Su
Geometry quality assessment (GQA) of colorless point clouds is crucial for evaluating the performance of emerging point cloud-based solutions (e.g., watermarking, compression, and 3-Dimensional (3D) reconstruction). Unfortunately, existing objective GQA approaches are traditional full-reference metrics, whereas state-of-the-art learning-based point cloud quality assessment (PCQA) methods target both color and geometry distortions, neither of which are qualified for the no-reference GQA task. In addition, the lack of large-scale GQA datasets with subjective scores, which are always imprecise, biased, and inconsistent, also hinders the development of learning-based GQA metrics. Driven by these limitations, this paper proposes a no-reference geometry-only quality assessment approach based on list-wise rank learning, termed LRL-GQA, which comprises of a geometry quality assessment network (GQANet) and a list-wise rank learning network (LRLNet). The proposed LRL-GQA formulates the no-reference GQA as a list-wise rank problem, with the objective of directly optimizing the entire quality ordering. Specifically, a large dataset containing a variety of geometry-only distortions is constructed first, named LRL dataset, in which each sample is label-free but coupled with quality ranking information. Then, the GQANet is designed to capture intrinsic multi-scale patch-wise geometric features in order to predict a quality index for each point cloud. After that, the LRLNet leverages the LRL dataset and a likelihood loss to train the GQANet and ranks the input list of degraded point clouds according to their distortion levels. In addition, the pre-trained GQANet can be fine-tuned further to obtain absolute quality scores. Experimental results demonstrate the superior performance of the proposed no-reference LRL-GQA method compared with existing full-reference GQA metrics.
Zhenheng Tang, Zichen Tang, Junlin Huang, Xinglin Pan, Rudan Yan, Yuxin Wang, Amelie Chi Zhou, Shaohuai Shi et al.
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, and recent studies have successfully applied it to geo-distributedly pre-train LLMs. However, we identify that its model synchronization mechanism prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only some layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule the communication and computation to reduce the training time based on fine-grained profiling of layer-wise communication and computation time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\times$ to $3.91\times$ over leading baseline methods.
Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu et al.
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
Zhiwen Ruan, Yixia Li, He Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang, Yun Chen, Guanhua Chen
Despite being pretrained on multilingual corpora, large language models
(LLMs) exhibit suboptimal performance on low-resource languages. Recent
approaches have leveraged multilingual encoders alongside LLMs by introducing
trainable parameters connecting the two models. However, these methods
typically focus on the encoder's output, overlooking valuable information from
other layers. We propose \aname (\mname), a framework that integrates
representations from all encoder layers, coupled with the \attaname mechanism
to enable layer-wise interaction between the LLM and the multilingual encoder.
Extensive experiments on multilingual reasoning tasks, along with analyses of
learned representations, show that our approach consistently outperforms
existing baselines.
Authors' comments: In Findings of NAACL 2025(The 2025 Annual Conference of the Nations
of the Americas Chapter of the ACL)
Toshinori Kitamura, Arnob Ghosh, Tadashi Kozuno, Wataru Kumagai, Kazumi Kasaura, Kenta Hoshino, Yohei Hosoe, Yutaka Matsuo
We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
Thierry Paul, Stefano Rossi, Emmanuel Trélat, Eth Zurich
We consider interacting multi-agent systems where the interaction is not only pairwise but involves simultaneous interactions among multiple agents (multiple-wise interaction). By passing through the mesoscopic and macroscopic limits with a fixed multiple-wise interaction of order m, we derive a macroscopic equation in the limit m $\rightarrow$ $\infty$, capturing the dominant effects in large-size multiple-wise order.
Ahmed Elhussein, Gamze Gürsoy
Non-identically distributed data is a major challenge in Federated Learning (FL). Personalized FL tackles this by balancing local model adaptation with global model consistency. One variant, partial FL, leverages the observation that early layers learn more transferable features by federating only early layers. However, current partial FL approaches use predetermined, architecture-specific rules to select layers, limiting their applicability. We introduce Principled Layer-wise-FL (PLayer-FL), which uses a novel federation sensitivity metric to identify layers that benefit from federation. This metric, inspired by model pruning, quantifies each layer's contribution to cross-client generalization after the first training epoch, identifying a transition point in the network where the benefits of federation diminish. We first demonstrate that our federation sensitivity metric shows strong correlation with established generalization measures across diverse architectures. Next, we show that PLayer-FL outperforms existing FL algorithms on a range of tasks, also achieving more uniform performance improvements across clients.
Yan Dai, Moise Blanchard, Patrick Jaillet
We study a repeated resource allocation problem with strategic agents where
monetary transfers are disallowed and the central planner has no prior
information on agents' utility distributions. In light of Arrow's impossibility
theorem, acquiring information about agent preferences through some form of
feedback is necessary. We assume that the central planner can request powerful
but expensive audits on the winner in any round, revealing the true utility of
the winner in that round. We design a mechanism achieving $T$-independent
$O(K^2)$ social welfare regret while only requesting $O(K^3 \log T)$ audits in
expectation, where $K$ is the number of agents and $T$ is the number of rounds.
We also show an $\Omega(K)$ lower bound on the regret and an $\Omega(1)$ lower
bound on the number of audits when having low regret. Algorithmically, we show
that incentive-compatibility can be mostly enforced via the imposition of
adaptive future punishments, where the audit probability is inversely
proportional to the winner's future winning probability. To accurately estimate
such probabilities in presence of strategic agents, who may adversely react to
any potential misestimate, we introduce a flagging component that allows agents
to flag any biased estimate (we show that doing so aligns with individual
incentives). On the technical side, without a unique and known distribution,
one cannot apply the revelation principle and conclude that truthful reporting
is exactly an equilibrium. Instead, we characterize the equilibrium via a
reduction to a simpler auxiliary game, in which agents cannot strategize until
close to the end of the game; we show equilibria in this game can induce
equilibria in the actual, fully strategic game. The tools developed therein may
be of independent interest for other mechanism design problems in which the
revelation principle cannot be readily applied.
Authors' comments: Accepted for presentation at the Conference on Learning Theory (COLT)
2025
Jason Wu, Kang Yang, Lance Kaplan, Mani Srivastava
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Current multimodal systems employ static resource provisioning and cannot easily adapt when compute resources change over time. Additionally, their reliance on processing sensor data with fixed feature extractors is ill-equipped to handle variations in modality quality. Consequently, uninformative modalities, such as those with high noise, needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges - it adjusts the total number of active layers across all modalities to meet compute resource constraints, and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
Yunhui Liu, Zhen Tao, Xiang Zhao, Jianhua Zhao, Tao Zheng, Tieke He
Multiplex graphs, with multiple edge types (graph views) among common nodes,
provide richer structural semantics and better modeling capabilities. Multiplex
Graph Neural Networks (MGNNs), typically comprising view-specific GNNs and a
multi-view integration layer, have achieved advanced performance in various
downstream tasks. However, their reliance on neighborhood aggregation poses
challenges for deployment in latency-sensitive applications. Motivated by
recent GNN-to-MLP knowledge distillation frameworks, we propose Multiplex
Graph-Free Neural Networks (MGFNN and MGFNN+) to combine MGNNs' superior
performance and MLPs' efficient inference via knowledge distillation. MGFNN
directly trains student MLPs with node features as input and soft labels from
teacher MGNNs as targets. MGFNN+ further employs a low-rank approximation-based
reparameterization to learn node-wise coefficients, enabling adaptive knowledge
ensemble from each view-specific GNN. This node-wise multi-view ensemble
distillation strategy allows student MLPs to learn more informative multiplex
semantic knowledge for different nodes. Experiments show that MGFNNs achieve
average accuracy improvements of about 10% over vanilla MLPs and perform
comparably or even better to teacher MGNNs (accurate); MGFNNs achieve a
35.40$\times$-89.14$\times$ speedup in inference over MGNNs (efficient); MGFNN+
adaptively assigns different coefficients for multi-view ensemble distillation
regarding different nodes (interpretable).
Authors' comments: Accepted by DASFAA 2025
Jialiang Wu, Yi Shen, Sijia Liu, Yi Tang, Sen Song, Xiaoyi Wang, Longjun Cai
Despite their impressive capacities, Large language models (LLMs) often
struggle with the hallucination issue of generating inaccurate or fabricated
content even when they possess correct knowledge. In this paper, we extend the
exploration of the correlation between hidden-state prediction changes and
output factuality into a deeper, token-wise level. Based on the insights , we
propose cross-layer Entropy eNhanced Decoding (END), a decoding method that
mitigates hallucinations without requiring extra training. END leverages inner
probability changes across layers to individually quantify the factual
knowledge required for each candidate token, and adjusts the final predicting
distribution to prioritize tokens with higher factuality. Experiments on both
hallucination and QA benchmarks demonstrate that END significantly enhances the
truthfulness and informativeness of generated content while maintaining robust
QA accuracy. Moreover, our work provides a deeper perspective on understanding
the correlations between inherent knowledge and output factuality.
Authors' comments: NAACL 2025 Findings
Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
This paper conducts a comprehensive layer-wise analysis of self-supervised
learning (SSL) models for audio deepfake detection across diverse contexts,
including multilingual datasets (English, Chinese, Spanish), partial, song, and
scene-based deepfake scenarios. By systematically evaluating the contributions
of different transformer layers, we uncover critical insights into model
behavior and performance. Our findings reveal that lower layers consistently
provide the most discriminative features, while higher layers capture less
relevant information. Notably, all models achieve competitive equal error rate
(EER) scores even when employing a reduced number of layers. This indicates
that we can reduce computational costs and increase the inference speed of
detecting deepfakes by utilizing only a few lower layers. This work enhances
our understanding of SSL models in deepfake detection, offering valuable
insights applicable across varied linguistic and contextual settings. Our
trained models and code are publicly available:
https://github.com/Yaselley/SSL_Layerwise_Deepfake.
Authors' comments: Accepted to NAACL Findings 2025
Runbing Zheng
Pairwise network comparison is essential for various applications, including neuroscience, disease research, and dynamic network analysis. While existing literature primarily focuses on comparing entire network structures, we address a vertex-wise comparison problem where two random networks share the same set of vertices but allow for structural variations in some vertices, enabling a more detailed and flexible analysis of network differences. In our framework, some vertices retain their latent positions between networks, while others undergo shifts. To identify the shifted and unshifted vertices and estimate their latent position shifts, we propose a method that first derives vertex embeddings in a low-rank Euclidean space for each network, then aligns these estimated vertex latent positions into a common space to resolve potential non-identifiability, and finally tests whether each vertex is shifted or not and estimates the vertex shifts. Our theoretical results establish the test statistic for the algorithms, guide parameter selection, and provide performance guarantees. Simulation studies and real data applications, including a case-control study in disease research and dynamic network analysis, demonstrate that the proposed algorithms are both computationally efficient and effective in extracting meaningful insights from network comparisons.
Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
Sangyeon Park, Isaac Han, Seungwon Oh, Kyung-Joong Kim
Plasticity loss, a critical challenge in neural network training, limits a
model's ability to adapt to new tasks or shifts in data distribution. This
paper introduces AID (Activation by Interval-wise Dropout), a novel method
inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID
generates subnetworks by applying Dropout with different probabilities on each
preactivation interval. Theoretical analysis reveals that AID regularizes the
network, promoting behavior analogous to that of deep linear networks, which do
not suffer from plasticity loss. We validate the effectiveness of AID in
maintaining plasticity across various benchmarks, including continual learning
tasks on standard image classification datasets such as CIFAR10, CIFAR100, and
TinyImageNet. Furthermore, we show that AID enhances reinforcement learning
performance in the Arcade Learning Environment benchmark.
Authors' comments: Accepted to ICML 2025 (poster)