Hayun Lee, Dongkun Shin
With the recent proliferation of on-device AI, there is an increasing need to
run computationally intensive DNNs directly on mobile devices. However, the
limited computing and memory resources of these devices necessitate effective
pruning techniques. Block-wise pruning is promising due to its low accuracy
drop tradeoff for speedup gains, but it requires block positions to be aligned
with block size, hindering optimal position selection to minimize model
accuracy drop. Unaligned block pruning (UBP) addresses this by allowing blocks
to be selected at arbitrary positions, yet its practical use is limited by a
time-consuming optimal block selection algorithm and lack of efficient
inference kernels. In this paper, we propose a pseudo-optimal yet fast block
selection algorithm called Block Expansion and Division (BED), which can be
integrated into an iterative model training process. Additionally, we introduce
an efficient inference kernel implementation for mobile devices, enabling a
UBP-based model to achieve similar latency to a DNN model compressed by aligned
block pruning. We demonstrate the superiority of our techniques on a real
mobile phone with MobileNet and ResNet models.
Authors' comments: 11 pages, 8 figures
Tianxiao Zhang, Wenju Xu, Bo Luo, Guanghui Wang
The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the models capture both local and global information with minimal overhead. Additionally, we introduce two architecture variants, allowing the Depth-Wise Convolution modules to be applied to multiple Transformer blocks for parameter savings, and incorporating independent parallel Depth-Wise Convolution modules with different kernels to enhance the acquisition of local information. The proposed approach significantly boosts the performance of ViT models on image classification, object detection, and instance segmentation by a large margin, especially on small datasets, as evaluated on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet for image classification, and COCO for object detection and instance segmentation. The source code can be accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.
Daniel Berend, Philip A. Ernst, Aryeh Kontorovich, Rishi Kumar
Let $M(n, k, p)$ denote the maximum probability of the event $X_1 = X_2 = \cdots = X_n=1$ under a $k$-wise independent distribution whose marginals are Bernoulli random variables with mean $p$. A long-standing question is to calculate $M(n, k, p)$ for all values of $n,k,p$. This question has been partially addressed by several authors, primarily with the goal of answering asymptotic questions. The present paper focuses on obtaining exact expressions for this probability. To this end, we provide closed-form formulas of $M(n,k,p)$ for $p$ near 0 as well as $p$ near 1.
Ye Lin Tun, Chu Myaet Thwal, Minh N. H. Nguyen, Choong Seon Hong
Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving training approaches such as federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models that demand significant resources. This presents a substantial challenge for FL clients operating with limited computational resources and communication bandwidth. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach, which decomposes the training process into multiple steps. Each step focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL scenarios and multimodal learning setups to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also introduce a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.
Huyen Ngo, Khoi Do, Duong Nguyen, Viet Dung Nguyen, Lan Dang
A significant challenge in the electroencephalogram EEG lies in the fact that current data representations involve multiple electrode signals, resulting in data redundancy and dominant lead information. However extensive research conducted on EEG classification focuses on designing model architectures without tackling the underlying issues. Otherwise, there has been a notable gap in addressing data preprocessing for EEG, leading to considerable computational overhead in Deep Learning (DL) processes. In light of these issues, we propose a simple yet effective approach for EEG data pre-processing. Our method first transforms the EEG data into an encoded image by an Inverted Channel-wise Magnitude Homogenization (ICWMH) to mitigate inter-channel biases. Next, we apply the edge detection technique on the EEG-encoded image combined with skip connection to emphasize the most significant transitions in the data while preserving structural and invariant information. By doing so, we can improve the EEG learning process efficiently without using a huge DL network. Our experimental evaluations reveal that we can significantly improve (i.e., from 2% to 5%) over current baselines.
Zhourui Zhang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu
In recent years, current mainstream feature masking distillation methods mainly function by reconstructing selectively masked regions of a student network from the feature maps of a teacher network. In these methods, attention mechanisms can help to identify spatially important regions and crucial object-aware channel clues, such that the reconstructed features are encoded with sufficient discriminative and representational power similar to teacher features. However, previous feature-masking distillation methods mainly address homogeneous knowledge distillation without fully taking into account the heterogeneous knowledge distillation scenario. In particular, the huge discrepancy between the teacher and the student frameworks within the heterogeneous distillation paradigm is detrimental to feature masking, leading to deteriorating reconstructed student features. In this study, a novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection. More specifically, a stage-wise adaptation learning module is incorporated into the dual feature-masking framework, and thus the student model can be progressively adapted to the teacher models for bridging the gap between heterogeneous networks. Furthermore, a masking enhancement strategy is combined with stage-wise learning such that object-aware masking regions are adaptively strengthened to improve feature-masking reconstruction. In addition, semantic alignment is performed at each Feature Pyramid Network (FPN) layer between the teacher and the student networks for generating consistent feature distributions. Our experiments for the object detection task demonstrate the promise of our approach, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.
Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33\% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17\% in large-scale traces.
Fengyu Cai, Xinran Zhao, Hongming Zhang, Iryna Gurevych, Heinz Koeppl
Recent advances in measuring hardness-wise properties of data guide language
models in sample selection within low-resource scenarios. However,
class-specific properties are overlooked for task setup and learning. How will
these properties influence model learning and is it generalizable across
datasets? To answer this question, this work formally initiates the concept of
$\textit{class-wise hardness}$. Experiments across eight natural language
understanding (NLU) datasets demonstrate a consistent hardness distribution
across learning paradigms, models, and human judgment. Subsequent experiments
unveil a notable challenge in measuring such class-wise hardness with
instance-level metrics in previous works. To address this, we propose
$\textit{GeoHard}$ for class-wise hardness measurement by modeling class
geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses
instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation
on measuring class-wise hardness. Our analysis theoretically and empirically
underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data
diagnosis. Additionally, we showcase how understanding class-wise hardness can
practically aid in improving task learning.
Authors' comments: Findings of ACL 2024
Mijoo Kim, Junseok Kwon
With the rapid advancement in the performance of deep neural networks (DNNs),
there has been significant interest in deploying and incorporating artificial
intelligence (AI) systems into real-world scenarios. However, many DNNs lack
the ability to represent uncertainty, often exhibiting excessive confidence
even when making incorrect predictions. To ensure the reliability of AI
systems, particularly in safety-critical cases, DNNs should transparently
reflect the uncertainty in their predictions. In this paper, we investigate
robust post-hoc uncertainty calibration methods for DNNs within the context of
multi-class classification tasks. While previous studies have made notable
progress, they still face challenges in achieving robust calibration,
particularly in scenarios involving out-of-distribution (OOD). We identify that
previous methods lack adaptability to individual input data and struggle to
accurately estimate uncertainty when processing inputs drawn from the wild
dataset. To address this issue, we introduce a novel instance-wise calibration
method based on an energy model. Our method incorporates energy scores instead
of softmax confidence scores, allowing for adaptive consideration of DNN
uncertainty for each prediction within a logit space. In experiments, we show
that the proposed method consistently maintains robust performance across the
spectrum, spanning from in-distribution to OOD scenarios, when compared to
other state-of-the-art methods.
Authors' comments: Accepted to ECCV 2024
Amanda Olmin, Fredrik Lindsten
Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
Jingjing Xu, Wei Zhou, Zijian Yang, Eugen Beck, Ralf Schlueter
Varying-size models are often required to deploy ASR systems under different
hardware and/or application constraints such as memory and latency. To avoid
redundant training and optimization efforts for individual models of different
sizes, we present the dynamic encoder size approach, which jointly trains
multiple performant models within one supernet from scratch. These subnets of
various sizes are layer-wise pruned from the supernet, and thus, enjoy full
parameter sharing. By combining score-based pruning with supernet training, we
propose two novel methods, Simple-Top-k and Iterative-Zero-Out, to
automatically select the best-performing subnets in a data-driven manner,
avoiding resource-intensive search efforts. Our experiments using CTC on both
Librispeech and TED-LIUM-v2 corpora show that our methods can achieve on-par
performance as individually trained models of each size category. Also, our
approach consistently brings small performance improvements for the full-size
supernet.
Authors' comments: Accepted by Interspeech 2024
Ardhi Wiratama Baskara Yudha, Jiaqi Xue, Qian Lou, Huiyang Zhou, Yan Solihin
Fully Homomorphic Encryption (FHE) allows for the execution of computations
on encrypted data without the need to decrypt it first, offering significant
potential for privacy-preserving computational operations. Emerging
arithmetic-based FHE schemes (ar-FHE), like BGV, demonstrate even better
performance in word-wise comparison operations over non-arithmetic FHE (na-FHE)
schemes, such as TFHE, especially for basic tasks like comparing values,
finding maximums, and minimums. This shows the universality of ar-FHE in
effectively handling both arithmetic and non-arithmetic operations without the
expensive conversion between arithmetic and non-arithmetic FHEs. We refer to
universal arithmetic Fully Homomorphic Encryption as uFHE. The arithmetic
operations in uFHE remain consistent with those in the original arithmetic FHE,
which have seen significant acceleration. However, its non-arithmetic
comparison operations differ, are slow, and have not been as thoroughly studied
or accelerated. In this paper, we introduce BoostCom, a scheme designed to
speed up word-wise comparison operations, enhancing the efficiency of uFHE
systems. BoostCom involves a multi-prong optimizations including infrastructure
acceleration (Multi-level heterogeneous parallelization and GPU-related
improvements), and algorithm-aware optimizations (slot compaction, non-blocking
comparison semantic). Together, BoostCom achieves an end-to-end performance
improvement of more than an order of magnitude (11.1x faster) compared to the
state-of-the-art CPU-based uFHE systems, across various FHE parameters and
tasks.
Authors' comments: To be appeared on PACT 2024
Heikki Muhli, Tapio Ala-Nissila, Miguel A. Caro
A common approach to modeling dispersion interactions and overcoming the inaccurate description of long-range correlation effects in electronic structure calculations is the use of pairwise-additive potentials, as in the Tkatchenko-Scheffler [Phys. Rev. Lett. 102, 073005 (2009)] method. In previous work [Phys. Rev. B 104, 054106 (2021)], we have shown how these are amenable to highly efficient atomistic simulation by machine learning their local parametrization. However, the atomic polarizability and the electron correlation energy have a complex and non-local many-body character and some of the dispersion effects in complex systems are not sufficiently described by these types of pairwise-additive potentials. Currently, one of the most widely used rigorous descriptions of the many-body effects is based on the many-body dispersion (MBD) model [Phys. Rev. Lett. 108, 236402 (2012)]. In this work, we show that the MBD model can also be locally parametrized to derive a local approximation for the highly non-local many-body effects. With this local parametrization, we develop an atom-wise formulation of MBD that we refer to as linear MBD (lMBD), as this decomposition enables linear scaling with system size. This model provides a transparent and controllable approximation to the full MBD model with tunable convergence parameters for a fraction of the computational cost observed in electronic structure calculations with popular density-functional theory codes. We show that our model scales linearly with the number of atoms in the system and is easily parallelizable. Furthermore, we show how using the same machinery already established in previous work for predicting Hirshfeld volumes with machine learning enables access to large-scale simulations with MBD-level corrections.
Shirley Kokane, Mostofa Rafid Uddin, Min Xu
Transfer learning methods start performing poorly when the complexity of the learning task is increased. Most of these methods calculate the cumulative differences of all the matched features and then use them to back-propagate that loss through all the layers. Contrary to these methods, in this work, we propose a novel layer-wise learning scheme that adjusts learning parameters per layer as a function of the differences in the Jacobian/Attention/Hessian of the output activations w.r.t. the network parameters. We applied this novel scheme for attention map-based and derivative-based (first and second order) transfer learning methods. We received improved learning performance and stability against a wide range of datasets. From extensive experimental evaluation, we observed that the performance boost achieved by our method becomes more significant with the increasing difficulty of the learning task.
Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
Beatrice Alessandra Motetti, Matteo Risso, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari
The resource requirements of deep neural networks (DNNs) pose significant
challenges to their deployment on edge devices. Common approaches to address
this issue are pruning and mixed-precision quantization, which lead to latency
and memory occupation improvements. These optimization techniques are usually
applied independently. We propose a novel methodology to apply them jointly via
a lightweight gradient-based search, and in a hardware-aware manner, greatly
reducing the time required to generate Pareto-optimal DNNs in terms of accuracy
versus cost (i.e., latency or memory). We test our approach on three
edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny
ImageNet. When targeting the optimization of the memory footprint, we are able
to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the
baseline networks with all weights quantized at 8 and 2-bit, respectively. Our
method surpasses a previous state-of-the-art approach with up to 56.17% size
reduction at iso-accuracy. With respect to the sequential application of
state-of-the-art pruning and mixed-precision optimizations, we obtain
comparable or superior results, but with a significantly lowered training time.
In addition, we show how well-tailored cost models can improve the cost versus
accuracy trade-offs when targeting specific hardware for deployment.
Authors' comments: Accepted for publication in IEEE Transactions on Computers
Jingheng Ye, Shang Qin, Yinghui Li, Xuxin Cheng, Libo Qin, Hai-Tao Zheng, Ying Shen, Peng Xing et al.
Existing studies explore the explainability of Grammatical Error Correction
(GEC) in a limited scenario, where they ignore the interaction between
corrections and explanations and have not established a corresponding
comprehensive benchmark. To bridge the gap, this paper first introduces the
task of EXplainable GEC (EXGEC), which focuses on the integral role of
correction and explanation tasks. To facilitate the task, we propose EXCGEC, a
tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented
samples featuring the design of hybrid edit-wise explanations. We then
benchmark several series of LLMs in multi-task learning settings, including
post-explaining and pre-explaining. To promote the development of the task, we
also build a comprehensive evaluation suite by leveraging existing automatic
metrics and conducting human evaluation experiments to demonstrate the human
consistency of the automatic metrics for free-text explanations. Our
experiments reveal the effectiveness of evaluating free-text explanations using
traditional metrics like METEOR and ROUGE, and the inferior performance of
multi-task models compared to the pipeline solution, indicating its challenges
to establish positive effects in learning both tasks.
Authors' comments: Accepted to AAAI 2025. 19 pages with an appendix, 10 tables, and 9
figures
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya Jia
Mathematical reasoning presents a significant challenge for Large Language
Models (LLMs) due to the extensive and precise chain of reasoning required for
accuracy. Ensuring the correctness of each reasoning step is critical. To
address this, we aim to enhance the robustness and factuality of LLMs by
learning from human feedback. However, Direct Preference Optimization (DPO) has
shown limited benefits for long-chain mathematical reasoning, as models
employing DPO struggle to identify detailed errors in incorrect answers. This
limitation stems from a lack of fine-grained process supervision. We propose a
simple, effective, and data-efficient method called Step-DPO, which treats
individual reasoning steps as units for preference optimization rather than
evaluating answers holistically. Additionally, we have developed a data
construction pipeline for Step-DPO, enabling the creation of a high-quality
dataset containing 10K step-wise preference pairs. We also observe that in DPO,
self-generated data is more effective than data generated by humans or GPT-4,
due to the latter's out-of-distribution nature. Our findings demonstrate that
as few as 10K preference data pairs and fewer than 500 Step-DPO training steps
can yield a nearly 3% gain in accuracy on MATH for models with over 70B
parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves
scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,
surpassing a series of closed-source models, including GPT-4-1106,
Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at
https://github.com/dvlab-research/Step-DPO.
Authors' comments: Code, data, and models are available at
https://github.com/dvlab-research/Step-DPO
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu
We present a simple variable quantization approach that quantizes different
layers of a large language model (LLM) at different bit levels. Specifically,
we quantize the most important layers to higher bit precision and less
important layers to lower bits to achieve floating point quantization levels.
We propose two effective strategies to measure the importance of layers within
LLMs: the first measures the importance of a layer based on how different its
output embeddings are from the input embeddings (the higher the better); the
second estimates the importance of a layer using the number of layer weights
that are much larger than average (the smaller the better). We show that
quantizing different layers at varying bits according to our importance scores
results in minimal performance drop with a far more compressed model size.
Finally, we present several practical key takeaways from our variable
layer-wise quantization experiments: (a) LLM performance under variable
quantization remains close to the original model until 25-50% of layers are
moved in lower quantization using our proposed ordering but only until 5-10% if
moved using no specific ordering; (b) Quantizing LLMs to lower bits performs
substantially better than pruning unless extreme quantization (2-bit) is used;
and (c) Layer-wise quantization to lower bits works better in the case of
larger LLMs with more layers compared to smaller LLMs with fewer layers. The
code used to run the experiments is available at:
https://github.com/RazvanDu/LayerwiseQuant.
Authors' comments: submitted to EMNLP, 15 pages, 10 figures, 4 tables
Måns Williamson, Monika Eisenmann, Tony Stillfjord
Choosing the optimization algorithm that performs best on a given machine learning problem is often delicate, and there is no guarantee that current state-of-the-art algorithms will perform well across all tasks. Consequently, the more reliable methods that one has at hand, the larger the likelihood of a good end result. To this end, we introduce and analyze a large class of stochastic so-called soft-clipping schemes with a broad range of applications. Despite the wide adoption of clipping techniques in practice, soft-clipping methods have not been analyzed to a large extent in the literature. In particular, a rigorous mathematical analysis is lacking in the general, nonlinear case. Our analysis lays a theoretical foundation for a large class of such schemes, and motivates their usage. In particular, under standard assumptions such as Lipschitz continuous gradients of the objective function, we give rigorous proofs of convergence in expectation. These include rates in both the convex and the non-convex case, as well as almost sure convergence to a stationary point in the non-convex case. The computational cost of the analyzed schemes is essentially the same as that of stochastic gradient descent.