Zijian Wang, Bin Wang, Haifeng Jing, Huayu Li, Hongbo Dou
Recent years, multi-hop reasoning has been widely studied for knowledge graph
(KG) reasoning due to its efficacy and interpretability. However, previous
multi-hop reasoning approaches are subject to two primary shortcomings. First,
agents struggle to learn effective and robust policies at the early phase due
to sparse rewards. Second, these approaches often falter on specific datasets
like sparse knowledge graphs, where agents are required to traverse lengthy
reasoning paths. To address these problems, we propose a multi-hop reasoning
model with dual agents based on hierarchical reinforcement learning (HRL),
which is named FULORA. FULORA tackles the above reasoning challenges by
eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks
on the simplified knowledge graph to provide stage-wise hints for the low-level
agent walking on the original knowledge graph. In this framework, the low-level
agent optimizes a value function that balances two objectives: (1) maximizing
return, and (2) integrating efficient guidance from the high-level agent.
Experiments conducted on three real-word knowledge graph datasets demonstrate
that FULORA outperforms RL-based baselines, especially in the case of
long-distance reasoning.
Authors' comments: Accepted by AAAI-25
Chenyan Liu, Yufan Cai, Yun Lin, Yuhuan Huang, Yunrui Pei, Bo Jiang, Ping Yang, Jin Song Dong et al.
Recent years have seen the development of LLM-based code generation. Compared
to generating code in a software project, incremental code edits are
empirically observed to be more frequent. The emerging code editing approaches
usually formulate the problem as generating an edit based on known relevant
prior edits and context. However, practical code edits can be more complicated.
First, an editing session can include multiple (ir)relevant edits to the code
under edit. Second, the inference of the subsequent edits is non-trivial as the
scope of its ripple effect can be the whole project. In this work, we propose
CoEdPilot, an LLM-driven solution to recommend code edits by discriminating the
relevant edits, exploring their interactive natures, and estimating its ripple
effect in the project. Specifically, CoEdPilot orchestrates multiple neural
transformers to identify what and how to edit in the project regarding both
edit location and edit content. When a user accomplishes an edit with an
optional editing description, a Subsequent Edit Analysis first reports the most
relevant files in the project with what types of edits (e.g., keep, insert, and
replace) can happen for each line of their code. Next, an Edit-content
Generator generates concrete edit options for the lines of code, regarding its
relevant prior changes reported by an Edit-dependency Analyzer. Lastly, both
the Subsequent Edit Analysis and the Edit-content Generator capture relevant
prior edits as feedback to readjust their recommendations. We train our models
by collecting over 180K commits from 471 open-source projects in 5 programming
languages. Our extensive experiments show that CoEdPilot can well predict the
edits (i.e., predicting edit location with an accuracy of 70.8%-85.3%, and the
edit content with an exact match rate of 41.8% and BLEU4 score of 60.7)...
Authors' comments: 13 pages, 7 figures
Seungeun Oh, Sihun Baek, Jihong Park, Hyelin Nam, Praneeth Vepakomma, Ramesh Raskar, Mehdi Bennis, Seong-Lyun Kim
In computer vision, the vision transformer (ViT) has increasingly superseded
the convolutional neural network (CNN) for improved accuracy and robustness.
However, ViT's large model sizes and high sample complexity make it difficult
to train on resource-constrained edge devices. Split learning (SL) emerges as a
viable solution, leveraging server-side resources to train ViTs while utilizing
private data from distributed devices. However, SL requires additional
information exchange for weight updates between the device and the server,
which can be exposed to various attacks on private training data. To mitigate
the risk of data breaches in classification tasks, inspired from the CutMix
regularization, we propose a novel privacy-preserving SL framework that injects
Gaussian noise into smashed data and mixes randomly chosen patches of smashed
data across clients, coined DP-CutMixSL. Our analysis demonstrates that
DP-CutMixSL is a differentially private (DP) mechanism that strengthens privacy
protection against membership inference attacks during forward propagation.
Through simulations, we show that DP-CutMixSL improves privacy protection
against membership inference attacks, reconstruction attacks, and label
inference attacks, while also improving accuracy compared to DP-SL and
DP-MixSL.
Authors' comments: 23 pages, 11 figures, 8 tables, to be published in Transactions on
Machine Learning Research (TMLR)
Lukas Kratochvila, Gijs de Jong, Monique Arkesteijn, Simon Bilik, Tomas Zemcik, Karel Horak, Jan S. Rellermeyer
Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
This paper introduces a novel approach called sentence-wise speech
summarization (Sen-SSum), which generates text summaries from a spoken document
in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of
automatic speech recognition (ASR) with the conciseness of speech
summarization. To explore this approach, we present two datasets for Sen-SSum:
Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of
Transformer-based models: 1) cascade models that combine ASR and strong text
summarization models, and 2) end-to-end (E2E) models that directly convert
speech into a text summary. While E2E models are appealing to develop
compute-efficient models, they perform worse than cascade models. Therefore, we
propose knowledge distillation for E2E models using pseudo-summaries generated
by the cascade models. Our experiments show that this proposed knowledge
distillation effectively improves the performance of the E2E model on both
datasets.
Authors' comments: Accepted to Interspeech2024. Dataset:
https://huggingface.co/datasets/komats/mega-ssum
Hayun Lee, Dongkun Shin
With the recent proliferation of on-device AI, there is an increasing need to
run computationally intensive DNNs directly on mobile devices. However, the
limited computing and memory resources of these devices necessitate effective
pruning techniques. Block-wise pruning is promising due to its low accuracy
drop tradeoff for speedup gains, but it requires block positions to be aligned
with block size, hindering optimal position selection to minimize model
accuracy drop. Unaligned block pruning (UBP) addresses this by allowing blocks
to be selected at arbitrary positions, yet its practical use is limited by a
time-consuming optimal block selection algorithm and lack of efficient
inference kernels. In this paper, we propose a pseudo-optimal yet fast block
selection algorithm called Block Expansion and Division (BED), which can be
integrated into an iterative model training process. Additionally, we introduce
an efficient inference kernel implementation for mobile devices, enabling a
UBP-based model to achieve similar latency to a DNN model compressed by aligned
block pruning. We demonstrate the superiority of our techniques on a real
mobile phone with MobileNet and ResNet models.
Authors' comments: 11 pages, 8 figures
Tianxiao Zhang, Wenju Xu, Bo Luo, Guanghui Wang
The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the models capture both local and global information with minimal overhead. Additionally, we introduce two architecture variants, allowing the Depth-Wise Convolution modules to be applied to multiple Transformer blocks for parameter savings, and incorporating independent parallel Depth-Wise Convolution modules with different kernels to enhance the acquisition of local information. The proposed approach significantly boosts the performance of ViT models on image classification, object detection, and instance segmentation by a large margin, especially on small datasets, as evaluated on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet for image classification, and COCO for object detection and instance segmentation. The source code can be accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.
Daniel Berend, Philip A. Ernst, Aryeh Kontorovich, Rishi Kumar
Let $M(n, k, p)$ denote the maximum probability of the event $X_1 = X_2 = \cdots = X_n=1$ under a $k$-wise independent distribution whose marginals are Bernoulli random variables with mean $p$. A long-standing question is to calculate $M(n, k, p)$ for all values of $n,k,p$. This question has been partially addressed by several authors, primarily with the goal of answering asymptotic questions. The present paper focuses on obtaining exact expressions for this probability. To this end, we provide closed-form formulas of $M(n,k,p)$ for $p$ near 0 as well as $p$ near 1.
Ye Lin Tun, Chu Myaet Thwal, Minh N. H. Nguyen, Choong Seon Hong
Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving training approaches such as federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models that demand significant resources. This presents a substantial challenge for FL clients operating with limited computational resources and communication bandwidth. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach, which decomposes the training process into multiple steps. Each step focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL scenarios and multimodal learning setups to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also introduce a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.
Huyen Ngo, Khoi Do, Duong Nguyen, Viet Dung Nguyen, Lan Dang
A significant challenge in the electroencephalogram EEG lies in the fact that current data representations involve multiple electrode signals, resulting in data redundancy and dominant lead information. However extensive research conducted on EEG classification focuses on designing model architectures without tackling the underlying issues. Otherwise, there has been a notable gap in addressing data preprocessing for EEG, leading to considerable computational overhead in Deep Learning (DL) processes. In light of these issues, we propose a simple yet effective approach for EEG data pre-processing. Our method first transforms the EEG data into an encoded image by an Inverted Channel-wise Magnitude Homogenization (ICWMH) to mitigate inter-channel biases. Next, we apply the edge detection technique on the EEG-encoded image combined with skip connection to emphasize the most significant transitions in the data while preserving structural and invariant information. By doing so, we can improve the EEG learning process efficiently without using a huge DL network. Our experimental evaluations reveal that we can significantly improve (i.e., from 2% to 5%) over current baselines.
Zhourui Zhang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu
In recent years, current mainstream feature masking distillation methods mainly function by reconstructing selectively masked regions of a student network from the feature maps of a teacher network. In these methods, attention mechanisms can help to identify spatially important regions and crucial object-aware channel clues, such that the reconstructed features are encoded with sufficient discriminative and representational power similar to teacher features. However, previous feature-masking distillation methods mainly address homogeneous knowledge distillation without fully taking into account the heterogeneous knowledge distillation scenario. In particular, the huge discrepancy between the teacher and the student frameworks within the heterogeneous distillation paradigm is detrimental to feature masking, leading to deteriorating reconstructed student features. In this study, a novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection. More specifically, a stage-wise adaptation learning module is incorporated into the dual feature-masking framework, and thus the student model can be progressively adapted to the teacher models for bridging the gap between heterogeneous networks. Furthermore, a masking enhancement strategy is combined with stage-wise learning such that object-aware masking regions are adaptively strengthened to improve feature-masking reconstruction. In addition, semantic alignment is performed at each Feature Pyramid Network (FPN) layer between the teacher and the student networks for generating consistent feature distributions. Our experiments for the object detection task demonstrate the promise of our approach, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.
Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33\% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17\% in large-scale traces.
Fengyu Cai, Xinran Zhao, Hongming Zhang, Iryna Gurevych, Heinz Koeppl
Recent advances in measuring hardness-wise properties of data guide language
models in sample selection within low-resource scenarios. However,
class-specific properties are overlooked for task setup and learning. How will
these properties influence model learning and is it generalizable across
datasets? To answer this question, this work formally initiates the concept of
$\textit{class-wise hardness}$. Experiments across eight natural language
understanding (NLU) datasets demonstrate a consistent hardness distribution
across learning paradigms, models, and human judgment. Subsequent experiments
unveil a notable challenge in measuring such class-wise hardness with
instance-level metrics in previous works. To address this, we propose
$\textit{GeoHard}$ for class-wise hardness measurement by modeling class
geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses
instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation
on measuring class-wise hardness. Our analysis theoretically and empirically
underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data
diagnosis. Additionally, we showcase how understanding class-wise hardness can
practically aid in improving task learning.
Authors' comments: Findings of ACL 2024
Mijoo Kim, Junseok Kwon
With the rapid advancement in the performance of deep neural networks (DNNs),
there has been significant interest in deploying and incorporating artificial
intelligence (AI) systems into real-world scenarios. However, many DNNs lack
the ability to represent uncertainty, often exhibiting excessive confidence
even when making incorrect predictions. To ensure the reliability of AI
systems, particularly in safety-critical cases, DNNs should transparently
reflect the uncertainty in their predictions. In this paper, we investigate
robust post-hoc uncertainty calibration methods for DNNs within the context of
multi-class classification tasks. While previous studies have made notable
progress, they still face challenges in achieving robust calibration,
particularly in scenarios involving out-of-distribution (OOD). We identify that
previous methods lack adaptability to individual input data and struggle to
accurately estimate uncertainty when processing inputs drawn from the wild
dataset. To address this issue, we introduce a novel instance-wise calibration
method based on an energy model. Our method incorporates energy scores instead
of softmax confidence scores, allowing for adaptive consideration of DNN
uncertainty for each prediction within a logit space. In experiments, we show
that the proposed method consistently maintains robust performance across the
spectrum, spanning from in-distribution to OOD scenarios, when compared to
other state-of-the-art methods.
Authors' comments: Accepted to ECCV 2024
Amanda Olmin, Fredrik Lindsten
Epoch-wise double descent is the phenomenon where generalisation performance improves beyond the point of overfitting, resulting in a generalisation curve exhibiting two descents under the course of learning. Understanding the mechanisms driving this behaviour is crucial not only for understanding the generalisation behaviour of machine learning models in general, but also for employing conventional selection methods, such as the use of early stopping to mitigate overfitting. While we ultimately want to draw conclusions of more complex models, such as deep neural networks, a majority of theoretical results regarding the underlying cause of epoch-wise double descent are based on simple models, such as standard linear regression. In this paper, to take a step towards more complex models in theoretical analysis, we study epoch-wise double descent in two-layer linear neural networks. First, we derive a gradient flow for the linear two-layer model, that bridges the learning dynamics of the standard linear regression model, and the linear two-layer diagonal network with quadratic weights. Second, we identify additional factors of epoch-wise double descent emerging with the extra model layer, by deriving necessary conditions for the generalisation error to follow a double descent pattern. While epoch-wise double descent in linear regression has been attributed to differences in input variance, in the two-layer model, also the singular values of the input-output covariance matrix play an important role. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
Jingjing Xu, Wei Zhou, Zijian Yang, Eugen Beck, Ralf Schlueter
Varying-size models are often required to deploy ASR systems under different
hardware and/or application constraints such as memory and latency. To avoid
redundant training and optimization efforts for individual models of different
sizes, we present the dynamic encoder size approach, which jointly trains
multiple performant models within one supernet from scratch. These subnets of
various sizes are layer-wise pruned from the supernet, and thus, enjoy full
parameter sharing. By combining score-based pruning with supernet training, we
propose two novel methods, Simple-Top-k and Iterative-Zero-Out, to
automatically select the best-performing subnets in a data-driven manner,
avoiding resource-intensive search efforts. Our experiments using CTC on both
Librispeech and TED-LIUM-v2 corpora show that our methods can achieve on-par
performance as individually trained models of each size category. Also, our
approach consistently brings small performance improvements for the full-size
supernet.
Authors' comments: Accepted by Interspeech 2024
Ardhi Wiratama Baskara Yudha, Jiaqi Xue, Qian Lou, Huiyang Zhou, Yan Solihin
Fully Homomorphic Encryption (FHE) allows for the execution of computations
on encrypted data without the need to decrypt it first, offering significant
potential for privacy-preserving computational operations. Emerging
arithmetic-based FHE schemes (ar-FHE), like BGV, demonstrate even better
performance in word-wise comparison operations over non-arithmetic FHE (na-FHE)
schemes, such as TFHE, especially for basic tasks like comparing values,
finding maximums, and minimums. This shows the universality of ar-FHE in
effectively handling both arithmetic and non-arithmetic operations without the
expensive conversion between arithmetic and non-arithmetic FHEs. We refer to
universal arithmetic Fully Homomorphic Encryption as uFHE. The arithmetic
operations in uFHE remain consistent with those in the original arithmetic FHE,
which have seen significant acceleration. However, its non-arithmetic
comparison operations differ, are slow, and have not been as thoroughly studied
or accelerated. In this paper, we introduce BoostCom, a scheme designed to
speed up word-wise comparison operations, enhancing the efficiency of uFHE
systems. BoostCom involves a multi-prong optimizations including infrastructure
acceleration (Multi-level heterogeneous parallelization and GPU-related
improvements), and algorithm-aware optimizations (slot compaction, non-blocking
comparison semantic). Together, BoostCom achieves an end-to-end performance
improvement of more than an order of magnitude (11.1x faster) compared to the
state-of-the-art CPU-based uFHE systems, across various FHE parameters and
tasks.
Authors' comments: To be appeared on PACT 2024
Heikki Muhli, Tapio Ala-Nissila, Miguel A. Caro
A common approach to modeling dispersion interactions and overcoming the inaccurate description of long-range correlation effects in electronic structure calculations is the use of pairwise-additive potentials, as in the Tkatchenko-Scheffler [Phys. Rev. Lett. 102, 073005 (2009)] method. In previous work [Phys. Rev. B 104, 054106 (2021)], we have shown how these are amenable to highly efficient atomistic simulation by machine learning their local parametrization. However, the atomic polarizability and the electron correlation energy have a complex and non-local many-body character and some of the dispersion effects in complex systems are not sufficiently described by these types of pairwise-additive potentials. Currently, one of the most widely used rigorous descriptions of the many-body effects is based on the many-body dispersion (MBD) model [Phys. Rev. Lett. 108, 236402 (2012)]. In this work, we show that the MBD model can also be locally parametrized to derive a local approximation for the highly non-local many-body effects. With this local parametrization, we develop an atom-wise formulation of MBD that we refer to as linear MBD (lMBD), as this decomposition enables linear scaling with system size. This model provides a transparent and controllable approximation to the full MBD model with tunable convergence parameters for a fraction of the computational cost observed in electronic structure calculations with popular density-functional theory codes. We show that our model scales linearly with the number of atoms in the system and is easily parallelizable. Furthermore, we show how using the same machinery already established in previous work for predicting Hirshfeld volumes with machine learning enables access to large-scale simulations with MBD-level corrections.
Shirley Kokane, Mostofa Rafid Uddin, Min Xu
Transfer learning methods start performing poorly when the complexity of the learning task is increased. Most of these methods calculate the cumulative differences of all the matched features and then use them to back-propagate that loss through all the layers. Contrary to these methods, in this work, we propose a novel layer-wise learning scheme that adjusts learning parameters per layer as a function of the differences in the Jacobian/Attention/Hessian of the output activations w.r.t. the network parameters. We applied this novel scheme for attention map-based and derivative-based (first and second order) transfer learning methods. We received improved learning performance and stability against a wide range of datasets. From extensive experimental evaluation, we observed that the performance boost achieved by our method becomes more significant with the increasing difficulty of the learning task.
Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.