Ning-Chi Huang, Chi-Chih Chang, Wei-Cheng Lin, Endri Taka, Diana Marculescu, Kai-Chiang Wu
$N{:}M$ sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing $N{:}M$ sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized $N{:}M$ sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on $N{:}M$ sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise $N{:}M$ Sparsity for ViTs. Considering not only all $N{:}M$ sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9$\times$ reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.
Nick Gravin, Zhiqi Wang
This paper reexamines the classic problem of revenue maximization in single-item auctions with $n$ buyers under the lens of the robust optimization framework. The celebrated Myerson's mechanism is the format that maximizes the seller's revenue under the prior distribution, which is mutually independent across all $n$ buyers. As argued in a recent line of work (Caragiannis et al. 22), (Dughmi et al. 24), mutual independence is a strong assumption that is extremely hard to verify statistically, thus it is important to relax the assumption. While optimal under mutual independent prior, we find that Myerson's mechanism may lose almost all of its revenue when the independence assumption is relaxed to pairwise independence, i.e., Myerson's mechanism is not pairwise-robust. The mechanism regains robustness when the prior is assumed to be 3-wise independent. In contrast, we show that second-price auctions with anonymous reserve, including optimal auctions under i.i.d. priors, lose at most a constant fraction of their revenues on any regular pairwise independent prior. Our findings draw a comprehensive picture of robustness to $k$-wise independence in single-item auction settings.
Lu Wang, Tianyuan Zhang, Yikai Han, Muyang Fang, Ting Jin, Jiaqi Kang
With recent breakthroughs in deep neural networks, numerous tasks within autonomous driving have exhibited remarkable performance. However, deep learning models are susceptible to adversarial attacks, presenting significant security risks to autonomous driving systems. Presently, end-to-end architectures have emerged as the predominant solution for autonomous driving, owing to their collaborative nature across different tasks. Yet, the implications of adversarial attacks on such models remain relatively unexplored. In this paper, we conduct comprehensive adversarial security research on the modular end-to-end autonomous driving model for the first time. We thoroughly consider the potential vulnerabilities in the model inference process and design a universal attack scheme through module-wise noise injection. We conduct large-scale experiments on the full-stack autonomous driving model and demonstrate that our attack method outperforms previous attack methods. We trust that our research will offer fresh insights into ensuring the safety and reliability of autonomous driving systems.
Alessandro Perlo, Giordano Paoletti, Nikhil Jha, Luca Vassio, Jussara Almeida, Marco Mellia
Although currently one of the most popular instant messaging apps worldwide, Telegram has been largely understudied in the past years. In this paper, we aim to address this gap by presenting an analysis of publicly accessible groups covering discussions encompassing different topics, as diverse as Education, Erotic, Politics, and Cryptocurrencies. We engineer and offer an open-source tool to automate the collection of messages from Telegram groups, a non-straightforward problem. We use it to collect more than 50 million messages from 669 groups. Here, we present a first-of-its-kind, per-topic analysis, contrasting the characteristics of the messages sent on the platform from different angles -- the language, the presence of bots, the type and volume of shared media content. Our results confirm some anecdotal evidence, e.g., clues that Telegram is used to share possibly illicit content, and unveil some unexpected findings, e.g., the different sharing patterns of video and stickers in groups of different topics. While preliminary, we hope that our work paves the road for several avenues of future research on the understudied Telegram platform.
Hengyu Zhou, Hui Zhang, Bin Wang
The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled B\'ezier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.
Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes et al.
Dysfluent speech detection is the bottleneck for disordered speech analysis
and spoken language learning. Current state-of-the-art models are governed by
rule-based systems which lack efficiency and robustness, and are sensitive to
template design. In this paper, we propose YOLO-Stutter: a first end-to-end
method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes
imperfect speech-text alignment as input, followed by a spatial feature
aggregator, and a temporal dependency extractor to perform region-wise boundary
and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter
and VCTK-TTS, that simulate natural spoken dysfluencies including repetition,
block, missing, replacement, and prolongation. Our end-to-end method achieves
state-of-the-art performance with a minimum number of trainable parameters for
on both simulated data and real aphasia speech. Code and datasets are
open-sourced at https://github.com/rorizzz/YOLO-Stutter
Authors' comments: Interspeech 2024
Muyao Wang, Zeke Xie, Bo Chen
The influence function, a technique from robust statistics, measures the impact on model parameters or related functions when training data is removed or modified. This effective and valuable post-hoc method allows for studying the interpretability of machine learning models without requiring costly model retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of training MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation that leverages the more informative average gradient of the data set. Additionally, we demonstrate how this influence function can be used to estimate the impact of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world dataset, the original influence function performs worse than our method and even fail for the channel pruning problem, which demonstrate the superiority and necessity of channel-wise influence function in MTS analysis tasks.
Jenni Raitoharju
This paper proposes an easy-to-use method for one-class classification:
Repeated Element-wise Folding (REF). The algorithm consists of repeatedly
standardizing and applying an element-wise folding operation on the one-class
training data. Equivalent mappings are performed on unknown test items and the
classification prediction is based on the item's distance to the origin of the
final distribution. As all the included operations have linear time complexity,
the proposed algorithm provides a linear-time alternative for the commonly used
computationally much more demanding approaches. Furthermore, REF can avoid the
challenges of hyperparameter setting in one-class classification by providing
robust default settings. The experiments show that the proposed method can
produce similar classification performance or even outperform the more complex
algorithms on various benchmark datasets. Matlab codes for REF are publicly
available at https://github.com/JenniRaitoharju/REF.
Authors' comments: Accepted to EUSIPCO 2024
Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, Xiangnan He
Sequential recommendation systems predict the next interaction item based on users' past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches are eager to apply LLMs to sequential recommendation. A common paradigm is converting user behavior sequences into instruction data, and fine-tuning the LLM with parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaption (LoRA). However, the uniform application of LoRA across diverse user behaviors is insufficient to capture individual variability, resulting in negative transfer between disparate sequences. To address these challenges, we propose Instance-wise LoRA (iLoRA). We innovatively treat the sequential recommendation task as a form of multi-task learning, integrating LoRA with the Mixture of Experts (MoE) framework. This approach encourages different experts to capture various aspects of user behavior. Additionally, we introduce a sequence representation guided gate function that generates customized expert participation weights for each user sequence, which allows dynamic parameter adjustment for instance-wise recommendations. In sequential recommendation, iLoRA achieves an average relative improvement of 11.4\% over basic LoRA in the hit ratio metric, with less than a 1\% relative increase in trainable parameters. Extensive experiments on three benchmark datasets demonstrate the effectiveness of iLoRA, highlighting its superior performance compared to existing methods in mitigating negative transfer and improving recommendation accuracy. Our data and code are available at https://github.com/AkaliKong/iLoRA.
Yabin Wang, Zhiwu Huang, Su Zhou, Adam Prugel-Bennett, Xiaopeng Hong
The diffusion of deepfake technologies has sparked serious concerns about its potential misuse across various domains, prompting the urgent need for robust detection methods. Despite advancement, many current approaches prioritize short-term gains at expense of long-term effectiveness. This paper critiques the overly specialized approach of fine-tuning pre-trained models solely with a penny-wise objective on a single deepfake dataset, while disregarding the pound-wise balance for generalization and knowledge retention. To address this "Penny-Wise and Pound-Foolish" issue, we propose a novel learning framework (PoundNet) for generalization of deepfake detection on a pre-trained vision-language model. PoundNet incorporates a learnable prompt design and a balanced objective to preserve broad knowledge from upstream tasks (object classification) while enhancing generalization for downstream tasks (deepfake detection). We train PoundNet on a standard single deepfake dataset, following common practice in the literature. We then evaluate its performance across 10 public large-scale deepfake datasets with 5 main evaluation metrics-forming the largest benchmark test set for assessing the generalization ability of deepfake detection models, to our knowledge. The comprehensive benchmark evaluation demonstrates the proposed PoundNet is significantly less "Penny-Wise and Pound-Foolish", achieving a remarkable improvement of 19% in deepfake detection performance compared to state-of-the-art methods, while maintaining a strong performance of 63% on object classification tasks, where other deepfake detection models tend to be ineffective. Code and data are open-sourced at https://github.com/iamwangyabin/PoundNet.
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li et al.
This paper identifies significant redundancy in the query-key interactions
within self-attention mechanisms of diffusion transformer models, particularly
during the early stages of denoising diffusion steps. In response to this
observation, we present a novel diffusion transformer framework incorporating
an additional set of mediator tokens to engage with queries and keys
separately. By modulating the number of mediator tokens during the denoising
generation phases, our model initiates the denoising process with a precise,
non-ambiguous stage and gradually transitions to a phase enriched with detail.
Concurrently, integrating mediator tokens simplifies the attention module's
complexity to a linear scale, enhancing the efficiency of global attention
processes. Additionally, we propose a time-step dynamic mediator token
adjustment mechanism that further decreases the required computational FLOPs
for generation, simultaneously facilitating the generation of high-quality
images within the constraints of varied inference budgets. Extensive
experiments demonstrate that the proposed method can improve the generated
image quality while also reducing the inference cost of diffusion transformers.
When integrated with the recent work SiT, our method achieves a
state-of-the-art FID score of 2.01. The source code is available at
https://github.com/LeapLabTHU/Attention-Mediators.
Authors' comments: ECCV 2024
David Zagardo
Traditional Differentially Private Stochastic Gradient Descent (DP-SGD)
introduces statistical noise on top of gradients drawn from a Gaussian
distribution to ensure privacy. This paper introduces the novel Differentially
Private Block-wise Gradient Shuffle (DP-BloGS) algorithm for deep learning.
BloGS builds off of existing private deep learning literature, but makes a
definitive shift by taking a probabilistic approach to gradient noise
introduction through shuffling modeled after information theoretic privacy
analyses. The theoretical results presented in this paper show that the
combination of shuffling, parameter-specific block size selection, batch layer
clipping, and gradient accumulation allows DP-BloGS to achieve training times
close to that of non-private training while maintaining similar privacy and
utility guarantees to DP-SGD. DP-BloGS is found to be significantly more
resistant to data extraction attempts than DP-SGD. The theoretical results are
validated by the experimental findings.
Authors' comments: 43 pages, 11 figures, 8 tables
Aadarsh Singh
In this paper, we have explored the impact of certain indices-dependent
element-wise transformations on the null space of a matrix. We have found the
conditions on this transformation that will preserve the rank and nullity of
the original matrix. We have also found some transformations which give
localized null vectors for the transformed matrix. Finally, some possible
applications of these localized null vectors and eigenvalues are mentioned in
different domains.
Authors' comments: 13 pages
Ruizi Han, Jinglei Tang
Parameter-efficient transfer learning (PETL) aims to adapt large pre-trained
models using limited parameters. While most PETL approaches update the added
parameters and freeze pre-trained weights during training, the minimal impact
of task-specific deep layers on cross-domain data poses a challenge as PETL
cannot modify them, resulting in redundant model structures. Structural pruning
effectively reduces model redundancy; however, common pruning methods often
lead to an excessive increase in stored parameters due to varying pruning
structures based on pruning rates and data. Recognizing the storage parameter
volume issue, we propose a Straightforward layer-wise pruning method, called
SLS, for pruning PETL-transferred models. By evaluating parameters from a
feature perspective of each layer and utilizing clustering metrics to assess
current parameters based on clustering phenomena in low-dimensional space
obtained through t-SNE, SLS facilitates informed pruning decisions. Our study
reveals that layer-wise pruning, with a focus on storing pruning indices,
addresses storage volume concerns. Notably, mainstream Layer-wise pruning
methods may not be suitable for assessing layer importance in PETL-transferred
models, where the majority of parameters are pre-trained and have limited
relevance to downstream datasets. Comparative analysis against state-of-the-art
PETL methods demonstrates that the pruned model achieved a notable balance
between model throughput and accuracy. Moreover, SLS effectively reduces
storage overhead arising from varying pruned structures while enhancing the
accuracy and speed of pruned models compared to conventional pruning methods.
Authors' comments: published to ECCV2024
Jiahong Ma, Mingguo He, Zhewei Wei
Spectral Graph Neural Networks have demonstrated superior performance in
graph representation learning. However, many current methods focus on employing
shared polynomial coefficients for all nodes, i.e., learning node-unified
filters, which limits the filters' flexibility for node-level tasks. The recent
DSF attempts to overcome this limitation by learning node-wise coefficients
based on positional encoding. However, the initialization and updating process
of the positional encoding are burdensome, hindering scalability on large-scale
graphs. In this work, we propose a scalable node-wise filter, PolyAttn.
Leveraging the attention mechanism, PolyAttn can directly learn node-wise
filters in an efficient manner, offering powerful representation capabilities.
Building on PolyAttn, we introduce the whole model, named PolyFormer. In the
lens of Graph Transformer models, PolyFormer, which calculates attention scores
within nodes, shows great scalability. Moreover, the model captures spectral
information, enhancing expressiveness while maintaining efficiency. With these
advantages, PolyFormer offers a desirable balance between scalability and
expressiveness for node-level tasks. Extensive experiments demonstrate that our
proposed methods excel at learning arbitrary node-wise filters, showing
superior performance on both homophilic and heterophilic graphs, and handling
graphs containing up to 100 million nodes. The code is available at
https://github.com/air029/PolyFormer.
Authors' comments: ACM SIGKDD 2024
Luís Almeida, Inês Dutra, Francesco Renna
Semantic segmentation is a fundamental computer vision task with a vast number of applications. State of the art methods increasingly rely on deep learning models, known to incorrectly estimate uncertainty and being overconfident in predictions, especially in data not seen during training. This is particularly problematic in semantic segmentation due to inherent class imbalance. Popular uncertainty quantification approaches are task-agnostic and fail to leverage spatial pixel correlations in uncertainty estimates, crucial in this task. In this work, a novel training methodology specifically designed for semantic segmentation is presented. Training samples are weighted by instance-wise uncertainty masks computed by an ensemble. This is shown to increase performance on minority classes, boost model generalization and robustness to domain-shift when compared to using the inverse of class proportions or no class weights at all. This method addresses the challenges of class imbalance and uncertainty estimation in semantic segmentation, potentially enhancing model performance and reliability across various applications.
Seitaro Otsuki, Tsumugi Iida, Félix Doublet, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura
The transparent formulation of explanation methods is essential for
elucidating the predictions of neural networks, which are typically black-box
models. Layer-wise Relevance Propagation (LRP) is a well-established method
that transparently traces the flow of a model's prediction backward through its
architecture by backpropagating relevance scores. However, the conventional LRP
does not fully consider the existence of skip connections, and thus its
application to the widely used ResNet architecture has not been thoroughly
explored. In this study, we extend LRP to ResNet models by introducing
Relevance Splitting at points where the output from a skip connection converges
with that from a residual block. Our formulation guarantees the conservation
property throughout the process, thereby preserving the integrity of the
generated explanations. To evaluate the effectiveness of our approach, we
conduct experiments on ImageNet and the Caltech-UCSD Birds-200-2011 dataset.
Our method achieves superior performance to that of baseline methods on
standard evaluation metrics such as the Insertion-Deletion score while
maintaining its conservation property. We will release our code for further
research at https://5ei74r0.github.io/lrp-for-resnet.page/
Authors' comments: Accepted for presentation at ECCV2024
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao
Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
Kevin Gu, Eva Tuecke, Dmitriy Katz, Raya Horesh, David Alvarez-Melis, Mikhail Yurochkin
Large language models (LLMs) have shown remarkable potential for problem
solving, with open source models achieving increasingly impressive performance
on benchmarks measuring areas from logical reasoning to mathematical ability.
Ensembling models can further improve capabilities across a variety of domains.
However, conventional methods of combining models at inference time such as
shallow fusion necessitate a shared vocabulary and tokenization, and
alternatives like fine-tuning for domain-specific performance are both time
consuming and computationally expensive. We therefore present an inference-time
ensembling algorithm aimed at "averaging" outputs from multiple LLMs and
illustrate its improved performance across multiple domains compared to its
constituent models alone. Character-wise ensemble decoding, CharED, finds the
marginal distribution of each character for an individual model and performs a
weighted average to generate an output, character by character. In coding,
math, and toxicity benchmarks, we find our proposed model able to combine
complimentary strengths of multiple LLMs, regardless of vocabulary,
tokenization, or model size.
Authors' comments: 9 pages, 4 figures
Derck W. E. Prinzhorn, Thijmen Nijdam, Putri A. van der Linden, Alexander Timans
Conformal prediction offers a practical framework for distribution-free
uncertainty quantification, providing finite-sample coverage guarantees under
relatively mild assumptions on data exchangeability. However, these assumptions
cease to hold for time series due to their temporally correlated nature. In
this work, we present a novel use of conformal prediction for time series
forecasting that incorporates time series decomposition. This approach allows
us to model different temporal components individually. By applying specific
conformal algorithms to each component and then merging the obtained prediction
intervals, we customize our methods to account for the different
exchangeability regimes underlying each component. Our decomposition-based
approach is thoroughly discussed and empirically evaluated on synthetic and
real-world data. We find that the method provides promising results on
well-structured time series, but can be limited by factors such as the
decomposition step for more complex data.
Authors' comments: Accepted at COPA 2024; 34 pages, 14 figures, 8 tables (incl.
appendix)