Qi Bing, Chaoyi Zhang, Weidong Cai
In contrast to the well-established technique of rasterization, vectorization
of images poses a significant challenge in the field of computer graphics.
Recent learning-based methods for converting raster images to vector formats
frequently suffer from incomplete shapes, redundant path prediction, and a lack
of accuracy in preserving the semantics of the original content. These
shortcomings severely hinder the utility of these methods for further editing
and manipulation of images. To address these challenges, we present DeepIcon, a
novel hierarchical image vectorization network specifically tailored for
generating variable-length icon vector graphics based on the raster image
input. Our experimental results indicate that DeepIcon can efficiently produce
Scalable Vector Graphics (SVGs) directly from raster images, bypassing the need
for a differentiable rasterizer while also demonstrating a profound
understanding of the image contents.
Authors' comments: Accepted as Oral Presentation at DICTA 2024
Zihan Chen, Bike Xie, Jundong Li, Cong Shen
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
Shen Yuan, Hongteng Xu
Transformer plays a central role in many fundamental deep learning models,
e.g., the ViT in computer vision and the BERT and GPT in natural language
processing, whose effectiveness is mainly attributed to its multi-head
attention (MHA) mechanism. In this study, we propose a simple and novel
channel-wise sample permutation (CSP) operator, achieving a new structured MHA
with fewer parameters and lower complexity. Given an input matrix, CSP
circularly shifts the samples of different channels with various steps and then
sorts grouped samples of each channel. This operator is equivalent to
implicitly implementing cross-channel attention maps as permutation matrices,
which achieves linear complexity and suppresses the risk of rank collapse when
representing data. We replace the MHA of some representative models with CSP
and test the CSP-based models in several discriminative tasks, including image
classification and long sequence analysis. Experiments show that the CSP-based
models achieve comparable or better performance with fewer parameters and lower
computational costs than the classic Transformer and its state-of-the-art
variants. The code is available at https://github.com/DaShenZi721/CSP.
Authors' comments: 18 pages, 4 figures
Siyuan Huang, Yunchong Song, Jiayue Zhou, Zhouhan Lin
In the realm of graph learning, there is a category of methods that
conceptualize graphs as hierarchical structures, utilizing node clustering to
capture broader structural information. While generally effective, these
methods often rely on a fixed graph coarsening routine, leading to overly
homogeneous cluster representations and loss of node-level information. In this
paper, we envision the graph as a network of interconnected node sets without
compressing each cluster into a single embedding. To enable effective
information transfer among these node sets, we propose the Node-to-Cluster
Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple
Kernel Learning into the kernelized attention framework, effectively capturing
information at both node and cluster levels. We then devise an efficient form
for N2C-Attn using the cluster-wise message-passing framework, achieving linear
time complexity. We further analyze how N2C-Attn combines bi-level feature maps
of queries and keys, demonstrating its capability to merge dual-granularity
information. The resulting architecture, Cluster-wise Graph Transformer
(Cluster-GT), which uses node clusters as tokens and employs our proposed
N2C-Attn module, shows superior performance on various graph-level tasks. Code
is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.
Authors' comments: Accepted as NeurIPS 2024 Spotlight
Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran
Transferability of adversarial examples is a well-known property that endangers all classification models, even those that are only accessible through black-box queries. Prior work has shown that an ensemble of models is more resilient to transferability: the probability that an adversarial example is effective against most models of the ensemble is low. Thus, most ongoing research focuses on improving ensemble diversity. Another line of prior work has shown that Lipschitz continuity of the models can make models more robust since it limits how a model's output changes with small input perturbations. In this paper, we study the effect of Lipschitz continuity on transferability rates. We show that although a lower Lipschitz constant increases the robustness of a single model, it is not as beneficial in training robust ensembles as it increases the transferability rate of adversarial examples across models in the ensemble. Therefore, we introduce LOTOS, a new training paradigm for ensembles, which counteracts this adverse effect. It does so by promoting orthogonality among the top-$k$ sub-spaces of the transformations of the corresponding affine layers of any pair of models in the ensemble. We theoretically show that $k$ does not need to be large for convolutional layers, which makes the computational overhead negligible. Through various experiments, we show LOTOS increases the robust accuracy of ensembles of ResNet-18 models by $6$ percentage points (p.p) against black-box attacks on CIFAR-10. It is also capable of combining with the robustness of prior state-of-the-art methods for training robust ensembles to enhance their robust accuracy by $10.7$ p.p.
Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling
This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase
Prediction (SP-NSPP) model, which predicts the phase spectrum from input
amplitude spectrum by two-stage neural networks. In the initial
prior-construction stage, we preliminarily predict a rough prior phase spectrum
from the amplitude spectrum. The subsequent refinement stage transforms the
amplitude spectrum into a refined high-quality phase spectrum conditioned on
the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone
and adopt adversarial training by innovatively introducing a phase spectrum
discriminator (PSD). To further improve the continuity of the refined phase, we
also incorporate a time-frequency integrated difference (TFID) loss in the
refinement stage. Experimental results confirm that, compared to neural
network-based no-prior phase prediction methods, the proposed SP-NSPP achieves
higher phase prediction accuracy, thanks to introducing the coarse phase priors
and diverse training criteria. Compared to iterative phase estimation
algorithms, our proposed SP-NSPP does not require multiple rounds of staged
iterations, resulting in higher generation efficiency.
Authors' comments: Accepted by SLT2024
Ismail Alkhouri, Shijun Liang, Cheng-Han Huang, Jimmy Dai, Qing Qu, Saiprasad Ravishankar, Rongrong Wang
Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse imaging problems (IPs), the reverse sampling steps of DMs are typically modified to approximately sample from a measurement-conditioned distribution in the image space. However, these modifications may be unsuitable for certain settings (such as in the presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier sampling steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces the standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions, either implicitly or explicitly, our sampler requires significantly fewer reverse steps. Therefore, we refer to our accelerated method as Step-wise Triple-Consistent Sampling (SITCOM). Compared to existing state-of-the-art baseline methods, under different levels of measurement noise, our extensive experiments across five linear and three non-linear image restoration tasks demonstrate that SITCOM achieves competitive or superior results in terms of standard image similarity metrics while requiring a significantly reduced run-time across all considered tasks.
Urszula Jessen, Dirk Fahland
Anomalies in complex industrial processes are often obscured by high variability and complexity of event data, which hinders their identification and interpretation using process mining. To address this problem, we introduce WISE (Weighted Insights for Evaluating Efficiency), a novel method for analyzing business process metrics through the integration of domain knowledge, process mining, and machine learning. The methodology involves defining business goals and establishing Process Norms with weighted constraints at the activity level, incorporating input from domain experts and process analysts. Individual process instances are scored based on these constraints, and the scores are normalized to identify features impacting process goals. Evaluation using the BPIC 2019 dataset and real industrial contexts demonstrates that WISE enhances automation in business process analysis and effectively detects deviations from desired process flows. While LLMs support the analysis, the inclusion of domain experts ensures the accuracy and relevance of the findings.
Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
Diffusion transformers have shown significant effectiveness in both image and
video synthesis at the expense of huge computation costs. To address this
problem, feature caching methods have been introduced to accelerate diffusion
transformers by caching the features in previous timesteps and reusing them in
the following timesteps. However, previous caching methods ignore that
different tokens exhibit different sensitivities to feature caching, and
feature caching on some tokens may lead to 10$\times$ more destruction to the
overall generation quality compared with other tokens. In this paper, we
introduce token-wise feature caching, allowing us to adaptively select the most
suitable tokens for caching, and further enable us to apply different caching
ratios to neural layers in different types and depths. Extensive experiments on
PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image
and video generation with no requirements for training. For instance,
2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and
PixArt-$\alpha$ with almost no drop in generation quality.
Authors' comments: ToCa is honored to be accepted by ICLR 2025
Sawinder Kaur, Avery Gump, Yi Xiao, Jingyu Xin, Harshit Sharma, Nina R Benway, Jonathan L Preston, Asif Salekin
The advancement in deep learning and internet-of-things have led to diverse
human sensing applications. However, distinct patterns in human sensing,
influenced by various factors or contexts, challenge the generic neural network
model's performance due to natural distribution shifts. To address this,
personalization tailors models to individual users. Yet most personalization
studies overlook intra-user heterogeneity across contexts in sensory data,
limiting intra-user generalizability. This limitation is especially critical in
clinical applications, where limited data availability hampers both
generalizability and personalization. Notably, intra-user sensing attributes
are expected to change due to external factors such as treatment progression,
further complicating the challenges. To address the intra-user generalization
challenge, this work introduces CRoP, a novel static personalization approach.
CRoP leverages off-the-shelf pre-trained models as generic starting points and
captures user-specific traits through adaptive pruning on a minimal sub-network
while allowing generic knowledge to be incorporated in remaining parameters.
CRoP demonstrates superior personalization effectiveness and intra-user
robustness across four human-sensing datasets, including two from real-world
health domains, underscoring its practical and social impact. Additionally, to
support CRoP's generalization ability and design choices, we provide empirical
justification through gradient inner product analysis, ablation studies, and
comparisons against state-of-the-art baselines.
Authors' comments: 34 pages, 6 figues and 15 tables
Hongtao Huang, Xiaojun Chang, Lina Yao
Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster generation processes. However, NAS for diffusion is inherently time-consuming as it requires estimating thousands of diffusion models to search for the optimal one. In this paper, we introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models by concurrently optimizing generation steps and network structures. Specifically, we partition the generation process into isometric step segments, each sequentially composed of a full step, multiple partial steps, and several null steps. The full step computes all network blocks, while the partial step involves part of the blocks, and the null step entails no computation. Flexiffusion autonomously explores flexible step combinations for each segment, substantially reducing search costs and enabling greater acceleration compared to the state-of-the-art (SOTA) method for diffusion models. Our searched models reported speedup factors of $2.6\times$ and $1.5\times$ for the original LDM-4-G and the SOTA, respectively. The factors for Stable Diffusion V1.5 and the SOTA are $5.1\times$ and $2.0\times$. We also verified the performance of Flexiffusion on multiple datasets, and positive experiment results indicate that Flexiffusion can effectively reduce redundancy in diffusion models.
Liangyu Zhong, Joachim Sicking, Fabian Hüger, Hanno Gottschalk
Semantic segmentation networks have achieved significant success under the
assumption of independent and identically distributed data. However, these
networks often struggle to detect anomalies from unknown semantic classes due
to the limited set of visual concepts they are typically trained on. To address
this issue, anomaly segmentation often involves fine-tuning on outlier samples,
necessitating additional efforts for data collection, labeling, and model
retraining. Seeking to avoid this cumbersome work, we take a different approach
and propose to incorporate Vision-Language (VL) encoders into existing anomaly
detectors to leverage the semantically broad VL pre-training for improved
outlier awareness. Additionally, we propose a new scoring function that enables
data- and training-free outlier supervision via textual prompts. The resulting
VL4AD model, which includes max-logit prompt ensembling and a class-merging
strategy, achieves competitive performance on widely used benchmark datasets,
thereby demonstrating the potential of vision-language models for pixel-wise
anomaly detection.
Authors' comments: 27 pages, 9 figures, to be published in ECCV 2024 2nd Workshop on
Vision-Centric Autonomous Driving (VCAD)
Ning-Chi Huang, Chi-Chih Chang, Wei-Cheng Lin, Endri Taka, Diana Marculescu, Kai-Chiang Wu
$N{:}M$ sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing $N{:}M$ sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized $N{:}M$ sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on $N{:}M$ sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise $N{:}M$ Sparsity for ViTs. Considering not only all $N{:}M$ sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9$\times$ reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.
Nick Gravin, Zhiqi Wang
This paper reexamines the classic problem of revenue maximization in single-item auctions with $n$ buyers under the lens of the robust optimization framework. The celebrated Myerson's mechanism is the format that maximizes the seller's revenue under the prior distribution, which is mutually independent across all $n$ buyers. As argued in a recent line of work (Caragiannis et al. 22), (Dughmi et al. 24), mutual independence is a strong assumption that is extremely hard to verify statistically, thus it is important to relax the assumption. While optimal under mutual independent prior, we find that Myerson's mechanism may lose almost all of its revenue when the independence assumption is relaxed to pairwise independence, i.e., Myerson's mechanism is not pairwise-robust. The mechanism regains robustness when the prior is assumed to be 3-wise independent. In contrast, we show that second-price auctions with anonymous reserve, including optimal auctions under i.i.d. priors, lose at most a constant fraction of their revenues on any regular pairwise independent prior. Our findings draw a comprehensive picture of robustness to $k$-wise independence in single-item auction settings.
Lu Wang, Tianyuan Zhang, Yikai Han, Muyang Fang, Ting Jin, Jiaqi Kang
With recent breakthroughs in deep neural networks, numerous tasks within autonomous driving have exhibited remarkable performance. However, deep learning models are susceptible to adversarial attacks, presenting significant security risks to autonomous driving systems. Presently, end-to-end architectures have emerged as the predominant solution for autonomous driving, owing to their collaborative nature across different tasks. Yet, the implications of adversarial attacks on such models remain relatively unexplored. In this paper, we conduct comprehensive adversarial security research on the modular end-to-end autonomous driving model for the first time. We thoroughly consider the potential vulnerabilities in the model inference process and design a universal attack scheme through module-wise noise injection. We conduct large-scale experiments on the full-stack autonomous driving model and demonstrate that our attack method outperforms previous attack methods. We trust that our research will offer fresh insights into ensuring the safety and reliability of autonomous driving systems.
Alessandro Perlo, Giordano Paoletti, Nikhil Jha, Luca Vassio, Jussara Almeida, Marco Mellia
Although currently one of the most popular instant messaging apps worldwide, Telegram has been largely understudied in the past years. In this paper, we aim to address this gap by presenting an analysis of publicly accessible groups covering discussions encompassing different topics, as diverse as Education, Erotic, Politics, and Cryptocurrencies. We engineer and offer an open-source tool to automate the collection of messages from Telegram groups, a non-straightforward problem. We use it to collect more than 50 million messages from 669 groups. Here, we present a first-of-its-kind, per-topic analysis, contrasting the characteristics of the messages sent on the platform from different angles -- the language, the presence of bots, the type and volume of shared media content. Our results confirm some anecdotal evidence, e.g., clues that Telegram is used to share possibly illicit content, and unveil some unexpected findings, e.g., the different sharing patterns of video and stickers in groups of different topics. While preliminary, we hope that our work paves the road for several avenues of future research on the understudied Telegram platform.
Hengyu Zhou, Hui Zhang, Bin Wang
The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled B\'ezier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.
Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes et al.
Dysfluent speech detection is the bottleneck for disordered speech analysis
and spoken language learning. Current state-of-the-art models are governed by
rule-based systems which lack efficiency and robustness, and are sensitive to
template design. In this paper, we propose YOLO-Stutter: a first end-to-end
method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes
imperfect speech-text alignment as input, followed by a spatial feature
aggregator, and a temporal dependency extractor to perform region-wise boundary
and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter
and VCTK-TTS, that simulate natural spoken dysfluencies including repetition,
block, missing, replacement, and prolongation. Our end-to-end method achieves
state-of-the-art performance with a minimum number of trainable parameters for
on both simulated data and real aphasia speech. Code and datasets are
open-sourced at https://github.com/rorizzz/YOLO-Stutter
Authors' comments: Interspeech 2024
Muyao Wang, Zeke Xie, Bo Chen
The influence function, a technique from robust statistics, measures the impact on model parameters or related functions when training data is removed or modified. This effective and valuable post-hoc method allows for studying the interpretability of machine learning models without requiring costly model retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of training MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation that leverages the more informative average gradient of the data set. Additionally, we demonstrate how this influence function can be used to estimate the impact of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world dataset, the original influence function performs worse than our method and even fail for the channel pruning problem, which demonstrate the superiority and necessity of channel-wise influence function in MTS analysis tasks.
Jenni Raitoharju
This paper proposes an easy-to-use method for one-class classification:
Repeated Element-wise Folding (REF). The algorithm consists of repeatedly
standardizing and applying an element-wise folding operation on the one-class
training data. Equivalent mappings are performed on unknown test items and the
classification prediction is based on the item's distance to the origin of the
final distribution. As all the included operations have linear time complexity,
the proposed algorithm provides a linear-time alternative for the commonly used
computationally much more demanding approaches. Furthermore, REF can avoid the
challenges of hyperparameter setting in one-class classification by providing
robust default settings. The experiments show that the proposed method can
produce similar classification performance or even outperform the more complex
algorithms on various benchmark datasets. Matlab codes for REF are publicly
available at https://github.com/JenniRaitoharju/REF.
Authors' comments: Accepted to EUSIPCO 2024