Yi-Jun Chang, Yanyu Chen, Gopinath Mishra
We consider the problem of constructing distributed overlay networks, where nodes in a reconfigurable system can create or sever connections with nodes whose identifiers they know. Initially, each node knows only its own and its neighbors' identifiers, forming a local channel, while the evolving structure is termed the global channel. The goal is to reconfigure any connected graph into a desired topology, such as a bounded-degree expander graph or a well-formed tree with a constant maximum degree and logarithmic diameter, minimizing the total number of rounds and message complexity. This problem mirrors real-world peer-to-peer network construction, where creating robust and efficient systems is desired. We study the overlay reconstruction problem in a network of $n$ nodes in two models: \textbf{GOSSIP-reply} and \textbf{HYBRID}. In the \textbf{GOSSIP-reply} model, each node can send a message and receive a corresponding reply message in one round. In the \textbf{HYBRID} model, a node can send $O(1)$ messages to each neighbor in the local channel and a total of $O(\log n)$ messages in the global channel. In both models, we propose protocols with $O\left(\log^2 n\right)$ round complexities and $O\left(n \log^2 n\right)$ message complexities using messages of $O(\log n)$ bits. Both protocols use $O\left(n \log^3 n\right)$ bits of communication, which we conjecture to be optimal. Additionally, our approach ensures that the total number of messages for node $v$, with degree $\deg(v)$ in the initial topology, is bounded by $O\left(\deg(v) + \log^2 n\right)$ with high probability.
Jacob Marks, Brent A. Griffin, Jason J. Corso
We introduce a new framework for analyzing classification datasets based on
the ratios of reconstruction errors between autoencoders trained on individual
classes. This analysis framework enables efficient characterization of datasets
on the sample, class, and entire dataset levels. We define reconstruction error
ratios (RERs) that probe classification difficulty and allow its decomposition
into (1) finite sample size and (2) Bayes error and decision-boundary
complexity. Through systematic study across 19 popular visual datasets, we find
that our RER-based dataset difficulty probe strongly correlates with error rate
for state-of-the-art (SOTA) classification models. By interpreting sample-level
classification difficulty as a label mistakenness score, we further find that
RERs achieve SOTA performance on mislabel detection tasks on hard datasets
under symmetric and asymmetric label noise. Our code is publicly available at
https://github.com/voxel51/reconstruction-error-ratios.
Authors' comments: 30 pages, 18 figures
Michele De Vita, Vasileios Belagiannis
Despite the remarkable progress in generative modelling, current diffusion
models lack a quantitative approach to assess image quality. To address this
limitation, we propose to estimate the pixel-wise aleatoric uncertainty during
the sampling phase of diffusion models and utilise the uncertainty to improve
the sample generation quality. The uncertainty is computed as the variance of
the denoising scores with a perturbation scheme that is specifically designed
for diffusion models. We then show that the aleatoric uncertainty estimates are
related to the second-order derivative of the diffusion noise distribution. We
evaluate our uncertainty estimation algorithm and the uncertainty-guided
sampling on the ImageNet and CIFAR-10 datasets. In our comparisons with the
related work, we demonstrate promising results in filtering out low quality
samples. Furthermore, we show that our guided approach leads to better sample
generation in terms of FID scores.
Authors' comments: Accepted at WACV 2025
Dylan Clark-Boucher, Brent A Coull, Harrison T Reeder, Fenglei Wang, Qi Sun, Jacqueline R Starr, Kyu Ha Lee
A key challenge in differential abundance analysis of microbial samples is that the counts for each sample are compositional, resulting in biased comparisons of the absolute abundance across study groups. Normalization-based differential abundance analysis methods rely on external normalization factors that account for the compositionality by standardizing the counts onto a common numerical scale. However, existing normalization methods have struggled at maintaining the false discovery rate in settings where the variance or compositional bias is large. This article proposes a novel framework for normalization that can reduce bias in differential abundance analysis by re-conceptualizing normalization as a group-level task. We present two normalization methods within the group-wise framework: group-wise relative log expression (G-RLE) and fold-truncated sum scaling (FTSS). G-RLE and FTSS achieve higher statistical power for identifying differentially abundant taxa than existing methods in model-based and synthetic data simulation settings, while maintaining the false discovery rate in challenging scenarios where existing methods suffer. The best results are obtained from using FTSS normalization with the differential abundance analysis method MetagenomeSeq. Code for implementing the methods and replicating the analysis can be found at our GitHub page (https://github.com/dclarkboucher/microbiome_groupwise_normalization).
Sanghyun Byun, Kayvan Shah, Ayushi Gang, Christopher Apton, Jacob Song, Woo Seong Chung
Many state-of-the-art computer vision architectures leverage U-Net for its adaptability and efficient feature extraction. However, the multi-resolution convolutional design often leads to significant computational demands, limiting deployment on edge devices. We present a streamlined alternative: a 1D convolutional encoder that retains accuracy while enhancing its suitability for edge applications. Our novel encoder architecture achieves semantic segmentation through channel-wise 1D convolutions combined with pixel-unshuffle operations. By incorporating PixelShuffle, known for improving accuracy in super-resolution tasks while reducing computational load, OneNet captures spatial relationships without requiring 2D convolutions, reducing parameters by up to 47%. Additionally, we explore a fully 1D encoder-decoder that achieves a 71% reduction in size, albeit with some accuracy loss. We benchmark our approach against U-Net variants across diverse mask-generation tasks, demonstrating that it preserves accuracy effectively. Although focused on image segmentation, this architecture is adaptable to other convolutional applications. Code for the project is available at https://github.com/shbyun080/OneNet .
Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell et al.
Although AI has become increasingly smart, its wisdom has not kept pace. In
this article, we examine what is known about human wisdom and sketch a vision
of its AI counterpart. We analyze human wisdom as a set of strategies for
solving intractable problems-those outside the scope of analytic
techniques-including both object-level strategies like heuristics [for managing
problems] and metacognitive strategies like intellectual humility,
perspective-taking, or context-adaptability [for managing object-level
strategies]. We argue that AI systems particularly struggle with metacognition;
improved metacognition would lead to AI more robust to novel environments,
explainable to users, cooperative with others, and safer in risking fewer
misaligned goals with human users. We discuss how wise AI might be benchmarked,
trained, and implemented.
Authors' comments: 23 pages, 1 figure, 3 tables
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster
Large language models (LLMs) are expensive to deploy. Parameter sharing
offers a possible path towards reducing their size and cost, but its
effectiveness in modern LLMs remains fairly limited. In this work, we revisit
"layer tying" as form of parameter sharing in Transformers, and introduce novel
methods for converting existing LLMs into smaller "Recursive Transformers" that
share parameters across layers, with minimal loss of performance. Here, our
Recursive Transformers are efficiently initialized from standard pretrained
Transformers, but only use a single block of unique layers that is then
repeated multiple times in a loop. We further improve performance by
introducing Relaxed Recursive Transformers that add flexibility to the layer
tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still
preserve the compactness of the overall model. We show that our recursive
models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla
pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge
distillation baselines -- and can even recover most of the performance of the
original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally,
we propose Continuous Depth-wise Batching, a promising new inference paradigm
enabled by the Recursive Transformer when paired with early exiting. In a
theoretical analysis, we show that this has the potential to lead to
significant (2-3x) gains in inference throughput.
Authors' comments: ICLR 2025; 49 pages, 17 figures, 19 tables
Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen
The development of large language models (LLMs) has significantly expanded
model sizes, resulting in substantial GPU memory requirements during inference.
The key and value storage of the attention map in the KV (key-value) cache
accounts for more than 80\% of this memory consumption. Nowadays, most existing
KV cache compression methods focus on intra-layer compression within a single
Transformer layer but few works consider layer-wise compression. In this paper,
we propose a plug-and-play method called \textit{KVSharer}, which shares the KV
cache between layers to achieve layer-wise compression. Rather than intuitively
sharing based on higher similarity, we discover a counterintuitive phenomenon:
sharing dissimilar KV caches better preserves the model performance.
Experiments show that \textit{KVSharer} can reduce KV cache computation by
30\%, thereby lowering memory consumption without significantly impacting model
performance and it can also achieve at least 1.3 times generation acceleration.
Additionally, we verify that \textit{KVSharer} is compatible with existing
intra-layer KV cache compression methods, and combining both can further save
memory.
Authors' comments: Under Review by ICLR2025
Qi Bing, Chaoyi Zhang, Weidong Cai
In contrast to the well-established technique of rasterization, vectorization
of images poses a significant challenge in the field of computer graphics.
Recent learning-based methods for converting raster images to vector formats
frequently suffer from incomplete shapes, redundant path prediction, and a lack
of accuracy in preserving the semantics of the original content. These
shortcomings severely hinder the utility of these methods for further editing
and manipulation of images. To address these challenges, we present DeepIcon, a
novel hierarchical image vectorization network specifically tailored for
generating variable-length icon vector graphics based on the raster image
input. Our experimental results indicate that DeepIcon can efficiently produce
Scalable Vector Graphics (SVGs) directly from raster images, bypassing the need
for a differentiable rasterizer while also demonstrating a profound
understanding of the image contents.
Authors' comments: Accepted as Oral Presentation at DICTA 2024
Zihan Chen, Bike Xie, Jundong Li, Cong Shen
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
Shen Yuan, Hongteng Xu
Transformer plays a central role in many fundamental deep learning models,
e.g., the ViT in computer vision and the BERT and GPT in natural language
processing, whose effectiveness is mainly attributed to its multi-head
attention (MHA) mechanism. In this study, we propose a simple and novel
channel-wise sample permutation (CSP) operator, achieving a new structured MHA
with fewer parameters and lower complexity. Given an input matrix, CSP
circularly shifts the samples of different channels with various steps and then
sorts grouped samples of each channel. This operator is equivalent to
implicitly implementing cross-channel attention maps as permutation matrices,
which achieves linear complexity and suppresses the risk of rank collapse when
representing data. We replace the MHA of some representative models with CSP
and test the CSP-based models in several discriminative tasks, including image
classification and long sequence analysis. Experiments show that the CSP-based
models achieve comparable or better performance with fewer parameters and lower
computational costs than the classic Transformer and its state-of-the-art
variants. The code is available at https://github.com/DaShenZi721/CSP.
Authors' comments: 18 pages, 4 figures
Siyuan Huang, Yunchong Song, Jiayue Zhou, Zhouhan Lin
In the realm of graph learning, there is a category of methods that
conceptualize graphs as hierarchical structures, utilizing node clustering to
capture broader structural information. While generally effective, these
methods often rely on a fixed graph coarsening routine, leading to overly
homogeneous cluster representations and loss of node-level information. In this
paper, we envision the graph as a network of interconnected node sets without
compressing each cluster into a single embedding. To enable effective
information transfer among these node sets, we propose the Node-to-Cluster
Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple
Kernel Learning into the kernelized attention framework, effectively capturing
information at both node and cluster levels. We then devise an efficient form
for N2C-Attn using the cluster-wise message-passing framework, achieving linear
time complexity. We further analyze how N2C-Attn combines bi-level feature maps
of queries and keys, demonstrating its capability to merge dual-granularity
information. The resulting architecture, Cluster-wise Graph Transformer
(Cluster-GT), which uses node clusters as tokens and employs our proposed
N2C-Attn module, shows superior performance on various graph-level tasks. Code
is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.
Authors' comments: Accepted as NeurIPS 2024 Spotlight
Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran
Transferability of adversarial examples is a well-known property that endangers all classification models, even those that are only accessible through black-box queries. Prior work has shown that an ensemble of models is more resilient to transferability: the probability that an adversarial example is effective against most models of the ensemble is low. Thus, most ongoing research focuses on improving ensemble diversity. Another line of prior work has shown that Lipschitz continuity of the models can make models more robust since it limits how a model's output changes with small input perturbations. In this paper, we study the effect of Lipschitz continuity on transferability rates. We show that although a lower Lipschitz constant increases the robustness of a single model, it is not as beneficial in training robust ensembles as it increases the transferability rate of adversarial examples across models in the ensemble. Therefore, we introduce LOTOS, a new training paradigm for ensembles, which counteracts this adverse effect. It does so by promoting orthogonality among the top-$k$ sub-spaces of the transformations of the corresponding affine layers of any pair of models in the ensemble. We theoretically show that $k$ does not need to be large for convolutional layers, which makes the computational overhead negligible. Through various experiments, we show LOTOS increases the robust accuracy of ensembles of ResNet-18 models by $6$ percentage points (p.p) against black-box attacks on CIFAR-10. It is also capable of combining with the robustness of prior state-of-the-art methods for training robust ensembles to enhance their robust accuracy by $10.7$ p.p.
Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling
This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase
Prediction (SP-NSPP) model, which predicts the phase spectrum from input
amplitude spectrum by two-stage neural networks. In the initial
prior-construction stage, we preliminarily predict a rough prior phase spectrum
from the amplitude spectrum. The subsequent refinement stage transforms the
amplitude spectrum into a refined high-quality phase spectrum conditioned on
the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone
and adopt adversarial training by innovatively introducing a phase spectrum
discriminator (PSD). To further improve the continuity of the refined phase, we
also incorporate a time-frequency integrated difference (TFID) loss in the
refinement stage. Experimental results confirm that, compared to neural
network-based no-prior phase prediction methods, the proposed SP-NSPP achieves
higher phase prediction accuracy, thanks to introducing the coarse phase priors
and diverse training criteria. Compared to iterative phase estimation
algorithms, our proposed SP-NSPP does not require multiple rounds of staged
iterations, resulting in higher generation efficiency.
Authors' comments: Accepted by SLT2024
Ismail Alkhouri, Shijun Liang, Cheng-Han Huang, Jimmy Dai, Qing Qu, Saiprasad Ravishankar, Rongrong Wang
Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse imaging problems (IPs), the reverse sampling steps of DMs are typically modified to approximately sample from a measurement-conditioned distribution in the image space. However, these modifications may be unsuitable for certain settings (such as in the presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier sampling steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces the standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions, either implicitly or explicitly, our sampler requires significantly fewer reverse steps. Therefore, we refer to our accelerated method as Step-wise Triple-Consistent Sampling (SITCOM). Compared to existing state-of-the-art baseline methods, under different levels of measurement noise, our extensive experiments across five linear and three non-linear image restoration tasks demonstrate that SITCOM achieves competitive or superior results in terms of standard image similarity metrics while requiring a significantly reduced run-time across all considered tasks.
Urszula Jessen, Dirk Fahland
Anomalies in complex industrial processes are often obscured by high variability and complexity of event data, which hinders their identification and interpretation using process mining. To address this problem, we introduce WISE (Weighted Insights for Evaluating Efficiency), a novel method for analyzing business process metrics through the integration of domain knowledge, process mining, and machine learning. The methodology involves defining business goals and establishing Process Norms with weighted constraints at the activity level, incorporating input from domain experts and process analysts. Individual process instances are scored based on these constraints, and the scores are normalized to identify features impacting process goals. Evaluation using the BPIC 2019 dataset and real industrial contexts demonstrates that WISE enhances automation in business process analysis and effectively detects deviations from desired process flows. While LLMs support the analysis, the inclusion of domain experts ensures the accuracy and relevance of the findings.
Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
Diffusion transformers have shown significant effectiveness in both image and
video synthesis at the expense of huge computation costs. To address this
problem, feature caching methods have been introduced to accelerate diffusion
transformers by caching the features in previous timesteps and reusing them in
the following timesteps. However, previous caching methods ignore that
different tokens exhibit different sensitivities to feature caching, and
feature caching on some tokens may lead to 10$\times$ more destruction to the
overall generation quality compared with other tokens. In this paper, we
introduce token-wise feature caching, allowing us to adaptively select the most
suitable tokens for caching, and further enable us to apply different caching
ratios to neural layers in different types and depths. Extensive experiments on
PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image
and video generation with no requirements for training. For instance,
2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and
PixArt-$\alpha$ with almost no drop in generation quality.
Authors' comments: ToCa is honored to be accepted by ICLR 2025
Sawinder Kaur, Avery Gump, Yi Xiao, Jingyu Xin, Harshit Sharma, Nina R Benway, Jonathan L Preston, Asif Salekin
The advancement in deep learning and internet-of-things have led to diverse
human sensing applications. However, distinct patterns in human sensing,
influenced by various factors or contexts, challenge the generic neural network
model's performance due to natural distribution shifts. To address this,
personalization tailors models to individual users. Yet most personalization
studies overlook intra-user heterogeneity across contexts in sensory data,
limiting intra-user generalizability. This limitation is especially critical in
clinical applications, where limited data availability hampers both
generalizability and personalization. Notably, intra-user sensing attributes
are expected to change due to external factors such as treatment progression,
further complicating the challenges. To address the intra-user generalization
challenge, this work introduces CRoP, a novel static personalization approach.
CRoP leverages off-the-shelf pre-trained models as generic starting points and
captures user-specific traits through adaptive pruning on a minimal sub-network
while allowing generic knowledge to be incorporated in remaining parameters.
CRoP demonstrates superior personalization effectiveness and intra-user
robustness across four human-sensing datasets, including two from real-world
health domains, underscoring its practical and social impact. Additionally, to
support CRoP's generalization ability and design choices, we provide empirical
justification through gradient inner product analysis, ablation studies, and
comparisons against state-of-the-art baselines.
Authors' comments: 34 pages, 6 figues and 15 tables
Hongtao Huang, Xiaojun Chang, Lina Yao
Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster generation processes. However, NAS for diffusion is inherently time-consuming as it requires estimating thousands of diffusion models to search for the optimal one. In this paper, we introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models by concurrently optimizing generation steps and network structures. Specifically, we partition the generation process into isometric step segments, each sequentially composed of a full step, multiple partial steps, and several null steps. The full step computes all network blocks, while the partial step involves part of the blocks, and the null step entails no computation. Flexiffusion autonomously explores flexible step combinations for each segment, substantially reducing search costs and enabling greater acceleration compared to the state-of-the-art (SOTA) method for diffusion models. Our searched models reported speedup factors of $2.6\times$ and $1.5\times$ for the original LDM-4-G and the SOTA, respectively. The factors for Stable Diffusion V1.5 and the SOTA are $5.1\times$ and $2.0\times$. We also verified the performance of Flexiffusion on multiple datasets, and positive experiment results indicate that Flexiffusion can effectively reduce redundancy in diffusion models.
Liangyu Zhong, Joachim Sicking, Fabian Hüger, Hanno Gottschalk
Semantic segmentation networks have achieved significant success under the
assumption of independent and identically distributed data. However, these
networks often struggle to detect anomalies from unknown semantic classes due
to the limited set of visual concepts they are typically trained on. To address
this issue, anomaly segmentation often involves fine-tuning on outlier samples,
necessitating additional efforts for data collection, labeling, and model
retraining. Seeking to avoid this cumbersome work, we take a different approach
and propose to incorporate Vision-Language (VL) encoders into existing anomaly
detectors to leverage the semantically broad VL pre-training for improved
outlier awareness. Additionally, we propose a new scoring function that enables
data- and training-free outlier supervision via textual prompts. The resulting
VL4AD model, which includes max-logit prompt ensembling and a class-merging
strategy, achieves competitive performance on widely used benchmark datasets,
thereby demonstrating the potential of vision-language models for pixel-wise
anomaly detection.
Authors' comments: 27 pages, 9 figures, to be published in ECCV 2024 2nd Workshop on
Vision-Centric Autonomous Driving (VCAD)