Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.
Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui
Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.
Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu
KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.
Qihua Zhu, Mingshuo Liu, Yuefeng Han, Doudou Zhou
We propose a nonparametric test for serial independence that aggregates pairwise similarities of observations with lag-dependent weights. The resulting statistic is powerful to general forms of temporal dependence, including nonlinear and uncorrelated alternatives, and applies to ultra-high-dimensional and non-Euclidean data. We derive asymptotic normality under both permutation and population nulls, and establish consistency in classical large-sample and high-dimension-low-sample-size (HDLSS) regimes. The test therefore provides the first theoretical power guarantees for serial independence in the HDLSS setting. Simulations demonstrate accurate size and strong power against a wide range of alternatives, showing significant power improvement over existing methods under various high-dimensional time series models. An application to spatio-temporal data illustrates the method's utility for non-Euclidean observations.
Erica Cooper, Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai
While supervised quality predictors for synthesized speech have demonstrated
strong correlations with human ratings, their requirement for in-domain labeled
training data hinders their generalization ability to new domains. Unsupervised
approaches based on pretrained self-supervised learning (SSL) based models and
automatic speech recognition (ASR) models are a promising alternative; however,
little is known about how these models encode information about speech quality.
Towards the goal of better understanding how different aspects of speech
quality are encoded in a multilingual setting, we present a layer-wise analysis
of multilingual pretrained speech models based on reference modeling. We find
that features extracted from early SSL layers show correlations with human
ratings of synthesized speech, and later layers of ASR models can predict
quality of non-neural systems as well as intelligibility. We also demonstrate
the importance of using well-matched reference data.
Authors' comments: Copyright 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
Bo Wu, Zhiqi Ai, Jun Jiang, Congcong Zhu, Shugong Xu
Label ambiguity poses a significant challenge in age estimation tasks. Most existing methods address this issue by modeling correlations between adjacent age groups through label distribution learning. However, they often overlook the varying degrees of ambiguity present across different age stages. In this paper, we propose a Stage-wise Adaptive Label Distribution Learning (SA-LDL) algorithm, which leverages the observation -- revealed through our analysis of embedding similarities between an anchor and all other ages -- that label ambiguity exhibits clear stage-wise patterns. By jointly employing stage-wise adaptive variance modeling and weighted loss function, SA-LDL effectively captures the complex and structured nature of label ambiguity, leading to more accurate and robust age estimation. Extensive experiments demonstrate that SA-LDL achieves competitive performance, with MAE of 1.74 and 2.15 on the MORPH-II and FG-NET datasets.
Authors' comments: 14 pages, 3 fugures
Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.
Kaiqi Zhao
Neural network quantization aims to reduce the bit-widths of weights and activations, making it a critical technique for deploying deep neural networks on resource-constrained hardware. Most Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions by replacing their derivatives with that of the identity function. While effective, STE overlooks discretization errors between continuous and quantized values, which can lead to accuracy degradation -- especially at extremely low bit-widths. In this paper, we propose Progressive Element-wise Gradient Estimation (PEGE), a simple yet effective alternative to STE, which can be seamlessly integrated with any forward propagation methods and improves the quantized model accuracy. PEGE progressively replaces full-precision weights and activations with their quantized counterparts via a novel logarithmic curriculum-driven mixed-precision replacement strategy. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the task loss for prediction and the discretization error for quantization, providing a unified and generalizable framework. Extensive experiments on CIFAR-10 and ImageNet across various architectures (e.g., ResNet, VGG) demonstrate that PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
Hiroaki Aizawa, Yuta Naito, Kohei Fukuda
The purpose of training neural networks is to achieve high generalization
performance on unseen inputs. However, when trained on imbalanced datasets, a
model's prediction tends to favor majority classes over minority classes,
leading to significant degradation in the recognition performance of minority
classes. To address this issue, we propose class-wise flooding regularization,
an extension of flooding regularization applied at the class level. Flooding is
a regularization technique that mitigates overfitting by preventing the
training loss from falling below a predefined threshold, known as the flooding
level, thereby discouraging memorization. Our proposed method assigns a
class-specific flooding level based on class frequencies. By doing so, it
suppresses overfitting in majority classes while allowing sufficient learning
for minority classes. We validate our approach on imbalanced image
classification. Compared to conventional flooding regularizations, our method
improves the classification performance of minority classes and achieves better
overall generalization.
Authors' comments: Accepted to ACPR2025
Durgesh Kumar Singh, Qing Cao, Sarina Thomas, Ahcène Boubekki, Robert Jenssen, Michael Kampffmeyer
Clinical guidelines recommend performing left ventricular (LV) linear measurements in B-mode echocardiographic images at the basal level -- typically at the mitral valve leaflet tips -- and aligned perpendicular to the LV long axis along a virtual scanline (SL). However, most automated methods estimate landmarks directly from B-mode images for the measurement task, where even small shifts in predicted points along the LV walls can lead to significant measurement errors, reducing their clinical reliability. A recent semi-automatic method, EnLVAM, addresses this limitation by constraining landmark prediction to a clinician-defined SL and training on generated Anatomical Motion Mode (AMM) images to predict LV landmarks along the same. To enable full automation, a contour-aware SL placement approach is proposed in this work, in which the LV contour is estimated using a weakly supervised B-mode landmark detector. SL placement is then performed by inferring the LV long axis and the basal level-mimicking clinical guidelines. Building on this foundation, we introduce \textit{WiseLVAM} -- a novel, fully automated yet manually adaptable framework for automatically placing the SL and then automatically performing the LV linear measurements in the AMM mode. \textit{WiseLVAM} utilizes the structure-awareness from B-mode images and the motion-awareness from AMM mode to enhance robustness and accuracy with the potential to provide a practical solution for the routine clinical application.
Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
Abdullah Al Raqibul Islam, Helen Xu, Dong Dai, Aydın Buluç
Sparse matrix-sparse matrix multiplication (SpGEMM) is a key kernel in many
scientific applications and graph workloads. Unfortunately, SpGEMM is
bottlenecked by data movement due to its irregular memory access patterns.
Significant work has been devoted to developing row reordering schemes towards
improving locality in sparse operations, but prior studies mostly focus on the
case of sparse-matrix vector multiplication (SpMV).
In this paper, we address these issues with hierarchical clustering for
SpGEMM that leverages both row reordering and cluster-wise computation to
improve reuse in the second input (B) matrix with a novel row-clustered matrix
format and access pattern in the first input (A) matrix. We find that
hierarchical clustering can speed up SpGEMM by 1.39x on average with low
preprocessing cost (less than 20x the cost of a single SpGEMM on about 90% of
inputs). Furthermore, we decouple the reordering algorithm from the clustered
matrix format so they can be applied as independent optimizations.
Additionally, this paper sheds light on the role of both row reordering and
clustering independently and together for SpGEMM with a comprehensive empirical
study of the effect of 10 different reordering algorithms and 3 clustering
schemes on SpGEMM performance on a suite of 110 matrices. We find that
reordering based on graph partitioning provides better SpGEMM performance than
existing alternatives at the cost of high preprocessing time. The evaluation
demonstrates that the proposed hierarchical clustering method achieves greater
average speedup compared to other reordering schemes with similar preprocessing
times.
Authors' comments: Accepted to appear in the International Conference for High
Performance Computing, Networking, Storage, and Analysis (SC) 2025
Heitor R. Medeiros, Atif Belal, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli
Object detection (OD) in infrared (IR) imagery is critical for low-light and
nighttime applications. However, the scarcity of large-scale IR datasets forces
models to rely on weights pre-trained on RGB images. While fine-tuning on IR
improves accuracy, it often compromises robustness under distribution shifts
due to the inherent modality gap between RGB and IR. To address this, we
introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD)
benchmarks built by applying corruption to standard IR datasets. Additionally,
to fully leverage the complementary knowledge from RGB and infrared trained
models, we propose WiSE-OD, a weight-space ensembling method with two variants:
WiSE-OD$_{ZS}$, which combines RGB zero-shot and IR fine-tuned weights, and
WiSE-OD$_{LP}$, which blends zero-shot and linear probing. Evaluated across
three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both
cross-modality and corruption robustness without any additional training or
inference cost.
Authors' comments: 8 pages, conference
Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu
In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate's profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as 4.0x, and preserves ImageNet top-1 accuracy within 2.1\% of the dense baseline.
Soumen Sinha, Shahryar Rahnamayan, Azam Asilian Bidgoli
Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.
Riku Inoue, Masamitsu Tsuchiya, Yuji Yasui
For safety-critical robotics applications such as autonomous driving, it is
important to detect all required objects accurately in real-time. Motion
segmentation offers a solution by identifying dynamic objects from the scene in
a class-agnostic manner. Recently, various motion segmentation models have been
proposed, most of which jointly use subnetworks to estimate Depth, Pose,
Optical Flow, and Scene Flow. As a result, the overall computational cost of
the model increases, hindering real-time performance.
In this paper, we propose a novel cost-volume-based motion feature
representation, Channel-wise Motion Features. By extracting depth features of
each instance in the feature map and capturing the scene's 3D motion
information, it offers enhanced efficiency. The only subnetwork used to build
Channel-wise Motion Features is the Pose Network, and no others are required.
Our method not only achieves about 4 times the FPS of state-of-the-art models
in the KITTI Dataset and Cityscapes of the VCAS-Motion Dataset, but also
demonstrates equivalent accuracy while reducing the parameters to about 25$\%$.
Authors' comments: This paper has been accepted to IROS 2024 (Abu Dhabi, UAE), October
14-18, 2024
Sunwoo Kim, Haneul Yoo, Alice Oh
Understanding how large language models (LLMs) internally represent and
process their predictions is central to detecting uncertainty and preventing
hallucinations. While several studies have shown that models encode uncertainty
in their hidden states, it is underexplored how this affects the way they
process such hidden states. In this work, we demonstrate that the dynamics of
output token probabilities across layers for certain and uncertain outputs are
largely aligned, revealing that uncertainty does not seem to affect inference
dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to
analyze the layer-wise probability trajectories of final prediction tokens
across 11 datasets and 5 models. Using incorrect predictions as those with
higher epistemic uncertainty, our results show aligned trajectories for certain
and uncertain predictions that both observe abrupt increases in confidence at
similar layers. We balance this finding by showing evidence that more competent
models may learn to process uncertainty differently. Our findings challenge the
feasibility of leveraging simplistic methods for detecting uncertainty at
inference. More broadly, our work demonstrates how interpretability methods may
be used to investigate the way uncertainty affects inference.
Authors' comments: Accepted to Actionable Interpretability Workshop - ICML 2025
Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji
General-purpose audio representations have proven effective across diverse
music information retrieval applications, yet their utility in intelligent
music production remains limited by insufficient understanding of audio effects
(Fx). Although previous approaches have emphasized audio effects analysis at
the mixture level, this focus falls short for tasks demanding instrument-wise
audio effects understanding, such as automatic mixing. In this work, we present
Fx-Encoder++, a novel model designed to extract instrument-wise audio effects
representations from music mixtures. Our approach leverages a contrastive
learning framework and introduces an "extractor" mechanism that, when provided
with instrument queries (audio or text), transforms mixture-level audio effects
embeddings into instrument-wise audio effects embeddings. We evaluated our
model across retrieval and audio effects parameter matching tasks, testing its
performance across a diverse range of instruments. The results demonstrate that
Fx-Encoder++ outperforms previous approaches at mixture level and show a novel
ability to extract effects representation instrument-wise, addressing a
critical capability gap in intelligent music production systems.
Authors' comments: ISMIR 2025
Yuting He, Shuo Li
Contrastive learning (CL) has become a cornerstone of self-supervised
pretraining (SSP) in foundation models, however, extending CL to pixel-wise
representation, crucial for medical vision, remains an open problem. Standard
CL formulates SSP as a binary optimization problem (binary CL) where the
excessive pursuit of feature dispersion leads to an over-dispersion problem,
breaking pixel-wise feature correlation thus disrupting the intra-class
distribution. Our vector CL reformulates CL as a vector regression problem,
enabling dispersion quantification in pixel-wise pretraining via modeling
feature distances in regressing displacement vectors. To implement this novel
paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER
establishes an extendable vector-based self-learning, enforces a consistent
optimization flow from vector regression to distance modeling, and leverages a
vector pyramid architecture for granularity adaptation, thus preserving
pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks,
spanning 2 dimensions and 4 modalities, show that COVER significantly improves
pixel-wise SSP, advancing generalizable medical visual foundation models.
Authors' comments: Accepted by ICCV 2025
Tingting Zhu, Tingyang Chen, Yinghui Wu, Arijit Khan, Xiangyu Ke
Ensuring the trustworthiness of graph neural networks (GNNs) as black-box models requires effective explanation methods. Existing GNN explanations typically apply input perturbations to identify subgraphs that are responsible for the occurrence of the final output of GNNs. However, such approaches lack finer-grained, layer-wise analysis of how intermediate representations contribute to the final result, capabilities that are crucial for model diagnosis and architecture optimization. This paper introduces SliceGX, a novel GNN explanation approach that generates explanations at specific GNN layers in a progressive manner. Given a GNN M, a set of selected intermediate layers, and a target layer, SliceGX automatically segments M into layer blocks ("model slice") and discovers high-quality explanatory subgraphs in each layer block that clarifies the occurrence of output of M at the targeted layer. Although finding such layer-wise explanations is computationally challenging, we develop efficient algorithms and optimization techniques that incrementally generate and maintain these subgraphs with provable approximation guarantees. Additionally, SliceGX offers a SPARQL-like query interface, providing declarative access and search capacities for the generated explanations. Through experiments on large real-world graphs and representative GNN architectures, we verify the effectiveness and efficiency of SliceGX, and illustrate its practical utility in supporting model debugging.