Hiroaki Aizawa, Yuta Naito, Kohei Fukuda
The purpose of training neural networks is to achieve high generalization
performance on unseen inputs. However, when trained on imbalanced datasets, a
model's prediction tends to favor majority classes over minority classes,
leading to significant degradation in the recognition performance of minority
classes. To address this issue, we propose class-wise flooding regularization,
an extension of flooding regularization applied at the class level. Flooding is
a regularization technique that mitigates overfitting by preventing the
training loss from falling below a predefined threshold, known as the flooding
level, thereby discouraging memorization. Our proposed method assigns a
class-specific flooding level based on class frequencies. By doing so, it
suppresses overfitting in majority classes while allowing sufficient learning
for minority classes. We validate our approach on imbalanced image
classification. Compared to conventional flooding regularizations, our method
improves the classification performance of minority classes and achieves better
overall generalization.
Authors' comments: Accepted to ACPR2025
Durgesh Kumar Singh, Qing Cao, Sarina Thomas, Ahcène Boubekki, Robert Jenssen, Michael Kampffmeyer
Clinical guidelines recommend performing left ventricular (LV) linear measurements in B-mode echocardiographic images at the basal level -- typically at the mitral valve leaflet tips -- and aligned perpendicular to the LV long axis along a virtual scanline (SL). However, most automated methods estimate landmarks directly from B-mode images for the measurement task, where even small shifts in predicted points along the LV walls can lead to significant measurement errors, reducing their clinical reliability. A recent semi-automatic method, EnLVAM, addresses this limitation by constraining landmark prediction to a clinician-defined SL and training on generated Anatomical Motion Mode (AMM) images to predict LV landmarks along the same. To enable full automation, a contour-aware SL placement approach is proposed in this work, in which the LV contour is estimated using a weakly supervised B-mode landmark detector. SL placement is then performed by inferring the LV long axis and the basal level-mimicking clinical guidelines. Building on this foundation, we introduce \textit{WiseLVAM} -- a novel, fully automated yet manually adaptable framework for automatically placing the SL and then automatically performing the LV linear measurements in the AMM mode. \textit{WiseLVAM} utilizes the structure-awareness from B-mode images and the motion-awareness from AMM mode to enhance robustness and accuracy with the potential to provide a practical solution for the routine clinical application.
Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
Abdullah Al Raqibul Islam, Helen Xu, Dong Dai, Aydın Buluç
Sparse matrix-sparse matrix multiplication (SpGEMM) is a key kernel in many
scientific applications and graph workloads. Unfortunately, SpGEMM is
bottlenecked by data movement due to its irregular memory access patterns.
Significant work has been devoted to developing row reordering schemes towards
improving locality in sparse operations, but prior studies mostly focus on the
case of sparse-matrix vector multiplication (SpMV).
In this paper, we address these issues with hierarchical clustering for
SpGEMM that leverages both row reordering and cluster-wise computation to
improve reuse in the second input (B) matrix with a novel row-clustered matrix
format and access pattern in the first input (A) matrix. We find that
hierarchical clustering can speed up SpGEMM by 1.39x on average with low
preprocessing cost (less than 20x the cost of a single SpGEMM on about 90% of
inputs). Furthermore, we decouple the reordering algorithm from the clustered
matrix format so they can be applied as independent optimizations.
Additionally, this paper sheds light on the role of both row reordering and
clustering independently and together for SpGEMM with a comprehensive empirical
study of the effect of 10 different reordering algorithms and 3 clustering
schemes on SpGEMM performance on a suite of 110 matrices. We find that
reordering based on graph partitioning provides better SpGEMM performance than
existing alternatives at the cost of high preprocessing time. The evaluation
demonstrates that the proposed hierarchical clustering method achieves greater
average speedup compared to other reordering schemes with similar preprocessing
times.
Authors' comments: Accepted to appear in the International Conference for High
Performance Computing, Networking, Storage, and Analysis (SC) 2025
Heitor R. Medeiros, Atif Belal, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli
Object detection (OD) in infrared (IR) imagery is critical for low-light and
nighttime applications. However, the scarcity of large-scale IR datasets forces
models to rely on weights pre-trained on RGB images. While fine-tuning on IR
improves accuracy, it often compromises robustness under distribution shifts
due to the inherent modality gap between RGB and IR. To address this, we
introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD)
benchmarks built by applying corruption to standard IR datasets. Additionally,
to fully leverage the complementary knowledge from RGB and infrared trained
models, we propose WiSE-OD, a weight-space ensembling method with two variants:
WiSE-OD$_{ZS}$, which combines RGB zero-shot and IR fine-tuned weights, and
WiSE-OD$_{LP}$, which blends zero-shot and linear probing. Evaluated across
three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both
cross-modality and corruption robustness without any additional training or
inference cost.
Authors' comments: 8 pages, conference
Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu
In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate's profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as 4.0x, and preserves ImageNet top-1 accuracy within 2.1\% of the dense baseline.
Soumen Sinha, Shahryar Rahnamayan, Azam Asilian Bidgoli
Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.
Riku Inoue, Masamitsu Tsuchiya, Yuji Yasui
For safety-critical robotics applications such as autonomous driving, it is
important to detect all required objects accurately in real-time. Motion
segmentation offers a solution by identifying dynamic objects from the scene in
a class-agnostic manner. Recently, various motion segmentation models have been
proposed, most of which jointly use subnetworks to estimate Depth, Pose,
Optical Flow, and Scene Flow. As a result, the overall computational cost of
the model increases, hindering real-time performance.
In this paper, we propose a novel cost-volume-based motion feature
representation, Channel-wise Motion Features. By extracting depth features of
each instance in the feature map and capturing the scene's 3D motion
information, it offers enhanced efficiency. The only subnetwork used to build
Channel-wise Motion Features is the Pose Network, and no others are required.
Our method not only achieves about 4 times the FPS of state-of-the-art models
in the KITTI Dataset and Cityscapes of the VCAS-Motion Dataset, but also
demonstrates equivalent accuracy while reducing the parameters to about 25$\%$.
Authors' comments: This paper has been accepted to IROS 2024 (Abu Dhabi, UAE), October
14-18, 2024
Sunwoo Kim, Haneul Yoo, Alice Oh
Understanding how large language models (LLMs) internally represent and
process their predictions is central to detecting uncertainty and preventing
hallucinations. While several studies have shown that models encode uncertainty
in their hidden states, it is underexplored how this affects the way they
process such hidden states. In this work, we demonstrate that the dynamics of
output token probabilities across layers for certain and uncertain outputs are
largely aligned, revealing that uncertainty does not seem to affect inference
dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to
analyze the layer-wise probability trajectories of final prediction tokens
across 11 datasets and 5 models. Using incorrect predictions as those with
higher epistemic uncertainty, our results show aligned trajectories for certain
and uncertain predictions that both observe abrupt increases in confidence at
similar layers. We balance this finding by showing evidence that more competent
models may learn to process uncertainty differently. Our findings challenge the
feasibility of leveraging simplistic methods for detecting uncertainty at
inference. More broadly, our work demonstrates how interpretability methods may
be used to investigate the way uncertainty affects inference.
Authors' comments: Accepted to Actionable Interpretability Workshop - ICML 2025
Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji
General-purpose audio representations have proven effective across diverse
music information retrieval applications, yet their utility in intelligent
music production remains limited by insufficient understanding of audio effects
(Fx). Although previous approaches have emphasized audio effects analysis at
the mixture level, this focus falls short for tasks demanding instrument-wise
audio effects understanding, such as automatic mixing. In this work, we present
Fx-Encoder++, a novel model designed to extract instrument-wise audio effects
representations from music mixtures. Our approach leverages a contrastive
learning framework and introduces an "extractor" mechanism that, when provided
with instrument queries (audio or text), transforms mixture-level audio effects
embeddings into instrument-wise audio effects embeddings. We evaluated our
model across retrieval and audio effects parameter matching tasks, testing its
performance across a diverse range of instruments. The results demonstrate that
Fx-Encoder++ outperforms previous approaches at mixture level and show a novel
ability to extract effects representation instrument-wise, addressing a
critical capability gap in intelligent music production systems.
Authors' comments: ISMIR 2025
Yuting He, Shuo Li
Contrastive learning (CL) has become a cornerstone of self-supervised
pretraining (SSP) in foundation models, however, extending CL to pixel-wise
representation, crucial for medical vision, remains an open problem. Standard
CL formulates SSP as a binary optimization problem (binary CL) where the
excessive pursuit of feature dispersion leads to an over-dispersion problem,
breaking pixel-wise feature correlation thus disrupting the intra-class
distribution. Our vector CL reformulates CL as a vector regression problem,
enabling dispersion quantification in pixel-wise pretraining via modeling
feature distances in regressing displacement vectors. To implement this novel
paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER
establishes an extendable vector-based self-learning, enforces a consistent
optimization flow from vector regression to distance modeling, and leverages a
vector pyramid architecture for granularity adaptation, thus preserving
pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks,
spanning 2 dimensions and 4 modalities, show that COVER significantly improves
pixel-wise SSP, advancing generalizable medical visual foundation models.
Authors' comments: Accepted by ICCV 2025
Tingting Zhu, Tingyang Chen, Yinghui Wu, Arijit Khan, Xiangyu Ke
Ensuring the trustworthiness of graph neural networks (GNNs) as black-box models requires effective explanation methods. Existing GNN explanations typically apply input perturbations to identify subgraphs that are responsible for the occurrence of the final output of GNNs. However, such approaches lack finer-grained, layer-wise analysis of how intermediate representations contribute to the final result, capabilities that are crucial for model diagnosis and architecture optimization. This paper introduces SliceGX, a novel GNN explanation approach that generates explanations at specific GNN layers in a progressive manner. Given a GNN M, a set of selected intermediate layers, and a target layer, SliceGX automatically segments M into layer blocks ("model slice") and discovers high-quality explanatory subgraphs in each layer block that clarifies the occurrence of output of M at the targeted layer. Although finding such layer-wise explanations is computationally challenging, we develop efficient algorithms and optimization techniques that incrementally generate and maintain these subgraphs with provable approximation guarantees. Additionally, SliceGX offers a SPARQL-like query interface, providing declarative access and search capacities for the generated explanations. Through experiments on large real-world graphs and representative GNN architectures, we verify the effectiveness and efficiency of SliceGX, and illustrate its practical utility in supporting model debugging.
Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.
Seyed Mohsen Hosseini
Class imbalance and the difficulty imbalance are the two types of data imbalance that affect the performance of neural networks in medical segmentation tasks. In class imbalance the loss is dominated by the majority classes and in difficulty imbalance the loss is dominated by easy to classify pixels. This leads to an ineffective training. Dice loss, which is based on a geometrical metric, is very effective in addressing the class imbalance compared to the cross entropy (CE) loss, which is adopted directly from classification tasks. To address the difficulty imbalance, the common approach is employing a re-weighted CE loss or a modified Dice loss to focus the training on difficult to classify areas. The existing modification methods are computationally costly and with limited success. In this study we propose a simple modification to the Dice loss with minimal computational cost. With a pixel level modulating term, we take advantage of the effectiveness of Dice loss in handling the class imbalance to also handle the difficulty imbalance. Results on three commonly used medical segmentation tasks show that the proposed Pixel-wise Modulated Dice loss (PM Dice loss) outperforms other methods, which are designed to tackle the difficulty imbalance problem.
Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Shengjia Hua, Lei Chen, Zhi Wang
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose Block-wise Adaptive Caching(BAC), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and locks. To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to signiffcant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with signiffcant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3x inference speedup for free.
YaChen Yan, Liubo Li, Ravi Choudhary
In modern recommender systems, CTR/CVR models are increasingly trained with ranking objectives to improve item ranking quality. While this shift aligns training more closely with serving goals, most existing methods rely on in-batch negative sampling, which predominantly surfaces easy negatives. This limits the model's ability to capture fine-grained user preferences and weakens overall ranking performance. To address this, we propose a Hierarchical Group-wise Ranking Framework with two key components. First, we apply residual vector quantization to user embeddings to generate hierarchical user codes that partition users into hierarchical, trie-structured clusters. Second, we apply listwise ranking losses to user-item pairs at each level of the hierarchy, where shallow levels group loosely similar users and deeper levels group highly similar users, reinforcing learning-to-rank signals through progressively harder negatives. Since users with similar preferences and content exposure tend to yield more informative negatives, applying ranking losses within these hierarchical user groups serves as an effective approximation of hard negative mining. Our approach improves ranking performance without requiring complex real-time context collection or retrieval infrastructure. Extensive experiments demonstrate that the proposed framework consistently enhances both model calibration and ranking accuracy, offering a scalable and practical solution for industrial recommender systems.
Fuhan Cai, Yong Guo, Jie Li, Wenbo Li, Xiangzhong Fang, Jian Chen
Recent advancements in text-to-image (T2I) generation have led to the
emergence of highly expressive models such as diffusion transformers (DiTs),
exemplified by FLUX. However, their massive parameter sizes lead to slow
inference, high memory usage, and poor deployability. Existing acceleration
methods (e.g., single-step distillation and attention pruning) often suffer
from significant performance degradation and incur substantial training costs.
To address these limitations, we propose FastFLUX, an architecture-level
pruning framework designed to enhance the inference efficiency of FLUX. At its
core is the Block-wise Replacement with Linear Layers (BRLL) method, which
replaces structurally complex residual branches in ResBlocks with lightweight
linear layers while preserving the original shortcut connections for stability.
Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning
strategy that leverages LoRA to supervise neighboring blocks, mitigating
performance drops caused by structural replacement. Experiments show that our
FastFLUX maintains high image quality under both qualitative and quantitative
evaluations, while significantly improving inference speed, even with 20\% of
the hierarchy pruned. Our code will be available soon.
Authors' comments: 14 pages
Weijie Shi, Han Zhu, Jiaming Ji, Mengze Li, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu et al.
Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step's logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at https://huggingface.co/datasets/weijiezz/LegalHK.
Yuling Wang, Zihui Chen, Pengfei Jiao, Xiao Wang
Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the
need for tailored attacks to assess their robustness and ensure security.
However, existing HGNN attacks often require complex retraining of parameters
to generate specific perturbations for new scenarios. Recently, foundation
models have opened new horizons for the generalization of graph neural networks
by capturing shared semantics across various graph distributions. This leads us
to ask:Can we design a foundation attack model for HGNNs that enables
generalizable perturbations across different HGNNs, and quickly adapts to new
heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant
differences in model design and parameter space, different HGNNs surprisingly
share common vulnerability patterns from a relation-aware perspective.
Therefore, we explore how to design foundation HGNN attack criteria by mining
shared attack units. In this paper, we propose a novel relation-wise
heterogeneous graph foundation attack model, HeTa. We introduce a foundation
surrogate model to align heterogeneity and identify the importance of shared
relation-aware attack units. Building on this, we implement a serialized
relation-by-relation attack based on the identified relational weights. In this
way, the perturbation can be transferred to various target HGNNs and easily
fine-tuned for new HGs. Extensive experiments exhibit powerful attack
performances and generalizability of our method.
Authors' comments: Accepted by IJCAI 2025
Chunyuan Deng, Ruidi Chang, Hanjie Chen
Interventions in language models (LMs) are applied strategically to steer
model behavior during the forward pass. Learnable interventions, also known as
representation fine-tuning, aim to apply pointwise control within the concept
subspace and have proven effective in altering high-level behaviors. In this
work, we extend this approach to the distribution level, enabling the model to
learn not only pointwise transformations but also the surrounding regions of
the concept subspace. We demonstrate that these methods perform effectively in
early layers, with larger standard deviations correlating strongly with
improved performance. Across eight commonsense reasoning and seven arithmetic
reasoning benchmarks, our distribution-wise interventions consistently
outperform pointwise interventions in controllability and robustness. These
results illustrate that distribution-wise interventions provide a more
comprehensive method for steering model behavior and enabling finer-grained
control over language models. The code is at:
\href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.
Authors' comments: ICML 2025