Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Shaojie Lyu, Qingming Huang
Real-world datasets often exhibit a long-tailed distribution, where vast
majority of classes known as tail classes have only few samples. Traditional
methods tend to overfit on these tail classes. Recently, a new approach called
Imbalanced SAM (ImbSAM) is proposed to leverage the generalization benefits of
Sharpness-Aware Minimization (SAM) for long-tailed distributions. The main
strategy is to merely enhance the smoothness of the loss function for tail
classes. However, we argue that improving generalization in long-tail scenarios
requires a careful balance between head and tail classes. We show that neither
SAM nor ImbSAM alone can fully achieve this balance. For SAM, we prove that
although it enhances the model's generalization ability by escaping saddle
point in the overall loss landscape, it does not effectively address this for
tail-class losses. Conversely, while ImbSAM is more effective at avoiding
saddle points in tail classes, the head classes are trained insufficiently,
resulting in significant performance drops. Based on these insights, we propose
Stage-wise Saddle Escaping SAM (SSE-SAM), which uses complementary strengths of
ImbSAM and SAM in a phased approach. Initially, SSE-SAM follows the majority
sample to avoid saddle points of the head-class loss. During the later phase,
it focuses on tail-classes to help them escape saddle points. Our experiments
confirm that SSE-SAM has better ability in escaping saddles both on head and
tail classes, and shows performance improvements.
Authors' comments: Update: Add missing information and correct some grammatical issues
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang et al.
Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.
Jiajun Gong, Wei Cai, Siyuan Liang, Zhong Guan, Tao Wang, Ee-Chien Chang
Website Fingerprinting (WF) aims to deanonymize users on the Tor network by
analyzing encrypted network traffic. Recent deep-learning-based attacks show
high accuracy on undefended traces. However, they struggle against modern
defenses that use tactics like injecting dummy packets and delaying real
packets, which significantly degrade classification performance. Our analysis
reveals that current attacks inadequately leverage the timing information
inherent in traffic traces, which persists as a source of leakage even under
robust defenses. Addressing this shortfall, we introduce a novel feature
representation named the Inter-Arrival Time (IAT) histogram, which quantifies
the frequencies of packet inter-arrival times across predetermined time slots.
Complementing this feature, we propose a new CNN-based attack, WFCAT, enhanced
with two innovative architectural blocks designed to optimally extract and
utilize timing information. Our approach uses kernels of varying sizes to
capture multi-scale features, which are then integrated using a weighted sum
across all feature channels to enhance the model's efficacy in identifying
temporal patterns. Our experiments validate that WFCAT substantially
outperforms existing methods on defended traces in both closed- and open-world
scenarios. Notably, WFCAT achieves over 59% accuracy against Surakav, a
recently developed robust defense, marking an improvement of over 28% and 48%
against the state-of-the-art attacks RF and Tik-Tok, respectively, in the
closed-world scenario.
Authors' comments: 13 pages
Zhuo Wu, Qinglin Jia, Chuhan Wu, Zhaocheng Du, Shuai Wang, Zan Wang, Zhenhua Dong
Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users' preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.
Shuo Xie, Fangzhi Zhu, Jiahui Wang, Lulu Wen, Wei Dai, Xiaowei Chen, Junxiong Zhu, Kai Zhou et al.
Aligning Large Language Models (LLMs) with human feedback is crucial for
their development. Existing preference optimization methods such as DPO and
KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF),
are inherently derived from PPO, requiring a reference model that adds GPU
memory resources and relies heavily on abundant preference data. Meanwhile,
current preference optimization research mainly targets single-question
scenarios with two replies, neglecting optimization with multiple replies,
which leads to a waste of data in the application. This study introduces the
MPPO algorithm, which leverages the average likelihood of model responses to
fit the reward function and maximizes the utilization of preference data.
Through a comparison of Point-wise, Pair-wise, and List-wise implementations,
we found that the Pair-wise approach achieves the best performance,
significantly enhancing the quality of model responses. Experimental results
demonstrate MPPO's outstanding performance across various benchmarks. On
MT-Bench, MPPO outperforms DPO, ORPO, and SimPO. Notably, on Arena-Hard, MPPO
surpasses DPO and ORPO by substantial margins. These achievements underscore
the remarkable advantages of MPPO in preference optimization tasks.
Authors' comments: Accepted by COLING2025
Hazel Kim, Adel Bibi, Philip Torr, Yarin Gal
Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel approach to detecting model hallucination through systematic analysis of information flow across model layers when processing inputs with insufficient or ambiguous context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ($\mathcal{L}$I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. $\mathcal{L}$I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.
Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, Chen Lv
The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.
Jin-Seop Lee, Noo-ri Kim, Jee-Hyong Lee
Self-supervised learning (SSL) methods based on the instance discrimination
tasks with InfoNCE have achieved remarkable success. Despite their success, SSL
models often struggle to generate effective representations for unseen-domain
data. To address this issue, research on unsupervised domain generalization
(UDG), which aims to develop SSL models that can generate domain-irrelevant
features, has been conducted. Most UDG approaches utilize contrastive learning
with InfoNCE to generate representations, and perform feature alignment based
on strong assumptions to generalize domain-irrelevant common features from
multi-source domains. However, existing methods that rely on instance
discrimination tasks are not effective at extracting domain-irrelevant common
features. This leads to the suppression of domain-irrelevant common features
and the amplification of domain-relevant features, thereby hindering domain
generalization. Furthermore, strong assumptions underlying feature alignment
can lead to biased feature learning, reducing the diversity of common features.
In this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive
Learning with Prototype Mixup. We explore how InfoNCE suppresses
domain-irrelevant common features and amplifies domain-relevant features. Based
on this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance
domain-irrelevant common features. We also propose Prototype Mixup Learning
(PMix) to generalize domain-irrelevant common features across multiple domains
without relying on strong assumptions. The proposed method consistently
outperforms state-of-the-art methods on the PACS and DomainNet datasets across
various label fractions, showing significant improvements. Our code will be
released. Our project page is available at https://github.com/jinsuby/DomCLP.
Authors' comments: Code page: https://github.com/jinsuby/DomCLP
Morgan B. Talbot, Gabriel Kreiman, James J. DiCarlo, Guy Gaziv
The currently leading artificial neural network models of the visual ventral stream - which are derived from a combination of performance optimization and robustification methods - have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. We show that image perturbations generated by these models can enhance the ability of humans to accurately report the ground truth class. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) applying image perturbations that aid recognition for novice learners. We find that combining these model-based strategies leads to categorization accuracy gains of 33-72% relative to control subjects without these interventions, on unmodified, randomly selected held-out test images. Beyond the accuracy gain, the training time for the augmented learning group was also shortened by 20-23%, despite both groups completing the same number of training trials. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as two tasks in clinically relevant image domains - histology and dermoscopy - where visual learning is notoriously challenging. To the best of our knowledge, our work is the first application of artificial neural networks to increase visual learning performance in humans by enhancing category-specific image features.
Yuchun He, Yuhan He
Single image super-resolution (SR) has long posed a challenge in the field of computer vision. While the advent of deep learning has led to the emergence of numerous methods aimed at tackling this persistent issue, the current methodologies still encounter challenges in modeling long sequence information, leading to limitations in effectively capturing the global pixel interactions. To tackle this challenge and achieve superior SR outcomes, we propose the Mamba pixel-wise sequential interaction network (MPSI), aimed at enhancing the establishment of long-range connections of information, particularly focusing on pixel-wise sequential interaction. We propose the Channel-Mamba Block (CMB) to capture comprehensive pixel interaction information by effectively modeling long sequence information. Moreover, in the existing SR methodologies, there persists the issue of the neglect of features extracted by preceding layers, leading to the loss of valuable feature information. While certain existing models strive to preserve these features, they frequently encounter difficulty in establishing connections across all layers. To overcome this limitation, MPSI introduces the Mamba channel recursion module (MCRM), which maximizes the retention of valuable feature information from early layers, thereby facilitating the acquisition of pixel sequence interaction information from multiple-level layers. Through extensive experimentation, we demonstrate that MPSI outperforms existing super-resolution methods in terms of image reconstruction results, attaining state-of-the-art performance.
Haihang Wu
Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.
Yang Jiao, Kai Yang, Chengtao Jian
Trilevel learning (TLL) found diverse applications in numerous machine learning applications, ranging from robust hyperparameter optimization to domain adaptation. However, existing researches primarily focus on scenarios where TLL can be addressed with first order information available at each level, which is inadequate in many situations involving zeroth order constraints, such as when black-box models are employed. Moreover, in trilevel learning, data may be distributed across various nodes, necessitating strategies to address TLL problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the TLL problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) TLL problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $\epsilon$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO, e.g., it approximately achieves up to a 40$\%$ improvement in performance.
Deepshikha Bhati, Fnu Neha, Md Amiruzzaman, Angela Guercio, Deepak Kumar Shukla, Ben Ward
Interpreting complex neural networks is crucial for understanding their decision-making processes, particularly in applications where transparency and accountability are essential. This proposed method addresses this need by focusing on layer-wise Relevance Propagation (LRP), a technique used in explainable artificial intelligence (XAI) to attribute neural network outputs to input features through backpropagated relevance scores. Existing LRP methods often struggle with precision in evaluating individual neuron contributions. To overcome this limitation, we present a novel approach that improves the parsing of selected neurons during LRP backward propagation, using the Visual Geometry Group 16 (VGG16) architecture as a case study. Our method creates neural network graphs to highlight critical paths and visualizes these paths with heatmaps, optimizing neuron selection through accuracy metrics like Mean Squared Error (MSE) and Symmetric Mean Absolute Percentage Error (SMAPE). Additionally, we utilize a deconvolutional visualization technique to reconstruct feature maps, offering a comprehensive view of the network's inner workings. Extensive experiments demonstrate that our approach enhances interpretability and supports the development of more transparent artificial intelligence (AI) systems for computer vision applications. This advancement has the potential to improve the trustworthiness of AI models in real-world machine vision applications, thereby increasing their reliability and effectiveness.
Fuchao Yang, Jianhong Cheng, Hui Liu, Yongqiang Dong, Yuheng Jia, Junhui Hou
In partial label learning (PLL), every sample is associated with a candidate
label set comprising the ground-truth label and several noisy labels. The
conventional PLL assumes the noisy labels are randomly generated
(instance-independent), while in practical scenarios, the noisy labels are
always instance-dependent and are highly related to the sample features,
leading to the instance-dependent partial label learning (IDPLL) problem.
Instance-dependent noisy label is a double-edged sword. On one side, it may
promote model training as the noisy labels can depict the sample to some
extent. On the other side, it brings high label ambiguity as the noisy labels
are quite undistinguishable from the ground-truth label. To leverage the
nuances of IDPLL effectively, for the first time we create class-wise
embeddings for each sample, which allow us to explore the relationship of
instance-dependent noisy labels, i.e., the class-wise embeddings in the
candidate label set should have high similarity, while the class-wise
embeddings between the candidate label set and the non-candidate label set
should have high dissimilarity. Moreover, to reduce the high label ambiguity,
we introduce the concept of class prototypes containing global feature
information to disambiguate the candidate label set. Extensive experimental
comparisons with twelve methods on six benchmark data sets, including four
fine-grained data sets, demonstrate the effectiveness of the proposed method.
The code implementation is publicly available at
https://github.com/Yangfc-ML/CEL.
Authors' comments: Accepted by KDD 2025
Yiqin Zhang, Qingkui Chen, Chen Huang, Zhengjie Zhang, Meiling Chen, Zhibing Fu
Most data-driven models for medical image analysis rely on universal augmentations to improve performance. Experimental evidence has confirmed their effectiveness, but the unclear mechanism underlying them poses a barrier to the widespread acceptance and trust in such methods within the medical community. We revisit and acknowledge the unique characteristics of medical images apart from traditional digital images, and consequently, proposed a medical-specific augmentation algorithm that is more elastic and aligns well with radiology scan procedure. The method performs piecewise affine with sinusoidal distorted ray according to radius on polar coordinates, thus simulating uncertain postures of human lying flat on the scanning table. Our method could generate human visceral distribution without affecting the fundamental relative position on axial plane. Two non-adaptive algorithms, namely Meta-based Scan Table Removal and Similarity-Guided Parameter Search, are introduced to bolster robustness of our augmentation method. Experiments show our method improves accuracy across multiple famous segmentation frameworks without requiring more data samples. Our preview code is available in: https://github.com/MGAMZ/PSBPD.
Daniel Siegismund, Mario Wieser, Stephan Heyse, Stephan Steigele
Deep Neural Networks (DNNs) have shown remarkable success in various computer vision tasks. However, their black-box nature often leads to difficulty in interpreting their decisions, creating an unfilled need for methods to explain the decisions, and ultimately forming a barrier to their wide acceptance especially in biomedical applications. This work introduces a novel method, Pixel-wise Channel Isolation Mixing (PCIM), to calculate pixel attribution maps, highlighting the image parts most crucial for a classification decision but without the need to extract internal network states or gradients. Unlike existing methods, PCIM treats each pixel as a distinct input channel and trains a blending layer to mix these pixels, reflecting specific classifications. This unique approach allows the generation of pixel attribution maps for each image, but agnostic to the choice of the underlying classification network. Benchmark testing on three application relevant, diverse high content Imaging datasets show state-of-the-art performance, particularly for model fidelity and localization ability in both, fluorescence and bright field High Content Imaging. PCIM contributes as a unique and effective method for creating pixel-level attribution maps from arbitrary DNNs, enabling interpretability and trust.
Gustavo P. C. P. da Luz, Gabriel Massuyoshi Sato, Luis Fernando Gomez Gonzalez, Juliana Freitag Borin
The increasing urbanization and the growing number of vehicles in cities have
underscored the need for efficient parking management systems. Traditional
smart parking solutions often rely on sensors or cameras for occupancy
detection, each with its limitations. Recent advancements in deep learning have
introduced new YOLO models (YOLOv8, YOLOv9, YOLOv10, and YOLOv11), but these
models have not been extensively evaluated in the context of smart parking
systems, particularly when combined with Region of Interest (ROI) selection for
object detection. Existing methods still rely on fixed polygonal ROI selections
or simple pixel-based modifications, which limit flexibility and precision.
This work introduces a novel approach that integrates Internet of Things, Edge
Computing, and Deep Learning concepts, by using the latest YOLO models for
vehicle detection. By exploring both edge and cloud computing, it was found
that inference times on edge devices ranged from 1 to 92 seconds, depending on
the hardware and model version. Additionally, a new pixel-wise post-processing
ROI selection method is proposed for accurately identifying regions of interest
to count vehicles in parking lot images. The proposed system achieved 99.68%
balanced accuracy on a custom dataset of 3,484 images, offering a
cost-effective smart parking solution that ensures precise vehicle detection
while preserving data privacy
Authors' comments: Submitted to Elsevier Internet of Things, 22 pages, 11 figures, 6
tables
Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei
Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
Zhiming Xu, Suorong Yang, Baile Xu, Jian Zhao, Furao Shen
Class-incremental learning (CIL) aims to acquire new classes while conserving
historical knowledge incrementally. Despite existing pre-trained model (PTM)
based methods performing excellently in CIL, it is better to fine-tune them on
downstream incremental tasks with massive patterns unknown to PTMs. However,
using task streams for fine-tuning could lead to catastrophic forgetting that
will erase the knowledge in PTMs. This paper proposes the Dual Prototype
network for Task-wise Adaption (DPTA) of PTM-based CIL. For each incremental
learning task, a task-wise adapter module is built to fine-tune the PTM, where
the center-adapt loss forces the representation to be more centrally clustered
and class separable. The dual prototype network improves the prediction process
by enabling test-time adapter selection, where the raw prototypes deduce
several possible task indexes of test samples to select suitable adapter
modules for PTM, and the augmented prototypes that could separate highly
correlated classes are utilized to determine the final result. Experiments on
several benchmark datasets demonstrate the state-of-the-art performance of
DPTA. The code will be open-sourced after the paper is published.
Authors' comments: 9 pages,6 figures,2 tables
Wei Lin, Qingyu Song, Hong Xu
Tuning effective step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes tuning methods have been explored in first-order methods, second-order methods still lack efficient techniques. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these challenges, we introduce a novel Learning-to-Optimize (L2O) model within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, which leverages neural networks to predict optimal coordinate-wise step sizes. Our model integrates a theoretical foundation that establishes conditions for the stability and convergence of these step sizes. Extensive experiments demonstrate that our approach achieves substantial improvements over traditional backtracking line search and hypergradient descent-based methods, offering up to 7$\times$ faster and stable performance across diverse optimization tasks.