Jialiang Wu, Yi Shen, Sijia Liu, Yi Tang, Sen Song, Xiaoyi Wang, Longjun Cai
Despite their impressive capacities, Large language models (LLMs) often
struggle with the hallucination issue of generating inaccurate or fabricated
content even when they possess correct knowledge. In this paper, we extend the
exploration of the correlation between hidden-state prediction changes and
output factuality into a deeper, token-wise level. Based on the insights , we
propose cross-layer Entropy eNhanced Decoding (END), a decoding method that
mitigates hallucinations without requiring extra training. END leverages inner
probability changes across layers to individually quantify the factual
knowledge required for each candidate token, and adjusts the final predicting
distribution to prioritize tokens with higher factuality. Experiments on both
hallucination and QA benchmarks demonstrate that END significantly enhances the
truthfulness and informativeness of generated content while maintaining robust
QA accuracy. Moreover, our work provides a deeper perspective on understanding
the correlations between inherent knowledge and output factuality.
Authors' comments: NAACL 2025 Findings
Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
This paper conducts a comprehensive layer-wise analysis of self-supervised
learning (SSL) models for audio deepfake detection across diverse contexts,
including multilingual datasets (English, Chinese, Spanish), partial, song, and
scene-based deepfake scenarios. By systematically evaluating the contributions
of different transformer layers, we uncover critical insights into model
behavior and performance. Our findings reveal that lower layers consistently
provide the most discriminative features, while higher layers capture less
relevant information. Notably, all models achieve competitive equal error rate
(EER) scores even when employing a reduced number of layers. This indicates
that we can reduce computational costs and increase the inference speed of
detecting deepfakes by utilizing only a few lower layers. This work enhances
our understanding of SSL models in deepfake detection, offering valuable
insights applicable across varied linguistic and contextual settings. Our
trained models and code are publicly available:
https://github.com/Yaselley/SSL_Layerwise_Deepfake.
Authors' comments: Accepted to NAACL Findings 2025
Runbing Zheng
Pairwise network comparison is essential for various applications, including neuroscience, disease research, and dynamic network analysis. While existing literature primarily focuses on comparing entire network structures, we address a vertex-wise comparison problem where two random networks share the same set of vertices but allow for structural variations in some vertices, enabling a more detailed and flexible analysis of network differences. In our framework, some vertices retain their latent positions between networks, while others undergo shifts. To identify the shifted and unshifted vertices and estimate their latent position shifts, we propose a method that first derives vertex embeddings in a low-rank Euclidean space for each network, then aligns these estimated vertex latent positions into a common space to resolve potential non-identifiability, and finally tests whether each vertex is shifted or not and estimates the vertex shifts. Our theoretical results establish the test statistic for the algorithms, guide parameter selection, and provide performance guarantees. Simulation studies and real data applications, including a case-control study in disease research and dynamic network analysis, demonstrate that the proposed algorithms are both computationally efficient and effective in extracting meaningful insights from network comparisons.
Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
Sangyeon Park, Isaac Han, Seungwon Oh, Kyung-Joong Kim
Plasticity loss, a critical challenge in neural network training, limits a
model's ability to adapt to new tasks or shifts in data distribution. This
paper introduces AID (Activation by Interval-wise Dropout), a novel method
inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID
generates subnetworks by applying Dropout with different probabilities on each
preactivation interval. Theoretical analysis reveals that AID regularizes the
network, promoting behavior analogous to that of deep linear networks, which do
not suffer from plasticity loss. We validate the effectiveness of AID in
maintaining plasticity across various benchmarks, including continual learning
tasks on standard image classification datasets such as CIFAR10, CIFAR100, and
TinyImageNet. Furthermore, we show that AID enhances reinforcement learning
performance in the Arcade Learning Environment benchmark.
Authors' comments: Accepted to ICML 2025 (poster)
Mehdi Nickzamir, Seyed Mohammad Sheikh Ahamdi Gandab
A novel hybrid Random Forest and Convolutional Neural Network (CNN) framework is presented for oil-water classification in hyperspectral images (HSI). To address the challenge of preserving spatial context, the images were divided into smaller, non-overlapping tiles, which served as the basis for training, validation, and testing. Random Forest demonstrated strong performance in pixel-wise classification, outperforming models such as XGBoost, Attention-Based U-Net, and HybridSN. However, Random Forest loses spatial context, limiting its ability to fully exploit the spatial relationships in hyperspectral data. To improve performance, a CNN was trained on the probability maps generated by the Random Forest, leveraging the CNN's capacity to incorporate spatial context. The hybrid approach achieved 7.6% improvement in recall (to 0.85), 2.4% improvement in F1 score (to 0.84), and 0.54% improvement in AUC (to 0.99) compared to the baseline. These results highlight the effectiveness of combining probabilistic outputs with spatial feature learning for context-aware analysis of hyperspectral images.
Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li
Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model's performance improves--raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies--particularly exponential growth policies--exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.
Shipra Mahata, Samala Rathan, Juan Ruiz-Álvarez, Dionisio F. Yáñez
This article presents a novel approach to enhance the accuracy of classical quadrature rules by incorporating correction terms. The proposed method is particularly effective when the position of an isolated discontinuity in the function and the jump in the function and its derivatives at that position are known. Traditional numerical integration rules are exact for polynomials of certain degree. However, they may not provide accurate results for piece-wise polynomials or functions with discontinuities without modifying the location and number of data points in the formula. Our proposed correction terms address this limitation, enabling the integration rule to conserve its accuracy even in the presence of a jump discontinuity. The numerical experiments that we present support the theoretical results obtained.
Peng Xue, Wei Fang, Zhengyu Ma, Zihan Huang, Zhaokun Zhou, Yonghong Tian, Timothée Masquelier, Huihui Zhou
Spiking Neural Networks (SNNs) are distinguished from Artificial Neural Networks (ANNs) for their sophisticated neuronal dynamics and sparse binary activations (spikes) inspired by the biological neural system. Traditional neuron models use iterative step-by-step dynamics, resulting in serial computation and slow training speed of SNNs. Recently, parallelizable spiking neuron models have been proposed to fully utilize the massive parallel computing ability of graphics processing units to accelerate the training of SNNs. However, existing parallelizable spiking neuron models involve dense floating operations and can only achieve high long-term dependencies learning ability with a large order at the cost of huge computational and memory costs. To solve the dilemma of performance and costs, we propose the mul-free channel-wise Parallel Spiking Neuron, which is hardware-friendly and suitable for SNNs' resource-restricted application scenarios. The proposed neuron imports the channel-wise convolution to enhance the learning ability, induces the sawtooth dilations to reduce the neuron order, and employs the bit shift operation to avoid multiplications. The algorithm for design and implementation of acceleration methods is discussed meticulously. Our methods are validated in neuromorphic Spiking Heidelberg Digits voices, sequential CIFAR images, and neuromorphic DVS-Lip vision datasets, achieving the best accuracy among SNNs. Training speed results demonstrate the effectiveness of our acceleration methods, providing a practical reference for future research.
Eric Nyiri, Olivier Gibaru
Machine learning methods are solving very successfully a plethora of tasks,
but they have the disadvantage of not providing any information about their
decision. Consequently, estimating the reasoning of the system provides
additional information. For this, Layer-Wise Relevance Propagation (LRP) is one
of the methods in eXplainable Machine Learning (XML). Its purpose is to provide
contributions of any neural network output in the domain of its input. The main
drawback of current methods is mainly due to division by small values. To
overcome this problem, we provide a new definition called Relative LRP where
the classical conservation law is satisfied up to a multiplicative factor but
without divisions by small values except for Resnet skip connection. In this
article, we will focus on image classification. This allows us to visualize the
contributions of a pixel to the predictions of a multi-layer neural network.
Pixel contributions provide a focus to further analysis on regions of potential
interest. R-LRP can be applied for any dense, CNN or residual neural networks.
Moreover, R-LRP doesn't need any hyperparameters to tune contrary to other LRP
methods. We then compare the R-LRP method on different datasets with simple
CNN, VGG16, VGG19 and Resnet50 networks.
Authors' comments: arXiv admin note: text overlap with arXiv:2012.14501,
arXiv:1605.01713 by other authors
Tabinda Aman, Mohammad Nadeem, Shahab Saquib Sohail, Mohammad Anas, Erik Cambria
Animal stereotypes are deeply embedded in human culture and language. They often shape our perceptions and expectations of various species. Our study investigates how animal stereotypes manifest in vision-language models during the task of image generation. Through targeted prompts, we explore whether DALL-E perpetuates stereotypical representations of animals, such as "owls as wise," "foxes as unfaithful," etc. Our findings reveal significant stereotyped instances where the model consistently generates images aligned with cultural biases. The current work is the first of its kind to examine animal stereotyping in vision-language models systematically and to highlight a critical yet underexplored dimension of bias in AI-generated visual content.
Gurleen Kaur, Shubham Ghoshal, Reena Marbate, Neetiraj Malviya, Arshmehar Kaur, Vaisakh SB, Amit Kumar Srivastava, Manmeet Singh
Climate change significantly impacts public health, driving the emergence and spread of epidemics. Climate health models are essential for assessing and predicting disease outbreaks influenced by climatic variables like temperature and precipitation. For instance, dengue and malaria correlate with temperature changes, while cholera is linked to precipitation anomalies. Advances in AI-enabled weather prediction (AI-NWP) have improved forecasting, but integrating climate models with health systems is hindered by the lack of comprehensive, granular health datasets. This study introduces EpiClim: India's Epidemic-Climate Dataset, the first weekly district-wise dataset for major epidemics in India from 2009 to the present, sourced from the Integrated Disease Surveillance Programme (IDSP). The dataset, covering diseases like dengue, malaria, and acute-diarrheal disease, bridges the gap between climate and health data, enabling the integration of climate forecasts with epidemic prediction models. This work lays the foundation for coupling predictive climate health models with weather and climate models, advancing efforts to mitigate climate-induced public health crises.
Meng Wang, Jintao Yang, Bin Yang, Hui Li, Tongxin Gong, Bo Yang, Jiangtao Cui
Patch-wise Transformer based time series forecasting achieves superior
accuracy. However, this superiority relies heavily on intricate model design
with massive parameters, rendering both training and inference expensive, thus
preventing their deployments on edge devices with limited resources and low
latency requirements. In addition, existing methods often work in an
autoregressive manner, which take into account only historical values, but
ignore valuable, easy-to-obtain context information, such as weather forecasts,
date and time of day. To contend with the two limitations, we propose
LiPFormer, a novel Lightweight Patch-wise Transformer with weak data enriching.
First, to simplify the Transformer backbone, LiPFormer employs a novel
lightweight cross-patch attention and a linear transformation-based attention
to eliminate Layer Normalization and Feed Forward Network, two heavy components
in existing Transformers. Second, we propose a lightweight, weak data enriching
module to provide additional, valuable weak supervision to the training. It
enhances forecasting accuracy without significantly increasing model complexity
as it does not involve expensive, human-labeling but using easily accessible
context information. This facilitates the weak data enriching to plug-and-play
on existing models. Extensive experiments on nine benchmark time series
datasets demonstrate that LiPFormer outperforms state-of-the-art methods in
accuracy, while significantly reducing parameter scale, training duration, and
GPU memory usage. Deployment on an edge device reveals that LiPFormer takes
only 1/3 inference time compared to classic Transformers. In addition, we
demonstrate that the weak data enriching can integrate seamlessly into various
Transformer based models to enhance their accuracy, suggesting its generality.
Authors' comments: Accepted by the 41st IEEE International Conference on Data
Engineering (ICDE 2025)
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng
Current Multilingual ASR models only support a fraction of the world's
languages. Continual Learning (CL) aims to tackle this problem by adding new
languages to pre-trained models while avoiding the loss of performance on
existing languages, also known as Catastrophic Forgetting (CF). However,
existing CL methods overlook the adaptation of the token embedding lookup table
at the decoder, despite its significant contribution to CF. We propose
Embedding Layer Surgery where separate copies of the token embeddings are
created for each new languages, and one of the copies is selected to replace
the old languages embeddings when transcribing the corresponding new language.
Unfortunately, this approach means LID errors also cause incorrect ASR
embedding selection. Our Task-wise Beam Search allows self-correction for such
mistakes. By adapting Whisper to 10 hours of data for each of 10 unseen
languages from Common Voice, results show that our method reduces the Average
WER (AWER) of pre-trained languages from 14.2% to 11.9% compared with
Experience Replay, without compromising the AWER of the unseen languages.
Authors' comments: Published in 2024 IEEE Spoken Language Technology Workshop
Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang
Scaling up the vocabulary of semantic segmentation models is extremely
challenging because annotating large-scale mask labels is labour-intensive and
time-consuming. Recently, language-guided segmentation models have been
proposed to address this challenge. However, their performance drops
significantly when applied to out-of-distribution categories. In this paper, we
propose a new large vocabulary semantic segmentation framework, called LarvSeg.
Different from previous works, LarvSeg leverages image classification data to
scale the vocabulary of semantic segmentation models as large-vocabulary
classification datasets usually contain balanced categories and are much easier
to obtain. However, for classification tasks, the category is image-level,
while for segmentation we need to predict the label at pixel level. To address
this issue, we first propose a general baseline framework to incorporate
image-level supervision into the training process of a pixel-level segmentation
model, making the trained network perform semantic segmentation on newly
introduced categories in the classification data. We then observe that a model
trained on segmentation data can group pixel features of categories beyond the
training vocabulary. Inspired by this finding, we design a category-wise
attentive classifier to apply supervision to the precise regions of
corresponding categories to improve the model performance. Extensive
experiments demonstrate that LarvSeg significantly improves the large
vocabulary semantic segmentation performance, especially in the categories
without mask labels. For the first time, we provide a 21K-category semantic
segmentation model with the help of ImageNet21K. The code is available at
https://github.com/HaojunYu1998/large_voc_seg.
Authors' comments: PRCV 2024
Tianle Tao, Shizhao Peng, Tianyu Mei, Shoumo Li, Haogang Zhu
Accurate nonlinear computation is a key challenge in privacy-preserving machine learning (PPML). Most existing frameworks approximate it through linear operations, resulting in significant precision loss. This paper proposes an efficient, verifiable and accurate security 2-party logistic regression framework (EVA-S2PLoR), which achieves accurate nonlinear function computation through a novel secure element-wise multiplication protocol and its derived protocols. Our framework primarily includes secure 2-party vector element-wise multiplication, addition to multiplication, reciprocal, and sigmoid function based on data disguising technology, where high efficiency and accuracy are guaranteed by the simple computation flow based on the real number domain and the few number of fixed communication rounds. We provide secure and robust anomaly detection through dimension transformation and Monte Carlo methods. EVA-S2PLoR outperforms many advanced frameworks in terms of precision (improving the performance of the sigmoid function by about 10 orders of magnitude compared to most frameworks) and delivers the best overall performance in secure logistic regression experiments.
Aadya Arora, Vinay Namboodiri
With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.
Hao Shu
Deep learning has significantly advanced image edge detection (ED), primarily
through improved feature extraction. However, most existing ED models apply
uniform feature fusion across all pixels, ignoring critical differences between
regions such as edges and textures. To address this limitation, we propose the
Extractor-Selector (E-S) paradigm, a novel framework that introduces pixel-wise
feature selection for more adaptive and precise fusion. Unlike conventional
image-level fusion that applies the same convolutional kernel to all pixels,
our approach dynamically selects relevant features at each pixel, enabling more
refined edge predictions. The E-S framework can be seamlessly integrated with
existing ED models without architectural changes, delivering substantial
performance gains. It can also be combined with enhanced feature extractors for
further accuracy improvements. Extensive experiments across multiple benchmarks
confirm that our method consistently outperforms baseline ED models. For
instance, on the BIPED2 dataset, the proposed framework can achieve over 7$\%$
improvements in ODS and OIS, and 22$\%$ improvements in AP, demonstrating its
effectiveness and superiority.
Authors' comments: 17 pages
Jingkai Sun, Qiang Zhang, Jiaxu Wang, Jiahang Cao, Renjing Xu
Dynamic vision sensors (DVS) are bio-inspired devices that capture visual
information in the form of asynchronous events, which encode changes in pixel
intensity with high temporal resolution and low latency. These events provide
rich motion cues that can be exploited for various computer vision tasks, such
as action recognition. However, most existing DVS-based action recognition
methods lose temporal information during data transformation or suffer from
noise and outliers caused by sensor imperfections or environmental factors. To
address these challenges, we propose a novel framework that preserves and
exploits the spatiotemporal structure of event data for action recognition. Our
framework consists of two main components: 1) a point-wise event masked
autoencoder (MAE) that learns a compact and discriminative representation of
event patches by reconstructing them from masked raw event camera points data;
2) an improved event points patch generation algorithm that leverages an event
data inlier model and point-wise data augmentation techniques to enhance the
quality and diversity of event points patches. To the best of our knowledge,
our approach introduces the pre-train method into event camera raw points data
for the first time, and we propose a novel event points patch embedding to
utilize transformer-based models on event cameras.
Authors' comments: ICASSP 2025 Camera Ready
Wonsuk Jang, Thierry Tambe
Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 11.83% (7.56%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.46% (2.65%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.