Mehdi Nickzamir, Seyed Mohammad Sheikh Ahamdi Gandab
A novel hybrid Random Forest and Convolutional Neural Network (CNN) framework is presented for oil-water classification in hyperspectral images (HSI). To address the challenge of preserving spatial context, the images were divided into smaller, non-overlapping tiles, which served as the basis for training, validation, and testing. Random Forest demonstrated strong performance in pixel-wise classification, outperforming models such as XGBoost, Attention-Based U-Net, and HybridSN. However, Random Forest loses spatial context, limiting its ability to fully exploit the spatial relationships in hyperspectral data. To improve performance, a CNN was trained on the probability maps generated by the Random Forest, leveraging the CNN's capacity to incorporate spatial context. The hybrid approach achieved 7.6% improvement in recall (to 0.85), 2.4% improvement in F1 score (to 0.84), and 0.54% improvement in AUC (to 0.99) compared to the baseline. These results highlight the effectiveness of combining probabilistic outputs with spatial feature learning for context-aware analysis of hyperspectral images.
Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li
Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model's performance improves--raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies--particularly exponential growth policies--exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.
Shipra Mahata, Samala Rathan, Juan Ruiz-Álvarez, Dionisio F. Yáñez
This article presents a novel approach to enhance the accuracy of classical quadrature rules by incorporating correction terms. The proposed method is particularly effective when the position of an isolated discontinuity in the function and the jump in the function and its derivatives at that position are known. Traditional numerical integration rules are exact for polynomials of certain degree. However, they may not provide accurate results for piece-wise polynomials or functions with discontinuities without modifying the location and number of data points in the formula. Our proposed correction terms address this limitation, enabling the integration rule to conserve its accuracy even in the presence of a jump discontinuity. The numerical experiments that we present support the theoretical results obtained.
Peng Xue, Wei Fang, Zhengyu Ma, Zihan Huang, Zhaokun Zhou, Yonghong Tian, Timothée Masquelier, Huihui Zhou
Spiking Neural Networks (SNNs) are distinguished from Artificial Neural Networks (ANNs) for their sophisticated neuronal dynamics and sparse binary activations (spikes) inspired by the biological neural system. Traditional neuron models use iterative step-by-step dynamics, resulting in serial computation and slow training speed of SNNs. Recently, parallelizable spiking neuron models have been proposed to fully utilize the massive parallel computing ability of graphics processing units to accelerate the training of SNNs. However, existing parallelizable spiking neuron models involve dense floating operations and can only achieve high long-term dependencies learning ability with a large order at the cost of huge computational and memory costs. To solve the dilemma of performance and costs, we propose the mul-free channel-wise Parallel Spiking Neuron, which is hardware-friendly and suitable for SNNs' resource-restricted application scenarios. The proposed neuron imports the channel-wise convolution to enhance the learning ability, induces the sawtooth dilations to reduce the neuron order, and employs the bit shift operation to avoid multiplications. The algorithm for design and implementation of acceleration methods is discussed meticulously. Our methods are validated in neuromorphic Spiking Heidelberg Digits voices, sequential CIFAR images, and neuromorphic DVS-Lip vision datasets, achieving the best accuracy among SNNs. Training speed results demonstrate the effectiveness of our acceleration methods, providing a practical reference for future research.
Eric Nyiri, Olivier Gibaru
Machine learning methods are solving very successfully a plethora of tasks,
but they have the disadvantage of not providing any information about their
decision. Consequently, estimating the reasoning of the system provides
additional information. For this, Layer-Wise Relevance Propagation (LRP) is one
of the methods in eXplainable Machine Learning (XML). Its purpose is to provide
contributions of any neural network output in the domain of its input. The main
drawback of current methods is mainly due to division by small values. To
overcome this problem, we provide a new definition called Relative LRP where
the classical conservation law is satisfied up to a multiplicative factor but
without divisions by small values except for Resnet skip connection. In this
article, we will focus on image classification. This allows us to visualize the
contributions of a pixel to the predictions of a multi-layer neural network.
Pixel contributions provide a focus to further analysis on regions of potential
interest. R-LRP can be applied for any dense, CNN or residual neural networks.
Moreover, R-LRP doesn't need any hyperparameters to tune contrary to other LRP
methods. We then compare the R-LRP method on different datasets with simple
CNN, VGG16, VGG19 and Resnet50 networks.
Authors' comments: arXiv admin note: text overlap with arXiv:2012.14501,
arXiv:1605.01713 by other authors
Tabinda Aman, Mohammad Nadeem, Shahab Saquib Sohail, Mohammad Anas, Erik Cambria
Animal stereotypes are deeply embedded in human culture and language. They often shape our perceptions and expectations of various species. Our study investigates how animal stereotypes manifest in vision-language models during the task of image generation. Through targeted prompts, we explore whether DALL-E perpetuates stereotypical representations of animals, such as "owls as wise," "foxes as unfaithful," etc. Our findings reveal significant stereotyped instances where the model consistently generates images aligned with cultural biases. The current work is the first of its kind to examine animal stereotyping in vision-language models systematically and to highlight a critical yet underexplored dimension of bias in AI-generated visual content.
Gurleen Kaur, Shubham Ghoshal, Reena Marbate, Neetiraj Malviya, Arshmehar Kaur, Vaisakh SB, Amit Kumar Srivastava, Manmeet Singh
Climate change significantly impacts public health, driving the emergence and spread of epidemics. Climate health models are essential for assessing and predicting disease outbreaks influenced by climatic variables like temperature and precipitation. For instance, dengue and malaria correlate with temperature changes, while cholera is linked to precipitation anomalies. Advances in AI-enabled weather prediction (AI-NWP) have improved forecasting, but integrating climate models with health systems is hindered by the lack of comprehensive, granular health datasets. This study introduces EpiClim: India's Epidemic-Climate Dataset, the first weekly district-wise dataset for major epidemics in India from 2009 to the present, sourced from the Integrated Disease Surveillance Programme (IDSP). The dataset, covering diseases like dengue, malaria, and acute-diarrheal disease, bridges the gap between climate and health data, enabling the integration of climate forecasts with epidemic prediction models. This work lays the foundation for coupling predictive climate health models with weather and climate models, advancing efforts to mitigate climate-induced public health crises.
Meng Wang, Jintao Yang, Bin Yang, Hui Li, Tongxin Gong, Bo Yang, Jiangtao Cui
Patch-wise Transformer based time series forecasting achieves superior
accuracy. However, this superiority relies heavily on intricate model design
with massive parameters, rendering both training and inference expensive, thus
preventing their deployments on edge devices with limited resources and low
latency requirements. In addition, existing methods often work in an
autoregressive manner, which take into account only historical values, but
ignore valuable, easy-to-obtain context information, such as weather forecasts,
date and time of day. To contend with the two limitations, we propose
LiPFormer, a novel Lightweight Patch-wise Transformer with weak data enriching.
First, to simplify the Transformer backbone, LiPFormer employs a novel
lightweight cross-patch attention and a linear transformation-based attention
to eliminate Layer Normalization and Feed Forward Network, two heavy components
in existing Transformers. Second, we propose a lightweight, weak data enriching
module to provide additional, valuable weak supervision to the training. It
enhances forecasting accuracy without significantly increasing model complexity
as it does not involve expensive, human-labeling but using easily accessible
context information. This facilitates the weak data enriching to plug-and-play
on existing models. Extensive experiments on nine benchmark time series
datasets demonstrate that LiPFormer outperforms state-of-the-art methods in
accuracy, while significantly reducing parameter scale, training duration, and
GPU memory usage. Deployment on an edge device reveals that LiPFormer takes
only 1/3 inference time compared to classic Transformers. In addition, we
demonstrate that the weak data enriching can integrate seamlessly into various
Transformer based models to enhance their accuracy, suggesting its generality.
Authors' comments: Accepted by the 41st IEEE International Conference on Data
Engineering (ICDE 2025)
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng
Current Multilingual ASR models only support a fraction of the world's
languages. Continual Learning (CL) aims to tackle this problem by adding new
languages to pre-trained models while avoiding the loss of performance on
existing languages, also known as Catastrophic Forgetting (CF). However,
existing CL methods overlook the adaptation of the token embedding lookup table
at the decoder, despite its significant contribution to CF. We propose
Embedding Layer Surgery where separate copies of the token embeddings are
created for each new languages, and one of the copies is selected to replace
the old languages embeddings when transcribing the corresponding new language.
Unfortunately, this approach means LID errors also cause incorrect ASR
embedding selection. Our Task-wise Beam Search allows self-correction for such
mistakes. By adapting Whisper to 10 hours of data for each of 10 unseen
languages from Common Voice, results show that our method reduces the Average
WER (AWER) of pre-trained languages from 14.2% to 11.9% compared with
Experience Replay, without compromising the AWER of the unseen languages.
Authors' comments: Published in 2024 IEEE Spoken Language Technology Workshop
Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang
Scaling up the vocabulary of semantic segmentation models is extremely
challenging because annotating large-scale mask labels is labour-intensive and
time-consuming. Recently, language-guided segmentation models have been
proposed to address this challenge. However, their performance drops
significantly when applied to out-of-distribution categories. In this paper, we
propose a new large vocabulary semantic segmentation framework, called LarvSeg.
Different from previous works, LarvSeg leverages image classification data to
scale the vocabulary of semantic segmentation models as large-vocabulary
classification datasets usually contain balanced categories and are much easier
to obtain. However, for classification tasks, the category is image-level,
while for segmentation we need to predict the label at pixel level. To address
this issue, we first propose a general baseline framework to incorporate
image-level supervision into the training process of a pixel-level segmentation
model, making the trained network perform semantic segmentation on newly
introduced categories in the classification data. We then observe that a model
trained on segmentation data can group pixel features of categories beyond the
training vocabulary. Inspired by this finding, we design a category-wise
attentive classifier to apply supervision to the precise regions of
corresponding categories to improve the model performance. Extensive
experiments demonstrate that LarvSeg significantly improves the large
vocabulary semantic segmentation performance, especially in the categories
without mask labels. For the first time, we provide a 21K-category semantic
segmentation model with the help of ImageNet21K. The code is available at
https://github.com/HaojunYu1998/large_voc_seg.
Authors' comments: PRCV 2024
Tianle Tao, Shizhao Peng, Tianyu Mei, Shoumo Li, Haogang Zhu
Accurate nonlinear computation is a key challenge in privacy-preserving machine learning (PPML). Most existing frameworks approximate it through linear operations, resulting in significant precision loss. This paper proposes an efficient, verifiable and accurate security 2-party logistic regression framework (EVA-S2PLoR), which achieves accurate nonlinear function computation through a novel secure element-wise multiplication protocol and its derived protocols. Our framework primarily includes secure 2-party vector element-wise multiplication, addition to multiplication, reciprocal, and sigmoid function based on data disguising technology, where high efficiency and accuracy are guaranteed by the simple computation flow based on the real number domain and the few number of fixed communication rounds. We provide secure and robust anomaly detection through dimension transformation and Monte Carlo methods. EVA-S2PLoR outperforms many advanced frameworks in terms of precision (improving the performance of the sigmoid function by about 10 orders of magnitude compared to most frameworks) and delivers the best overall performance in secure logistic regression experiments.
Aadya Arora, Vinay Namboodiri
With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.
Hao Shu
Deep learning has significantly advanced image edge detection (ED), primarily
through improved feature extraction. However, most existing ED models apply
uniform feature fusion across all pixels, ignoring critical differences between
regions such as edges and textures. To address this limitation, we propose the
Extractor-Selector (E-S) paradigm, a novel framework that introduces pixel-wise
feature selection for more adaptive and precise fusion. Unlike conventional
image-level fusion that applies the same convolutional kernel to all pixels,
our approach dynamically selects relevant features at each pixel, enabling more
refined edge predictions. The E-S framework can be seamlessly integrated with
existing ED models without architectural changes, delivering substantial
performance gains. It can also be combined with enhanced feature extractors for
further accuracy improvements. Extensive experiments across multiple benchmarks
confirm that our method consistently outperforms baseline ED models. For
instance, on the BIPED2 dataset, the proposed framework can achieve over 7$\%$
improvements in ODS and OIS, and 22$\%$ improvements in AP, demonstrating its
effectiveness and superiority.
Authors' comments: 17 pages
Jingkai Sun, Qiang Zhang, Jiaxu Wang, Jiahang Cao, Renjing Xu
Dynamic vision sensors (DVS) are bio-inspired devices that capture visual
information in the form of asynchronous events, which encode changes in pixel
intensity with high temporal resolution and low latency. These events provide
rich motion cues that can be exploited for various computer vision tasks, such
as action recognition. However, most existing DVS-based action recognition
methods lose temporal information during data transformation or suffer from
noise and outliers caused by sensor imperfections or environmental factors. To
address these challenges, we propose a novel framework that preserves and
exploits the spatiotemporal structure of event data for action recognition. Our
framework consists of two main components: 1) a point-wise event masked
autoencoder (MAE) that learns a compact and discriminative representation of
event patches by reconstructing them from masked raw event camera points data;
2) an improved event points patch generation algorithm that leverages an event
data inlier model and point-wise data augmentation techniques to enhance the
quality and diversity of event points patches. To the best of our knowledge,
our approach introduces the pre-train method into event camera raw points data
for the first time, and we propose a novel event points patch embedding to
utilize transformer-based models on event cameras.
Authors' comments: ICASSP 2025 Camera Ready
Wonsuk Jang, Thierry Tambe
Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 11.83% (7.56%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.46% (2.65%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
Kun Li, George Vosselman, Michael Ying Yang
The goal of referring remote sensing image segmentation (RRSIS) is to extract
specific pixel-level regions within an aerial image via a natural language
expression. Recent advancements, particularly Transformer-based fusion designs,
have demonstrated remarkable progress in this domain. However, existing methods
primarily focus on refining visual features using language-aware guidance
during the cross-modal fusion stage, neglecting the complementary
vision-to-language flow. This limitation often leads to irrelevant or
suboptimal representations. In addition, the diverse spatial scales of ground
objects in aerial images pose significant challenges to the visual perception
capabilities of existing models when conditioned on textual inputs. In this
paper, we propose an innovative framework called Scale-wise Bidirectional
Alignment Network (SBANet) to address these challenges for RRSIS. Specifically,
we design a Bidirectional Alignment Module (BAM) with learnable query tokens to
selectively and effectively represent visual and linguistic features,
emphasizing regions associated with key tokens. BAM is further enhanced with a
dynamic feature selection block, designed to provide both macro- and
micro-level visual features, preserving global context and local details to
facilitate more effective cross-modal interaction. Furthermore, SBANet
incorporates a text-conditioned channel and spatial aggregator to bridge the
gap between the encoder and decoder, enhancing cross-scale information exchange
in complex aerial scenarios. Extensive experiments demonstrate that our
proposed method achieves superior performance in comparison to previous
state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both
quantitatively and qualitatively. The code will be released after publication.
Authors' comments: Under review
Ninad Jadhav, Meghna Behari, Robert J. Wood, Stephanie Gil
We introduce a Wireless Signal based Efficient multi-Robot eXploration (WiSER-X) algorithm applicable to a decentralized team of robots exploring an unknown environment with communication bandwidth constraints. WiSER-X relies only on local inter-robot relative position estimates, that can be obtained by exchanging signal pings from onboard sensors such as WiFi, Ultra-Wide Band, amongst others, to inform the exploration decisions of individual robots to minimize redundant coverage overlaps. Furthermore, WiSER-X also enables asynchronous termination without requiring a shared map between the robots. It also adapts to heterogeneous robot behaviors and even complete failures in unknown environment while ensuring complete coverage. Simulations show that WiSER-X leads to 58% lower overlap than a zero-information-sharing baseline algorithm-1 and only 23% more overlap than a full-information-sharing algorithm baseline algorithm-2.
Shuokai Pan, Gerti Tuzi, Sudarshan Sreeram, Dibakar Gope
Despite the revolutionary breakthroughs of large-scale textto-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).
Takashi Horiuchi, Yoshiki Toba, Toru Misawa, Katsuhiro L. Murata, Keisuke Isogai, Yoichi Yatsu, Ichiro Takahashi, Mahito Sasada et al.
The extremely luminous infrared galaxy (ELIRG), WISE J090924.01+000211.1
(hereafter; WISE J0909+0002, $z=1.87$) is an extraordinary object with a quasar
aspect. This study performs monitoring observations of WISE J0909+0002 with the
105 cm Murikabushi telescope, Okayama and Akeno 50 cm telescopes/MITSuME ($g'$,
$R_{\rm c}$, and $I_{\rm c}$ bands), and the SaCRA 55 cm telescope/MuSaSHI
($r$, $i$, and $z$ bands). We obtain the following results by combining the
UV/optical light curves of the CRTS, Pan-STARRS, and ZTF archive data, and our
observational data: (1) the light curves of WISE J0909+0002 present
quasi-periodic (sinusoidal) oscillations with the rest-frame period of $\sim$
660$-$689 day; (2) the structure functions of WISE J0909+0002 do not show a
damped random walk (DRW) trend; (3) the mock DRW light curves present
periodic-like trend on rare occasions in 10000 simulations; (4) the
relativistic boost scenario is favored, since the relation between variability
amplitude and power-law slope ratio is consistent with the theoretical
prediction of this scenario, and a substantial parameter space exists between
the inclination angles and the black hole mass; (5) the circumbinary disk model
is difficult to explain the spectral energy distribution of our target; (6) the
significant radio flux density of WISE J0909+0002 is not detected from the VLA
FIRST Survey, thus the radio jet precession scenario is ruled out. From our
results, the Doppler boost scenario is likely as a cause of the periodic
variability, consequently the quasi-periodic oscillations in WISE J0909+0002 is
possibly interpreted by a supermassive blackhole binary. Additional
observations to investigate the continuity of the periodic trend would bring
new insights into mechanisms of the quasi-periodic oscillations and/or ELIRGs.
Authors' comments: 19 pages, 11 figures, published by publication in PASJ
Xin Gao, Yang Lin, Ruiqing Li, Yasha Wang, Xu Chu, Xinyu Ma, Hailong Yu
Data mining and knowledge discovery are essential aspects of extracting valuable insights from vast datasets. Neural topic models (NTMs) have emerged as a valuable unsupervised tool in this field. However, the predominant objective in NTMs, which aims to discover topics maximizing data likelihood, often lacks alignment with the central goals of data mining and knowledge discovery which is to reveal interpretable insights from large data repositories. Overemphasizing likelihood maximization without incorporating topic regularization can lead to an overly expansive latent space for topic modeling. In this paper, we present an innovative approach to NTMs that addresses this misalignment by introducing contrastive learning measures to assess topic interpretability. We propose a novel NTM framework, named ContraTopic, that integrates a differentiable regularizer capable of evaluating multiple facets of topic interpretability throughout the training process. Our regularizer adopts a unique topic-wise contrastive methodology, fostering both internal coherence within topics and clear external distinctions among them. Comprehensive experiments conducted on three diverse datasets demonstrate that our approach consistently produces topics with superior interpretability compared to state-of-the-art NTMs.