Kun Li, George Vosselman, Michael Ying Yang
The goal of referring remote sensing image segmentation (RRSIS) is to extract
specific pixel-level regions within an aerial image via a natural language
expression. Recent advancements, particularly Transformer-based fusion designs,
have demonstrated remarkable progress in this domain. However, existing methods
primarily focus on refining visual features using language-aware guidance
during the cross-modal fusion stage, neglecting the complementary
vision-to-language flow. This limitation often leads to irrelevant or
suboptimal representations. In addition, the diverse spatial scales of ground
objects in aerial images pose significant challenges to the visual perception
capabilities of existing models when conditioned on textual inputs. In this
paper, we propose an innovative framework called Scale-wise Bidirectional
Alignment Network (SBANet) to address these challenges for RRSIS. Specifically,
we design a Bidirectional Alignment Module (BAM) with learnable query tokens to
selectively and effectively represent visual and linguistic features,
emphasizing regions associated with key tokens. BAM is further enhanced with a
dynamic feature selection block, designed to provide both macro- and
micro-level visual features, preserving global context and local details to
facilitate more effective cross-modal interaction. Furthermore, SBANet
incorporates a text-conditioned channel and spatial aggregator to bridge the
gap between the encoder and decoder, enhancing cross-scale information exchange
in complex aerial scenarios. Extensive experiments demonstrate that our
proposed method achieves superior performance in comparison to previous
state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both
quantitatively and qualitatively. The code will be released after publication.
Authors' comments: Under review
Ninad Jadhav, Meghna Behari, Robert J. Wood, Stephanie Gil
We introduce a Wireless Signal based Efficient multi-Robot eXploration (WiSER-X) algorithm applicable to a decentralized team of robots exploring an unknown environment with communication bandwidth constraints. WiSER-X relies only on local inter-robot relative position estimates, that can be obtained by exchanging signal pings from onboard sensors such as WiFi, Ultra-Wide Band, amongst others, to inform the exploration decisions of individual robots to minimize redundant coverage overlaps. Furthermore, WiSER-X also enables asynchronous termination without requiring a shared map between the robots. It also adapts to heterogeneous robot behaviors and even complete failures in unknown environment while ensuring complete coverage. Simulations show that WiSER-X leads to 58% lower overlap than a zero-information-sharing baseline algorithm-1 and only 23% more overlap than a full-information-sharing algorithm baseline algorithm-2.
Shuokai Pan, Gerti Tuzi, Sudarshan Sreeram, Dibakar Gope
Despite the revolutionary breakthroughs of large-scale textto-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).
Takashi Horiuchi, Yoshiki Toba, Toru Misawa, Katsuhiro L. Murata, Keisuke Isogai, Yoichi Yatsu, Ichiro Takahashi, Mahito Sasada et al.
The extremely luminous infrared galaxy (ELIRG), WISE J090924.01+000211.1
(hereafter; WISE J0909+0002, $z=1.87$) is an extraordinary object with a quasar
aspect. This study performs monitoring observations of WISE J0909+0002 with the
105 cm Murikabushi telescope, Okayama and Akeno 50 cm telescopes/MITSuME ($g'$,
$R_{\rm c}$, and $I_{\rm c}$ bands), and the SaCRA 55 cm telescope/MuSaSHI
($r$, $i$, and $z$ bands). We obtain the following results by combining the
UV/optical light curves of the CRTS, Pan-STARRS, and ZTF archive data, and our
observational data: (1) the light curves of WISE J0909+0002 present
quasi-periodic (sinusoidal) oscillations with the rest-frame period of $\sim$
660$-$689 day; (2) the structure functions of WISE J0909+0002 do not show a
damped random walk (DRW) trend; (3) the mock DRW light curves present
periodic-like trend on rare occasions in 10000 simulations; (4) the
relativistic boost scenario is favored, since the relation between variability
amplitude and power-law slope ratio is consistent with the theoretical
prediction of this scenario, and a substantial parameter space exists between
the inclination angles and the black hole mass; (5) the circumbinary disk model
is difficult to explain the spectral energy distribution of our target; (6) the
significant radio flux density of WISE J0909+0002 is not detected from the VLA
FIRST Survey, thus the radio jet precession scenario is ruled out. From our
results, the Doppler boost scenario is likely as a cause of the periodic
variability, consequently the quasi-periodic oscillations in WISE J0909+0002 is
possibly interpreted by a supermassive blackhole binary. Additional
observations to investigate the continuity of the periodic trend would bring
new insights into mechanisms of the quasi-periodic oscillations and/or ELIRGs.
Authors' comments: 19 pages, 11 figures, published by publication in PASJ
Xin Gao, Yang Lin, Ruiqing Li, Yasha Wang, Xu Chu, Xinyu Ma, Hailong Yu
Data mining and knowledge discovery are essential aspects of extracting valuable insights from vast datasets. Neural topic models (NTMs) have emerged as a valuable unsupervised tool in this field. However, the predominant objective in NTMs, which aims to discover topics maximizing data likelihood, often lacks alignment with the central goals of data mining and knowledge discovery which is to reveal interpretable insights from large data repositories. Overemphasizing likelihood maximization without incorporating topic regularization can lead to an overly expansive latent space for topic modeling. In this paper, we present an innovative approach to NTMs that addresses this misalignment by introducing contrastive learning measures to assess topic interpretability. We propose a novel NTM framework, named ContraTopic, that integrates a differentiable regularizer capable of evaluating multiple facets of topic interpretability throughout the training process. Our regularizer adopts a unique topic-wise contrastive methodology, fostering both internal coherence within topics and clear external distinctions among them. Comprehensive experiments conducted on three diverse datasets demonstrate that our approach consistently produces topics with superior interpretability compared to state-of-the-art NTMs.
Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Shaojie Lyu, Qingming Huang
Real-world datasets often exhibit a long-tailed distribution, where vast
majority of classes known as tail classes have only few samples. Traditional
methods tend to overfit on these tail classes. Recently, a new approach called
Imbalanced SAM (ImbSAM) is proposed to leverage the generalization benefits of
Sharpness-Aware Minimization (SAM) for long-tailed distributions. The main
strategy is to merely enhance the smoothness of the loss function for tail
classes. However, we argue that improving generalization in long-tail scenarios
requires a careful balance between head and tail classes. We show that neither
SAM nor ImbSAM alone can fully achieve this balance. For SAM, we prove that
although it enhances the model's generalization ability by escaping saddle
point in the overall loss landscape, it does not effectively address this for
tail-class losses. Conversely, while ImbSAM is more effective at avoiding
saddle points in tail classes, the head classes are trained insufficiently,
resulting in significant performance drops. Based on these insights, we propose
Stage-wise Saddle Escaping SAM (SSE-SAM), which uses complementary strengths of
ImbSAM and SAM in a phased approach. Initially, SSE-SAM follows the majority
sample to avoid saddle points of the head-class loss. During the later phase,
it focuses on tail-classes to help them escape saddle points. Our experiments
confirm that SSE-SAM has better ability in escaping saddles both on head and
tail classes, and shows performance improvements.
Authors' comments: Update: Add missing information and correct some grammatical issues
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang et al.
Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.
Jiajun Gong, Wei Cai, Siyuan Liang, Zhong Guan, Tao Wang, Ee-Chien Chang
Website Fingerprinting (WF) aims to deanonymize users on the Tor network by
analyzing encrypted network traffic. Recent deep-learning-based attacks show
high accuracy on undefended traces. However, they struggle against modern
defenses that use tactics like injecting dummy packets and delaying real
packets, which significantly degrade classification performance. Our analysis
reveals that current attacks inadequately leverage the timing information
inherent in traffic traces, which persists as a source of leakage even under
robust defenses. Addressing this shortfall, we introduce a novel feature
representation named the Inter-Arrival Time (IAT) histogram, which quantifies
the frequencies of packet inter-arrival times across predetermined time slots.
Complementing this feature, we propose a new CNN-based attack, WFCAT, enhanced
with two innovative architectural blocks designed to optimally extract and
utilize timing information. Our approach uses kernels of varying sizes to
capture multi-scale features, which are then integrated using a weighted sum
across all feature channels to enhance the model's efficacy in identifying
temporal patterns. Our experiments validate that WFCAT substantially
outperforms existing methods on defended traces in both closed- and open-world
scenarios. Notably, WFCAT achieves over 59% accuracy against Surakav, a
recently developed robust defense, marking an improvement of over 28% and 48%
against the state-of-the-art attacks RF and Tik-Tok, respectively, in the
closed-world scenario.
Authors' comments: 13 pages
Zhuo Wu, Qinglin Jia, Chuhan Wu, Zhaocheng Du, Shuai Wang, Zan Wang, Zhenhua Dong
Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users' preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.
Shuo Xie, Fangzhi Zhu, Jiahui Wang, Lulu Wen, Wei Dai, Xiaowei Chen, Junxiong Zhu, Kai Zhou et al.
Aligning Large Language Models (LLMs) with human feedback is crucial for
their development. Existing preference optimization methods such as DPO and
KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF),
are inherently derived from PPO, requiring a reference model that adds GPU
memory resources and relies heavily on abundant preference data. Meanwhile,
current preference optimization research mainly targets single-question
scenarios with two replies, neglecting optimization with multiple replies,
which leads to a waste of data in the application. This study introduces the
MPPO algorithm, which leverages the average likelihood of model responses to
fit the reward function and maximizes the utilization of preference data.
Through a comparison of Point-wise, Pair-wise, and List-wise implementations,
we found that the Pair-wise approach achieves the best performance,
significantly enhancing the quality of model responses. Experimental results
demonstrate MPPO's outstanding performance across various benchmarks. On
MT-Bench, MPPO outperforms DPO, ORPO, and SimPO. Notably, on Arena-Hard, MPPO
surpasses DPO and ORPO by substantial margins. These achievements underscore
the remarkable advantages of MPPO in preference optimization tasks.
Authors' comments: Accepted by COLING2025
Hazel Kim, Adel Bibi, Philip Torr, Yarin Gal
Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel approach to detecting model hallucination through systematic analysis of information flow across model layers when processing inputs with insufficient or ambiguous context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ($\mathcal{L}$I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. $\mathcal{L}$I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.
Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, Chen Lv
The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.
Jin-Seop Lee, Noo-ri Kim, Jee-Hyong Lee
Self-supervised learning (SSL) methods based on the instance discrimination
tasks with InfoNCE have achieved remarkable success. Despite their success, SSL
models often struggle to generate effective representations for unseen-domain
data. To address this issue, research on unsupervised domain generalization
(UDG), which aims to develop SSL models that can generate domain-irrelevant
features, has been conducted. Most UDG approaches utilize contrastive learning
with InfoNCE to generate representations, and perform feature alignment based
on strong assumptions to generalize domain-irrelevant common features from
multi-source domains. However, existing methods that rely on instance
discrimination tasks are not effective at extracting domain-irrelevant common
features. This leads to the suppression of domain-irrelevant common features
and the amplification of domain-relevant features, thereby hindering domain
generalization. Furthermore, strong assumptions underlying feature alignment
can lead to biased feature learning, reducing the diversity of common features.
In this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive
Learning with Prototype Mixup. We explore how InfoNCE suppresses
domain-irrelevant common features and amplifies domain-relevant features. Based
on this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance
domain-irrelevant common features. We also propose Prototype Mixup Learning
(PMix) to generalize domain-irrelevant common features across multiple domains
without relying on strong assumptions. The proposed method consistently
outperforms state-of-the-art methods on the PACS and DomainNet datasets across
various label fractions, showing significant improvements. Our code will be
released. Our project page is available at https://github.com/jinsuby/DomCLP.
Authors' comments: Code page: https://github.com/jinsuby/DomCLP
Morgan B. Talbot, Gabriel Kreiman, James J. DiCarlo, Guy Gaziv
The currently leading artificial neural network models of the visual ventral stream - which are derived from a combination of performance optimization and robustification methods - have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. We show that image perturbations generated by these models can enhance the ability of humans to accurately report the ground truth class. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) applying image perturbations that aid recognition for novice learners. We find that combining these model-based strategies leads to categorization accuracy gains of 33-72% relative to control subjects without these interventions, on unmodified, randomly selected held-out test images. Beyond the accuracy gain, the training time for the augmented learning group was also shortened by 20-23%, despite both groups completing the same number of training trials. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as two tasks in clinically relevant image domains - histology and dermoscopy - where visual learning is notoriously challenging. To the best of our knowledge, our work is the first application of artificial neural networks to increase visual learning performance in humans by enhancing category-specific image features.
Yuchun He, Yuhan He
Single image super-resolution (SR) has long posed a challenge in the field of computer vision. While the advent of deep learning has led to the emergence of numerous methods aimed at tackling this persistent issue, the current methodologies still encounter challenges in modeling long sequence information, leading to limitations in effectively capturing the global pixel interactions. To tackle this challenge and achieve superior SR outcomes, we propose the Mamba pixel-wise sequential interaction network (MPSI), aimed at enhancing the establishment of long-range connections of information, particularly focusing on pixel-wise sequential interaction. We propose the Channel-Mamba Block (CMB) to capture comprehensive pixel interaction information by effectively modeling long sequence information. Moreover, in the existing SR methodologies, there persists the issue of the neglect of features extracted by preceding layers, leading to the loss of valuable feature information. While certain existing models strive to preserve these features, they frequently encounter difficulty in establishing connections across all layers. To overcome this limitation, MPSI introduces the Mamba channel recursion module (MCRM), which maximizes the retention of valuable feature information from early layers, thereby facilitating the acquisition of pixel sequence interaction information from multiple-level layers. Through extensive experimentation, we demonstrate that MPSI outperforms existing super-resolution methods in terms of image reconstruction results, attaining state-of-the-art performance.
Haihang Wu
Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.
Yang Jiao, Kai Yang, Chengtao Jian
Trilevel learning (TLL) found diverse applications in numerous machine learning applications, ranging from robust hyperparameter optimization to domain adaptation. However, existing researches primarily focus on scenarios where TLL can be addressed with first order information available at each level, which is inadequate in many situations involving zeroth order constraints, such as when black-box models are employed. Moreover, in trilevel learning, data may be distributed across various nodes, necessitating strategies to address TLL problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the TLL problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) TLL problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $\epsilon$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO, e.g., it approximately achieves up to a 40$\%$ improvement in performance.
Deepshikha Bhati, Fnu Neha, Md Amiruzzaman, Angela Guercio, Deepak Kumar Shukla, Ben Ward
Interpreting complex neural networks is crucial for understanding their decision-making processes, particularly in applications where transparency and accountability are essential. This proposed method addresses this need by focusing on layer-wise Relevance Propagation (LRP), a technique used in explainable artificial intelligence (XAI) to attribute neural network outputs to input features through backpropagated relevance scores. Existing LRP methods often struggle with precision in evaluating individual neuron contributions. To overcome this limitation, we present a novel approach that improves the parsing of selected neurons during LRP backward propagation, using the Visual Geometry Group 16 (VGG16) architecture as a case study. Our method creates neural network graphs to highlight critical paths and visualizes these paths with heatmaps, optimizing neuron selection through accuracy metrics like Mean Squared Error (MSE) and Symmetric Mean Absolute Percentage Error (SMAPE). Additionally, we utilize a deconvolutional visualization technique to reconstruct feature maps, offering a comprehensive view of the network's inner workings. Extensive experiments demonstrate that our approach enhances interpretability and supports the development of more transparent artificial intelligence (AI) systems for computer vision applications. This advancement has the potential to improve the trustworthiness of AI models in real-world machine vision applications, thereby increasing their reliability and effectiveness.
Fuchao Yang, Jianhong Cheng, Hui Liu, Yongqiang Dong, Yuheng Jia, Junhui Hou
In partial label learning (PLL), every sample is associated with a candidate
label set comprising the ground-truth label and several noisy labels. The
conventional PLL assumes the noisy labels are randomly generated
(instance-independent), while in practical scenarios, the noisy labels are
always instance-dependent and are highly related to the sample features,
leading to the instance-dependent partial label learning (IDPLL) problem.
Instance-dependent noisy label is a double-edged sword. On one side, it may
promote model training as the noisy labels can depict the sample to some
extent. On the other side, it brings high label ambiguity as the noisy labels
are quite undistinguishable from the ground-truth label. To leverage the
nuances of IDPLL effectively, for the first time we create class-wise
embeddings for each sample, which allow us to explore the relationship of
instance-dependent noisy labels, i.e., the class-wise embeddings in the
candidate label set should have high similarity, while the class-wise
embeddings between the candidate label set and the non-candidate label set
should have high dissimilarity. Moreover, to reduce the high label ambiguity,
we introduce the concept of class prototypes containing global feature
information to disambiguate the candidate label set. Extensive experimental
comparisons with twelve methods on six benchmark data sets, including four
fine-grained data sets, demonstrate the effectiveness of the proposed method.
The code implementation is publicly available at
https://github.com/Yangfc-ML/CEL.
Authors' comments: Accepted by KDD 2025
Yiqin Zhang, Qingkui Chen, Chen Huang, Zhengjie Zhang, Meiling Chen, Zhibing Fu
Most data-driven models for medical image analysis rely on universal augmentations to improve performance. Experimental evidence has confirmed their effectiveness, but the unclear mechanism underlying them poses a barrier to the widespread acceptance and trust in such methods within the medical community. We revisit and acknowledge the unique characteristics of medical images apart from traditional digital images, and consequently, proposed a medical-specific augmentation algorithm that is more elastic and aligns well with radiology scan procedure. The method performs piecewise affine with sinusoidal distorted ray according to radius on polar coordinates, thus simulating uncertain postures of human lying flat on the scanning table. Our method could generate human visceral distribution without affecting the fundamental relative position on axial plane. Two non-adaptive algorithms, namely Meta-based Scan Table Removal and Similarity-Guided Parameter Search, are introduced to bolster robustness of our augmentation method. Experiments show our method improves accuracy across multiple famous segmentation frameworks without requiring more data samples. Our preview code is available in: https://github.com/MGAMZ/PSBPD.