Mohammad Khalooei, Mohammad Mehdi Homayounpour, Maryam Amirmazlaghani
Deep neural network models are used today in various applications of
artificial intelligence, the strengthening of which, in the face of adversarial
attacks is of particular importance. An appropriate solution to adversarial
attacks is adversarial training, which reaches a trade-off between robustness
and generalization. This paper introduces a novel framework (Layer
Sustainability Analysis (LSA)) for the analysis of layer vulnerability in an
arbitrary neural network in the scenario of adversarial attacks. LSA can be a
helpful toolkit to assess deep neural networks and to extend the adversarial
training approaches towards improving the sustainability of model layers via
layer monitoring and analysis. The LSA framework identifies a list of Most
Vulnerable Layers (MVL list) of the given network. The relative error, as a
comparison measure, is used to evaluate representation sustainability of each
layer against adversarial inputs. The proposed approach for obtaining robust
neural networks to fend off adversarial attacks is based on a layer-wise
regularization (LR) over LSA proposal(s) for adversarial training (AT); i.e.
the AT-LR procedure. AT-LR could be used with any benchmark adversarial attack
to reduce the vulnerability of network layers and to improve conventional
adversarial training approaches. The proposed idea performs well theoretically
and experimentally for state-of-the-art multilayer perceptron and convolutional
neural network architectures. Compared with the AT-LR and its corresponding
base adversarial training, the classification accuracy of more significant
perturbations increased by 16.35%, 21.79%, and 10.730% on Moon, MNIST, and
CIFAR-10 benchmark datasets, respectively. The LSA framework is available and
published at https://github.com/khalooei/LSA.
Authors' comments: Layers Sustainability Analysis (LSA) framework
Pedro R. A. S. Bassi, Sergio S. J. Dertkigil, Andrea Cavalli
Features in images' backgrounds can spuriously correlate with the images'
classes, representing background bias. They can influence the classifier's
decisions, causing shortcut learning (Clever Hans effect). The phenomenon
generates deep neural networks (DNNs) that perform well on standard evaluation
datasets but generalize poorly to real-world data. Layer-wise Relevance
Propagation (LRP) explains DNNs' decisions. Here, we show that the optimization
of LRP heatmaps can minimize the background bias influence on deep classifiers,
hindering shortcut learning. By not increasing run-time computational cost, the
approach is light and fast. Furthermore, it applies to virtually any
classification architecture. After injecting synthetic bias in images'
backgrounds, we compared our approach (dubbed ISNet) to eight state-of-the-art
DNNs, quantitatively demonstrating its superior robustness to background bias.
Mixed datasets are common for COVID-19 and tuberculosis classification with
chest X-rays, fostering background bias. By focusing on the lungs, the ISNet
reduced shortcut learning. Thus, its generalization performance on external
(out-of-distribution) test databases significantly surpassed all implemented
benchmark models.
Authors' comments: Text and style improvements. Included reference to the published
article (Nature Communications, https://doi.org/10.1038/s41467-023-44371-z)
Chau Tran, Pedro Cisneros-Velarde, Sang-Yun Oh, Alexander Petersen
Recently, a special case of precision matrix estimation based on a distributionally robust optimization (DRO) framework has been shown to be equivalent to the graphical lasso. From this formulation, a method for choosing the regularization term, i.e., for graphical model selection, was proposed. In this work, we establish a theoretical connection between the confidence level of graphical model selection via the DRO formulation and the asymptotic family-wise error rate of estimating false edges. Simulation experiments and real data analyses illustrate the utility of the asymptotic family-wise error rate control behavior even in finite samples.
Piotr Garbaczewski, Mariusz Żaba
We discuss an impact of various (path-wise) reflection-from-the barrier
scenarios upon confining properties of a paradigmatic family of symmetric
$\alpha $-stable L\'{e}vy processes, whose permanent residence in a finite
interval on a line is secured by a two-sided reflection. Depending on the
specific reflection "mechanism", the inferred jump-type processes differ in
their spectral and statistical characteristics, like e.g. relaxation
properties, and functional shapes of invariant (equilibrium, or asymptotic
near-equilibrium) probability density functions in the interval. The analysis
is carried out in conjunction with attempts to give meaning to the notion of a
reflecting L\'{e}vy process, in terms of the domain of its motion generator, to
which an invariant pdf (actually an eigenfunction) does belong.
Authors' comments: 20 pp, 8 figures, Text amendments, Abstract and Section I modified
Ian Mason, Sebastian Starke, Taku Komura
Controlling the manner in which a character moves in a real-time animation system is a challenging task with useful applications. Existing style transfer systems require access to a reference content motion clip, however, in real-time systems the future motion content is unknown and liable to change with user input. In this work we present a style modelling system that uses an animation synthesis network to model motion content based on local motion phases. An additional style modulation network uses feature-wise transformations to modulate style in real-time. To evaluate our method, we create and release a new style modelling dataset, 100STYLE, containing over 4 million frames of stylised locomotion data in 100 different styles that present a number of challenges for existing systems. To model these styles, we extend the local phase calculation with a contact-free formulation. In comparison to other methods for real-time style modelling, we show our system is more robust and efficient in its style representation while improving motion quality.
Yang Long, Gui-Song Xia, Liangpei Zhang, Gong Cheng, Deren Li
Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content, e.g., by assigning a semantic label to every pixel of the image. With the popularization of data-driven methods, the past decades have witnessed promising progress on ASP by approaching the problem with the schemes of tile-level scene classification or segmentation-based image analysis, when using high-resolution aerial images. However, the former scheme often produces results with tile-wise boundaries, while the latter one needs to handle the complex modeling process from pixels to semantics, which often requires large-scale and well-annotated image samples with pixel-wise semantic labels. In this paper, we address these issues in ASP, with perspectives from tile-level scene classification to pixel-wise semantic labeling. Specifically, we first revisit aerial image interpretation by a literature review. We then present a large-scale scene classification dataset that contains one million aerial images termed Million-AID. With the presented dataset, we also report benchmarking experiments using classical convolutional neural networks (CNNs). Finally, we perform ASP by unifying the tile-level scene classification and object-based image analysis to achieve pixel-wise semantic labeling. Intensive experiments show that Million-AID is a challenging yet useful dataset, which can serve as a benchmark for evaluating newly developed algorithms. When transferring knowledge from Million-AID, fine-tuning CNN models pretrained on Million-AID perform consistently better than those pretrained ImageNet for aerial scene classification. Moreover, our designed hierarchical multi-task learning method achieves the state-of-the-art pixel-wise classification on the challenging GID, bridging the tile-level scene classification toward pixel-wise semantic labeling for aerial image interpretation.
Gongxin Yao, Yiwei Chen, Yong Liu, Xiaomin Hu, Yu Pan
Single-photon light detection and ranging (LiDAR) has been widely applied to 3D imaging in challenging scenarios. However, limited signal photon counts and high noises in the collected data have posed great challenges for predicting the depth image precisely. In this paper, we propose a pixel-wise residual shrinkage network for photon-efficient imaging from high-noise data, which adaptively generates the optimal thresholds for each pixel and denoises the intermediate features by soft thresholding. Besides, redefining the optimization target as pixel-wise classification provides a sharp advantage in producing confident and accurate depth estimation when compared with existing research. Comprehensive experiments conducted on both simulated and real-world datasets demonstrate that the proposed model outperforms the state-of-the-arts and maintains robust imaging performance under different signal-to-noise ratios including the extreme case of 1:100.
Quanziang Wang, Renzhen Wang, Yuexiang Li, Dong Wei, Kai Ma, Yefeng Zheng, Deyu Meng
Continual learning is a promising machine learning paradigm to learn new tasks while retaining previously learned knowledge over streaming training data. Till now, rehearsal-based methods, keeping a small part of data from old tasks as a memory buffer, have shown good performance in mitigating catastrophic forgetting for previously learned knowledge. However, most of these methods typically treat each new task equally, which may not adequately consider the relationship or similarity between old and new tasks. Furthermore, these methods commonly neglect sample importance in the continual training process and result in sub-optimal performance on certain tasks. To address this challenging problem, we propose Relational Experience Replay (RER), a bi-level learning framework, to adaptively tune task-wise relationships and sample importance within each task to achieve a better `stability' and `plasticity' trade-off. As such, the proposed method is capable of accumulating new knowledge while consolidating previously learned old knowledge during continual learning. Extensive experiments conducted on three publicly available datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) show that the proposed method can consistently improve the performance of all baselines and surpass current state-of-the-art methods.
Xinheng Liu, Yao Chen, Prakhar Ganesh, Junhao Pan, Jinjun Xiong, Deming Chen
Quantization for Convolutional Neural Network (CNN) has shown significant
progress with the intention of reducing the cost of computation and storage
with low-bitwidth data inputs. There are, however, no systematic studies on how
an existing full-bitwidth processing unit, such as CPUs and DSPs, can be better
utilized to carry out significantly higher computation throughput for
convolution under various quantized bitwidths. In this study, we propose
HiKonv, a unified solution that maximizes the compute throughput of a given
underlying processing unit to process low-bitwidth quantized data inputs
through novel bit-wise parallel computation. We establish theoretical
performance bounds using a full-bitwidth multiplier for highly parallelized
low-bitwidth convolution, and demonstrate new breakthroughs for
high-performance computing in this critical domain. For example, a single
32-bit processing unit can deliver 128 binarized convolution operations
(multiplications and additions) under one CPU instruction, and a single 27x18
DSP core can deliver eight convolution operations with 4-bit inputs in one
cycle. We demonstrate the effectiveness of HiKonv on CPU and FPGA for both
convolutional layers or a complete DNN model. For a convolutional layer
quantized to 4-bit, HiKonv achieves a 3.17x latency improvement over the
baseline implementation using C++ on CPU. Compared to the DAC-SDC 2020 champion
model for FPGA, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP
efficiency improvement, respectively.
Authors' comments: 7 pages, 6 figures. Accepted by ASP-DAC 2022
Jiaxing Yan, Hong Zhao, Penghui Bu, YuSheng Jin
Self-supervised learning has shown very promising results for monocular depth estimation. Scene structure and local details both are significant clues for high-quality depth estimation. Recent works suffer from the lack of explicit modeling of scene structure and proper handling of details information, which leads to a performance bottleneck and blurry artefacts in predicted results. In this paper, we propose the Channel-wise Attention-based Depth Estimation Network (CADepth-Net) with two effective contributions: 1) The structure perception module employs the self-attention mechanism to capture long-range dependencies and aggregates discriminative features in channel dimensions, explicitly enhances the perception of scene structure, obtains the better scene understanding and rich feature representation. 2) The detail emphasis module re-calibrates channel-wise feature maps and selectively emphasizes the informative features, aiming to highlight crucial local details information and fuse different level features more efficiently, resulting in more precise and sharper depth prediction. Furthermore, the extensive experiments validate the effectiveness of our method and show that our model achieves the state-of-the-art results on the KITTI benchmark and Make3D datasets.
Nan Gao, Mohammad Saiedur Rahaman, Wei Shao, Kaixin Ji, Flora D. Salim
Seating location in the classroom can affect student engagement, attention
and academic performance by providing better visibility, improved movement, and
participation in discussions. Existing studies typically explore how
traditional seating arrangements (e.g. grouped tables or traditional rows)
influence students' perceived engagement, without considering group seating
behaviours under more flexible seating arrangements. Furthermore, survey-based
measures of student engagement are prone to subjectivity and various response
bias. Therefore, in this research, we investigate how individual and group-wise
classroom seating experiences affect student engagement using wearable
physiological sensors. We conducted a field study at a high school and
collected survey and wearable data from 23 students in 10 courses over four
weeks. We aim to answer the following research questions: 1. How does the
seating proximity between students relate to their perceived learning
engagement? 2. How do students' group seating behaviours relate to their
physiologically-based measures of engagement (i.e. physiological arousal and
physiological synchrony)? Experiment results indicate that the individual and
group-wise classroom seating experience is associated with perceived student
engagement and physiologically-based engagement measured from electrodermal
activity. We also find that students who sit close together are more likely to
have similar learning engagement and tend to have high physiological synchrony.
This research opens up opportunities to explore the implications of flexible
seating arrangements and has great potential to maximize student engagement by
suggesting intelligent seating choices in the future.
Authors' comments: The manuscript has been accepted by IMWUT
Thilini Cooray, Ngai-Man Cheung
Unsupervised graph-level representation learning plays a crucial role in a
variety of tasks such as molecular property prediction and community analysis,
especially when data annotation is expensive. Currently, most of the
best-performing graph embedding methods are based on Infomax principle. The
performance of these methods highly depends on the selection of negative
samples and hurt the performance, if the samples were not carefully selected.
Inter-graph similarity-based methods also suffer if the selected set of graphs
for similarity matching is low in quality. To address this, we focus only on
utilizing the current input graph for embedding learning. We are motivated by
an observation from real-world graph generation processes where the graphs are
formed based on one or more global factors which are common to all elements of
the graph (e.g., topic of a discussion thread, solubility level of a molecule).
We hypothesize extracting these common factors could be highly beneficial.
Hence, this work proposes a new principle for unsupervised graph representation
learning: Graph-wise Common latent Factor EXtraction (GCFX). We further propose
a deep model for GCFX, deepGCFX, based on the idea of reversing the
above-mentioned graph generation process which could explicitly extract common
latent factors from an input graph and achieve improved results on downstream
tasks to the current state-of-the-art. Through extensive experiments and
analysis, we demonstrate that, while extracting common latent factors is
beneficial for graph-level tasks to alleviate distractions caused by local
variations of individual nodes or local neighbourhoods, it also benefits
node-level tasks by enabling long-range node dependencies, especially for
disassortative graphs.
Authors' comments: Accepted to AAAI 2022
Haohe Liu, Qiuqiang Kong, Jiafeng Liu
Music source separation (MSS) shows active progress with deep learning models
in recent years. Many MSS models perform separations on spectrograms by
estimating bounded ratio masks and reusing the phases of the mixture. When
using convolutional neural networks (CNN), weights are usually shared within a
spectrogram during convolution regardless of the different patterns between
frequency bands. In this study, we propose a new MSS model, channel-wise
subband phase-aware ResUNet (CWS-PResUNet), to decompose signals into subbands
and estimate an unbound complex ideal ratio mask (cIRM) for each source.
CWS-PResUNet utilizes a channel-wise subband (CWS) feature to limit unnecessary
global weights sharing on the spectrogram and reduce computational resource
consumptions. The saved computational cost and memory can in turn allow for a
larger architecture. On the MUSDB18HQ test set, we propose a 276-layer
CWS-PResUNet and achieve state-of-the-art (SoTA) performance on vocals with an
8.92 signal-to-distortion ratio (SDR) score. By combining CWS-PResUNet and
Demucs, our ByteMSS system ranks the 2nd on vocals score and 5th on average
score in the 2021 ISMIR Music Demixing (MDX) Challenge limited training data
track (leaderboard A). Our code and pre-trained models are publicly available
at: https://github.com/haoheliu/2021-ISMIR-MSS-Challenge-CWS-PResUNet
Authors' comments: Published at MDX Workshop @ ISMIR 2021
Jing You, Shaocheng Jia, Xin Pei, Danya Yao
Scene perception is essential for driving decision-making and traffic safety.
However, fog, as a kind of common weather, frequently appears in the real
world, especially in the mountain areas, making it difficult to accurately
observe the surrounding environments. Therefore, precisely estimating the
visibility under foggy weather can significantly benefit traffic management and
safety. To address this, most current methods use professional instruments
outfitted at fixed locations on the roads to perform the visibility
measurement; these methods are expensive and less flexible. In this paper, we
propose an innovative end-to-end convolutional neural network framework to
estimate the visibility leveraging Koschmieder's law exclusively using the
image data. The proposed method estimates the visibility by integrating the
physical model into the proposed framework, instead of directly predicting the
visibility value via the convolutional neural work. Moreover, we estimate the
visibility as a pixel-wise visibility map against those of previous visibility
measurement methods which solely predict a single value for an entire image.
Thus, the estimated result of our method is more informative, particularly in
uneven fog scenarios, which can benefit to developing a more precise early
warning system for foggy weather, thereby better protecting the intelligent
transportation infrastructure systems and promoting its development. To
validate the proposed framework, a virtual dataset, FACI, containing 3,000
foggy images in different concentrations, is collected using the AirSim
platform. Detailed experiments show that the proposed method achieves
performance competitive to those of state-of-the-art methods.
Authors' comments: 8 figures
Hai Phan, Anh Nguyen
Face identification (FI) is ubiquitous and drives many high-stake decisions
made by law enforcement. State-of-the-art FI approaches compare two images by
taking the cosine similarity between their image embeddings. Yet, such an
approach suffers from poor out-of-distribution (OOD) generalization to new
types of images (e.g., when a query face is masked, cropped, or rotated) not
included in the training set or the gallery. Here, we propose a re-ranking
approach that compares two faces using the Earth Mover's Distance on the deep,
spatial features of image patches. Our extra comparison stage explicitly
examines image similarity at a fine-grained level (e.g., eyes to eyes) and is
more robust to OOD perturbations and occlusions than traditional FI.
Interestingly, without finetuning feature extractors, our method consistently
improves the accuracy on all tested OOD queries: masked, cropped, rotated, and
adversarial while obtaining similar results on in-distribution images.
Authors' comments: CVPR 2022
Weixuan Sun, Jing Zhang, Zheyuan Liu, Yiran Zhong, Nick Barnes
Weakly Supervised Semantic Segmentation (WSSS) is challenging, particularly when image-level labels are used to supervise pixel level prediction. To bridge their gap, a Class Activation Map (CAM) is usually generated to provide pixel level pseudo labels. CAMs in Convolutional Neural Networks suffer from partial activation ie, only the most discriminative regions are activated. Transformer based methods, on the other hand, are highly effective at exploring global context with long range dependency modeling, potentially alleviating the "partial activation" issue. In this paper, we propose the first transformer based WSSS approach, and introduce the Gradient weighted Element wise Transformer Attention Map (GETAM). GETAM shows fine scale activation for all feature map elements, revealing different parts of the object across transformer layers. Further, we propose an activation aware label completion module to generate high quality pseudo labels. Finally, we incorporate our methods into an end to end framework for WSSS using double backward propagation. Extensive experiments on PASCAL VOC and COCO demonstrate that our results beat the state-of-the-art end-to-end approaches by a significant margin, and outperform most multi-stage methods.m most multi-stage methods.
Yucheng Shi, Yahong Han, Yu-an Tan, Xiaohui Kuang
Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the neglect of noise sensitivity differences between image regions by existing decision-based attacks further compromises the efficiency of noise compression, especially for ViTs. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we theoretically analyze the limitations of existing decision-based attacks from the perspective of noise sensitivity difference between regions of the image, and propose a new decision-based black-box attack against ViTs, termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on three datasets demonstrate that PAR achieves a much lower noise magnitude with the same number of queries.
Yuanyuan Yuan, Qi Pang, Shuai Wang
Various deep neural network (DNN) coverage criteria have been proposed to
assess DNN test inputs and steer input mutations. The coverage is characterized
via neurons having certain outputs, or the discrepancy between neuron outputs.
Nevertheless, recent research indicates that neuron coverage criteria show
little correlation with test suite quality.
In general, DNNs approximate distributions, by incorporating hierarchical
layers, to make predictions for inputs. Thus, we champion to deduce DNN
behaviors based on its approximated distributions from a layer perspective. A
test suite should be assessed using its induced layer output distributions.
Accordingly, to fully examine DNN behaviors, input mutation should be directed
toward diversifying the approximated distributions.
This paper summarizes eight design requirements for DNN coverage criteria,
taking into account distribution properties and practical concerns. We then
propose a new criterion, NeuraL Coverage (NLC), that satisfies all design
requirements. NLC treats a single DNN layer as the basic computational unit
(rather than a single neuron) and captures four critical properties of neuron
output distributions. Thus, NLC accurately describes how DNNs comprehend inputs
via approximated distributions. We demonstrate that NLC is significantly
correlated with the diversity of a test suite across a number of tasks
(classification and generation) and data formats (image and text). Its capacity
to discover DNN prediction errors is promising. Test input mutation guided by
NLC results in a greater quality and diversity of exposed erroneous behaviors.
Authors' comments: The extended version of a paper to appear in the Proceedings of the
45th IEEE/ACM International Conference on Software Engineering, 2023, (ICSE
'23), 14 pages
Ha Min Son, Moon Hyun Kim, Tai-Myoung Chung
Federated Learning is a widely adopted method to train neural networks over
distributed data. One main limitation is the performance degradation that
occurs when data is heterogeneously distributed. While many works have
attempted to address this problem, these methods under-perform because they are
founded on a limited understanding of neural networks. In this work, we verify
that only certain important layers in a neural network require regularization
for effective training. We additionally verify that Centered Kernel Alignment
(CKA) most accurately calculates similarity between layers of neural networks
trained on different data. By applying CKA-based regularization to important
layers during training, we significantly improve performance in heterogeneous
settings. We present FedCKA: a simple framework that out-performs previous
state-of-the-art methods on various deep learning tasks while also improving
efficiency and scalability.
Authors' comments: 8 pages, 5 figures, 4 tables
Shengbo Wang, Bo Lyu, Shiping Wen, Kaibo Shi, Song Zhu, Tingwen Huang
Safety is always one of the most critical principles for a system to be controlled. This paper investigates a safety-critical control scheme for unknown structured systems by using the control barrier function (CBF) method. Benefited from the dynamic regressor extension and mixing (DREM), an extended element-wise parameter identification law is utilized to dismiss the uncertainty. On the one hand, it is shown that the proposed control scheme can always guarantee the safety in the identification process with noised signal injection excitation, which was not considered in the previous study. On the other hand, the element-wise estimation process in DREM can minimize conservatism of the safe adaptive process compared to other existing adaptive CBF algorithms. The stability as well as the forward invariance of the presented safe control-estimation scheme is proved. Furthermore, the robustness of the scheme under bounded disturbances is analyzed, where a robust CBF with modest conditions is used to ensure safety. The framework is illustrated by simulations on adaptive cruise control, where the slope resistance of the following vehicle is robustly estimated in finite time against small disturbances and the potential crash risk is avoided by the proposed safe control scheme.