Yang Long, Gui-Song Xia, Liangpei Zhang, Gong Cheng, Deren Li
Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content, e.g., by assigning a semantic label to every pixel of the image. With the popularization of data-driven methods, the past decades have witnessed promising progress on ASP by approaching the problem with the schemes of tile-level scene classification or segmentation-based image analysis, when using high-resolution aerial images. However, the former scheme often produces results with tile-wise boundaries, while the latter one needs to handle the complex modeling process from pixels to semantics, which often requires large-scale and well-annotated image samples with pixel-wise semantic labels. In this paper, we address these issues in ASP, with perspectives from tile-level scene classification to pixel-wise semantic labeling. Specifically, we first revisit aerial image interpretation by a literature review. We then present a large-scale scene classification dataset that contains one million aerial images termed Million-AID. With the presented dataset, we also report benchmarking experiments using classical convolutional neural networks (CNNs). Finally, we perform ASP by unifying the tile-level scene classification and object-based image analysis to achieve pixel-wise semantic labeling. Intensive experiments show that Million-AID is a challenging yet useful dataset, which can serve as a benchmark for evaluating newly developed algorithms. When transferring knowledge from Million-AID, fine-tuning CNN models pretrained on Million-AID perform consistently better than those pretrained ImageNet for aerial scene classification. Moreover, our designed hierarchical multi-task learning method achieves the state-of-the-art pixel-wise classification on the challenging GID, bridging the tile-level scene classification toward pixel-wise semantic labeling for aerial image interpretation.
Gongxin Yao, Yiwei Chen, Yong Liu, Xiaomin Hu, Yu Pan
Single-photon light detection and ranging (LiDAR) has been widely applied to 3D imaging in challenging scenarios. However, limited signal photon counts and high noises in the collected data have posed great challenges for predicting the depth image precisely. In this paper, we propose a pixel-wise residual shrinkage network for photon-efficient imaging from high-noise data, which adaptively generates the optimal thresholds for each pixel and denoises the intermediate features by soft thresholding. Besides, redefining the optimization target as pixel-wise classification provides a sharp advantage in producing confident and accurate depth estimation when compared with existing research. Comprehensive experiments conducted on both simulated and real-world datasets demonstrate that the proposed model outperforms the state-of-the-arts and maintains robust imaging performance under different signal-to-noise ratios including the extreme case of 1:100.
Quanziang Wang, Renzhen Wang, Yuexiang Li, Dong Wei, Kai Ma, Yefeng Zheng, Deyu Meng
Continual learning is a promising machine learning paradigm to learn new tasks while retaining previously learned knowledge over streaming training data. Till now, rehearsal-based methods, keeping a small part of data from old tasks as a memory buffer, have shown good performance in mitigating catastrophic forgetting for previously learned knowledge. However, most of these methods typically treat each new task equally, which may not adequately consider the relationship or similarity between old and new tasks. Furthermore, these methods commonly neglect sample importance in the continual training process and result in sub-optimal performance on certain tasks. To address this challenging problem, we propose Relational Experience Replay (RER), a bi-level learning framework, to adaptively tune task-wise relationships and sample importance within each task to achieve a better `stability' and `plasticity' trade-off. As such, the proposed method is capable of accumulating new knowledge while consolidating previously learned old knowledge during continual learning. Extensive experiments conducted on three publicly available datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) show that the proposed method can consistently improve the performance of all baselines and surpass current state-of-the-art methods.
Xinheng Liu, Yao Chen, Prakhar Ganesh, Junhao Pan, Jinjun Xiong, Deming Chen
Quantization for Convolutional Neural Network (CNN) has shown significant
progress with the intention of reducing the cost of computation and storage
with low-bitwidth data inputs. There are, however, no systematic studies on how
an existing full-bitwidth processing unit, such as CPUs and DSPs, can be better
utilized to carry out significantly higher computation throughput for
convolution under various quantized bitwidths. In this study, we propose
HiKonv, a unified solution that maximizes the compute throughput of a given
underlying processing unit to process low-bitwidth quantized data inputs
through novel bit-wise parallel computation. We establish theoretical
performance bounds using a full-bitwidth multiplier for highly parallelized
low-bitwidth convolution, and demonstrate new breakthroughs for
high-performance computing in this critical domain. For example, a single
32-bit processing unit can deliver 128 binarized convolution operations
(multiplications and additions) under one CPU instruction, and a single 27x18
DSP core can deliver eight convolution operations with 4-bit inputs in one
cycle. We demonstrate the effectiveness of HiKonv on CPU and FPGA for both
convolutional layers or a complete DNN model. For a convolutional layer
quantized to 4-bit, HiKonv achieves a 3.17x latency improvement over the
baseline implementation using C++ on CPU. Compared to the DAC-SDC 2020 champion
model for FPGA, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP
efficiency improvement, respectively.
Authors' comments: 7 pages, 6 figures. Accepted by ASP-DAC 2022
Jiaxing Yan, Hong Zhao, Penghui Bu, YuSheng Jin
Self-supervised learning has shown very promising results for monocular depth estimation. Scene structure and local details both are significant clues for high-quality depth estimation. Recent works suffer from the lack of explicit modeling of scene structure and proper handling of details information, which leads to a performance bottleneck and blurry artefacts in predicted results. In this paper, we propose the Channel-wise Attention-based Depth Estimation Network (CADepth-Net) with two effective contributions: 1) The structure perception module employs the self-attention mechanism to capture long-range dependencies and aggregates discriminative features in channel dimensions, explicitly enhances the perception of scene structure, obtains the better scene understanding and rich feature representation. 2) The detail emphasis module re-calibrates channel-wise feature maps and selectively emphasizes the informative features, aiming to highlight crucial local details information and fuse different level features more efficiently, resulting in more precise and sharper depth prediction. Furthermore, the extensive experiments validate the effectiveness of our method and show that our model achieves the state-of-the-art results on the KITTI benchmark and Make3D datasets.
Nan Gao, Mohammad Saiedur Rahaman, Wei Shao, Kaixin Ji, Flora D. Salim
Seating location in the classroom can affect student engagement, attention
and academic performance by providing better visibility, improved movement, and
participation in discussions. Existing studies typically explore how
traditional seating arrangements (e.g. grouped tables or traditional rows)
influence students' perceived engagement, without considering group seating
behaviours under more flexible seating arrangements. Furthermore, survey-based
measures of student engagement are prone to subjectivity and various response
bias. Therefore, in this research, we investigate how individual and group-wise
classroom seating experiences affect student engagement using wearable
physiological sensors. We conducted a field study at a high school and
collected survey and wearable data from 23 students in 10 courses over four
weeks. We aim to answer the following research questions: 1. How does the
seating proximity between students relate to their perceived learning
engagement? 2. How do students' group seating behaviours relate to their
physiologically-based measures of engagement (i.e. physiological arousal and
physiological synchrony)? Experiment results indicate that the individual and
group-wise classroom seating experience is associated with perceived student
engagement and physiologically-based engagement measured from electrodermal
activity. We also find that students who sit close together are more likely to
have similar learning engagement and tend to have high physiological synchrony.
This research opens up opportunities to explore the implications of flexible
seating arrangements and has great potential to maximize student engagement by
suggesting intelligent seating choices in the future.
Authors' comments: The manuscript has been accepted by IMWUT
Thilini Cooray, Ngai-Man Cheung
Unsupervised graph-level representation learning plays a crucial role in a
variety of tasks such as molecular property prediction and community analysis,
especially when data annotation is expensive. Currently, most of the
best-performing graph embedding methods are based on Infomax principle. The
performance of these methods highly depends on the selection of negative
samples and hurt the performance, if the samples were not carefully selected.
Inter-graph similarity-based methods also suffer if the selected set of graphs
for similarity matching is low in quality. To address this, we focus only on
utilizing the current input graph for embedding learning. We are motivated by
an observation from real-world graph generation processes where the graphs are
formed based on one or more global factors which are common to all elements of
the graph (e.g., topic of a discussion thread, solubility level of a molecule).
We hypothesize extracting these common factors could be highly beneficial.
Hence, this work proposes a new principle for unsupervised graph representation
learning: Graph-wise Common latent Factor EXtraction (GCFX). We further propose
a deep model for GCFX, deepGCFX, based on the idea of reversing the
above-mentioned graph generation process which could explicitly extract common
latent factors from an input graph and achieve improved results on downstream
tasks to the current state-of-the-art. Through extensive experiments and
analysis, we demonstrate that, while extracting common latent factors is
beneficial for graph-level tasks to alleviate distractions caused by local
variations of individual nodes or local neighbourhoods, it also benefits
node-level tasks by enabling long-range node dependencies, especially for
disassortative graphs.
Authors' comments: Accepted to AAAI 2022
Haohe Liu, Qiuqiang Kong, Jiafeng Liu
Music source separation (MSS) shows active progress with deep learning models
in recent years. Many MSS models perform separations on spectrograms by
estimating bounded ratio masks and reusing the phases of the mixture. When
using convolutional neural networks (CNN), weights are usually shared within a
spectrogram during convolution regardless of the different patterns between
frequency bands. In this study, we propose a new MSS model, channel-wise
subband phase-aware ResUNet (CWS-PResUNet), to decompose signals into subbands
and estimate an unbound complex ideal ratio mask (cIRM) for each source.
CWS-PResUNet utilizes a channel-wise subband (CWS) feature to limit unnecessary
global weights sharing on the spectrogram and reduce computational resource
consumptions. The saved computational cost and memory can in turn allow for a
larger architecture. On the MUSDB18HQ test set, we propose a 276-layer
CWS-PResUNet and achieve state-of-the-art (SoTA) performance on vocals with an
8.92 signal-to-distortion ratio (SDR) score. By combining CWS-PResUNet and
Demucs, our ByteMSS system ranks the 2nd on vocals score and 5th on average
score in the 2021 ISMIR Music Demixing (MDX) Challenge limited training data
track (leaderboard A). Our code and pre-trained models are publicly available
at: https://github.com/haoheliu/2021-ISMIR-MSS-Challenge-CWS-PResUNet
Authors' comments: Published at MDX Workshop @ ISMIR 2021
Jing You, Shaocheng Jia, Xin Pei, Danya Yao
Scene perception is essential for driving decision-making and traffic safety.
However, fog, as a kind of common weather, frequently appears in the real
world, especially in the mountain areas, making it difficult to accurately
observe the surrounding environments. Therefore, precisely estimating the
visibility under foggy weather can significantly benefit traffic management and
safety. To address this, most current methods use professional instruments
outfitted at fixed locations on the roads to perform the visibility
measurement; these methods are expensive and less flexible. In this paper, we
propose an innovative end-to-end convolutional neural network framework to
estimate the visibility leveraging Koschmieder's law exclusively using the
image data. The proposed method estimates the visibility by integrating the
physical model into the proposed framework, instead of directly predicting the
visibility value via the convolutional neural work. Moreover, we estimate the
visibility as a pixel-wise visibility map against those of previous visibility
measurement methods which solely predict a single value for an entire image.
Thus, the estimated result of our method is more informative, particularly in
uneven fog scenarios, which can benefit to developing a more precise early
warning system for foggy weather, thereby better protecting the intelligent
transportation infrastructure systems and promoting its development. To
validate the proposed framework, a virtual dataset, FACI, containing 3,000
foggy images in different concentrations, is collected using the AirSim
platform. Detailed experiments show that the proposed method achieves
performance competitive to those of state-of-the-art methods.
Authors' comments: 8 figures
Hai Phan, Anh Nguyen
Face identification (FI) is ubiquitous and drives many high-stake decisions
made by law enforcement. State-of-the-art FI approaches compare two images by
taking the cosine similarity between their image embeddings. Yet, such an
approach suffers from poor out-of-distribution (OOD) generalization to new
types of images (e.g., when a query face is masked, cropped, or rotated) not
included in the training set or the gallery. Here, we propose a re-ranking
approach that compares two faces using the Earth Mover's Distance on the deep,
spatial features of image patches. Our extra comparison stage explicitly
examines image similarity at a fine-grained level (e.g., eyes to eyes) and is
more robust to OOD perturbations and occlusions than traditional FI.
Interestingly, without finetuning feature extractors, our method consistently
improves the accuracy on all tested OOD queries: masked, cropped, rotated, and
adversarial while obtaining similar results on in-distribution images.
Authors' comments: CVPR 2022
Weixuan Sun, Jing Zhang, Zheyuan Liu, Yiran Zhong, Nick Barnes
Weakly Supervised Semantic Segmentation (WSSS) is challenging, particularly when image-level labels are used to supervise pixel level prediction. To bridge their gap, a Class Activation Map (CAM) is usually generated to provide pixel level pseudo labels. CAMs in Convolutional Neural Networks suffer from partial activation ie, only the most discriminative regions are activated. Transformer based methods, on the other hand, are highly effective at exploring global context with long range dependency modeling, potentially alleviating the "partial activation" issue. In this paper, we propose the first transformer based WSSS approach, and introduce the Gradient weighted Element wise Transformer Attention Map (GETAM). GETAM shows fine scale activation for all feature map elements, revealing different parts of the object across transformer layers. Further, we propose an activation aware label completion module to generate high quality pseudo labels. Finally, we incorporate our methods into an end to end framework for WSSS using double backward propagation. Extensive experiments on PASCAL VOC and COCO demonstrate that our results beat the state-of-the-art end-to-end approaches by a significant margin, and outperform most multi-stage methods.m most multi-stage methods.
Yucheng Shi, Yahong Han, Yu-an Tan, Xiaohui Kuang
Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the neglect of noise sensitivity differences between image regions by existing decision-based attacks further compromises the efficiency of noise compression, especially for ViTs. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we theoretically analyze the limitations of existing decision-based attacks from the perspective of noise sensitivity difference between regions of the image, and propose a new decision-based black-box attack against ViTs, termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on three datasets demonstrate that PAR achieves a much lower noise magnitude with the same number of queries.
Yuanyuan Yuan, Qi Pang, Shuai Wang
Various deep neural network (DNN) coverage criteria have been proposed to
assess DNN test inputs and steer input mutations. The coverage is characterized
via neurons having certain outputs, or the discrepancy between neuron outputs.
Nevertheless, recent research indicates that neuron coverage criteria show
little correlation with test suite quality.
In general, DNNs approximate distributions, by incorporating hierarchical
layers, to make predictions for inputs. Thus, we champion to deduce DNN
behaviors based on its approximated distributions from a layer perspective. A
test suite should be assessed using its induced layer output distributions.
Accordingly, to fully examine DNN behaviors, input mutation should be directed
toward diversifying the approximated distributions.
This paper summarizes eight design requirements for DNN coverage criteria,
taking into account distribution properties and practical concerns. We then
propose a new criterion, NeuraL Coverage (NLC), that satisfies all design
requirements. NLC treats a single DNN layer as the basic computational unit
(rather than a single neuron) and captures four critical properties of neuron
output distributions. Thus, NLC accurately describes how DNNs comprehend inputs
via approximated distributions. We demonstrate that NLC is significantly
correlated with the diversity of a test suite across a number of tasks
(classification and generation) and data formats (image and text). Its capacity
to discover DNN prediction errors is promising. Test input mutation guided by
NLC results in a greater quality and diversity of exposed erroneous behaviors.
Authors' comments: The extended version of a paper to appear in the Proceedings of the
45th IEEE/ACM International Conference on Software Engineering, 2023, (ICSE
'23), 14 pages
Ha Min Son, Moon Hyun Kim, Tai-Myoung Chung
Federated Learning is a widely adopted method to train neural networks over
distributed data. One main limitation is the performance degradation that
occurs when data is heterogeneously distributed. While many works have
attempted to address this problem, these methods under-perform because they are
founded on a limited understanding of neural networks. In this work, we verify
that only certain important layers in a neural network require regularization
for effective training. We additionally verify that Centered Kernel Alignment
(CKA) most accurately calculates similarity between layers of neural networks
trained on different data. By applying CKA-based regularization to important
layers during training, we significantly improve performance in heterogeneous
settings. We present FedCKA: a simple framework that out-performs previous
state-of-the-art methods on various deep learning tasks while also improving
efficiency and scalability.
Authors' comments: 8 pages, 5 figures, 4 tables
Shengbo Wang, Bo Lyu, Shiping Wen, Kaibo Shi, Song Zhu, Tingwen Huang
Safety is always one of the most critical principles for a system to be controlled. This paper investigates a safety-critical control scheme for unknown structured systems by using the control barrier function (CBF) method. Benefited from the dynamic regressor extension and mixing (DREM), an extended element-wise parameter identification law is utilized to dismiss the uncertainty. On the one hand, it is shown that the proposed control scheme can always guarantee the safety in the identification process with noised signal injection excitation, which was not considered in the previous study. On the other hand, the element-wise estimation process in DREM can minimize conservatism of the safe adaptive process compared to other existing adaptive CBF algorithms. The stability as well as the forward invariance of the presented safe control-estimation scheme is proved. Furthermore, the robustness of the scheme under bounded disturbances is analyzed, where a robust CBF with modest conditions is used to ensure safety. The framework is illustrated by simulations on adaptive cruise control, where the slope resistance of the following vehicle is robustly estimated in finite time against small disturbances and the potential crash risk is avoided by the proposed safe control scheme.
Sheng Xu, Yanjing Li, Junhe Zhao, Baochang Zhang, Guodong Guo
Real-time point cloud processing is fundamental for lots of computer vision
tasks, while still challenged by the computational problem on resource-limited
edge devices. To address this issue, we implement XNOR-Net-based binary neural
networks (BNNs) for an efficient point cloud processing, but its performance is
severely suffered due to two main drawbacks, Gaussian-distributed weights and
non-learnable scale factor. In this paper, we introduce point-wise operations
based on Expectation-Maximization (POEM) into BNNs for efficient point cloud
processing. The EM algorithm can efficiently constrain weights for a robust
bi-modal distribution. We lead a well-designed reconstruction loss to calculate
learnable scale factors to enhance the representation capacity of 1-bit
fully-connected (Bi-FC) layers. Extensive experiments demonstrate that our POEM
surpasses existing the state-of-the-art binary point cloud networks by a
significant margin, up to 6.7 %.
Authors' comments: Accepted by BMVC 2021. arXiv admin note: text overlap with
arXiv:2010.05501 by other authors
Changyao Tian, Wenhai Wang, Xizhou Zhu, Jifeng Dai, Yu Qiao
Deep learning-based models encounter challenges when processing long-tailed
data in the real world. Existing solutions usually employ some balancing
strategies or transfer learning to deal with the class imbalance problem, based
on the image modality. In this work, we present a visual-linguistic long-tailed
recognition framework, termed VL-LTR, and conduct empirical studies on the
benefits of introducing text modality for long-tailed recognition (LTR).
Compared to existing approaches, the proposed VL-LTR has the following merits.
(1) Our method can not only learn visual representation from images but also
learn corresponding linguistic representation from noisy class-level text
descriptions collected from the Internet; (2) Our method can effectively use
the learned visual-linguistic representation to improve the visual recognition
performance, especially for classes with fewer image samples. We also conduct
extensive experiments and set the new state-of-the-art performance on
widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy
on ImageNet-LT, which significantly outperforms the previous best method by
over 17 points, and is close to the prevailing performance training on the full
ImageNet. Code is available at https://github.com/ChangyaoTian/VL-LTR.
Authors' comments: Accepted by ECCV 2022; 14 pages, 9 figures
Linda Albanese, Andrea Alessandrelli
Purpose of this paper is to face up to P-spin glass and Gaussian P-spin model, i.e. spin glasses with polynomial interactions of degree P > 2. We consider the replica symmetry and first step of replica simmetry breaking assumptions and we solve the models via transport equation and Guerra's interpolating technique, showing that we reach the same results. \\ Thus, using rigorous approaches, we recover the same expression for quenched statistical pressure and self-consistency equation in both assumption found with other techniques, including the well-known \textit{replica trick} technique. \\ At the end, we show that for $P=2$ the Gaussian P-spin glass model is intrinsecally RS.
Yu Tian, Yuyuan Liu, Guansong Pang, Fengbei Liu, Yuanhong Chen, Gustavo Carneiro
State-of-the-art (SOTA) anomaly segmentation approaches on complex urban
driving scenes explore pixel-wise classification uncertainty learned from
outlier exposure, or external reconstruction models. However, previous
uncertainty approaches that directly associate high uncertainty to anomaly may
sometimes lead to incorrect anomaly predictions, and external reconstruction
models tend to be too inefficient for real-time self-driving embedded systems.
In this paper, we propose a new anomaly segmentation method, named pixel-wise
energy-biased abstention learning (PEBAL), that explores pixel-wise abstention
learning (AL) with a model that learns an adaptive pixel-level anomaly class,
and an energy-based model (EBM) that learns inlier pixel distribution. More
specifically, PEBAL is based on a non-trivial joint training of EBM and AL,
where EBM is trained to output high-energy for anomaly pixels (from outlier
exposure) and AL is trained such that these high-energy pixels receive adaptive
low penalty for being included to the anomaly class. We extensively evaluate
PEBAL against the SOTA and show that it achieves the best performance across
four benchmarks. Code is available at https://github.com/tianyu0207/PEBAL.
Authors' comments: ECCV 2022 Oral
Yatin Dandi, Arthur Jacot
Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.