Alexander Bors, Qiang Wang
Let $K$ be a finite field of characteristic $p$. We study a certain class of
functions $K\rightarrow K$ that agree with an $\mathbb{F}_p$-affine function
$K\rightarrow K$ on each coset of a given additive subgroup $W$ of $K$ - we
call them $W$-coset-wise $\mathbb{F}_p$-affine functions of $K$. We show that
these functions form a permutation group on $K$ with the structure of an
imprimitive wreath product and characterize which of them are complete mappings
of $K$. As a consequence, we are able to provide various new examples of cycle
types of complete mappings of $K$, including that $K$ has a complete mapping
moving all elements of $K$ in one cycle if $p>2$.
Authors' comments: 29 pages
Alperen Kantarcı, Hasan Dertli, Hazım Kemal Ekenel
Face anti-spoofing is essential to prevent false facial verification by using
a photo, video, mask, or a different substitute for an authorized person's
face. Most of the state-of-the-art presentation attack detection (PAD) systems
suffer from overfitting, where they achieve near-perfect scores on a single
dataset but fail on a different dataset with more realistic data. This problem
drives researchers to develop models that perform well under real-world
conditions. This is an especially challenging problem for frame-based
presentation attack detection systems that use convolutional neural networks
(CNN). To this end, we propose a new PAD approach, which combines pixel-wise
binary supervision with patch-based CNN. We believe that training a CNN with
face patches allows the model to distinguish spoofs without learning background
or dataset-specific traces. We tested the proposed method both on the standard
benchmark datasets -- Replay-Mobile, OULU-NPU -- and on a real-world dataset.
The proposed approach shows its superiority on challenging experimental setups.
Namely, it achieves higher performance on OULU-NPU protocol 3, 4 and on
inter-dataset real-world experiments.
Authors' comments: Accepted to 20th International Conference of the Biometrics Special
Interest Group (BIOSIG 2021) as Oral paper
Chun Fan, Jiwei Li, Xiang Ao, Fei Wu, Yuxian Meng, Xiaofei Sun
The proposed pruning strategy offers merits over weight-based pruning
techniques: (1) it avoids irregular memory access since representations and
matrices can be squeezed into their smaller but dense counterparts, leading to
greater speedup; (2) in a manner of top-down pruning, the proposed method
operates from a more global perspective based on training signals in the top
layer, and prunes each layer by propagating the effect of global signals
through layers, leading to better performances at the same sparsity level.
Extensive experiments show that at the same sparsity level, the proposed
strategy offers both greater speedup and higher performances than weight-based
pruning methods (e.g., magnitude pruning, movement pruning).
Authors' comments: To appear at EMNLP2021
Kishan Wimalawarne, Taiji Suzuki
We investigate adaptive layer-wise graph convolution in deep GCN models. We propose AdaGPR to learn generalized Pageranks at each layer of a GCNII network to induce adaptive convolution. We show that the generalization bound for AdaGPR is bounded by a polynomial of the eigenvalue spectrum of the normalized adjacency matrix in the order of the number of generalized Pagerank coefficients. By analysing the generalization bounds we show that oversmoothing depends on both the convolutions by the higher orders of the normalized adjacency matrix and the depth of the model. We performed evaluations on node-classification using benchmark real data and show that AdaGPR provides improved accuracies compared to existing graph convolution networks while demonstrating robustness against oversmoothing. Further, we demonstrate that analysis of coefficients of layer-wise generalized Pageranks allows us to qualitatively understand convolution at each layer enabling model interpretations.
Hualian Sheng, Sijia Cai, Yuan Liu, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Min-Jian Zhao
Though 3D object detection from point clouds has achieved rapid progress in
recent years, the lack of flexible and high-performance proposal refinement
remains a great hurdle for existing state-of-the-art two-stage detectors.
Previous works on refining 3D proposals have relied on human-designed
components such as keypoints sampling, set abstraction and multi-scale feature
fusion to produce powerful 3D object representations. Such methods, however,
have limited ability to capture rich contextual dependencies among points. In
this paper, we leverage the high-quality region proposal network and a
Channel-wise Transformer architecture to constitute our two-stage 3D object
detection framework (CT3D) with minimal hand-crafted design. The proposed CT3D
simultaneously performs proposal-aware embedding and channel-wise context
aggregation for the point features within each proposal. Specifically, CT3D
uses proposal's keypoints for spatial contextual modelling and learns attention
propagation in the encoding module, mapping the proposal to point embeddings.
Next, a new channel-wise decoding module enriches the query-key interaction via
channel-wise re-weighting to effectively merge multi-level contexts, which
contributes to more accurate object predictions. Extensive experiments
demonstrate that our CT3D method has superior performance and excellent
scalability. Remarkably, CT3D achieves the AP of 81.77% in the moderate car
category on the KITTI test 3D detection benchmark, outperforms state-of-the-art
3D detectors.
Authors' comments: Accepted by ICCV2021
Ansheng You, Chenglin Zhou, Qixuan Zhang, Lan Xu
Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresponding content and style references. The model is also constrained by a content alignment loss to ensure the foreground editing will not interfere background contents. As a result, given interested region masks provided by users, our model supports foreground region-wise style transfer. Specially, our model receives no extra annotations such as semantic labels except for self-supervision. Extensive experiments show the effectiveness of the proposed method and exhibit the flexibility of the proposed model for various applications, including region-wise style editing, latent space interpolation, cross-domain style transfer.
Heng Wang, Chaoyi Zhang, Jianhui Yu, Yang Song, Siqi Liu, Wojciech Chrzanowski, Weidong Cai
Automatic 3D neuron reconstruction is critical for analysing the morphology
and functionality of neurons in brain circuit activities. However, the
performance of existing tracing algorithms is hinged by the low image quality.
Recently, a series of deep learning based segmentation methods have been
proposed to improve the quality of raw 3D optical image stacks by removing
noises and restoring neuronal structures from low-contrast background. Due to
the variety of neuron morphology and the lack of large neuron datasets, most of
current neuron segmentation models rely on introducing complex and
specially-designed submodules to a base architecture with the aim of encoding
better feature representations. Though successful, extra burden would be put on
computation during inference. Therefore, rather than modifying the base
network, we shift our focus to the dataset itself. The encoder-decoder backbone
used in most neuron segmentation models attends only intra-volume voxel points
to learn structural features of neurons but neglect the shared intrinsic
semantic features of voxels belonging to the same category among different
volumes, which is also important for expressive representation learning. Hence,
to better utilise the scarce dataset, we propose to explicitly exploit such
intrinsic features of voxels through a novel voxel-level cross-volume
representation learning paradigm on the basis of an encoder-decoder
segmentation model. Our method introduces no extra cost during inference.
Evaluated on 42 3D neuron images from BigNeuron project, our proposed method is
demonstrated to improve the learning ability of the original segmentation model
and further enhancing the reconstruction performance.
Authors' comments: 10 pages, 3 figures, 3 tables, accepted by MICCAI-MLMI 2021
Chi Zhang, Xiaoning Ma, Yu Liu, Le Wang, Yuanqi Su, Yuehu Liu
Fundamental machine learning theory shows that different samples contribute
unequally both in learning and testing processes. Contemporary studies on DNN
imply that such sample difference is rooted on the distribution of intrinsic
pattern information, namely sample regularity. Motivated by the recent
discovery on network memorization and generalization, we proposed a pair of
sample regularity measures for both processes with a formulation-consistent
representation. Specifically, cumulative binary training/generalizing loss
(CBTL/CBGL), the cumulative number of correct classiffcations of the
training/testing sample within training stage, is proposed to quantize the
stability in memorization-generalization process; while
forgetting/mal-generalizing events, i.e., the mis-classification of previously
learned or generalized sample, are utilized to represent the uncertainty of
sample regularity with respect to optimization dynamics. Experiments validated
the effectiveness and robustness of the proposed approaches for mini-batch SGD
optimization. Further applications on training/testing sample selection show
the proposed measures sharing the unified computing procedure could benefit for
both tasks.
Authors' comments: 20 pages, 13 figures, 3 tables
Zifan Shi, Na Fan, Dit-Yan Yeung, Qifeng Chen
Existing vision systems for autonomous driving or robots are sensitive to
waterdrops adhered to windows or camera lenses. Most recent waterdrop removal
approaches take a single image as input and often fail to recover the missing
content behind waterdrops faithfully. Thus, we propose a learning-based model
for waterdrop removal with stereo images. To better detect and remove
waterdrops from stereo images, we propose a novel row-wise dilated attention
module to enlarge attention's receptive field for effective information
propagation between the two stereo images. In addition, we propose an attention
consistency loss between the ground-truth disparity map and attention scores to
enhance the left-right consistency in stereo images. Because of related
datasets' unavailability, we collect a real-world dataset that contains stereo
images with and without waterdrops. Extensive experiments on our dataset
suggest that our model outperforms state-of-the-art methods both quantitatively
and qualitatively. Our source code and the stereo waterdrop dataset are
available at
\href{https://github.com/VivianSZF/Stereo-Waterdrop-Removal}{https://github.com/VivianSZF/Stereo-Waterdrop-Removal}
Authors' comments: IROS 2021
Noga Alon
Let $r \geq 2$, $n$ and $k$ be integers satisfying $k \leq \frac{r-1}{r}n$. In the original arXiv version of this note we suggested a conjecture that the family of all $k$-subsets of an $n$-set cannot be partitioned into fewer than $\lceil n-\frac{r}{r-1}(k-1) \rceil$ $r$-wise intersecting families. We noted that if true this is tight for all values of the parameters, that the case $r=2$ is Kneser's conjecture, proved by Lov\'asz, and observed that the assertion also holds provided $r$ is either a prime number or a power of $2$. We have recently learned, however, that the assertion of the conjecture for all values of the parameters follows from a recent result of Azarpendar and Jafari \cite{AJ}.
Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, Guoqi Li
How to effectively and efficiently deal with spatio-temporal event streams,
where the events are generally sparse and non-uniform and have the microsecond
temporal resolution, is of great value and has various real-life applications.
Spiking neural network (SNN), as one of the brain-inspired event-triggered
computing models, has the potential to extract effective spatio-temporal
features from the event streams. However, when aggregating individual events
into frames with a new higher temporal resolution, existing SNN models do not
attach importance to that the serial frames have different signal-to-noise
ratios since event streams are sparse and non-uniform. This situation
interferes with the performance of existing SNNs. In this work, we propose a
temporal-wise attention SNN (TA-SNN) model to learn frame-based representation
for processing event streams. Concretely, we extend the attention concept to
temporal-wise input to judge the significance of frames for the final decision
at the training stage, and discard the irrelevant frames at the inference
stage. We demonstrate that TA-SNN models improve the accuracy of event streams
classification tasks. We also study the impact of multiple-scale temporal
resolutions for frame-based representation. Our approach is tested on three
different classification tasks: gesture recognition, image classification, and
spoken digit recognition. We report the state-of-the-art results on these
tasks, and get the essential improvement of accuracy (almost 19\%) for gesture
recognition with only 60 ms.
Authors' comments: Accepted by ICCV 2021
Lucas Liebenwein, Alaa Maalouf, Oren Gal, Dan Feldman, Daniela Rus
We present a novel global compression framework for deep neural networks that
automatically analyzes each layer to identify the optimal per-layer compression
ratio, while simultaneously achieving the desired overall compression. Our
algorithm hinges on the idea of compressing each convolutional (or
fully-connected) layer by slicing its channels into multiple groups and
decomposing each group via low-rank decomposition. At the core of our algorithm
is the derivation of layer-wise error bounds from the Eckart Young Mirsky
theorem. We then leverage these bounds to frame the compression problem as an
optimization problem where we wish to minimize the maximum compression error
across layers and propose an efficient algorithm towards a solution. Our
experiments indicate that our method outperforms existing low-rank compression
approaches across a wide range of networks and data sets. We believe that our
results open up new avenues for future research into the global
performance-size trade-offs of modern neural networks. Our code is available at
https://github.com/lucaslie/torchprune.
Authors' comments: NeurIPS 2021
Mingjie He, Jie Zhang, Shiguang Shan, Xiao Liu, Zhongqin Wu, Xilin Chen
Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or several object templates including sunglasses, scarfs and phones, which cannot well simulate the realistic occlusions. In this paper, based on the argument that the occlusion essentially damages a group of neurons, we propose a novel and elegant occlusion-simulation method via dropping the activations of a group of neurons in some elaborately selected channel. Specifically, we first employ a spatial regularization to encourage each feature channel to respond to local and different face regions. In this way, the activations affected by an occlusion in a local region are more likely to be located in a single feature channel. Then, the locality-aware channel-wise dropout (LCD) is designed to simulate the occlusion by dropping out the entire feature channel. Furthermore, by randomly dropping out several feature channels, our method can well simulate the occlusion of larger area. The proposed LCD can encourage its succeeding layers to minimize the intra-class feature variance caused by occlusions, thus leading to improved robustness against occlusion. In addition, we design an auxiliary spatial attention module by learning a channel-wise attention vector to reweight the feature channels, which improves the contributions of non-occluded regions. Extensive experiments on various benchmarks show that the proposed method outperforms state-of-the-art methods with a remarkable improvement.
Yu-Yen Chang, Bau-Ching Hsieh, Wei-Hao Wang, Yen-Ting Lin, Chen-Fatt Lim, Yoshiki Toba, Yuxing Zhong, Siou-Yu Chang
We use machine learning techniques to investigate their performance in
classifying active galactic nuclei (AGNs), including X-ray selected AGNs
(XAGNs), infrared selected AGNs (IRAGNs), and radio selected AGNs (RAGNs).
Using known physical parameters in the Cosmic Evolution Survey (COSMOS) field,
we are able to well-established training samples in the region of Hyper
Suprime-Cam (HSC) survey. We compare several Python packages (e.g.,
scikit-learn, Keras, and XGBoost), and use XGBoost to identify AGNs and show
the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our
results indicate that the performance is high for bright XAGN and IRAGN host
galaxies. The combination of the HSC (optical) information with the Wide-field
Infrared Survey Explorer (WISE) band-1 and WISE band-2 (near-infrared)
information perform well to identify AGN hosts. For both type-1 (broad-line)
XAGNs and type-1 (unobscured) IRAGNs, the performance is very good by using
optical to infrared information. These results can apply to the five-band data
from the wide regions of the HSC survey, and future all-sky surveys.
Authors' comments: accepted for publication in ApJ
Yuanyi Zhong, Yuan Zhou, Jian Peng
The control variates (CV) method is widely used in policy gradient estimation
to reduce the variance of the gradient estimators in practice. A control
variate is applied by subtracting a baseline function from the state-action
value estimates. Then the variance-reduced policy gradient presumably leads to
higher learning efficiency. Recent research on control variates with deep
neural net policies mainly focuses on scalar-valued baseline functions. The
effect of vector-valued baselines is under-explored. This paper investigates
variance reduction with coordinate-wise and layer-wise control variates
constructed from vector-valued baselines for neural net policies. We present
experimental evidence suggesting that lower variance can be obtained with such
baselines than with the conventional scalar-valued baseline. We demonstrate how
to equip the popular Proximal Policy Optimization (PPO) algorithm with these
new control variates. We show that the resulting algorithm with proper
regularization can achieve higher sample efficiency than scalar control
variates in continuous control benchmarks.
Authors' comments: 14 pages, 3 figures, added references compared to v1
Yiqun Lin, Lichang Chen, Haibin Huang, Chongyang Ma, Xiaoguang Han, Shuguang Cui
Sampling, grouping, and aggregation are three important components in the
multi-scale analysis of point clouds. In this paper, we present a novel
data-driven sampler learning strategy for point-wise analysis tasks. Unlike the
widely used sampling technique, Farthest Point Sampling (FPS), we propose to
learn sampling and downstream applications jointly. Our key insight is that
uniform sampling methods like FPS are not always optimal for different tasks:
sampling more points around boundary areas can make the point-wise
classification easier for segmentation. Towards this end, we propose a novel
sampler learning strategy that learns sampling point displacement supervised by
task-related ground truth information and can be trained jointly with the
underlying tasks. We further demonstrate our methods in various point-wise
analysis tasks, including semantic part segmentation, point cloud completion,
and keypoint detection. Our experiments show that jointly learning of the
sampler and task brings better performance than using FPS in various
point-based networks.
Authors' comments: 14 pages, 13 figures and 14 tables
Francisco Eiras, Motasem Alfarra, M. Pawan Kumar, Philip H. S. Torr, Puneet K. Dokania, Bernard Ghanem, Adel Bibi
Randomized smoothing has recently emerged as an effective tool that enables
certification of deep neural network classifiers at scale. All prior art on
randomized smoothing has focused on isotropic $\ell_p$ certification, which has
the advantage of yielding certificates that can be easily compared among
isotropic methods via $\ell_p$-norm radius. However, isotropic certification
limits the region that can be certified around an input to worst-case
adversaries, i.e., it cannot reason about other "close", potentially large,
constant prediction safe regions. To alleviate this issue, (i) we theoretically
extend the isotropic randomized smoothing $\ell_1$ and $\ell_2$ certificates to
their generalized anisotropic counterparts following a simplified analysis.
Moreover, (ii) we propose evaluation metrics allowing for the comparison of
general certificates - a certificate is superior to another if it certifies a
superset region - with the quantification of each certificate through the
volume of the certified region. We introduce ANCER, a framework for obtaining
anisotropic certificates for a given test set sample via volume maximization.
We achieve it by generalizing memory-based certification of data-dependent
classifiers. Our empirical results demonstrate that ANCER achieves
state-of-the-art $\ell_1$ and $\ell_2$ certified accuracy on CIFAR-10 and
ImageNet in the data-dependence setting, while certifying larger regions in
terms of volume, highlighting the benefits of moving away from isotropic
analysis. Our code is available in https://github.com/MotasemAlfarra/ANCER.
Authors' comments: First two authors and the last one contributed equally to this work
Ankita Pasad, Ju-Chieh Chou, Karen Livescu
Recently proposed self-supervised learning approaches have been successful
for pre-training speech representation models. The utility of these learned
representations has been observed empirically, but not much has been studied
about the type or extent of information encoded in the pre-trained
representations themselves. Developing such insights can help understand the
capabilities and limits of these models and enable the research community to
more efficiently develop their usage for downstream applications. In this work,
we begin to fill this gap by examining one recent and successful pre-trained
model (wav2vec 2.0), via its intermediate representation vectors, using a suite
of analysis tools. We use the metrics of canonical correlation, mutual
information, and performance on simple downstream tasks with non-parametric
probes, in order to (i) query for acoustic and linguistic information content,
(ii) characterize the evolution of information across model layers, and (iii)
understand how fine-tuning the model for automatic speech recognition (ASR)
affects these observations. Our findings motivate modifying the fine-tuning
protocol for ASR, which produces improved word error rates in a low-resource
setting.
Authors' comments: Accepted to ASRU 2021. Code:
https://github.com/ankitapasad/layerwise-analysis
Chen Dun, Cameron R. Wolfe, Christopher M. Jermaine, Anastasios Kyrillidis
We propose ResIST, a novel distributed training protocol for Residual
Networks (ResNets). ResIST randomly decomposes a global ResNet into several
shallow sub-ResNets that are trained independently in a distributed manner for
several local iterations, before having their updates synchronized and
aggregated into the global model. In the next round, new sub-ResNets are
randomly generated and the process repeats until convergence. By construction,
per iteration, ResIST communicates only a small portion of network parameters
to each machine and never uses the full model during training. Thus, ResIST
reduces the per-iteration communication, memory, and time requirements of
ResNet training to only a fraction of the requirements of full-model training.
In comparison to common protocols, like data-parallel training and
data-parallel training with local SGD, ResIST yields a decrease in
communication and compute requirements, while being competitive with respect to
model performance.
Authors' comments: 26 pages, 8 figures, pre-print under review
Huajun Liu, Fuqiang Liu, Xinyi Fan, Dong Huang
Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by $2-4$ points, and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic segmentation benchmarks.