Jeffrey Fong, Siwei Chen, Kaiqi Chen
Training neural networks with large batch is of fundamental significance to deep learning. Large batch training remarkably reduces the amount of training time but has difficulties in maintaining accuracy. Recent works have put forward optimization methods such as LARS and LAMB to tackle this issue through adaptive layer-wise optimization using trust ratios. Though prevailing, such methods are observed to still suffer from unstable and extreme trust ratios which degrades performance. In this paper, we propose a new variant of LAMB, called LAMBC, which employs trust ratio clipping to stabilize its magnitude and prevent extreme values. We conducted experiments on image classification tasks such as ImageNet and CIFAR-10 and our empirical results demonstrate promising improvements across different batch sizes.
Hao Zheng, Yulei Qin, Yun Gu, Fangfang Xie, Jie Yang, Jiayuan Sun, Guang-zhong Yang
Automated airway segmentation is a prerequisite for pre-operative diagnosis and intra-operative navigation for pulmonary intervention. Due to the small size and scattered spatial distribution of peripheral bronchi, this is hampered by severe class imbalance between foreground and background regions, which makes it challenging for CNN-based methods to parse distal small airways. In this paper, we demonstrate that this problem is arisen by gradient erosion and dilation of the neighborhood voxels. During back-propagation, if the ratio of the foreground gradient to background gradient is small while the class imbalance is local, the foreground gradients can be eroded by their neighborhoods. This process cumulatively increases the noise information included in the gradient flow from top layers to the bottom ones, limiting the learning of small structures in CNNs. To alleviate this problem, we use group supervision and the corresponding WingsNet to provide complementary gradient flows to enhance the training of shallow layers. To further address the intra-class imbalance between large and small airways, we design a General Union loss function which obviates the impact of airway size by distance-based weights and adaptively tunes the gradient ratio based on the learning process. Extensive experiments on public datasets demonstrate that the proposed method can predict the airway structures with higher accuracy and better morphological completeness than the baselines.
Zitong Yu, Xiaobai Li, Jingang Shi, Zhaoqiang Xia, Guoying Zhao
Face anti-spoofing (FAS) plays a vital role in securing face recognition
systems from the presentation attacks (PAs). As more and more realistic PAs
with novel types spring up, it is necessary to develop robust algorithms for
detecting unknown attacks even in unseen scenarios. However, deep models
supervised by traditional binary loss (e.g., `0' for bonafide vs. `1' for PAs)
are weak in describing intrinsic and discriminative spoofing patterns.
Recently, pixel-wise supervision has been proposed for the FAS task, intending
to provide more fine-grained pixel/patch-level cues. In this paper, we firstly
give a comprehensive review and analysis about the existing pixel-wise
supervision methods for FAS. Then we propose a novel pyramid supervision, which
guides deep models to learn both local details and global semantics from
multi-scale spatial context. Extensive experiments are performed on five FAS
benchmark datasets to show that, without bells and whistles, the proposed
pyramid supervision could not only improve the performance beyond existing
pixel-wise supervision frameworks, but also enhance the model's
interpretability (i.e., locating the patch-level positions of PAs more
reasonably). Furthermore, elaborate studies are conducted for exploring the
efficacy of different architecture configurations with two kinds of pixel-wise
supervisions (binary mask and depth map supervisions), which provides
inspirable insights for future architecture/supervision design.
Authors' comments: submitted to IEEE Transactions on Biometrics, Behavior and Identity
Science
Nicolas Nadisic, Jeremy E Cohen, Arnaud Vandaele, Nicolas Gillis
Nonnegative least squares problems with multiple right-hand sides (MNNLS)
arise in models that rely on additive linear combinations. In particular, they
are at the core of most nonnegative matrix factorization algorithms and have
many applications. The nonnegativity constraint is known to naturally favor
sparsity, that is, solutions with few non-zero entries. However, it is often
useful to further enhance this sparsity, as it improves the interpretability of
the results and helps reducing noise, which leads to the sparse MNNLS problem.
In this paper, as opposed to most previous works that enforce sparsity column-
or row-wise, we first introduce a novel formulation for sparse MNNLS, with a
matrix-wise sparsity constraint. Then, we present a two-step algorithm to
tackle this problem. The first step divides sparse MNNLS in subproblems, one
per column of the original problem. It then uses different algorithms to
produce, either exactly or approximately, a Pareto front for each subproblem,
that is, to produce a set of solutions representing different tradeoffs between
reconstruction error and sparsity. The second step selects solutions among
these Pareto fronts in order to build a sparsity-constrained matrix that
minimizes the reconstruction error. We perform experiments on facial and
hyperspectral images, and we show that our proposed two-step approach provides
more accurate results than state-of-the-art sparse coding heuristics applied
both column-wise and globally.
Authors' comments: 25 pages + 18 pages supplementary material. This is the new version
of a work originally called "A Homotopy-based Algorithm for Sparse Multiple
Right-hand Sides Nonnegative Least Squares". Although the central concept is
the same, the paper has been almost completely rewritten
Jonas Spethmann, Martin Grünebohm, Roland Wiesendanger, Kirsten von Bergmann, André Kubetzka
We investigate magnetic domain walls in a single fcc Mn layer on Re(0001)
employing spin-polarized STM, atom manipulation, and spin dynamics simulations.
The low symmetry of the row-wise antiferromagnetic (1Q) state leads to a new
type of domain wall which connects rotational 1Q domains by a transient 2Q
state with characteristic 90$^\circ$ angles between neighboring magnetic
moments. The domain wall properties depend on their orientation and their width
of about 2 nm essentially results from a balance of Heisenberg and higher-order
exchange interactions. Atom manipulation allows domain wall imaging with atomic
spin-resolution, as well as domain wall positioning, and we demonstrate that
the force to move an atom is anisotropic on the 1Q domain.
Authors' comments: 6 pages, 4 figures
Hao Li, Xiaopeng Zhang, Hongkai Xiong
Contrastive learning based on instance discrimination trains model to
discriminate different transformations of the anchor sample from other samples,
which does not consider the semantic similarity among samples. This paper
proposes a new kind of contrastive learning method, named CLIM, which uses
positives from other samples in the dataset. This is achieved by searching
local similar samples of the anchor, and selecting samples that are closer to
the corresponding cluster center, which we denote as center-wise local image
selection. The selected samples are instantiated via an data mixture strategy,
which performs as a smoothing regularization. As a result, CLIM encourages both
local similarity and global aggregation in a robust way, which we find is
beneficial for feature representation. Besides, we introduce
\emph{multi-resolution} augmentation, which enables the representation to be
scale invariant. We reach 75.5% top-1 accuracy with linear evaluation over
ResNet-50, and 59.3% top-1 accuracy when fine-tuned with only 1% labels.
Authors' comments: Accepted by BMVC2021
Qiang Wang, Changliang Li, Yue Zhang, Tong Xiao, Jingbo Zhu
Traditional neural machine translation is limited to the topmost encoder
layer's context representation and cannot directly perceive the lower encoder
layers. Existing solutions usually rely on the adjustment of network
architecture, making the calculation more complicated or introducing additional
structural restrictions. In this work, we propose layer-wise multi-view
learning to solve this problem, circumventing the necessity to change the model
structure. We regard each encoder layer's off-the-shelf output, a by-product in
layer-by-layer encoding, as the redundant view for the input sentence. In this
way, in addition to the topmost encoder layer (referred to as the primary
view), we also incorporate an intermediate encoder layer as the auxiliary view.
We feed the two views to a partially shared decoder to maintain independent
prediction. Consistency regularization based on KL divergence is used to
encourage the two views to learn from each other. Extensive experimental
results on five translation tasks show that our approach yields stable
improvements over multiple strong baselines. As another bonus, our method is
agnostic to network architectures and can maintain the same inference speed as
the original model.
Authors' comments: COLING 2020
Ruizhe Li, Xiao Li, Guanyi Chen, Chenghua Lin
The Variational Autoencoder (VAE) is a popular and powerful model applied to
text modelling to generate diverse sentences. However, an issue known as
posterior collapse (or KL loss vanishing) happens when the VAE is used in text
modelling, where the approximate posterior collapses to the prior, and the
model will totally ignore the latent variables and be degraded to a plain
language model during text generation. Such an issue is particularly prevalent
when RNN-based VAE models are employed for text modelling. In this paper, we
propose a simple, generic architecture called Timestep-Wise Regularisation VAE
(TWR-VAE), which can effectively avoid posterior collapse and can be applied to
any RNN-based VAE models. The effectiveness and versatility of our model are
demonstrated in different tasks, including language modelling and dialogue
response generation.
Authors' comments: Accepted by COLING 2020, final camera ready version
Yiwen Liao, Raphaël Latty, Bin Yang
Feature selection is generally used as one of the most important
preprocessing techniques in machine learning, as it helps to reduce the
dimensionality of data and assists researchers and practitioners in
understanding data. Thereby, by utilizing feature selection, better performance
and reduced computational consumption, memory complexity and even data amount
can be expected. Although there exist approaches leveraging the power of deep
neural networks to carry out feature selection, many of them often suffer from
sensitive hyperparameters. This paper proposes a feature mask module
(FM-module) for feature selection based on a novel batch-wise attenuation and
feature mask normalization. The proposed method is almost free from
hyperparameters and can be easily integrated into common neural networks as an
embedded feature selection method. Experiments on popular image, text and
speech datasets have shown that our approach is easy to use and has superior
performance in comparison with other state-of-the-art deep-learning-based
feature selection methods.
Authors' comments: accepted by IJCNN2021
Trung Trinh, Samuel Kaski, Markus Heinonen
We introduce implicit Bayesian neural networks, a simple and scalable
approach for uncertainty representation in deep learning. Standard Bayesian
approach to deep learning requires the impractical inference of the posterior
distribution over millions of parameters. Instead, we propose to induce a
distribution that captures the uncertainty over neural networks by augmenting
each layer's inputs with latent variables. We present appropriate input
distributions and demonstrate state-of-the-art performance in terms of
calibration, robustness and uncertainty characterisation over large-scale,
multi-million parameter image classification tasks.
Authors' comments: 8 pages
Marc Abeille, Louis Faury, Clément Calauzènes
Logistic Bandits have recently attracted substantial attention, by providing
an uncluttered yet challenging framework for understanding the impact of
non-linearity in parametrized bandits. It was shown by Faury et al. (2020) that
the learning-theoretic difficulties of Logistic Bandits can be embodied by a
large (sometimes prohibitively) problem-dependent constant $\kappa$,
characterizing the magnitude of the reward's non-linearity. In this paper we
introduce a novel algorithm for which we provide a refined analysis. This
allows for a better characterization of the effect of non-linearity and yields
improved problem-dependent guarantees. In most favorable cases this leads to a
regret upper-bound scaling as $\tilde{\mathcal{O}}(d\sqrt{T/\kappa})$, which
dramatically improves over the $\tilde{\mathcal{O}}(d\sqrt{T}+\kappa)$
state-of-the-art guarantees. We prove that this rate is minimax-optimal by
deriving a $\Omega(d\sqrt{T/\kappa})$ problem-dependent lower-bound. Our
analysis identifies two regimes (permanent and transitory) of the regret, which
ultimately re-conciliates Faury et al. (2020) with the Bayesian approach of
Dong et al. (2019). In contrast to previous works, we find that in the
permanent regime non-linearity can dramatically ease the
exploration-exploitation trade-off. While it also impacts the length of the
transitory phase in a problem-dependent fashion, we show that this impact is
mild in most reasonable configurations.
Authors' comments: 40 pages. AISTATS 2021, oral
Sean McBane, Youngsoo Choi
Lattice-type structures can provide a combination of stiffness with light
weight that is desirable in a variety of applications. Design optimization of
these structures must rely on approximations of the governing physics to render
solution of a mathematical model feasible. In this paper, we propose a topology
optimization (TO) formulation that approximates the governing physics using
component-wise reduced order modeling, which can reduce solution time by
multiple orders of magnitude over a full-order finite element model while
providing a relative error in the solution of less than one percent. In
addition, the offline training data set from such component-wise models is
reusable, allowing its application to many design problems for only the cost of
a single offline training phase, and the component-wise method is nearly
embarrassingly parallel. We also show how the parameterization chosen in our
optimization allows a simplification of the component-wise reduced order model
(CWROM) not noted in previous literature, for further speedup of the
optimization process. The sensitivity of the compliance with respect to the
particular parameterization is derived solely in the component level. In
numerical examples, we demonstrate a 1000x speedup over a full-order FEM model
with relative error of less than one percent and show minimum compliance
designs for two different cantilever beam examples, one smaller and one larger.
Finally, error bounds for displacement field, compliance, and compliance
sensitivity of the CWROM are derived.
Authors' comments: 27 pages, 11 figures
Chen Tang, Wenyu Sun, Zhuqing Yuan, Yongpan Liu
To accelerate deep CNN models, this paper proposes a novel spatially adaptive framework that can dynamically generate pixel-wise sparsity according to the input image. The sparse scheme is pixel-wise refined, regional adaptive under a unified importance map, which makes it friendly to hardware implementation. A sparse controlling method is further presented to enable online adjustment for applications with different precision/latency requirements. The sparse model is applicable to a wide range of vision tasks. Experimental results show that this method efficiently improve the computing efficiency for both image classification using ResNet-18 and super resolution using SRResNet. On image classification task, our method can save 30%-70% MACs with a slightly drop in top-1 and top-5 accuracy. On super resolution task, our method can reduce more than 90% MACs while only causing around 0.1 dB and 0.01 decreasing in PSNR and SSIM. Hardware validation is also included.
Hao Wang, Jia Zhang, Yingce Xia, Jiang Bian, Chao Zhang, Tie-Yan Liu
Semantic code search, which aims to retrieve code snippets relevant to a given natural language query, has attracted many research efforts with the purpose of accelerating software development. The huge amount of online publicly available code repositories has prompted the employment of deep learning techniques to build state-of-the-art code search models. Particularly, they leverage deep neural networks to embed codes and queries into a unified semantic vector space and then use the similarity between code's and query's vectors to approximate the semantic correlation between code and the query. However, most existing studies overlook the code's intrinsic structural logic, which indeed contains a wealth of semantic information, and fails to capture intrinsic features of codes. In this paper, we propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the valuable code's intrinsic structural logic. To further increase the learning efficiency of COSEA, we propose a variant of contrastive loss for training the code search model, where the ground-truth code should be distinguished from the most similar negative sample. We have implemented a prototype of COSEA. Extensive experiments over existing public datasets of Python and SQL have demonstrated that COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
Fan Mo, Anastasia Borovykh, Mohammad Malekzadeh, Hamed Haddadi, Soteris Demetriou
Training deep neural networks via federated learning allows clients to share,
instead of the original data, only the model trained on their data. Prior work
has demonstrated that in practice a client's private information, unrelated to
the main learning task, can be discovered from the model's gradients, which
compromises the promised privacy protection. However, there is still no formal
approach for quantifying the leakage of private information via the shared
updated model or gradients. In this work, we analyze property inference attacks
and define two metrics based on (i) an adaptation of the empirical
$\mathcal{V}$-information, and (ii) a sensitivity analysis using Jacobian
matrices allowing us to measure changes in the gradients with respect to latent
information. We show the applicability of our proposed metrics in localizing
private latent information in a layer-wise manner and in two settings where (i)
we have or (ii) we do not have knowledge of the attackers' capabilities. We
evaluate the proposed metrics for quantifying information leakage on three
real-world datasets using three benchmark models.
Authors' comments: 9 pages, at ICLR workshop (Distributed and Private Machine Learning)
Alessio Netti, Daniele Tafani, Michael Ott, Martin Schulz
Modern High-Performance Computing (HPC) and data center operators rely more
and more on data analytics techniques to improve the efficiency and reliability
of their operations. They employ models that ingest time-series monitoring
sensor data and transform it into actionable knowledge for system tuning: a
process known as Operational Data Analytics (ODA). However, monitoring data has
a high dimensionality, is hardware-dependent and difficult to interpret. This,
coupled with the strict requirements of ODA, makes most traditional data mining
methods impractical and in turn renders this type of data cumbersome to
process. Most current ODA solutions use ad-hoc processing methods that are not
generic, are sensible to the sensors' features and are not fit for
visualization.
In this paper we propose a novel method, called Correlation-wise Smoothing
(CS), to extract descriptive signatures from time-series monitoring data in a
generic and lightweight way. Our CS method exploits correlations between data
dimensions to form groups and produces image-like signatures that can be easily
manipulated, visualized and compared. We evaluate the CS method on HPC-ODA, a
collection of datasets that we release with this work, and show that it leads
to the same performance as most state-of-the-art methods while producing
signatures that are up to ten times smaller and up to ten times faster, while
gaining visualizability, portability across systems and clear scaling
properties.
Authors' comments: Accepted for publication at the 35th IEEE International Parallel &
Distributed Processing Symposium (IPDPS 2021)
Tsai-Shien Chen, Man-Yu Lee, Chih-Ting Liu, Shao-Yi Chien
Vehicle re-identification (re-ID) matches images of the same vehicle across
different cameras. It is fundamentally challenging because the dramatically
different appearance caused by different viewpoints would make the framework
fail to match two vehicles of the same identity. Most existing works solved the
problem by extracting viewpoint-aware feature via spatial attention mechanism,
which, yet, usually suffers from noisy generated attention map or otherwise
requires expensive keypoint labels to improve the quality. In this work, we
propose Viewpoint-aware Channel-wise Attention Mechanism (VCAM) by observing
the attention mechanism from a different aspect. Our VCAM enables the feature
learning framework channel-wisely reweighing the importance of each feature
maps according to the "viewpoint" of input vehicle. Extensive experiments
validate the effectiveness of the proposed method and show that we perform
favorably against state-of-the-arts methods on the public VeRi-776 dataset and
obtain promising results on the 2020 AI City Challenge. We also conduct other
experiments to demonstrate the interpretability of how our VCAM practically
assists the learning framework.
Authors' comments: CVPR Workshop 2020
Yuhki Hatakeyama, Hiroki Sakuma, Yoshinori Konishi, Kohei Suenaga
Image classification based on machine learning is being commonly used.
However, a classification result given by an advanced method, including deep
learning, is often hard to interpret. This problem of interpretability is one
of the major obstacles in deploying a trained model in safety-critical systems.
Several techniques have been proposed to address this problem; one of which is
RISE, which explains a classification result by a heatmap, called a saliency
map, which explains the significance of each pixel. We propose MC-RISE
(Multi-Color RISE), which is an enhancement of RISE to take color information
into account in an explanation. Our method not only shows the saliency of each
pixel in a given image as the original RISE does, but the significance of color
components of each pixel; a saliency map with color information is useful
especially in the domain where the color information matters (e.g.,
traffic-sign recognition). We implemented MC-RISE and evaluate them using two
datasets (GTSRB and ImageNet) to demonstrate the effectiveness of our methods
in comparison with existing techniques for interpreting image classification
results.
Authors' comments: To appear in ACCV 2020
Saptarshi Sinha, Hiroki Ohashi, Katsuyuki Nakamura
Class-imbalance is one of the major challenges in real world datasets, where
a few classes (called majority classes) constitute much more data samples than
the rest (called minority classes). Learning deep neural networks using such
datasets leads to performances that are typically biased towards the majority
classes. Most of the prior works try to solve class-imbalance by assigning more
weights to the minority classes in various manners (e.g., data re-sampling,
cost-sensitive learning). However, we argue that the number of available
training data may not be always a good clue to determine the weighting strategy
because some of the minority classes might be sufficiently represented even by
a small number of training data. Overweighting samples of such classes can lead
to drop in the model's overall performance. We claim that the 'difficulty' of a
class as perceived by the model is more important to determine the weighting.
In this light, we propose a novel loss function named Class-wise
Difficulty-Balanced loss, or CDB loss, which dynamically distributes weights to
each sample according to the difficulty of the class that the sample belongs
to. Note that the assigned weights dynamically change as the 'difficulty' for
the model may change with the learning progress. Extensive experiments are
conducted on both image (artificially induced class-imbalanced MNIST,
long-tailed CIFAR and ImageNet-LT) and video (EGTEA) datasets. The results show
that CDB loss consistently outperforms the recently proposed loss functions on
class-imbalanced datasets irrespective of the data type (i.e., video or image).
Authors' comments: Accepted for ACCV 2020 oral presentation
Wolfgang Fuhl, Enkelejda Kasneci
Simple image rotations significantly reduce the accuracy of deep neural networks. Moreover, training with all possible rotations increases the data set, which also increases the training duration. In this work, we address trainable rotation invariant convolutions as well as the construction of nets, since fully connected layers can only be rotation invariant with a one-dimensional input. On the one hand, we show that our approach is rotationally invariant for different models and on different public data sets. We also discuss the influence of purely rotational invariant features on accuracy. The rotationally adaptive convolution models presented in this work are more computationally intensive than normal convolution models. Therefore, we also present a depth wise separable approach with radial convolution. Link to CUDA code https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/