Keyu Tian, Chen Lin, Ming Sun, Luping Zhou, Junjie Yan, Wanli Ouyang
The recent progress on automatically searching augmentation policies has
boosted the performance substantially for various tasks. A key component of
automatic augmentation search is the evaluation process for a particular
augmentation policy, which is utilized to return reward and usually runs
thousands of times. A plain evaluation process, which includes full model
training and validation, would be time-consuming. To achieve efficiency, many
choose to sacrifice evaluation reliability for speed. In this paper, we dive
into the dynamics of augmented training of the model. This inspires us to
design a powerful and efficient proxy task based on the Augmentation-Wise
Weight Sharing (AWS) to form a fast yet accurate evaluation process in an
elegant way. Comprehensive analysis verifies the superiority of this approach
in terms of effectiveness and efficiency. The augmentation policies found by
our method achieve superior accuracies compared with existing auto-augmentation
search methods. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is
currently the best performing single model without extra training data. On
ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to
3.34% absolute error rate reduction over the baseline augmentation.
Authors' comments: Accepted to NeurIPS 2020 (Poster)
Behzad Ghazanfari, Fatemeh Afghah, Sixian Zhang
This paper proposes piece-wise matching layer as a novel layer in representation learning methods for electrocardiogram (ECG) classification. Despite the remarkable performance of representation learning methods in the analysis of time series, there are still several challenges associated with these methods ranging from the complex structures of methods, the lack of generality of solutions, the need for expert knowledge, and large-scale training datasets. We introduce the piece-wise matching layer that works based on two levels to address some of the aforementioned challenges. At the first level, a set of morphological, statistical, and frequency features and comparative forms of them are computed based on each periodic part and its neighbors. At the second level, these features are modified by predefined transformation functions based on a receptive field scenario. Several scenarios of offline processing, incremental processing, fixed sliding receptive field, and event-based triggering receptive field can be implemented based on the choice of length and mechanism of indicating the receptive field. We propose dynamic time wrapping as a mechanism that indicates a receptive field based on event triggering tactics. To evaluate the performance of this method in time series analysis, we applied the proposed layer in two publicly available datasets of PhysioNet competitions in 2015 and 2017 where the input data is ECG signal. We compared the performance of our method against a variety of known tuned methods from expert knowledge, machine learning, deep learning methods, and the combination of them. The proposed approach improves the state of the art in two known completions 2015 and 2017 around 4% and 7% correspondingly while it does not rely on in advance knowledge of the classes or the possible places of arrhythmia.
Weiwei Hou, Hanna Suominen, Piotr Koniusz, Sabrina Caldwell, Tom Gedeon
Sentence compression is a Natural Language Processing (NLP) task aimed at shortening original sentences and preserving their key information. Its applications can benefit many fields e.g. one can build tools for language education. However, current methods are largely based on Recurrent Neural Network (RNN) models which suffer from poor processing speed. To address this issue, in this paper, we propose a token-wise Convolutional Neural Network, a CNN-based model along with pre-trained Bidirectional Encoder Representations from Transformers (BERT) features for deletion-based sentence compression. We also compare our model with RNN-based models and fine-tuned BERT. Although one of the RNN-based models outperforms marginally other models given the same input, our CNN-based model was ten times faster than the RNN-based approach.
Hengyi Cai, Hongshen Chen, Yonghao Song, Zhuoye Ding, Yongjun Bao, Weipeng Yan, Xiaofang Zhao
Neural dialogue response generation has gained much popularity in recent years. Maximum Likelihood Estimation (MLE) objective is widely adopted in existing dialogue model learning. However, models trained with MLE objective function are plagued by the low-diversity issue when it comes to the open-domain conversational setting. Inspired by the observation that humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions, in this work, we introduce contrastive learning into dialogue generation, where the model explicitly perceives the difference between the well-chosen positive and negative utterances. Specifically, we employ a pretrained baseline model as a reference. During contrastive learning, the target dialogue model is trained to give higher conditional probabilities for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model. To manage the multi-mapping relations prevailed in human conversation, we augment contrastive dialogue learning with group-wise dual sampling. Extensive experimental results show that the proposed group-wise contrastive learning framework is suited for training a wide range of neural dialogue generation models with very favorable performance over the baseline training approaches.
Xuyang Shen, Jo Plested, Yue Yao, Tom Gedeon
Three-dimensional face reconstruction is one of the popular applications in
computer vision. However, even state-of-the-art models still require frontal
face as inputs, which restricts its usage scenarios in the wild. A similar
dilemma also happens in face recognition. New research designed to recover the
frontal face from a single side-pose facial image has emerged. The
state-of-the-art in this area is the Face-Transformation generative adversarial
network, which is based on the CycleGAN. This inspired our research which
explores the performance of two models from pixel transformation in frontal
facial synthesis, Pix2Pix and CycleGAN. We conducted the experiments on five
different loss functions on Pix2Pix to improve its performance, then followed
by proposing a new network Pairwise-GAN in frontal facial synthesis.
Pairwise-GAN uses two parallel U-Nets as the generator and PatchGAN as the
discriminator. The detailed hyper-parameters are also discussed. Based on the
quantitative measurement by face similarity comparison, our results showed that
Pix2Pix with L1 loss, gradient difference loss, and identity loss results in
2.72% of improvement at average similarity compared to the default Pix2Pix
model. Additionally, the performance of Pairwise-GAN is 5.4% better than the
CycleGAN and 9.1% than the Pix2Pix at average similarity.
Authors' comments: The 27th International Conference on Neural Information
Processing(ICONIP2020)
Xu Qian, Victor Li, Crews Darren
Second-order information has proven to be very effective in determining the redundancy of neural network weights and activations. Recent paper proposes to use Hessian traces of weights and activations for mixed-precision quantization and achieves state-of-the-art results. However, prior works only focus on selecting bits for each layer while the redundancy of different channels within a layer also differ a lot. This is mainly because the complexity of determining bits for each channel is too high for original methods. Here, we introduce Channel-wise Hessian Aware trace-Weighted Quantization (CW-HAWQ). CW-HAWQ uses Hessian trace to determine the relative sensitivity order of different channels of activations and weights. What's more, CW-HAWQ proposes to use deep Reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agent to find the optimal ratios of different quantization bits and assign bits to channels according to the Hessian trace order. The number of states in CW-HAWQ is much smaller compared with traditional AutoML based mix-precision methods since we only need to search ratios for the quantization bits. Compare CW-HAWQ with state-of-the-art shows that we can achieve better results for multiple networks.
Han Liu, Caixia Yuan, Xiaojie Wang
A major challenge of multi-label text classification (MLTC) is to
stimulatingly exploit possible label differences and label correlations. In
this paper, we tackle this challenge by developing Label-Wise Pre-Training
(LW-PT) method to get a document representation with label-aware information.
The basic idea is that, a multi-label document can be represented as a
combination of multiple label-wise representations, and that, correlated labels
always cooccur in the same or similar documents. LW-PT implements this idea by
constructing label-wise document classification tasks and trains label-wise
document encoders. Finally, the pre-trained label-wise encoder is fine-tuned
with the downstream MLTC task. Extensive experimental results validate that the
proposed method has significant advantages over the previous state-of-the-art
models and is able to discover reasonable label relationship. The code is
released to facilitate other researchers.
Authors' comments: Accepted to NLPCC 2020
Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, Rynson W. H. Lau
We investigate the generalization of semi-supervised learning (SSL) to
diverse pixel-wise tasks. Although SSL methods have achieved impressive results
in image classification, the performances of applying them to pixel-wise tasks
are unsatisfactory due to their need for dense outputs. In addition, existing
pixel-wise SSL approaches are only suitable for certain tasks as they usually
require to use task-specific properties. In this paper, we present a new SSL
framework, named Guided Collaborative Training (GCT), for pixel-wise tasks,
with two main technical contributions. First, GCT addresses the issues caused
by the dense outputs through a novel flaw detector. Second, the modules in GCT
learn from unlabeled data collaboratively through two newly proposed
constraints that are independent of task-specific properties. As a result, GCT
can be applied to a wide range of pixel-wise tasks without structural
adaptation. Our extensive experiments on four challenging vision tasks,
including semantic segmentation, real image denoising, portrait image matting,
and night image enhancement, show that GCT outperforms state-of-the-art SSL
methods by a large margin. Our code available at:
https://github.com/ZHKKKe/PixelSSL.
Authors' comments: 16th European Conference on Computer Vision (ECCV 2020)
Myoungha Song, Jeongho Lee, Donghwan Kim
6D pose estimation refers to object recognition and estimation of 3D rotation
and 3D translation. The key technology for estimating 6D pose is to estimate
pose by extracting enough features to find pose in any environment. Previous
methods utilized depth information in the refinement process or were designed
as a heterogeneous architecture for each data space to extract feature.
However, these methods are limited in that they cannot extract sufficient
feature. Therefore, this paper proposes a Point Attention Module that can
efficiently extract powerful feature from RGB-D. In our Module, attention map
is formed through a Geometric Attention Path(GAP) and Channel Attention
Path(CAP). In GAP, it is designed to pay attention to important information in
geometric information, and CAP is designed to pay attention to important
information in Channel information. We show that the attention module
efficiently creates feature representations without significantly increasing
computational complexity. Experimental results show that the proposed method
outperforms the existing methods in benchmarks, YCB Video and LineMod. In
addition, the attention module was applied to the classification task, and it
was confirmed that the performance significantly improved compared to the
existing model.
Authors' comments: 11 pages, 5figures
Dikshant Sagar, Jatin Garg, Prarthana Kansal, Sejal Bhalla, Rajiv Ratn Shah, Yi Yu
Fashion is an important part of human experience. Events such as interviews,
meetings, marriages, etc. are often based on clothing styles. The rise in the
fashion industry and its effect on social influencing have made outfit
compatibility a need. Thus, it necessitates an outfit compatibility model to
aid people in clothing recommendation. However, due to the highly subjective
nature of compatibility, it is necessary to account for personalization. Our
paper devises an attribute-wise interpretable compatibility scheme with
personal preference modelling which captures user-item interaction along with
general item-item interaction. Our work solves the problem of interpretability
in clothing matching by locating the discordant and harmonious attributes
between fashion items. Extensive experiment results on IQON3000, a publicly
available real-world dataset, verify the effectiveness of the proposed model.
Authors' comments: 10 pages, 5 figures, to be published in IEEE BigMM, 2020
Siniša Družeta, Stefan Ivić
Throughout the course of the development of Particle Swarm Optimization,
particle inertia has been established as an important aspect of the method for
researching possible method improvements. As a continuation of our previous
research, we propose a novel generalized technique of inertia weight adaptation
based on individual particle's fitness improvement, called anakatabatic
inertia. This technique allows for adapting inertia weight value for each
particle corresponding to the particle's increasing or decreasing fitness, i.e.
conditioned by particle's ascending (anabatic) or descending (katabatic)
movement. The proposed inertia weight control framework was metaoptimized and
tested on the 30 test functions of the CEC 2014 test suite. The conducted
procedure produced four anakatabatic models, two for each of the PSO methods
used (Standard PSO and TVAC-PSO). The benchmark testing results show that using
the proposed anakatabatic inertia models reliably yield moderate improvements
in accuracy of Standard PSO (final fitness minimum reduced up to 0.09 orders of
magnitude) and rather strong improvements for TVAC-PSO (final fitness minimum
reduced up to 0.59 orders of magnitude), mostly without any adverse effects on
the method's performance.
Authors' comments: 6 pages, 5 figures, 2 tables. arXiv admin note: substantial text
overlap with arXiv:1906.02474
Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu
A novel framework for meeting transcription using asynchronous microphones is
proposed in this paper. It consists of audio synchronization, speaker
diarization, utterance-wise speech enhancement using guided source separation,
automatic speech recognition, and duplication reduction. Doing speaker
diarization before speech enhancement enables the system to deal with
overlapped speech without considering sampling frequency mismatch between
microphones. Evaluation on our real meeting datasets showed that our framework
achieved a character error rate (CER) of 28.7 % by using 11 distributed
microphones, while a monaural microphone placed on the center of the table had
a CER of 38.2 %. We also showed that our framework achieved CER of 21.8 %,
which is only 2.1 percentage points higher than the CER in headset
microphone-based transcription.
Authors' comments: Accepted to INTERSPEECH 2020
Wenshuang Liu, Wenting Chen, Linlin Shen
Though GAN (Generative Adversarial Networks) based technique has greatly
advanced the performance of image synthesis and face translation, only few
works available in literature provide region based style encoding and
translation. We propose in this paper a region-wise normalization framework,
for region level face translation. While per-region style is encoded using
available approach, we build a so called RIN (region-wise normalization) block
to individually inject the styles into per-region feature maps and then fuse
them for following convolution and upsampling. Both shape and texture of
different regions can thus be translated to various target styles. A region
matching loss has also been proposed to significantly reduce the inference
between regions during the translation process. Extensive experiments on three
publicly available datasets, i.e. Morph, RaFD and CelebAMask-HQ, suggest that
our approach demonstrate a large improvement over state-of-the-art methods like
StarGAN, SEAN and FUNIT. Our approach has further advantages in precise control
of the regions to be translated. As a result, region level expression changes
and step by step make up can be achieved. The video demo is available at
https://youtu.be/ceRqsbzXAfk.
Authors' comments: 13 pages, 13 figures
Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, Junjie Yan
Recent works have made great progress in semantic segmentation by exploiting
contextual information in a local or global manner with dilated convolutions,
pyramid pooling or self-attention mechanism. In order to avoid potential
misleading contextual information aggregation in previous works, we propose a
class-wise dynamic graph convolution (CDGC) module to adaptively propagate
information. The graph reasoning is performed among pixels in the same class.
Based on the proposed CDGC module, we further introduce the Class-wise Dynamic
Graph Convolution Network(CDGCNet), which consists of two main parts including
the CDGC module and a basic segmentation network, forming a coarse-to-fine
paradigm. Specifically, the CDGC module takes the coarse segmentation result as
class mask to extract node features for graph construction and performs dynamic
graph convolutions on the constructed graph to learn the feature aggregation
and weight allocation. Then the refined feature and the original feature are
fused to get the final prediction. We conduct extensive experiments on three
popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012
and COCO Stuff, and achieve state-of-the-art performance on all three
benchmarks.
Authors' comments: Accepted by ECCV2020
David Minnen, Saurabh Singh
In learning-based approaches to image compression, codecs are developed by
optimizing a computational model to minimize a rate-distortion objective.
Currently, the most effective learned image codecs take the form of an
entropy-constrained autoencoder with an entropy model that uses both forward
and backward adaptation. Forward adaptation makes use of side information and
can be efficiently integrated into a deep neural network. In contrast, backward
adaptation typically makes predictions based on the causal context of each
symbol, which requires serial processing that prevents efficient GPU / TPU
utilization. We introduce two enhancements, channel-conditioning and latent
residual prediction, that lead to network architectures with better
rate-distortion performance than existing context-adaptive models while
minimizing serial processing. Empirically, we see an average rate savings of
6.7% on the Kodak image set and 11.4% on the Tecnick image set compared to a
context-adaptive baseline model. At low bit rates, where the improvements are
most effective, our model saves up to 18% over the baseline and outperforms
hand-engineered codecs like BPG by up to 25%.
Authors' comments: Published at the IEEE International Conference on Image Processing
(ICIP) 2020
Yunxiao Qin, Weiguo Zhang, Zezheng Wang, Chenxu Zhao, Jingping Shi
Few-shot image classification (FSIC), which requires a model to recognize new categories via learning from few images of these categories, has attracted lots of attention. Recently, meta-learning based methods have been shown as a promising direction for FSIC. Commonly, they train a meta-learner (meta-learning model) to learn easy fine-tuning weight, and when solving an FSIC task, the meta-learner efficiently fine-tunes itself to a task-specific model by updating itself on few images of the task. In this paper, we propose a novel meta-learning based layer-wise adaptive updating (LWAU) method for FSIC. LWAU is inspired by an interesting finding that compared with common deep models, the meta-learner pays much more attention to update its top layer when learning from few images. According to this finding, we assume that the meta-learner may greatly prefer updating its top layer to updating its bottom layers for better FSIC performance. Therefore, in LWAU, the meta-learner is trained to learn not only the easy fine-tuning model but also its favorite layer-wise adaptive updating rule to improve its learning efficiency. Extensive experiments show that with the layer-wise adaptive updating rule, the proposed LWAU: 1) outperforms existing few-shot classification methods with a clear margin; 2) learns from few images more efficiently by at least 5 times than existing meta-learners when solving FSIC.
Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, Heng Tao Shen
By adding human-imperceptible noise to clean images, the resultant
adversarial examples can fool other unknown models. Features of a pixel
extracted by deep neural networks (DNNs) are influenced by its surrounding
regions, and different DNNs generally focus on different discriminative regions
in recognition. Motivated by this, we propose a patch-wise iterative algorithm
-- a black-box attack towards mainstream normally trained and defense models,
which differs from the existing attack methods manipulating pixel-wise noise.
In this way, without sacrificing the performance of white-box attack, our
adversarial examples can have strong transferability. Specifically, we
introduce an amplification factor to the step size in each iteration, and one
pixel's overall gradient overflowing the $\epsilon$-constraint is properly
assigned to its surrounding regions by a project kernel. Our method can be
generally integrated to any gradient-based attack methods. Compared with the
current state-of-the-art attacks, we significantly improve the success rate by
9.2\% for defense models and 3.7\% for normally trained models on average. Our
code is available at
\url{https://github.com/qilong-zhang/Patch-wise-iterative-attack}
Authors' comments: Accepted by ECCV 2020
Frédéric Bernicot, Polona Durcik
We prove $L^p$ estimates for various multi-parameter bi- and trilinear
operators with symbols acting on fibers of the two-dimensional functions. In
particular, this yields estimates for the general bi-parameter form of the
twisted paraproduct studied in arXiv:1011.6140.
Authors' comments: 26 pages
Yawei Li, Wen Li, Martin Danelljan, Kai Zhang, Shuhang Gu, Luc Van Gool, Radu Timofte
In this paper, we tackle the problem of convolutional neural network design.
Instead of focusing on the design of the overall architecture, we investigate a
design space that is usually overlooked, i.e. adjusting the channel
configurations of predefined networks. We find that this adjustment can be
achieved by shrinking widened baseline networks and leads to superior
performance. Based on that, we articulate the heterogeneity hypothesis: with
the same training protocol, there exists a layer-wise differentiated network
architecture (LW-DNA) that can outperform the original network with regular
channel configurations but with a lower level of model complexity.
The LW-DNA models are identified without extra computational cost or training
time compared with the original network. This constraint leads to controlled
experiments which direct the focus to the importance of layer-wise specific
channel configurations. LW-DNA models come with advantages related to
overfitting, i.e. the relative relationship between model complexity and
dataset size. Experiments are conducted on various networks and datasets for
image classification, visual tracking and image restoration. The resultant
LW-DNA models consistently outperform the baseline models. Code is available at
https://github.com/ofsoundof/Heterogeneity_Hypothesis.
Authors' comments: CVPR2021 paper
Tsubasa Takahashi, Shun Takagi, Hajime Ono, Tatsuya Komatsu
This paper studies how to learn variational autoencoders with a variety of
divergences under differential privacy constraints. We often build a VAE with
an appropriate prior distribution to describe the desired properties of the
learned representations and introduce a divergence as a regularization term to
close the representations to the prior. Using differentially private SGD
(DP-SGD), which randomizes a stochastic gradient by injecting a dedicated noise
designed according to the gradient's sensitivity, we can easily build a
differentially private model. However, we reveal that attaching several
divergences increase the sensitivity from O(1) to O(B) in terms of batch size
B. That results in injecting a vast amount of noise that makes it hard to
learn. To solve the above issue, we propose term-wise DP-SGD that crafts
randomized gradients in two different ways tailored to the compositions of the
loss terms. The term-wise DP-SGD keeps the sensitivity at O(1) even when
attaching the divergence. We can therefore reduce the amount of noise. In our
experiments, we demonstrate that our method works well with two pairs of the
prior distribution and the divergence.
Authors' comments: 10 pages