Philipp Benz, Chaoning Zhang, Adil Karjauv, In So Kweon
Despite their overwhelming success on a wide range of applications,
convolutional neural networks (CNNs) are widely recognized to be vulnerable to
adversarial examples. This intriguing phenomenon led to a competition between
adversarial attacks and defense techniques. So far, adversarial training is the
most widely used method for defending against adversarial attacks. It has also
been extended to defend against universal adversarial perturbations (UAPs). The
SOTA universal adversarial training (UAT) method optimizes a single
perturbation for all training samples in the mini-batch. In this work, we find
that a UAP does not attack all classes equally. Inspired by this observation,
we identify it as the source of the model having unbalanced robustness. To this
end, we improve the SOTA UAT by proposing to utilize class-wise UAPs during
adversarial training. On multiple benchmark datasets, our class-wise UAT leads
superior performance for both clean accuracy and adversarial robustness against
universal attack.
Authors' comments: Accepted to ICME 2021
Themos Stafylakis, Johan Rohdin, Lukas Burget
Speaker embeddings extracted with deep 2D convolutional neural networks are
typically modeled as projections of first and second order statistics of
channel-frequency pairs onto a linear layer, using either average or attentive
pooling along the time axis. In this paper we examine an alternative pooling
method, where pairwise correlations between channels for given frequencies are
used as statistics. The method is inspired by style-transfer methods in
computer vision, where the style of an image, modeled by the matrix of
channel-wise correlations, is transferred to another image, in order to produce
a new image having the style of the first and the content of the second. By
drawing analogies between image style and speaker characteristics, and between
image content and phonetic sequence, we explore the use of such channel-wise
correlations features to train a ResNet architecture in an end-to-end fashion.
Our experiments on VoxCeleb demonstrate the effectiveness of the proposed
pooling method in speaker recognition.
Authors' comments: Accepted at Interspeech 2021
Junghyup Lee, Dohyung Kim, Bumsub Ham
Network quantization aims at reducing bit-widths of weights and/or
activations, particularly important for implementing deep neural networks with
limited hardware resources. Most methods use the straight-through estimator
(STE) to train quantized networks, which avoids a zero-gradient problem by
replacing a derivative of a discretizer (i.e., a round function) with that of
an identity function. Although quantized networks exploiting the STE have shown
decent performance, the STE is sub-optimal in that it simply propagates the
same gradient without considering discretization errors between inputs and
outputs of the discretizer. In this paper, we propose an element-wise gradient
scaling (EWGS), a simple yet effective alternative to the STE, training a
quantized network better than the STE in terms of stability and accuracy. Given
a gradient of the discretizer output, EWGS adaptively scales up or down each
gradient element, and uses the scaled gradient as the one for the discretizer
input to train quantized networks via backpropagation. The scaling is performed
depending on both the sign of each gradient element and an error between the
continuous input and discrete output of the discretizer. We adjust a scaling
factor adaptively using Hessian information of a network. We show extensive
experimental results on the image classification datasets, including CIFAR-10
and ImageNet, with diverse network architectures under a wide range of
bit-width settings, demonstrating the effectiveness of our method.
Authors' comments: Accepted to CVPR 2021
Lianli Gao, Qilong Zhang, Jingkuan Song, Heng Tao Shen
Although great progress has been made on adversarial attacks for deep neural networks (DNNs), their transferability is still unsatisfactory, especially for targeted attacks. There are two problems behind that have been long overlooked: 1) the conventional setting of $T$ iterations with the step size of $\epsilon/T$ to comply with the $\epsilon$-constraint. In this case, most of the pixels are allowed to add very small noise, much less than $\epsilon$; and 2) usually manipulating pixel-wise noise. However, features of a pixel extracted by DNNs are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. To tackle these issues, our previous work proposes a patch-wise iterative method (PIM) aimed at crafting adversarial examples with high transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. But targeted attacks aim to push the adversarial examples into the territory of a specific class, and the amplification factor may lead to underfitting. Thus, we introduce the temperature and propose a patch-wise++ iterative method (PIM++) to further improve transferability without significantly sacrificing the performance of the white-box attack. Our method can be generally integrated to any gradient-based attack methods. Compared with the current state-of-the-art attack methods, we significantly improve the success rate by 33.1\% for defense models and 31.4\% for normally trained models on average.
Kartheek Akella, Sai Himal Allu, Sridhar Suresh Ragupathi, Aman Singhal, Zeeshan Khan, Vinay P. Namboodiri, C V Jawahar
In this paper, we address the task of improving pair-wise machine translation
for specific low resource Indian languages. Multilingual NMT models have
demonstrated a reasonable amount of effectiveness on resource-poor languages.
In this work, we show that the performance of these models can be significantly
improved upon by using back-translation through a filtered back-translation
process and subsequent fine-tuning on the limited pair-wise language corpora.
The analysis in this paper suggests that this method can significantly improve
a multilingual model's performance over its baseline, yielding state-of-the-art
results for various Indian languages.
Authors' comments: ICON 2020 Short paper
Evan Petrosky, Hsiang-Chih Hwang, Nadia L. Zakamska, Vedant Chandra, Matthew J. Hill
The time-series component of WISE is a valuable resource for the study of
variable objects. We present an analysis of an all-sky sample of ~450,000
AllWISE+NEOWISE infrared light curves of likely variables identified in
AllWISE. By computing periodograms of all these sources, we identify ~56,000
periodic variables. Of these, ~42,000 are short-period (P<1 day), near-contact
or contact eclipsing binaries, many of which are on the main sequence. We use
the periodic and aperiodic variables to test computationally inexpensive
methods of periodic variable classification and identification, utilizing
various measures of the probability distribution function of fluxes and of
timescales of variability. The combination of variability measures from our
periodogram and non-parametric analyses with infrared colors from WISE and
absolute magnitudes, colors and variability amplitude from Gaia is useful for
the identification and classification of periodic variables. Furthermore, we
show that the effectiveness of non-parametric methods for the identification of
periodic variables is comparable to that of the periodogram but at a much lower
computational cost. Future surveys can utilize these methods to accelerate more
traditional time-series analyses and to identify evolving sources missed by
periodogram-based selections.
Authors' comments: 23 pages, 15 figures. Accepted for Publication in MNRAS. The full
catalog of WISE variables, periodic variables, and binaries is available at
https://zakamska.johnshopkins.edu/data.htm
Mingbao Lin, Rongrong Ji, Xiaoshuai Sun, Baochang Zhang, Feiyue Huang, Yonghong Tian, Dacheng Tao
Online image hashing has received increasing research attention recently,
which processes large-scale data in a streaming fashion to update the hash
functions on-the-fly. To this end, most existing works exploit this problem
under a supervised setting, i.e., using class labels to boost the hashing
performance, which suffers from the defects in both adaptivity and efficiency:
First, large amounts of training batches are required to learn up-to-date hash
functions, which leads to poor online adaptivity. Second, the training is
time-consuming, which contradicts with the core need of online learning. In
this paper, a novel supervised online hashing scheme, termed Fast Class-wise
Updating for Online Hashing (FCOH), is proposed to address the above two
challenges by introducing a novel and efficient inner product operation. To
achieve fast online adaptivity, a class-wise updating method is developed to
decompose the binary code learning and alternatively renew the hash functions
in a class-wise fashion, which well addresses the burden on large amounts of
training batches. Quantitatively, such a decomposition further leads to at
least 75% storage saving. To further achieve online efficiency, we propose a
semi-relaxation optimization, which accelerates the online training by treating
different binary constraints independently. Without additional constraints and
variables, the time complexity is significantly reduced. Such a scheme is also
quantitatively shown to well preserve past information during updating hashing
functions. We have quantitatively demonstrated that the collective effort of
class-wise updating and semi-relaxation optimization provides a superior
performance comparing to various state-of-the-art methods, which is verified
through extensive experiments on three widely-used datasets.
Authors' comments: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI)
Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, Chunhua Shen
Knowledge distillation (KD) has been proven to be a simple and effective tool
for training compact models. Almost all KD variants for dense prediction tasks
align the student and teacher networks' feature maps in the spatial domain,
typically by minimizing point-wise and/or pair-wise discrepancy. Observing that
in semantic segmentation, some layers' feature activations of each channel tend
to encode saliency of scene categories (analogue to class activation mapping),
we propose to align features channel-wise between the student and teacher
networks. To this end, we first transform the feature map of each channel into
a probabilty map using softmax normalization, and then minimize the
Kullback-Leibler (KL) divergence of the corresponding channels of the two
networks. By doing so, our method focuses on mimicking the soft distributions
of channels between networks. In particular, the KL divergence enables learning
to pay more attention to the most salient regions of the channel-wise maps,
presumably corresponding to the most useful signals for semantic segmentation.
Experiments demonstrate that our channel-wise distillation outperforms almost
all existing spatial distillation methods for semantic segmentation
considerably, and requires less computational cost during training. We
consistently achieve superior performance on three benchmarks with various
network structures. Code is available at: https://git.io/Distiller
Authors' comments: Accepted to Proc. Int. Conf. Computer Vision (ICCV) 2021. Code is
available at: https://git.io/Distiller
Jason O'Neill, Jacques Verstraëte
For integers $2 \leq t \leq k$, we consider a collection of $k$ set families $\mathcal{A}_j: 1 \leq j \leq k$ where $\mathcal{A}_j = \{ A_{j,i} \subseteq [n] : 1 \leq i \leq m \}$ and $|A_{1, i_1} \cap \cdots \cap A_{k,i_k}|$ is even if and only if at least $t$ of the $i_j$ are distinct. In this paper, we prove that $m =O(n^{ 1/ \lfloor k/2 \rfloor})$ when $t=k$ and $m = O( n^{1/(t-1)})$ when $2t-2 \leq k$ and prove that both of these bounds are best possible. Specializing to the case where $\mathcal{A} = \mathcal{A}_1 = \cdots = \mathcal{A}_k$, we recover a variation of the classical oddtown problem.
Maxwell Horton, Yanzi Jin, Ali Farhadi, Mohammad Rastegari
We present a computationally efficient method for compressing a trained neural network without using real data. We break the problem of data-free network compression into independent layer-wise compressions. We show how to efficiently generate layer-wise training data using only a pretrained network. We use this data to perform independent layer-wise compressions on the pretrained network. We also show how to precondition the network to improve the accuracy of our layer-wise compression method. We present results for layer-wise compression using quantization and pruning. When quantizing, we compress with higher accuracy than related works while using orders of magnitude less compute. When compressing MobileNetV2 and evaluating on ImageNet, our method outperforms existing methods for quantization at all bit-widths, achieving a $+0.34\%$ improvement in $8$-bit quantization, and a stronger improvement at lower bit-widths (up to a $+28.50\%$ improvement at $5$ bits). When pruning, we outperform baselines of a similar compute envelope, achieving $1.5$ times the sparsity rate at the same accuracy. We also show how to combine our efficient method with high-compute generative methods to improve upon their results.
Mingyang Zhang, Xinyi Yu, Jingtao Rong, Linlin Ou
Automated Machine Learning(Auto-ML) pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning which is caused by unfull and unfair training of the supernet is shown. A deep supernet suffers from unfull training because it contains too many candidates. To overcome the unfull training, a stage-wise pruning(SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training. Besides, A wide supernet is hit by unfair training since the sampling probability of each channel is unequal. Therefore, the fullnet and the tinynet are sampled in each training iteration to ensure each channel can be overtrained. Remarkably, the proxy performance of the subnets trained with SWP is closer to the actual performance than that of most of the previous Auto-ML pruning work. Experiments show that SWP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting.
Ruisong Zhang, Weize Quan, Baoyuan Wu, Zhifeng Li, Dong-Ming Yan
Recent GAN-based image inpainting approaches adopt an average strategy to
discriminate the generated image and output a scalar, which inevitably lose the
position information of visual artifacts. Moreover, the adversarial loss and
reconstruction loss (e.g., l1 loss) are combined with tradeoff weights, which
are also difficult to tune. In this paper, we propose a novel detection-based
generative framework for image inpainting, which adopts the min-max strategy in
an adversarial process. The generator follows an encoder-decoder architecture
to fill the missing regions, and the detector using weakly supervised learning
localizes the position of artifacts in a pixel-wise manner. Such position
information makes the generator pay attention to artifacts and further enhance
them. More importantly, we explicitly insert the output of the detector into
the reconstruction loss with a weighting criterion, which balances the weight
of the adversarial loss and reconstruction loss automatically rather than
manual operation. Experiments on multiple public datasets show the superior
performance of the proposed framework. The source code is available at
https://github.com/Evergrow/GDN_Inpainting.
Authors' comments: 12 pages, 9 figures, accepted by Computer Graphics Forum,
supplementary material link:
https://evergrow.github.io/GDN_Inpainting_files/GDN_Inpainting_Supplement.pdf
Tycho F. A. van der Ouderaa, Ivana Išgum, Wouter B. Veldhuis, Bob D. de Vos
Deep neural networks are increasingly used for pair-wise image registration. We propose to extend current learning-based image registration to allow simultaneous registration of multiple images. To achieve this, we build upon the pair-wise variational and diffeomorphic VoxelMorph approach and present a general mathematical framework that enables both registration of multiple images to their geodesic average and registration in which any of the available images can be used as a fixed image. In addition, we provide a likelihood based on normalized mutual information, a well-known image similarity metric in registration, between multiple images, and a prior that allows for explicit control over the viscous fluid energy to effectively regularize deformations. We trained and evaluated our approach using intra-patient registration of breast MRI and Thoracic 4DCT exams acquired over multiple time points. Comparison with Elastix and VoxelMorph demonstrates competitive quantitative performance of the proposed method in terms of image similarity and reference landmark distances at significantly faster registration.
Hang Yang, Shan Jiang, Xinge Zhu, Mingyang Huang, Zhiqiang Shen, Chunxiao Liu, Jianping Shi
Generic object detection has been immensely promoted by the development of
deep convolutional neural networks in the past decade. However, in the domain
shift circumstance, the changes in weather, illumination, etc., often cause
domain gap, and thus performance drops substantially when detecting objects
from one domain to another. Existing methods on this task usually draw
attention on the high-level alignment based on the whole image or object of
interest, which naturally, cannot fully utilize the fine-grained channel
information. In this paper, we realize adaptation from a thoroughly different
perspective, i.e., channel-wise alignment. Motivated by the finding that each
channel focuses on a specific pattern (e.g., on special semantic regions, such
as car), we aim to align the distribution of source and target domain on the
channel level, which is finer for integration between discrepant domains. Our
method mainly consists of self channel-wise and cross channel-wise alignment.
These two parts explore the inner-relation and cross-relation of attention
regions implicitly from the view of channels. Further more, we also propose a
RPN domain classifier module to obtain a domain-invariant RPN network.
Extensive experiments show that the proposed method performs notably better
than existing methods with about 5% improvement under various domain-shift
settings. Experiments on different task (e.g. instance segmentation) also
demonstrate its good scalability.
Authors' comments: First two authors contributed equally
Qi Wang, Junyu Gao, Wei Lin, Yuan Yuan
Crowd analysis via computer vision techniques is an important topic in the
field of video surveillance, which has wide-spread applications including crowd
monitoring, public safety, space design and so on. Pixel-wise crowd
understanding is the most fundamental task in crowd analysis because of its
finer results for video sequences or still images than other analysis tasks.
Unfortunately, pixel-level understanding needs a large amount of labeled
training data. Annotating them is an expensive work, which causes that current
crowd datasets are small. As a result, most algorithms suffer from over-fitting
to varying degrees. In this paper, take crowd counting and segmentation as
examples from the pixel-wise crowd understanding, we attempt to remedy these
problems from two aspects, namely data and methodology. Firstly, we develop a
free data collector and labeler to generate synthetic and labeled crowd scenes
in a computer game, Grand Theft Auto V. Then we use it to construct a
large-scale, diverse synthetic crowd dataset, which is named as "GCC Dataset".
Secondly, we propose two simple methods to improve the performance of crowd
understanding via exploiting the synthetic data. To be specific, 1) supervised
crowd understanding: pre-train a crowd analysis model on the synthetic data,
then fine-tune it using the real data and labels, which makes the model perform
better on the real world; 2) crowd understanding via domain adaptation:
translate the synthetic data to photo-realistic images, then train the model on
translated data and labels. As a result, the trained model works well in real
crowd scenes.
Authors' comments: Accepted by IJCV. arXiv admin note: text overlap with
arXiv:1903.03303
Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, Deyi Xiong
Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
Yao-Hung Hubert Tsai, Han Zhao, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov
Since its inception, the neural estimation of mutual information (MI) has demonstrated the empirical success of modeling expected dependency between high-dimensional random variables. However, MI is an aggregate statistic and cannot be used to measure point-wise dependency between different events. In this work, instead of estimating the expected dependency, we focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur. We show that we can naturally obtain PD when we are optimizing MI neural variational bounds. However, optimizing these bounds is challenging due to its large variance in practice. To address this issue, we develop two methods (free of optimizing MI variational bounds): Probabilistic Classifier and Density-Ratio Fitting. We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
Manikandasriram Srinivasan Ramanagopal, Zixu Zhang, Ram Vasudevan, Matthew Johnson-Roberson
Uncooled microbolometers can enable robots to see in the absence of visible
illumination by imaging the "heat" radiated from the scene. Despite this
ability to see in the dark, these sensors suffer from significant motion blur.
This has limited their application on robotic systems. As described in this
paper, this motion blur arises due to the thermal inertia of each pixel. This
has meant that traditional motion deblurring techniques, which rely on
identifying an appropriate spatial blur kernel to perform spatial
deconvolution, are unable to reliably perform motion deblurring on thermal
camera images. To address this problem, this paper formulates reversing the
effect of thermal inertia at a single pixel as a Least Absolute Shrinkage and
Selection Operator (LASSO) problem which we can solve rapidly using a quadratic
programming solver. By leveraging sparsity and a high frame rate, this
pixel-wise LASSO formulation is able to recover motion deblurred frames of
thermal videos without using any spatial information. To compare its quality
against state-of-the-art visible camera based deblurring methods, this paper
evaluated the performance of a family of pre-trained object detectors on a set
of images restored by different deblurring algorithms. All evaluated object
detectors performed systematically better on images restored by the proposed
algorithm rather than any other tested, state-of-the-art methods.
Authors' comments: 10 pages, 8 figures, Accepted to Robotics: Science and Systems 2020
Jie Fu, Xue Geng, Zhijian Duan, Bohan Zhuang, Xingdi Yuan, Adam Trischler, Jie Lin, Chris Pal et al.
Knowledge Distillation (KD) is a common method for transferring the ``knowledge'' learned by one machine learning model (the \textit{teacher}) into another model (the \textit{student}), where typically, the teacher has a greater capacity (e.g., more parameters or higher bit-widths). To our knowledge, existing methods overlook the fact that although the student absorbs extra knowledge from the teacher, both models share the same input data -- and this data is the only medium by which the teacher's knowledge can be demonstrated. Due to the difference in model capacities, the student may not benefit fully from the same data points on which the teacher is trained. On the other hand, a human teacher may demonstrate a piece of knowledge with individualized examples adapted to a particular student, for instance, in terms of her cultural background and interests. Inspired by this behavior, we design data augmentation agents with distinct roles to facilitate knowledge distillation. Our data augmentation agents generate distinct training data for the teacher and student, respectively. We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student. We compare our approach with existing KD methods on training popular neural architectures and demonstrate that role-wise data augmentation improves the effectiveness of KD over strong prior approaches. The code for reproducing our results can be found at https://github.com/bigaidream-projects/role-kd
Tuyen Trung Truong
Let $z=(x,y)$ be coordinates for the product space $\mathbb{R}^{m_1}\times
\mathbb{R}^{m_2}$. Let $f:\mathbb{R}^{m_1}\times \mathbb{R}^{m_2}\rightarrow
\mathbb{R}$ be a $C^1$ function, and $\nabla f=(\partial _xf,\partial _yf)$ its
gradient. Fix $0<\alpha <1$. For a point $(x,y) \in \mathbb{R}^{m_1}\times
\mathbb{R}^{m_2}$, a number $\delta >0$ satisfies Armijo's condition at $(x,y)$
if the following inequality holds: \begin{eqnarray*} f(x-\delta \partial
_xf,y-\delta \partial _yf)-f(x,y)\leq -\alpha \delta (||\partial
_xf||^2+||\partial _yf||^2). \end{eqnarray*}
In one previous paper, we proposed the following {\bf coordinate-wise}
Armijo's condition. Fix again $0<\alpha <1$. A pair of positive numbers $\delta
_1,\delta _2>0$ satisfies the coordinate-wise variant of Armijo's condition at
$(x,y)$ if the following inequality holds: \begin{eqnarray*} [f(x-\delta
_1\partial _xf(x,y), y-\delta _2\partial _y f(x,y))]-[f(x,y)]\leq -\alpha
(\delta _1||\partial _xf(x,y)||^2+\delta _2||\partial _yf(x,y)||^2).
\end{eqnarray*} Previously we applied this condition for functions of the form
$f(x,y)=f(x)+g(y)$, and proved various convergent results for them. For a
general function, it is crucial - for being able to do real computations - to
have a systematic algorithm for obtaining $\delta _1$ and $\delta _2$
satisfying the coordinate-wise version of Armijo's condition, much like
Backtracking for the usual Armijo's condition. In this paper we propose such an
algorithm, and prove according convergent results.
We then analyse and present experimental results for some functions such as
$f(x,y)=a|x|+y$ (given by Asl and Overton in connection to Wolfe's method),
$f(x,y)=x^3 sin (1/x) + y^3 sin(1/y)$ and Rosenbrock's function.
Authors' comments: 6 pages. Preprint arXiv:1911.07820 is incorporated as a very special
case