Wenshuang Liu, Wenting Chen, Linlin Shen
Though GAN (Generative Adversarial Networks) based technique has greatly
advanced the performance of image synthesis and face translation, only few
works available in literature provide region based style encoding and
translation. We propose in this paper a region-wise normalization framework,
for region level face translation. While per-region style is encoded using
available approach, we build a so called RIN (region-wise normalization) block
to individually inject the styles into per-region feature maps and then fuse
them for following convolution and upsampling. Both shape and texture of
different regions can thus be translated to various target styles. A region
matching loss has also been proposed to significantly reduce the inference
between regions during the translation process. Extensive experiments on three
publicly available datasets, i.e. Morph, RaFD and CelebAMask-HQ, suggest that
our approach demonstrate a large improvement over state-of-the-art methods like
StarGAN, SEAN and FUNIT. Our approach has further advantages in precise control
of the regions to be translated. As a result, region level expression changes
and step by step make up can be achieved. The video demo is available at
https://youtu.be/ceRqsbzXAfk.
Authors' comments: 13 pages, 13 figures
Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, Junjie Yan
Recent works have made great progress in semantic segmentation by exploiting
contextual information in a local or global manner with dilated convolutions,
pyramid pooling or self-attention mechanism. In order to avoid potential
misleading contextual information aggregation in previous works, we propose a
class-wise dynamic graph convolution (CDGC) module to adaptively propagate
information. The graph reasoning is performed among pixels in the same class.
Based on the proposed CDGC module, we further introduce the Class-wise Dynamic
Graph Convolution Network(CDGCNet), which consists of two main parts including
the CDGC module and a basic segmentation network, forming a coarse-to-fine
paradigm. Specifically, the CDGC module takes the coarse segmentation result as
class mask to extract node features for graph construction and performs dynamic
graph convolutions on the constructed graph to learn the feature aggregation
and weight allocation. Then the refined feature and the original feature are
fused to get the final prediction. We conduct extensive experiments on three
popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012
and COCO Stuff, and achieve state-of-the-art performance on all three
benchmarks.
Authors' comments: Accepted by ECCV2020
David Minnen, Saurabh Singh
In learning-based approaches to image compression, codecs are developed by
optimizing a computational model to minimize a rate-distortion objective.
Currently, the most effective learned image codecs take the form of an
entropy-constrained autoencoder with an entropy model that uses both forward
and backward adaptation. Forward adaptation makes use of side information and
can be efficiently integrated into a deep neural network. In contrast, backward
adaptation typically makes predictions based on the causal context of each
symbol, which requires serial processing that prevents efficient GPU / TPU
utilization. We introduce two enhancements, channel-conditioning and latent
residual prediction, that lead to network architectures with better
rate-distortion performance than existing context-adaptive models while
minimizing serial processing. Empirically, we see an average rate savings of
6.7% on the Kodak image set and 11.4% on the Tecnick image set compared to a
context-adaptive baseline model. At low bit rates, where the improvements are
most effective, our model saves up to 18% over the baseline and outperforms
hand-engineered codecs like BPG by up to 25%.
Authors' comments: Published at the IEEE International Conference on Image Processing
(ICIP) 2020
Yunxiao Qin, Weiguo Zhang, Zezheng Wang, Chenxu Zhao, Jingping Shi
Few-shot image classification (FSIC), which requires a model to recognize new categories via learning from few images of these categories, has attracted lots of attention. Recently, meta-learning based methods have been shown as a promising direction for FSIC. Commonly, they train a meta-learner (meta-learning model) to learn easy fine-tuning weight, and when solving an FSIC task, the meta-learner efficiently fine-tunes itself to a task-specific model by updating itself on few images of the task. In this paper, we propose a novel meta-learning based layer-wise adaptive updating (LWAU) method for FSIC. LWAU is inspired by an interesting finding that compared with common deep models, the meta-learner pays much more attention to update its top layer when learning from few images. According to this finding, we assume that the meta-learner may greatly prefer updating its top layer to updating its bottom layers for better FSIC performance. Therefore, in LWAU, the meta-learner is trained to learn not only the easy fine-tuning model but also its favorite layer-wise adaptive updating rule to improve its learning efficiency. Extensive experiments show that with the layer-wise adaptive updating rule, the proposed LWAU: 1) outperforms existing few-shot classification methods with a clear margin; 2) learns from few images more efficiently by at least 5 times than existing meta-learners when solving FSIC.
Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, Heng Tao Shen
By adding human-imperceptible noise to clean images, the resultant
adversarial examples can fool other unknown models. Features of a pixel
extracted by deep neural networks (DNNs) are influenced by its surrounding
regions, and different DNNs generally focus on different discriminative regions
in recognition. Motivated by this, we propose a patch-wise iterative algorithm
-- a black-box attack towards mainstream normally trained and defense models,
which differs from the existing attack methods manipulating pixel-wise noise.
In this way, without sacrificing the performance of white-box attack, our
adversarial examples can have strong transferability. Specifically, we
introduce an amplification factor to the step size in each iteration, and one
pixel's overall gradient overflowing the $\epsilon$-constraint is properly
assigned to its surrounding regions by a project kernel. Our method can be
generally integrated to any gradient-based attack methods. Compared with the
current state-of-the-art attacks, we significantly improve the success rate by
9.2\% for defense models and 3.7\% for normally trained models on average. Our
code is available at
\url{https://github.com/qilong-zhang/Patch-wise-iterative-attack}
Authors' comments: Accepted by ECCV 2020
Frédéric Bernicot, Polona Durcik
We prove $L^p$ estimates for various multi-parameter bi- and trilinear
operators with symbols acting on fibers of the two-dimensional functions. In
particular, this yields estimates for the general bi-parameter form of the
twisted paraproduct studied in arXiv:1011.6140.
Authors' comments: 26 pages
Yawei Li, Wen Li, Martin Danelljan, Kai Zhang, Shuhang Gu, Luc Van Gool, Radu Timofte
In this paper, we tackle the problem of convolutional neural network design.
Instead of focusing on the design of the overall architecture, we investigate a
design space that is usually overlooked, i.e. adjusting the channel
configurations of predefined networks. We find that this adjustment can be
achieved by shrinking widened baseline networks and leads to superior
performance. Based on that, we articulate the heterogeneity hypothesis: with
the same training protocol, there exists a layer-wise differentiated network
architecture (LW-DNA) that can outperform the original network with regular
channel configurations but with a lower level of model complexity.
The LW-DNA models are identified without extra computational cost or training
time compared with the original network. This constraint leads to controlled
experiments which direct the focus to the importance of layer-wise specific
channel configurations. LW-DNA models come with advantages related to
overfitting, i.e. the relative relationship between model complexity and
dataset size. Experiments are conducted on various networks and datasets for
image classification, visual tracking and image restoration. The resultant
LW-DNA models consistently outperform the baseline models. Code is available at
https://github.com/ofsoundof/Heterogeneity_Hypothesis.
Authors' comments: CVPR2021 paper
Tsubasa Takahashi, Shun Takagi, Hajime Ono, Tatsuya Komatsu
This paper studies how to learn variational autoencoders with a variety of
divergences under differential privacy constraints. We often build a VAE with
an appropriate prior distribution to describe the desired properties of the
learned representations and introduce a divergence as a regularization term to
close the representations to the prior. Using differentially private SGD
(DP-SGD), which randomizes a stochastic gradient by injecting a dedicated noise
designed according to the gradient's sensitivity, we can easily build a
differentially private model. However, we reveal that attaching several
divergences increase the sensitivity from O(1) to O(B) in terms of batch size
B. That results in injecting a vast amount of noise that makes it hard to
learn. To solve the above issue, we propose term-wise DP-SGD that crafts
randomized gradients in two different ways tailored to the compositions of the
loss terms. The term-wise DP-SGD keeps the sensitivity at O(1) even when
attaching the divergence. We can therefore reduce the amount of noise. In our
experiments, we demonstrate that our method works well with two pairs of the
prior distribution and the divergence.
Authors' comments: 10 pages
Yao Zhao, Mohammad Saleh, Peter J. Liu
Most prior work in the sequence-to-sequence paradigm focused on datasets with input sequence lengths in the hundreds of tokens due to the computational constraints of common RNN and Transformer architectures. In this paper, we study long-form abstractive text summarization, a sequence-to-sequence setting with input sequence lengths up to 100,000 tokens and output sequence lengths up to 768 tokens. We propose SEAL, a Transformer-based model, featuring a new encoder-decoder attention that dynamically extracts/selects input snippets to sparsely attend to for each output segment. Using only the original documents and summaries, we derive proxy labels that provide weak supervision for extractive layers simultaneously with regular supervision from abstractive summaries. The SEAL model achieves state-of-the-art results on existing long-form summarization tasks, and outperforms strong baseline models on a new dataset/task we introduce, Search2Wiki, with much longer input text. Since content selection is explicit in the SEAL model, a desirable side effect is that the selection can be inspected for enhanced interpretability.
Bartosz Wójcik, Paweł Morawiecki, Marek Śmieja, Tomasz Krzyżek, Przemysław Spurek, Jacek Tabor
We present a mechanism for detecting adversarial examples based on data representations taken from the hidden layers of the target network. For this purpose, we train individual autoencoders at intermediate layers of the target network. This allows us to describe the manifold of true data and, in consequence, decide whether a given example has the same characteristics as true data. It also gives us insight into the behavior of adversarial examples and their flow through the layers of a deep neural network. Experimental results show that our method outperforms the state of the art in supervised and unsupervised settings.
Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy
There is currently a debate within the neuroscience community over the
likelihood of the brain performing backpropagation (BP). To better mimic the
brain, training a network $\textit{one layer at a time}$ with only a "single
forward pass" has been proposed as an alternative to bypass BP; we refer to
these networks as "layer-wise" networks. We continue the work on layer-wise
networks by answering two outstanding questions. First, $\textit{do they have a
closed-form solution?}$ Second, $\textit{how do we know when to stop adding
more layers?}$ This work proves that the kernel Mean Embedding is the
closed-form weight that achieves the network global optimum while driving these
networks to converge towards a highly desirable kernel for classification; we
call it the $\textit{Neural Indicator Kernel}$.
Authors' comments: This version will be published in AIStats 2022
Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha
Multi-agent settings in the real world often involve tasks with varying types
and quantities of agents and non-agent entities; however, common patterns of
behavior often emerge among these agents/entities. Our method aims to leverage
these commonalities by asking the question: ``What is the expected utility of
each agent when only considering a randomly selected sub-group of its observed
entities?'' By posing this counterfactual question, we can recognize
state-action trajectories within sub-groups of entities that we may have
encountered in another task and use what we learned in that task to inform our
prediction in the current one. We then reconstruct a prediction of the full
returns as a combination of factors considering these disjoint groups of
entities and train this ``randomly factorized" value function as an auxiliary
objective for value-based multi-agent reinforcement learning. By doing so, our
model can recognize and leverage similarities across tasks to improve learning
efficiency in a multi-task setting. Our approach, Randomized Entity-wise
Factorization for Imagined Learning (REFIL), outperforms all strong baselines
by a significant margin in challenging multi-task StarCraft micromanagement
settings.
Authors' comments: ICML 2021 Camera Ready
Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu
Speaker diarization is an essential step for processing multi-speaker audio.
Although an end-to-end neural diarization (EEND) method achieved
state-of-the-art performance, it is limited to a fixed number of speakers. In
this paper, we solve this fixed number of speaker issue by a novel speaker-wise
conditional inference method based on the probabilistic chain rule. In the
proposed method, each speaker's speech activity is regarded as a single random
variable, and is estimated sequentially conditioned on previously estimated
other speakers' speech activities. Similar to other sequence-to-sequence
models, the proposed method produces a variable number of speakers with a stop
sequence condition. We evaluated the proposed method on multi-speaker audio
recordings of a variable number of speakers. Experimental results show that the
proposed method can correctly produce diarization results with a variable
number of speakers and outperforms the state-of-the-art end-to-end speaker
diarization methods in terms of diarization error rate.
Authors' comments: Submitted to Interspeech 2020
Zaida Zhou, Chaoran Zhuge, Xinwei Guan, Wen Liu
Knowledge distillation is to transfer the knowledge from the data learned by the teacher network to the student network, so that the student has the advantage of less parameters and less calculations, and the accuracy is close to the teacher. In this paper, we propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy. The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD). CD transfers the channel information from the teacher to the student. The second is Guided Knowledge Distillation (GKD). Unlike Knowledge Distillation (KD), which allows the student to mimic each sample's prediction distribution of the teacher, GKD only enables the student to mimic the correct output of the teacher. The last part is Early Decay Teacher (EDT). During the training process, we gradually decay the weight of the distillation loss. The purpose is to enable the student to gradually control the optimization rather than the teacher. Our proposed method is evaluated on ImageNet and CIFAR100. On ImageNet, we achieve 27.68% of top-1 error with ResNet18, which outperforms state-of-the-art methods. On CIFAR100, we achieve surprising result that the student outperforms the teacher. Code is available at https://github.com/zhouzaida/channel-distillation.
Wennan Chang, Xinyu Zhou, Yong Zang, Chi Zhang, Sha Cao
Parameter estimation of mixture regression model using the expectation maximization (EM) algorithm is highly sensitive to outliers. Here we propose a fast and efficient robust mixture regression algorithm, called Component-wise Adaptive Trimming (CAT) method. We consider simultaneous outlier detection and robust parameter estimation to minimize the effect of outlier contamination. Robust mixture regression has many important applications including in human cancer genomics data, where the population often displays strong heterogeneity added by unwanted technological perturbations. Existing robust mixture regression methods suffer from outliers as they either conduct parameter estimation in the presence of outliers, or rely on prior knowledge of the level of outlier contamination. CAT was implemented in the framework of classification expectation maximization, under which a natural definition of outliers could be derived. It implements a least trimmed squares (LTS) approach within each exclusive mixing component, where the robustness issue could be transformed from the mixture case to simple linear regression case. The high breakdown point of the LTS approach allows us to avoid the pre-specification of trimming parameter. Compared with multiple existing algorithms, CAT is the most competitive one that can handle and adaptively trim off outliers as well as heavy tailed noise, in different scenarios of simulated data and real genomic data. CAT has been implemented in an R package `RobMixReg' available in CRAN.
Seungwoo Yoo, Heeseok Lee, Heesoo Myeong, Sungrack Yun, Hyoungwoo Park, Janghoon Cho, Duck Hoon Kim
In autonomous driving, detecting reliable and accurate lane marker positions is a crucial yet challenging task. The conventional approaches for the lane marker detection problem perform a pixel-level dense prediction task followed by sophisticated post-processing that is inevitable since lane markers are typically represented by a collection of line segments without thickness. In this paper, we propose a method performing direct lane marker vertex prediction in an end-to-end manner, i.e., without any post-processing step that is required in the pixel-level dense prediction task. Specifically, we translate the lane marker detection problem into a row-wise classification task, which takes advantage of the innate shape of lane markers but, surprisingly, has not been explored well. In order to compactly extract sufficient information about lane markers which spread from the left to the right in an image, we devise a novel layer, which is utilized to successively compress horizontal components so enables an end-to-end lane marker detection system where the final lane marker positions are simply obtained via argmax operations in testing time. Experimental results demonstrate the effectiveness of the proposed method, which is on par or outperforms the state-of-the-art methods on two popular lane marker detection benchmarks, i.e., TuSimple and CULane.
Abhijith Punnappurath, Michael S. Brown
Imaging sensors digitize incoming scene light at a dynamic range of 10--12 bits (i.e., 1024--4096 tonal values). The sensor image is then processed onboard the camera and finally quantized to only 8 bits (i.e., 256 tonal values) to conform to prevailing encoding standards. There are a number of important applications, such as high-bit-depth displays and photo editing, where it is beneficial to recover the lost bit depth. Deep neural networks are effective at this bit-depth reconstruction task. Given the quantized low-bit-depth image as input, existing deep learning methods employ a single-shot approach that attempts to either (1) directly estimate the high-bit-depth image, or (2) directly estimate the residual between the high- and low-bit-depth images. In contrast, we propose a training and inference strategy that recovers the residual image bitplane-by-bitplane. Our bitplane-wise learning framework has the advantage of allowing for multiple levels of supervision during training and is able to obtain state-of-the-art results using a simple network architecture. We test our proposed method extensively on several image datasets and demonstrate an improvement from 0.5dB to 2.3dB PSNR over prior methods depending on the quantization level.
Daniella C. Bardalez Gagliuffi, Jacqueline K. Faherty, Adam C. Schneider, Aaron Meisner, Dan Caselden, Guilluame Colin, Sam Goodman, J. Davy Kirkpatrick et al.
We present the discovery of WISEA J083011.95+283716.0, the first Y dwarf
candidate identified through the Backyard Worlds: Planet 9 citizen science
project. We identified this object as a red, fast-moving source with a faint
$W2$ detection in multi-epoch \textit{AllWISE} and unWISE images. We have
characterized this object with Spitzer Space Telescope and \textit{Hubble Space
Telescope} follow-up imaging. With mid-infrared detections in
\textit{Spitzer}'s \emph{ch1} and \emph{ch2} bands and flux upper limits in
Hubble Space Telescope $F105W$ and $F125W$ filters, we find that this object is
both very faint and has extremely red colors ($ch1-ch2 = 3.25\pm0.23$ mag,
$F125W-ch2 \geq 9.36$ mag), consistent with a T$_{eff}\sim300$ K source, as
estimated from the known Y dwarf population. A preliminary parallax provides a
distance of $11.1^{+2.0}_{-1.5}$ pc, leading to a slightly warmer temperature
of $\sim350$ K. The extreme faintness and red Hubble Space Telescope and
Spitzer Space Telescope colors of this object suggest it may be a link between
the broader Y dwarf population and the coldest known brown dwarf WISE
J0855$-$0714, and highlight our limited knowledge of the true spread of Y dwarf
colors. We also present four additional Backyard Worlds: Planet 9 late-T brown
dwarf discoveries within 30 pc.
Authors' comments: 13 pages, 6 figures, 5 tables
Chun Jiang Zhu, Song Han, Kam-Yiu Lam
In this paper, we study the problem of fast constructions of source-wise
round-trip spanners in weighted directed graphs. For a source vertex set
$S\subseteq V$ in a graph $G(V,E)$, an $S$-sourcewise round-trip spanner of $G$
of stretch $k$ is a subgraph $H$ of $G$ such that for every pair of vertices
$u,v\in S\times V$, their round-trip distance in $H$ is at most $k$ times of
their round-trip distance in $G$. We show that for a graph $G(V,E)$ with $n$
vertices and $m$ edges, an $s$-sized source vertex set $S\subseteq V$ and an
integer $k>1$, there exists an algorithm that in time $O(ms^{1/k}\log^5n)$
constructs an $S$-sourcewise round-trip spanner of stretch $O(k\log n)$ and
$O(ns^{1/k}\log^2n)$ edges with high probability. Compared to the fast
algorithms for constructing all-pairs round-trip spanners \cite{PRS+18,CLR+20},
our algorithm improve the running time and the number of edges in the spanner
when $k$ is super-constant. Compared with the existing algorithm for
constructing source-wise round-trip spanners \cite{ZL17}, our algorithm
significantly improves their construction time $\Omega(\min\{ms,n^\omega\})$
(where $\omega \in [2,2.373)$ and 2.373 is the matrix multiplication exponent)
to nearly linear $O(ms^{1/k}\log^5n)$, at the expense of paying an extra
$O(\log n)$ in the stretch. As an important building block of the algorithm, we
develop a graph partitioning algorithm to partition $G$ into clusters of
bounded radius and prove that for every $u,v\in S\times V$ at small round-trip
distance, the probability of separating them in different clusters is small.
The algorithm takes the size of $S$ as input and does not need the knowledge of
$S$. With the algorithm and a reachability vertex size estimation algorithm, we
show that the recursive algorithm for constructing standard round-trip spanners
\cite{PRS+18} can be adapted to the source-wise setting.
Authors' comments: Chun Jiang Zhu, Song Han and Kam-Yiu Lam. A Fast Algorithm for
Source-Wise Round-Trip Spanners. Theoretical Computer Science (TCS), 876,
34-44, 2021
Ratnabali Pal, Arif Ahmed Sekh, Samarjit Kar, Dilip K. Prasad
The recent worldwide outbreak of the novel coronavirus (COVID-19) has opened up new challenges to the research community. Artificial intelligence (AI) driven methods can be useful to predict the parameters, risks, and effects of such an epidemic. Such predictions can be helpful to control and prevent the spread of such diseases. The main challenges of applying AI is the small volume of data and the uncertain nature. Here, we propose a shallow long short-term memory (LSTM) based neural network to predict the risk category of a country. We have used a Bayesian optimization framework to optimize and automatically design country-specific networks. The results show that the proposed pipeline outperforms state-of-the-art methods for data of 180 countries and can be a useful tool for such risk categorization. We have also experimented with the trend data and weather data combined for the prediction. The outcome shows that the weather does not have a significant role. The tool can be used to predict long-duration outbreak of such an epidemic such that we can take preventive steps earlier