Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, Deyi Xiong
Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
Yao-Hung Hubert Tsai, Han Zhao, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov
Since its inception, the neural estimation of mutual information (MI) has demonstrated the empirical success of modeling expected dependency between high-dimensional random variables. However, MI is an aggregate statistic and cannot be used to measure point-wise dependency between different events. In this work, instead of estimating the expected dependency, we focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur. We show that we can naturally obtain PD when we are optimizing MI neural variational bounds. However, optimizing these bounds is challenging due to its large variance in practice. To address this issue, we develop two methods (free of optimizing MI variational bounds): Probabilistic Classifier and Density-Ratio Fitting. We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
Manikandasriram Srinivasan Ramanagopal, Zixu Zhang, Ram Vasudevan, Matthew Johnson-Roberson
Uncooled microbolometers can enable robots to see in the absence of visible
illumination by imaging the "heat" radiated from the scene. Despite this
ability to see in the dark, these sensors suffer from significant motion blur.
This has limited their application on robotic systems. As described in this
paper, this motion blur arises due to the thermal inertia of each pixel. This
has meant that traditional motion deblurring techniques, which rely on
identifying an appropriate spatial blur kernel to perform spatial
deconvolution, are unable to reliably perform motion deblurring on thermal
camera images. To address this problem, this paper formulates reversing the
effect of thermal inertia at a single pixel as a Least Absolute Shrinkage and
Selection Operator (LASSO) problem which we can solve rapidly using a quadratic
programming solver. By leveraging sparsity and a high frame rate, this
pixel-wise LASSO formulation is able to recover motion deblurred frames of
thermal videos without using any spatial information. To compare its quality
against state-of-the-art visible camera based deblurring methods, this paper
evaluated the performance of a family of pre-trained object detectors on a set
of images restored by different deblurring algorithms. All evaluated object
detectors performed systematically better on images restored by the proposed
algorithm rather than any other tested, state-of-the-art methods.
Authors' comments: 10 pages, 8 figures, Accepted to Robotics: Science and Systems 2020
Jie Fu, Xue Geng, Zhijian Duan, Bohan Zhuang, Xingdi Yuan, Adam Trischler, Jie Lin, Chris Pal et al.
Knowledge Distillation (KD) is a common method for transferring the ``knowledge'' learned by one machine learning model (the \textit{teacher}) into another model (the \textit{student}), where typically, the teacher has a greater capacity (e.g., more parameters or higher bit-widths). To our knowledge, existing methods overlook the fact that although the student absorbs extra knowledge from the teacher, both models share the same input data -- and this data is the only medium by which the teacher's knowledge can be demonstrated. Due to the difference in model capacities, the student may not benefit fully from the same data points on which the teacher is trained. On the other hand, a human teacher may demonstrate a piece of knowledge with individualized examples adapted to a particular student, for instance, in terms of her cultural background and interests. Inspired by this behavior, we design data augmentation agents with distinct roles to facilitate knowledge distillation. Our data augmentation agents generate distinct training data for the teacher and student, respectively. We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student. We compare our approach with existing KD methods on training popular neural architectures and demonstrate that role-wise data augmentation improves the effectiveness of KD over strong prior approaches. The code for reproducing our results can be found at https://github.com/bigaidream-projects/role-kd
Tuyen Trung Truong
Let $z=(x,y)$ be coordinates for the product space $\mathbb{R}^{m_1}\times
\mathbb{R}^{m_2}$. Let $f:\mathbb{R}^{m_1}\times \mathbb{R}^{m_2}\rightarrow
\mathbb{R}$ be a $C^1$ function, and $\nabla f=(\partial _xf,\partial _yf)$ its
gradient. Fix $0<\alpha <1$. For a point $(x,y) \in \mathbb{R}^{m_1}\times
\mathbb{R}^{m_2}$, a number $\delta >0$ satisfies Armijo's condition at $(x,y)$
if the following inequality holds: \begin{eqnarray*} f(x-\delta \partial
_xf,y-\delta \partial _yf)-f(x,y)\leq -\alpha \delta (||\partial
_xf||^2+||\partial _yf||^2). \end{eqnarray*}
In one previous paper, we proposed the following {\bf coordinate-wise}
Armijo's condition. Fix again $0<\alpha <1$. A pair of positive numbers $\delta
_1,\delta _2>0$ satisfies the coordinate-wise variant of Armijo's condition at
$(x,y)$ if the following inequality holds: \begin{eqnarray*} [f(x-\delta
_1\partial _xf(x,y), y-\delta _2\partial _y f(x,y))]-[f(x,y)]\leq -\alpha
(\delta _1||\partial _xf(x,y)||^2+\delta _2||\partial _yf(x,y)||^2).
\end{eqnarray*} Previously we applied this condition for functions of the form
$f(x,y)=f(x)+g(y)$, and proved various convergent results for them. For a
general function, it is crucial - for being able to do real computations - to
have a systematic algorithm for obtaining $\delta _1$ and $\delta _2$
satisfying the coordinate-wise version of Armijo's condition, much like
Backtracking for the usual Armijo's condition. In this paper we propose such an
algorithm, and prove according convergent results.
We then analyse and present experimental results for some functions such as
$f(x,y)=a|x|+y$ (given by Asl and Overton in connection to Wolfe's method),
$f(x,y)=x^3 sin (1/x) + y^3 sin(1/y)$ and Rosenbrock's function.
Authors' comments: 6 pages. Preprint arXiv:1911.07820 is incorporated as a very special
case
Yuchen Fan, Jiahui Yu, Ding Liu, Thomas S. Huang
While scale-invariant modeling has substantially boosted the performance of
visual recognition tasks, it remains largely under-explored in deep networks
based image restoration. Naively applying those scale-invariant techniques
(e.g. multi-scale testing, random-scale data augmentation) to image restoration
tasks usually leads to inferior performance. In this paper, we show that
properly modeling scale-invariance into neural networks can bring significant
benefits to image restoration performance. Inspired from spatial-wise
convolution for shift-invariance, "scale-wise convolution" is proposed to
convolve across multiple scales for scale-invariance. In our scale-wise
convolutional network (SCN), we first map the input image to the feature space
and then build a feature pyramid representation via bi-linear down-scaling
progressively. The feature pyramid is then passed to a residual network with
scale-wise convolutions. The proposed scale-wise convolution learns to
dynamically activate and aggregate features from different input scales in each
residual building block, in order to exploit contextual information on multiple
scales. In experiments, we compare the restoration accuracy and parameter
efficiency among our model and many different variants of multi-scale neural
networks. The proposed network with scale-wise convolution achieves superior
performance in multiple image restoration tasks including image
super-resolution, image denoising and image compression artifacts removal. Code
and models are available at: https://github.com/ychfan/scn_sr
Authors' comments: AAAI 2020
Ahmad Abdi, Gérard Cornuéjols, Tony Huynh, Dabeen Lee
A clutter is \emph{$k$-wise intersecting} if every $k$ members have a common
element, yet no element belongs to all members. We conjecture that, for some
integer $k\geq 4$, every $k$-wise intersecting clutter is non-ideal. As
evidence for our conjecture, we prove it for $k=4$ for the class of binary
clutters. Two key ingredients for our proof are Jaeger's $8$-flow theorem for
graphs, and Seymour's characterization of the binary matroids with the sums of
circuits property. As further evidence for our conjecture, we also note that it
follows from an unpublished conjecture of Seymour from 1975. We also discuss
connections to the chromatic number of a clutter, projective geometries over
the two-element field, uniform cycle covers in graphs, and quarter-integral
packings of value two in ideal clutters.
Authors' comments: 20 pages, 2 figures. An extended abstract under the same title
appeared in the 21st Conference in Integer Programming and Combinatorial
Optimization
Jason O'Neill, Jacques Verstraete
For an integer $d \geq 2$, a family $\mathcal{F}$ of sets is $\textit{$d$-wise intersecting}$ if for any distinct sets $A_1,A_2,\dots,A_d \in \mathcal{F}$, $A_1 \cap A_2 \cap \dots \cap A_d \neq \emptyset$, and $\textit{non-trivial}$ if $\bigcap \mathcal{F} = \emptyset$. Hilton and Milner conjectured that for $k \geq d \geq 2$ and large enough $n$, the extremal non-trivial $d$-wise intersecting family of $k$-element subsets of $[n]$ is one of the following two families: \begin{align*} &\mathcal{H}(k,d) = \{A \in \binom{[n]}{k} : [d-1] \subset A, A \cap [d,k+1] \neq \emptyset\} \cup \{[k+1] \setminus \{i \} : i \in [d - 1]\} \\ &\mathcal{A}(k,d) = \{ A \in \binom{[n]}{k} : |A \cap [d+1]| \geq d \}. \end{align*} The celebrated Hilton-Milner Theorem states that $\mathcal{H}(k,2)$ is the unique extremal non-trivial intersecting family for $k>3$. We prove the conjecture and prove a stability theorem, stating that any large enough non-trivial $d$-wise intersecting family of $k$-element subsets of $[n]$ is a subfamily of $\mathcal{A}(k,d)$ or $\mathcal{H}(k,d)$.
Cyprien Ruffino, Romain Hérault, Eric Laloy, Gilles Gasso
Generative Adversarial Networks (GANs) have proven successful for unsupervised image generation. Several works extended GANs to image inpainting by conditioning the generation with parts of the image one wants to reconstruct. However, these methods have limitations in settings where only a small subset of the image pixels is known beforehand. In this paper, we study the effectiveness of conditioning GANs by adding an explicit regularization term to enforce pixel-wise conditions when very few pixel values are provided. In addition, we also investigate the influence of this regularization term on the quality of the generated images and the satisfaction of the conditions. Conducted experiments on MNIST and FashionMNIST show evidence that this regularization term allows for controlling the trade-off between quality of the generated images and constraint satisfaction.
Tapas Kumar Mishra
Let $L = \{\frac{a_1}{b_1}, \ldots , \frac{a_s}{b_s}\}$, where for every $i
\in [s]$, $\frac{a_i}{b_i} \in [0,1)$ is an irreducible fraction. Let
$\mathcal{F} = \{A_1, \ldots , A_m\}$ be a family of subsets of $[n]$. We say
$\mathcal{F}$ is a \emph{r-wise fractional $L$-intersecting family} if for
every distinct $i_1,i_2, \ldots,i_r \in [m]$, there exists an $\frac{a}{b} \in
L$ such that $|A_{i_1} \cap A_{i_2} \cap \ldots \cap A_{i_r}| \in \{
\frac{a}{b}|A_{i_1}|, \frac{a}{b} |A_{i_2}|,\ldots, \frac{a}{b} |A_{i_r}| \}$.
In this paper, we introduce and study the notion of r-wise fractional
$L$-intersecting families. This is a generalization of notion of fractional
$L$-intersecting families studied in [Niranjan et.al, Fractional
$L$-intersecting families, The Electronic Journal of Combinatorics, 2019].
Authors' comments: 8 pages
Ricard Durall, Franz-Josef Pfreundt, Ullrich Köthe, Janis Keuper
Recent deep learning based approaches have shown remarkable success on object segmentation tasks. However, there is still room for further improvement. Inspired by generative adversarial networks, we present a generic end-to-end adversarial approach, which can be combined with a wide range of existing semantic segmentation networks to improve their segmentation performance. The key element of our method is to replace the commonly used binary adversarial loss with a high resolution pixel-wise loss. In addition, we train our generator employing stochastic weight averaging fashion, which further enhances the predicted output label maps leading to state-of-the-art results. We show, that this combination of pixel-wise adversarial training and weight averaging leads to significant and consistent gains in segmentation performance, compared to the baseline models.
Raniere de Menezes, Harold A. Peña-Herazo, Ezequiel J. Marchesini, Raffaele D'Abrusco, Nicola Masetti, Rodrigo Nemmen, Francesco Massaro, Federica Ricci et al.
Over the last decade more than five thousand gamma-ray sources were detected
by the Large Area Telescope (LAT) on board Fermi Gamma-ray Space Telescope.
Given the positional uncertainty of the telescope, nearly 30% of these sources
remain without an obvious counterpart in lower energies. This motivated the
release of new catalogs of gamma-ray counterpart candidates and several follow
up campaigns in the last decade. Recently, two new catalogs of blazar
candidates were released, they are the improved and expanded version of the
WISE Blazar-Like Radio-Loud Sources (WIBRaLS2) catalog and the Kernel Density
Estimation selected candidate BL Lacs (KDEBLLACS) catalog, both selecting
blazar-like sources based on their infrared colors from the Wide-field Infrared
Survey Explorer (WISE). In this work we characterized these two catalogs,
clarifying the true nature of their sources based on their optical spectra from
SDSS data release 15, thus testing how efficient they are in selecting true
blazars. We first selected all WIBRaLS2 and KDEBLLACS sources with available
optical spectra in the footprint of Sloan Digital Sky Survey data release 15.
Then we analyzed these spectra to verify the nature of each selected candidate
and see which fraction of the catalogs is composed by spectroscopically
confirmed blazars. Finally, we evaluated the impact of selection effects,
specially those related to optical colors of WIBRaLS2/KDEBLLACS sources and
their optical magnitude distributions. We found that at least ~ 30% of each
catalog is composed by confirmed blazars, with quasars being the major
contaminants in the case of WIBRaLS2 (~ 58%) and normal galaxies in the case of
KDEBLLACS (~ 38.2%). The spectral analysis also allowed us to identify the
nature of 11 blazar candidates of uncertain type (BCUs) from the Fermi-LAT 4th
Point Source Catalog (4FGL) and to find 25 new BL Lac objects.
Authors' comments: 11 pages, 11 figures
Sachin Mehta, Hannaneh Hajishirzi, Mohammad Rastegari
We introduce a novel and generic convolutional unit, DiCE unit, that is built
using dimension-wise convolutions and dimension-wise fusion. The dimension-wise
convolutions apply light-weight convolutional filtering across each dimension
of the input tensor while dimension-wise fusion efficiently combines these
dimension-wise representations; allowing the DiCE unit to efficiently encode
spatial and channel-wise information contained in the input tensor. The DiCE
unit is simple and can be seamlessly integrated with any architecture to
improve its efficiency and performance. Compared to depth-wise separable
convolutions, the DiCE unit shows significant improvements across different
architectures. When DiCE units are stacked to build the DiCENet model, we
observe significant improvements over state-of-the-art models across various
computer vision tasks including image classification, object detection, and
semantic segmentation. On the ImageNet dataset, the DiCENet delivers 2-4%
higher accuracy than state-of-the-art manually designed models (e.g.,
MobileNetv2 and ShuffleNetv2). Also, DiCENet generalizes better to tasks (e.g.,
object detection) that are often used in resource-constrained devices in
comparison to state-of-the-art separable convolution-based efficient networks,
including neural search-based methods (e.g., MobileNetv3 and MixNet. Our source
code in PyTorch is open-source and is available at
https://github.com/sacmehta/EdgeNets/
Authors' comments: Accepted at IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI)
Radhika Vasisht, Ruchi Das
In this paper, we study continuum-wise expansive non-autonomous discrete
dynamical systems. We discuss various properties of such non-autonomous
systems. We further obtain results for cw-expansive non-autonomous systems with
shadowing property and obtain an important equivalence.
\keywords{Non-autonomous dynamical systems, continuum-wise expansive,
shadowing property
Authors' comments: Some results may not be correct
George Retsinas, Athena Elafrou, Georgios Goumas, Petros Maragos
In this paper, we introduce Channel-wise recurrent convolutional neural networks (RecNets), a family of novel, compact neural network architectures for computer vision tasks inspired by recurrent neural networks (RNNs). RecNets build upon Channel-wise recurrent convolutional (CRC) layers, a novel type of convolutional layer that splits the input channels into disjoint segments and processes them in a recurrent fashion. In this way, we simulate wide, yet compact models, since the number of parameters is vastly reduced via the parameter sharing of the RNN formulation. Experimental results on the CIFAR-10 and CIFAR-100 image classification tasks demonstrate the superior size-accuracy trade-off of RecNets compared to other compact state-of-the-art architectures.
Roger Behling, J. -Yunier Bello-Cruz, Luiz-Rafael Santos
The elementary Euclidean concept of circumcenter has recently been employed to improve two aspects of the classical Douglas--Rachford method for projecting onto the intersection of affine subspaces. The so-called circumcentered-reflection method is able to both accelerate the average reflection scheme by the Douglas--Rachford method and cope with the intersection of more than two affine subspaces. We now introduce the technique of circumcentering in blocks, which, more than just an option over the basic algorithm of circumcenters, turns out to be an elegant manner of generalizing the method of alternating projections. Linear convergence for this novel block-wise circumcenter framework is derived and illustrated numerically. Furthermore, we prove that the original circumcentered-reflection method essentially finds the best approximation solution in one single step if the given affine subspaces are hyperplanes.
Qian Lou, Feng Guo, Lantao Liu, Minje Kim, Lei Jiang
Network quantization is one of the most hardware friendly techniques to
enable the deployment of convolutional neural networks (CNNs) on low-power
mobile devices. Recent network quantization techniques quantize each weight
kernel in a convolutional layer independently for higher inference accuracy,
since the weight kernels in a layer exhibit different variances and hence have
different amounts of redundancy. The quantization bitwidth or bit number (QBN)
directly decides the inference accuracy, latency, energy and hardware overhead.
To effectively reduce the redundancy and accelerate CNN inferences, various
weight kernels should be quantized with different QBNs. However, prior works
use only one QBN to quantize each convolutional layer or the entire CNN,
because the design space of searching a QBN for each weight kernel is too
large. The hand-crafted heuristic of the kernel-wise QBN search is so
sophisticated that domain experts can obtain only sub-optimal results. It is
difficult for even deep reinforcement learning (DRL) Deep Deterministic Policy
Gradient (DDPG)-based agents to find a kernel-wise QBN configuration that can
achieve reasonable inference accuracy. In this paper, we propose a
hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to
automatically search a QBN for each weight kernel, and choose another QBN for
each activation layer. Compared to the models quantized by the state-of-the-art
DRL-based schemes, on average, the same models quantized by AutoQ reduce the
inference latency by 54.06\%, and decrease the inference energy consumption by
50.69\%, while achieving the same inference accuracy.
Authors' comments: 10 pages, 12 figures
Dániel Gerbner, Dániel T. Nagy, Balázs Patkós, Máté Vizer
In many proofs concerning extremal parameters of Berge hypergraphs one starts
with analyzing that part of that shadow graph which is contained in many
hyperedges. Capturing this phenomenon we introduce two new types of
hypergraphs. A hypergraph $\mathcal{H}$ is a $t$-heavy copy of a graph $F$ if
there is a copy of $F$ on its vertex set such that each edge of $F$ is
contained in at least $t$ hyperedges of $\mathcal{H}$. $\mathcal{H}$ is a
$t$-wise Berge copy of $F$ if additionally for distinct edges of $F$ those $t$
hyperedges are distinct.
We extend known upper bounds on the Tur\'an number of Berge hypergraphs to
the $t$-wise Berge hypergraphs case. We asymptotically determine the Tur\'an
number of $t$-heavy and $t$-wise Berge copies of long paths and cycles and
exactly determine the Tur\'an number of $t$-heavy and $t$-wise Berge copies of
cliques.
In the case of 3-uniform hypergraphs, we consider the problem in more details
and obtain additional results.
Authors' comments: 20 pages
Yingquan Wu, Eyal En Gad
In this paper we comprehensively investigate block-wise product (BWP) BCH
codes, wherein raw data is arranged in the form of block-wise matrix and each
row and column BCH codes intersect on one data block. We first devise efficient
BCH decoding algorithms, including reduced-1-bit decoding, extra-1-bit list
decoding, and extra-2-bit list decoding. We next present a systematic
construction of BWP-BCH codes upon given message and parity lengths that takes
into account for performance, implementation and scalability, rather than
focusing on a regularly defined BWP-BCH code. It can easily accommodate
different message length or parity length at minimal changes. It employs
extended BCH codes instead of BCH codes to reduce miscorrection rate and an
inner RS code to lower error floor. We also describe a high-speed scalable
encoder. We finally present a novel iterative decoding algorithm which is
divided into three phases. The first phase iteratively applies reduced BCH
correction capabilities to correct lightly corrupted rows/columns while
suppressing miscorrection, until the process stalls. The second phase
iteratively decodes up to the designed correction capabilities, until the
process stalls. The last phase iteratively applies the proposed list decoding
in a novel manner which effectively determines the correct candidate. The key
idea is to use cross decoding upon each list candidate to pick the candidate
which enables the maximum number of successful cross decoding. Our simulations
show that the proposed algorithm provides a significant performance boost
compared to the state-of-the-art algorithms.
Authors' comments: Submitted to IEEE trans. Info. Theory
Hyungtae Lee, Heesung Kwon, Wonkook Kim
Pixel-wise classification in remote sensing identifies entities in
large-scale satellite-based images at the pixel level. Few fully annotated
large-scale datasets for pixel-wise classification exist due to the challenges
of annotating individual pixels. Training data scarcity inevitably ensues from
the annotation challenge, leading to overfitting classifiers and degraded
classification performance. The lack of annotated pixels also necessarily
results in few hard examples of various entities critical for generating a
robust classification hyperplane. To overcome the problem of the data scarcity
and lack of hard examples in training, we introduce a two-step hard example
generation (HEG) approach that first generates hard example candidates and then
mines actual hard examples. In the first step, a generator that creates hard
example candidates is learned via the adversarial learning framework by fooling
a discriminator and a pixel-wise classification model at the same time. In the
second step, mining is performed to build a fixed number of hard examples from
a large pool of real and artificially generated examples. To evaluate the
effectiveness of the proposed HEG approach, we design a 9-layer fully
convolutional network suitable for pixel-wise classification. Experiments show
that using generated hard examples from the proposed HEG approach improves the
pixel-wise classification model's accuracy on red tide detection and
hyperspectral image classification tasks.
Authors' comments: IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing (JSTARS)