Hongyang Gao, Shuiwang Ji
Attention operators have been widely applied in various fields, including
computer vision, natural language processing, and network embedding learning.
Attention operators on graph data enables learnable weights when aggregating
information from neighboring nodes. However, graph attention operators (GAOs)
consume excessive computational resources, preventing their applications on
large graphs. In addition, GAOs belong to the family of soft attention, instead
of hard attention, which has been shown to yield better performance. In this
work, we propose novel hard graph attention operator (hGAO) and channel-wise
graph attention operator (cGAO). hGAO uses the hard attention mechanism by
attending to only important nodes. Compared to GAO, hGAO improves performance
and saves computational cost by only attending to important nodes. To further
reduce the requirements on computational resources, we propose the cGAO that
performs attention operations along channels. cGAO avoids the dependency on the
adjacency matrix, leading to dramatic reductions in computational resource
requirements. Experimental results demonstrate that our proposed deep models
with the new operators achieve consistently better performance. Comparison
results also indicates that hGAO achieves significantly better performance than
GAO on both node and graph embedding tasks. Efficiency comparison shows that
our cGAO leads to dramatic savings in computational resources, making them
applicable to large graphs.
Authors' comments: 9 pages, KDD19
Marc Brockschmidt
This paper presents a new Graph Neural Network (GNN) type using feature-wise
linear modulation (FiLM). Many standard GNN variants propagate information
along the edges of a graph by computing "messages" based only on the
representation of the source of each edge. In GNN-FiLM, the representation of
the target node of an edge is additionally used to compute a transformation
that can be applied to all incoming messages, allowing feature-wise modulation
of the passed information.
Results of experiments comparing different GNN architectures on three tasks
from the literature are presented, based on re-implementations of baseline
methods. Hyperparameters for all methods were found using extensive search,
yielding somewhat surprising results: differences between baseline models are
smaller than reported in the literature. Nonetheless, GNN-FiLM outperforms
baseline methods on a regression task on molecular graphs and performs
competitively on other tasks.
Authors' comments: As published in ICML 2020 proceedings
Grégoire Clarté, Christian P. Robert, Robin Ryder, Julien Stoehr
Approximate Bayesian computation methods are useful for generative models
with intractable likelihoods. These methods are however sensitive to the
dimension of the parameter space, requiring exponentially increasing resources
as this dimension grows. To tackle this difficulty, we explore a Gibbs version
of the ABC approach that runs component-wise approximate Bayesian computation
steps aimed at the corresponding conditional posterior distributions, and based
on summary statistics of reduced dimensions. While lacking the standard
justifications for the Gibbs sampler, the resulting Markov chain is shown to
converge in distribution under some partial independence conditions. The
associated stationary distribution can further be shown to be close to the true
posterior distribution and some hierarchical versions of the proposed mechanism
enjoy a closed form limiting distribution. Experiments also demonstrate the
gain in efficiency brought by the Gibbs version over the standard solution.
Authors' comments: 28 pages, 13 figures, third revision (accepted for publication in
Biometrika on 17 September, 2020)
Andrea Testa, Francesco Farina, Giuseppe Notarstefano
In this paper we deal with a network of computing agents with local processing and neighboring communication capabilities that aim at solving (without any central unit) a submodular optimization problem. The cost function is the sum of many local submodular functions and each agent in the network has access to one function in the sum only. In this \emph{distributed} set-up, in order to preserve their own privacy, agents communicate with neighbors but do not share their local cost functions. We propose a distributed algorithm in which agents resort to the Lov\`{a}sz extension of their local submodular functions and perform local updates and communications in terms of single blocks of the entire optimization variable. Updates are performed by means of a greedy algorithm which is run only until the selected block is computed, thus resulting in a reduced computational burden. The proposed algorithm is shown to converge in expected value to the optimal cost of the problem, and an approximate solution to the submodular problem is retrieved by a thresholding operation. As an application, we consider a distributed image segmentation problem in which each agent has access only to a portion of the entire image. While agents cannot segment the entire image on their own, they correctly complete the task by cooperating through the proposed distributed algorithm.
Cao Vien Phung, Jasenka Dizdarevic, Admela Jukan
CoAP (Constrained Application Protocol) with block-wise transfer (BWT) option
is a known protocol choice for large data transfer in general lossy IoT network
environments. Lossy transmission environments on the other hand lead to CoAP
resending multiple blocks, which creates overheads. To tackle this problem, we
design a BWT with network coding (NC), with the goal to reducing the number of
unnecessary retransmissions. The results show the reduction in the number of
block retransmissions for different values of blocksize, implying the reduced
transfer time. For the maximum blocksize of 1024 bytes and total probability
loss of 0.5, CoAP with NC can resend up to 5 times less blocks.
Authors' comments: 4 pages, 2 figures, submitted to Euro-Par 2019
Sathyaprakash Narayanan, Yeshwanth Bethi, Chetan Singh Thakur
Manifold amount of video data gets generated every minute as we read this document, ranging from surveillance to broadcasting purposes. There are two roadblocks that restrain us from using this data as such, first being the storage which restricts us from only storing the information based on the hardware constraints. Secondly, the computation required to process this data is highly expensive which makes it infeasible to work on them. Compressive sensing(CS)[2] is a signal process technique[11], through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Shannon-Nyquist sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity which requires the signal to be sparse in some domain. The second one is incoherence which is applied through the isometric property which is sufficient for sparse signals[9][10]. To sustain these characteristics, preserving all attributes in the uncompressed domain would help any kind of in this field. However, existing dataset fallback in terms of continuous tracking of all the object present in the scene, very few video datasets have comprehensive continuous tracking of objects. To address these problems collectively, in this work we propose a new comprehensive video dataset, where the data is compressed using pixel-wise coded exposure [3] that resolves various other impediments.
Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.
Kui Jia, Jiehong Lin, Mingkui Tan, Dacheng Tao
Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Multi-view learning is of the methods to achieve such goals. Recent methods propose deep multi-view networks via adaptation of generic Deep Neural Networks (DNNs), which concatenate features of individual views at intermediate network layers (i.e., fusion layers). In this work, we study the problem of multi-view learning in such end-to-end networks. We take a regularization approach via multi-view learning criteria, and propose a novel, effective, and efficient neuron-wise correlation-maximizing regularizer. We implement our proposed regularizers collectively as a correlation-regularized network layer (CorrReg). CorrReg can be applied to either fully-connected or convolutional fusion layers, simply by replacing them with their CorrReg counterparts. By partitioning neurons of a hidden layer in generic DNNs into multiple subsets, we also consider a multi-view feature learning perspective of generic DNNs. Such a perspective enables us to study deep multi-view learning in the context of regularized network training, for which we present control experiments of benchmark image classification to show the efficacy of our proposed CorrReg. To investigate how CorrReg is useful for practical multi-view learning problems, we conduct experiments of RGB-D object/scene recognition and multi-view based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGB-D object and RGB-D scene datasets. We make the implementation of CorrReg publicly available.
Charanjeet, Anuj Sharma
The second order method as Newton Step is a suitable technique in Online Learning to guarantee regret bound. The large data is a challenge in Newton method to store second order matrices as hessian. In this paper, we have proposed an modified online Newton step that store first and second order matrices of dimension m (classes) by d (features). we have used element wise arithmetic operation to retain matrices size same. The modified second order matrix size results in faster computations. Also, the mistake rate is at par with respect to popular methods in literature. The experiments outcome indicate that proposed method could be helpful to handle large multi class datasets in common desktop machines using second order method as Newton step.
Marco Dinarelli, Loïc Grobol
During the last couple of years, Recurrent Neural Networks (RNN) have reached
state-of-the-art performances on most of the sequence modelling problems. In
particular, the "sequence to sequence" model and the neural CRF have proved to
be very effective in this domain. In this article, we propose a new RNN
architecture for sequence labelling, leveraging gated recurrent layers to take
arbitrarily long contexts into account, and using two decoders operating
forward and backward. We compare several variants of the proposed solution and
their performances to the state-of-the-art. Most of our results are better than
the state-of-the-art or very close to it and thanks to the use of recent
technologies, our architecture can scale on corpora larger than those used in
this work.
Authors' comments: Slightly improved version of the paper accepted to the CICling 2019
conference
Quay Au, Daniel Schalk, Giuseppe Casalicchio, Ramona Schoedel, Clemens Stachl, Bernd Bischl
Multi-output prediction deals with the prediction of several targets of possibly diverse types. One way to address this problem is the so called problem transformation method. This method is often used in multi-label learning, but can also be used for multi-output prediction due to its generality and simplicity. In this paper, we introduce an algorithm that uses the problem transformation method for multi-output prediction, while simultaneously learning the dependencies between target variables in a sparse and interpretable manner. In a first step, predictions are obtained for each target individually. Target dependencies are then learned via a component-wise boosting approach. We compare our new method with similar approaches in a benchmark using multi-label, multivariate regression and mixed-type datasets.
Dahuin Jung, Ho Bae, Hyun-Soo Choi, Sungroh Yoon
Recently, the field of steganography has experienced rapid developments based
on deep learning (DL). DL based steganography distributes secret information
over all the available bits of the cover image, thereby posing difficulties in
using conventional steganalysis methods to detect, extract or remove hidden
secret images. However, our proposed framework is the first to effectively
disable covert communications and transactions that use DL based steganography.
We propose a DL based steganalysis technique that effectively removes secret
images by restoring the distribution of the original images. We formulate a
problem and address it by exploiting sophisticated pixel distributions and an
edge distribution of images by using a deep neural network. Based on the given
information, we remove the hidden secret information at the pixel level. We
evaluate our technique by comparing it with conventional steganalysis methods
using three public benchmarks. As the decoding method of DL based steganography
is approximate (lossy) and is different from the decoding method of
conventional steganography, we also introduce a new quantitative metric called
the destruction rate (DT). The experimental results demonstrate performance
improvements of 10-20% in both the decoded rate and the DT.
Authors' comments: IEEE TDSC
Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, Shenghua Gao
Single-image piece-wise planar 3D reconstruction aims to simultaneously
segment plane instances and recover 3D plane parameters from an image. Most
recent approaches leverage convolutional neural networks (CNNs) and achieve
promising results. However, these methods are limited to detecting a fixed
number of planes with certain learned order. To tackle this problem, we propose
a novel two-stage method based on associative embedding, inspired by its recent
success in instance segmentation. In the first stage, we train a CNN to map
each pixel to an embedding space where pixels from the same plane instance have
similar embeddings. Then, the plane instances are obtained by grouping the
embedding vectors in planar regions via an efficient mean shift clustering
algorithm. In the second stage, we estimate the parameter for each plane
instance by considering both pixel-level and instance-level consistencies. With
the proposed method, we are able to detect an arbitrary number of planes.
Extensive experiments on public datasets validate the effectiveness and
efficiency of our method. Furthermore, our method runs at 30 fps at the testing
time, thus could facilitate many real-time applications such as visual SLAM and
human-robot interaction. Code is available at
https://github.com/svip-lab/PlanarReconstruction.
Authors' comments: Minor Revision
Sixue Gong, Yichun Shi, Anil K. Jain
We propose a new approach to video face recognition. Our component-wise feature aggregation network (C-FAN) accepts a set of face images of a subject as an input, and outputs a single feature vector as the face representation of the set for the recognition task. The whole network is trained in two steps: (i) train a base CNN for still image face recognition; (ii) add an aggregation module to the base network to learn the quality value for each feature component, which adaptively aggregates deep feature vectors into a single vector to represent the face in a video. C-FAN automatically learns to retain salient face features with high quality scores while suppressing features with low quality scores. The experimental results on three benchmark datasets, YouTube Faces, IJB-A, and IJB-S show that the proposed C-FAN network is capable of generating a compact feature vector with 512 dimensions for a video sequence by efficiently aggregating feature vectors of all the video frames to achieve state of the art performance.
Shuyu Lin, Ronald Clark, Robert Birke, Niki Trigoni, Stephen Roberts
Variational Auto-encoders (VAEs) have been very successful as methods for
forming compressed latent representations of complex, often high-dimensional,
data. In this paper, we derive an alternative variational lower bound from the
one common in VAEs, which aims to minimize aggregate information loss. Using
our lower bound as the objective function for an auto-encoder enables us to
place a prior on the bulk statistics, corresponding to an aggregate posterior
for the entire dataset, as opposed to a single sample posterior as in the
original VAE. This alternative form of prior constraint allows individual
posteriors more flexibility to preserve necessary information for good
reconstruction quality. We further derive an analytic approximation to our
lower bound, leading to an efficient learning algorithm - WiSE-ALE. Through
various examples, we demonstrate that WiSE-ALE can reach excellent
reconstruction quality in comparison to other state-of-the-art VAE models,
while still retaining the ability to learn a smooth, compact representation.
Authors' comments: 18 pages, appendix included
Xinrui Cui, Dan Wang, Z. Jane Wang
With the widespread applications of deep convolutional neural networks
(DCNNs), it becomes increasingly important for DCNNs not only to make accurate
predictions but also to explain how they make their decisions. In this work, we
propose a CHannel-wise disentangled InterPretation (CHIP) model to give the
visual interpretation to the predictions of DCNNs. The proposed model distills
the class-discriminative importance of channels in networks by utilizing the
sparse regularization. Here, we first introduce the network perturbation
technique to learn the model. The proposed model is capable to not only distill
the global perspective knowledge from networks but also present the
class-discriminative visual interpretation for specific predictions of
networks. It is noteworthy that the proposed model is able to interpret
different layers of networks without re-training. By combining the distilled
interpretation knowledge in different layers, we further propose the Refined
CHIP visual interpretation that is both high-resolution and
class-discriminative. Experimental results on the standard dataset demonstrate
that the proposed model provides promising visual interpretation for the
predictions of networks in image classification task compared with existing
visual interpretation methods. Besides, the proposed method outperforms related
approaches in the application of ILSVRC 2015 weakly-supervised localization
task.
Authors' comments: 15 pages, 10 figures
Matan Shoef, Sharon Fogel, Daniel Cohen-Or
We present a novel approach to learning a point-wise, meaningful embedding for point-clouds in an unsupervised manner, through the use of neural-networks. The domain of point-cloud processing via neural-networks is rapidly evolving, with novel architectures and applications frequently emerging. Within this field of research, the availability and plethora of unlabeled point-clouds as well as their possible applications make finding ways of characterizing this type of data appealing. Though significant advancement was achieved in the realm of unsupervised learning, its adaptation to the point-cloud representation is not trivial. Previous research focuses on the embedding of entire point-clouds representing an object in a meaningful manner. We present a deep learning framework to learn point-wise description from a set of shapes without supervision. Our approach leverages self-supervision to define a relevant loss function to learn rich per-point features. We train a neural-network with objectives based on context derived directly from the raw data, with no added annotation. We use local structures of point-clouds to incorporate geometric information into each point's latent representation. In addition to using local geometric information, we encourage adjacent points to have similar representations and vice-versa, creating a smoother, more descriptive representation. We demonstrate the ability of our method to capture meaningful point-wise features through three applications. By clustering the learned embedding space, we perform unsupervised part-segmentation on point clouds. By calculating euclidean distance in the latent space we derive semantic point-analogies. Finally, by retrieving nearest-neighbors in our learned latent space we present meaningful point-correspondence within and among point-clouds.
Fei Xue, Annie Qu
For multi-source data, blocks of variable information from certain sources
are likely missing. Existing methods for handling missing data do not take
structures of block-wise missing data into consideration. In this paper, we
propose a Multiple Block-wise Imputation (MBI) approach, which incorporates
imputations based on both complete and incomplete observations. Specifically,
for a given missing pattern group, the imputations in MBI incorporate more
samples from groups with fewer observed variables in addition to the group with
complete observations. We propose to construct estimating equations based on
all available information, and optimally integrate informative estimating
functions to achieve efficient estimators. We show that the proposed method has
estimation and model selection consistency under both fixed-dimensional and
high-dimensional settings. Moreover, the proposed estimator is asymptotically
more efficient than the estimator based on a single imputation from complete
observations only. In addition, the proposed method is not restricted to
missing completely at random. Numerical studies and ADNI data application
confirm that the proposed method outperforms existing variable selection
methods under various missing mechanisms.
Authors' comments: 35 pages, 2 figures, accepted for publication in Journal of the
American Statistical Association
Paul N. Beuchat, Joseph Warrington, John Lygeros
We describe an approximate dynamic programming approach to compute lower
bounds on the optimal value function for a discrete time, continuous space,
infinite horizon setting. The approach iteratively constructs a family of lower
bounding approximate value functions by using the so-called Bellman inequality.
The novelty of our approach is that, at each iteration, we aim to compute an
approximate value function that maximizes the point-wise maximum taken with the
family of approximate value functions computed thus far. This leads to a
non-convex objective, and we propose a gradient ascent algorithm to find
stationary points by solving a sequence of convex optimization problems. We
provide convergence guarantees for our algorithm and an interpretation for how
the gradient computation relates to the state relevance weighting parameter
appearing in related approximate dynamic programming approaches. We demonstrate
through numerical examples that, when compared to existing approaches, the
algorithm we propose computes tighter sub-optimality bounds with less
computation time.
Authors' comments: 14 pages, 3 figures
Hyun-Joo Jung, Jaedeok Kim, Yoonsuck Choe
Various forms of representations may arise in the many layers embedded in
deep neural networks (DNNs). Of these, where can we find the most compact
representation? We propose to use a pruning framework to answer this question:
How compact can each layer be compressed, without losing performance? Most of
the existing DNN compression methods do not consider the relative
compressibility of the individual layers. They uniformly apply a single target
sparsity to all layers or adapt layer sparsity using heuristics and additional
training. We propose a principled method that automatically determines the
sparsity of individual layers derived from the importance of each layer. To do
this, we consider a metric to measure the importance of each layer based on the
layer-wise capacity. Given the trained model and the total target sparsity, we
first evaluate the importance of each layer from the model. From the evaluated
importance, we compute the layer-wise sparsity of each layer. The proposed
method can be applied to any DNN architecture and can be combined with any
pruning method that takes the total target sparsity as a parameter. To validate
the proposed method, we carried out an image classification task with two types
of DNN architectures on two benchmark datasets and used three pruning methods
for compression. In case of VGG-16 model with weight pruning on the ImageNet
dataset, we achieved up to 75% (17.5% on average) better top-5 accuracy than
the baseline under the same total target sparsity. Furthermore, we analyzed
where the maximum compression can occur in the network. This kind of analysis
can help us identify the most compact representation within a deep neural
network.
Authors' comments: Accepted to AAAI 2019 Workshop on Network Interpretability for Deep
Learning