Charanjeet, Anuj Sharma
The second order method as Newton Step is a suitable technique in Online Learning to guarantee regret bound. The large data is a challenge in Newton method to store second order matrices as hessian. In this paper, we have proposed an modified online Newton step that store first and second order matrices of dimension m (classes) by d (features). we have used element wise arithmetic operation to retain matrices size same. The modified second order matrix size results in faster computations. Also, the mistake rate is at par with respect to popular methods in literature. The experiments outcome indicate that proposed method could be helpful to handle large multi class datasets in common desktop machines using second order method as Newton step.
Marco Dinarelli, Loïc Grobol
During the last couple of years, Recurrent Neural Networks (RNN) have reached
state-of-the-art performances on most of the sequence modelling problems. In
particular, the "sequence to sequence" model and the neural CRF have proved to
be very effective in this domain. In this article, we propose a new RNN
architecture for sequence labelling, leveraging gated recurrent layers to take
arbitrarily long contexts into account, and using two decoders operating
forward and backward. We compare several variants of the proposed solution and
their performances to the state-of-the-art. Most of our results are better than
the state-of-the-art or very close to it and thanks to the use of recent
technologies, our architecture can scale on corpora larger than those used in
this work.
Authors' comments: Slightly improved version of the paper accepted to the CICling 2019
conference
Quay Au, Daniel Schalk, Giuseppe Casalicchio, Ramona Schoedel, Clemens Stachl, Bernd Bischl
Multi-output prediction deals with the prediction of several targets of possibly diverse types. One way to address this problem is the so called problem transformation method. This method is often used in multi-label learning, but can also be used for multi-output prediction due to its generality and simplicity. In this paper, we introduce an algorithm that uses the problem transformation method for multi-output prediction, while simultaneously learning the dependencies between target variables in a sparse and interpretable manner. In a first step, predictions are obtained for each target individually. Target dependencies are then learned via a component-wise boosting approach. We compare our new method with similar approaches in a benchmark using multi-label, multivariate regression and mixed-type datasets.
Dahuin Jung, Ho Bae, Hyun-Soo Choi, Sungroh Yoon
Recently, the field of steganography has experienced rapid developments based
on deep learning (DL). DL based steganography distributes secret information
over all the available bits of the cover image, thereby posing difficulties in
using conventional steganalysis methods to detect, extract or remove hidden
secret images. However, our proposed framework is the first to effectively
disable covert communications and transactions that use DL based steganography.
We propose a DL based steganalysis technique that effectively removes secret
images by restoring the distribution of the original images. We formulate a
problem and address it by exploiting sophisticated pixel distributions and an
edge distribution of images by using a deep neural network. Based on the given
information, we remove the hidden secret information at the pixel level. We
evaluate our technique by comparing it with conventional steganalysis methods
using three public benchmarks. As the decoding method of DL based steganography
is approximate (lossy) and is different from the decoding method of
conventional steganography, we also introduce a new quantitative metric called
the destruction rate (DT). The experimental results demonstrate performance
improvements of 10-20% in both the decoded rate and the DT.
Authors' comments: IEEE TDSC
Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, Shenghua Gao
Single-image piece-wise planar 3D reconstruction aims to simultaneously
segment plane instances and recover 3D plane parameters from an image. Most
recent approaches leverage convolutional neural networks (CNNs) and achieve
promising results. However, these methods are limited to detecting a fixed
number of planes with certain learned order. To tackle this problem, we propose
a novel two-stage method based on associative embedding, inspired by its recent
success in instance segmentation. In the first stage, we train a CNN to map
each pixel to an embedding space where pixels from the same plane instance have
similar embeddings. Then, the plane instances are obtained by grouping the
embedding vectors in planar regions via an efficient mean shift clustering
algorithm. In the second stage, we estimate the parameter for each plane
instance by considering both pixel-level and instance-level consistencies. With
the proposed method, we are able to detect an arbitrary number of planes.
Extensive experiments on public datasets validate the effectiveness and
efficiency of our method. Furthermore, our method runs at 30 fps at the testing
time, thus could facilitate many real-time applications such as visual SLAM and
human-robot interaction. Code is available at
https://github.com/svip-lab/PlanarReconstruction.
Authors' comments: Minor Revision
Sixue Gong, Yichun Shi, Anil K. Jain
We propose a new approach to video face recognition. Our component-wise feature aggregation network (C-FAN) accepts a set of face images of a subject as an input, and outputs a single feature vector as the face representation of the set for the recognition task. The whole network is trained in two steps: (i) train a base CNN for still image face recognition; (ii) add an aggregation module to the base network to learn the quality value for each feature component, which adaptively aggregates deep feature vectors into a single vector to represent the face in a video. C-FAN automatically learns to retain salient face features with high quality scores while suppressing features with low quality scores. The experimental results on three benchmark datasets, YouTube Faces, IJB-A, and IJB-S show that the proposed C-FAN network is capable of generating a compact feature vector with 512 dimensions for a video sequence by efficiently aggregating feature vectors of all the video frames to achieve state of the art performance.
Shuyu Lin, Ronald Clark, Robert Birke, Niki Trigoni, Stephen Roberts
Variational Auto-encoders (VAEs) have been very successful as methods for
forming compressed latent representations of complex, often high-dimensional,
data. In this paper, we derive an alternative variational lower bound from the
one common in VAEs, which aims to minimize aggregate information loss. Using
our lower bound as the objective function for an auto-encoder enables us to
place a prior on the bulk statistics, corresponding to an aggregate posterior
for the entire dataset, as opposed to a single sample posterior as in the
original VAE. This alternative form of prior constraint allows individual
posteriors more flexibility to preserve necessary information for good
reconstruction quality. We further derive an analytic approximation to our
lower bound, leading to an efficient learning algorithm - WiSE-ALE. Through
various examples, we demonstrate that WiSE-ALE can reach excellent
reconstruction quality in comparison to other state-of-the-art VAE models,
while still retaining the ability to learn a smooth, compact representation.
Authors' comments: 18 pages, appendix included
Xinrui Cui, Dan Wang, Z. Jane Wang
With the widespread applications of deep convolutional neural networks
(DCNNs), it becomes increasingly important for DCNNs not only to make accurate
predictions but also to explain how they make their decisions. In this work, we
propose a CHannel-wise disentangled InterPretation (CHIP) model to give the
visual interpretation to the predictions of DCNNs. The proposed model distills
the class-discriminative importance of channels in networks by utilizing the
sparse regularization. Here, we first introduce the network perturbation
technique to learn the model. The proposed model is capable to not only distill
the global perspective knowledge from networks but also present the
class-discriminative visual interpretation for specific predictions of
networks. It is noteworthy that the proposed model is able to interpret
different layers of networks without re-training. By combining the distilled
interpretation knowledge in different layers, we further propose the Refined
CHIP visual interpretation that is both high-resolution and
class-discriminative. Experimental results on the standard dataset demonstrate
that the proposed model provides promising visual interpretation for the
predictions of networks in image classification task compared with existing
visual interpretation methods. Besides, the proposed method outperforms related
approaches in the application of ILSVRC 2015 weakly-supervised localization
task.
Authors' comments: 15 pages, 10 figures
Matan Shoef, Sharon Fogel, Daniel Cohen-Or
We present a novel approach to learning a point-wise, meaningful embedding for point-clouds in an unsupervised manner, through the use of neural-networks. The domain of point-cloud processing via neural-networks is rapidly evolving, with novel architectures and applications frequently emerging. Within this field of research, the availability and plethora of unlabeled point-clouds as well as their possible applications make finding ways of characterizing this type of data appealing. Though significant advancement was achieved in the realm of unsupervised learning, its adaptation to the point-cloud representation is not trivial. Previous research focuses on the embedding of entire point-clouds representing an object in a meaningful manner. We present a deep learning framework to learn point-wise description from a set of shapes without supervision. Our approach leverages self-supervision to define a relevant loss function to learn rich per-point features. We train a neural-network with objectives based on context derived directly from the raw data, with no added annotation. We use local structures of point-clouds to incorporate geometric information into each point's latent representation. In addition to using local geometric information, we encourage adjacent points to have similar representations and vice-versa, creating a smoother, more descriptive representation. We demonstrate the ability of our method to capture meaningful point-wise features through three applications. By clustering the learned embedding space, we perform unsupervised part-segmentation on point clouds. By calculating euclidean distance in the latent space we derive semantic point-analogies. Finally, by retrieving nearest-neighbors in our learned latent space we present meaningful point-correspondence within and among point-clouds.
Fei Xue, Annie Qu
For multi-source data, blocks of variable information from certain sources
are likely missing. Existing methods for handling missing data do not take
structures of block-wise missing data into consideration. In this paper, we
propose a Multiple Block-wise Imputation (MBI) approach, which incorporates
imputations based on both complete and incomplete observations. Specifically,
for a given missing pattern group, the imputations in MBI incorporate more
samples from groups with fewer observed variables in addition to the group with
complete observations. We propose to construct estimating equations based on
all available information, and optimally integrate informative estimating
functions to achieve efficient estimators. We show that the proposed method has
estimation and model selection consistency under both fixed-dimensional and
high-dimensional settings. Moreover, the proposed estimator is asymptotically
more efficient than the estimator based on a single imputation from complete
observations only. In addition, the proposed method is not restricted to
missing completely at random. Numerical studies and ADNI data application
confirm that the proposed method outperforms existing variable selection
methods under various missing mechanisms.
Authors' comments: 35 pages, 2 figures, accepted for publication in Journal of the
American Statistical Association
Paul N. Beuchat, Joseph Warrington, John Lygeros
We describe an approximate dynamic programming approach to compute lower
bounds on the optimal value function for a discrete time, continuous space,
infinite horizon setting. The approach iteratively constructs a family of lower
bounding approximate value functions by using the so-called Bellman inequality.
The novelty of our approach is that, at each iteration, we aim to compute an
approximate value function that maximizes the point-wise maximum taken with the
family of approximate value functions computed thus far. This leads to a
non-convex objective, and we propose a gradient ascent algorithm to find
stationary points by solving a sequence of convex optimization problems. We
provide convergence guarantees for our algorithm and an interpretation for how
the gradient computation relates to the state relevance weighting parameter
appearing in related approximate dynamic programming approaches. We demonstrate
through numerical examples that, when compared to existing approaches, the
algorithm we propose computes tighter sub-optimality bounds with less
computation time.
Authors' comments: 14 pages, 3 figures
Hyun-Joo Jung, Jaedeok Kim, Yoonsuck Choe
Various forms of representations may arise in the many layers embedded in
deep neural networks (DNNs). Of these, where can we find the most compact
representation? We propose to use a pruning framework to answer this question:
How compact can each layer be compressed, without losing performance? Most of
the existing DNN compression methods do not consider the relative
compressibility of the individual layers. They uniformly apply a single target
sparsity to all layers or adapt layer sparsity using heuristics and additional
training. We propose a principled method that automatically determines the
sparsity of individual layers derived from the importance of each layer. To do
this, we consider a metric to measure the importance of each layer based on the
layer-wise capacity. Given the trained model and the total target sparsity, we
first evaluate the importance of each layer from the model. From the evaluated
importance, we compute the layer-wise sparsity of each layer. The proposed
method can be applied to any DNN architecture and can be combined with any
pruning method that takes the total target sparsity as a parameter. To validate
the proposed method, we carried out an image classification task with two types
of DNN architectures on two benchmark datasets and used three pruning methods
for compression. In case of VGG-16 model with weight pruning on the ImageNet
dataset, we achieved up to 75% (17.5% on average) better top-5 accuracy than
the baseline under the same total target sparsity. Furthermore, we analyzed
where the maximum compression can occur in the network. This kind of analysis
can help us identify the most compact representation within a deep neural
network.
Authors' comments: Accepted to AAAI 2019 Workshop on Network Interpretability for Deep
Learning
Seung-Geon Lee, Jaedeok Kim, Hyun-Joo Jung, Yoonsuck Choe
Estimating the relative importance of each sample in a training set has
important practical and theoretical value, such as in importance sampling or
curriculum learning. This kind of focus on individual samples invokes the
concept of sample-wise learnability: How easy is it to correctly learn each
sample (cf. PAC learnability)? In this paper, we approach the sample-wise
learnability problem within a deep learning context. We propose a measure of
the learnability of a sample with a given deep neural network (DNN) model. The
basic idea is to train the given model on the training set, and for each
sample, aggregate the hits and misses over the entire training epochs. Our
experiments show that the sample-wise learnability measure collected this way
is highly linearly correlated across different DNN models (ResNet-20, VGG-16,
and MobileNet), suggesting that such a measure can provide deep general
insights on the data's properties. We expect our method to help develop better
curricula for training, and help us better understand the data itself.
Authors' comments: Accepted to AAAI 2019 Student Abstract
Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, Xiaowei Zhou
This paper addresses the challenge of 6DoF pose estimation from a single RGB
image under severe occlusion or truncation. Many recent works have shown that a
two-stage approach, which first detects keypoints and then solves a
Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable
performance. However, most of these methods only localize a set of sparse
keypoints by regressing their image coordinates or heatmaps, which are
sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise
Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the
keypoints and use these vectors to vote for keypoint locations using RANSAC.
This creates a flexible representation for localizing occluded or truncated
keypoints. Another important feature of this representation is that it provides
uncertainties of keypoint locations that can be further leveraged by the PnP
solver. Experiments show that the proposed approach outperforms the state of
the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large
margin, while being efficient for real-time pose estimation. We further create
a Truncation LINEMOD dataset to validate the robustness of our approach against
truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
Authors' comments: The first two authors contributed equally to this paper. Project
page: https://zju-3dv.github.io/pvnet/
Nian Liu, Junwei Han, Ming-Hsuan Yang
In saliency detection, every pixel needs contextual information to make saliency prediction. Previous models usually incorporate contexts holistically. However, for each pixel, usually only part of its context region is useful and contributes to its prediction, while some other part may serve as noises and distractions. In this paper, we propose a novel pixel-wise contextual attention network, \ie PiCANet, to learn to selectively attend to informative context locations at each pixel. Specifically, PiCANet generates an attention map over the context region of each pixel, where each attention weight corresponds to the relevance of a context location w.r.t the referred pixel. Then, attentive contextual features can be constructed via selectively incorporating the features of useful context locations with the learned attention. We propose three specific formulations of the PiCANet via embedding the pixel-wise contextual attention mechanism into the pooling and convolution operations with attending to global or local contexts. All the three models are fully differentiable and can be integrated with CNNs with joint training. We introduce the proposed PiCANets into a U-Net architecture for salient object detection. Experimental results indicate that the proposed PiCANets can significantly improve the saliency detection performance. The generated global and local attention can learn to incorporate global contrast and smoothness, respectively, which help localize salient objects more accurately and highlight them more uniformly. Consequently, our saliency model performs favorably against other state-of-the-art methods. Moreover, we also validate that PiCANets can also improve semantic segmentation and object detection performances, which further demonstrates their effectiveness and generalization ability.
Martin Mundt, Sagnik Majumder, Tobias Weis, Visvanathan Ramesh
We characterize convolutional neural networks with respect to the relative
amount of features per layer. Using a skew normal distribution as a
parametrized framework, we investigate the common assumption of monotonously
increasing feature-counts with higher layers of architecture designs. Our
evaluation on models with VGG-type layers on the MNIST, Fashion-MNIST and
CIFAR-10 image classification benchmarks provides evidence that motivates
rethinking of our common assumption: architectures that favor larger early
layers seem to yield better accuracy.
Authors' comments: Accepted at the Critiquing and Correcting Trends in Machine Learning
(CRACT) Workshop at the 32nd Conference on Neural Information Processing
Systems (NeurIPS 2018)
Chengyue Gong, Xu Tan, Di He, Tao Qin
Maximum-likelihood estimation (MLE) is widely used in sequence to sequence
tasks for model training. It uniformly treats the generation/prediction of each
target token as multi-class classification, and yields non-smooth prediction
probabilities: in a target sequence, some tokens are predicted with small
probabilities while other tokens are with large probabilities. According to our
empirical study, we find that the non-smoothness of the probabilities results
in low quality of generated sequences. In this paper, we propose a
sentence-wise regularization method which aims to output smooth prediction
probabilities for all the tokens in the target sequence. Our proposed method
can automatically adjust the weights and gradients of each token in one
sentence to ensure the predictions in a sequence uniformly well. Experiments on
three neural machine translation tasks and one text summarization task show
that our method outperforms conventional MLE loss on all these tasks and
achieves promising BLEU scores on WMT14 English-German and WMT17
Chinese-English translation task.
Authors' comments: AAAI 2019
Alexey Kruglov
Neural network pruning is an important step in design process of efficient neural networks for edge devices with limited computational power. Pruning is a form of knowledge transfer from the weights of the original network to a smaller target subnetwork. We propose a new method for compute-constrained structured channel-wise pruning of convolutional neural networks. The method iteratively fine-tunes the network, while gradually tapering the computation resources available to the pruned network via a holonomic constraint in the method of Lagrangian multipliers framework. An explicit and adaptive automatic control over the rate of tapering is provided. The trainable parameters of our pruning method are separate from the weights of the neural network, which allows us to avoid the interference with the neural network solver (e.g. avoid the direct dependence of pruning speed on neural network learning rates). Our method combines the `rigoristic' approach by the direct application of constrained optimization, avoiding the pitfalls of ADMM-based methods, like their need to define the target amount of resources for each pruning run, and direct dependence of pruning speed and priority of pruning on the relative scale of weights between layers. For VGG-16 @ ILSVRC-2012, we achieve reduction of 15.47 -> 3.87 GMAC with only 1% top-1 accuracy reduction (68.4% -> 67.4%). For AlexNet @ ILSVRC-2012, we achieve 0.724 -> 0.411 GMAC with 1% top-1 accuracy reduction (56.8% -> 55.8%).
J. I. Penney, A. W. Blain, D. Wylezalek, N. A. Hatch, C. Lonsdale, A. Kimball, R. J. Assef, J. J. Condon et al.
We have observed the environments of a population of 33 heavily dust
obscured, ultra-luminous, high-redshift galaxies, selected using WISE and NVSS
at $z>$1.3 with the Infra-Red Array Camera on the $Spitzer$ Space Telescope
over $\rm5.12\,'\times5.12\,'$ fields. Colour selections are used to quantify
any potential overdensities of companion galaxies in these fields. We find no
significant excess of galaxies with the standard colour selection for IRAC
colours of $\rm[3.6]-[4.5]>-0.1$ consistent with galaxies at $z>$1.3 across the
whole fields with respect to wide-area $Spitzer$ comparison fields, but there
is a $\rm>2\sigma$ statistical excess within $\rm0.25\,'$ of the central
radio-WISE galaxy. Using a colour selection of $\rm[3.6]-[4.5]>0.4$, 0.5
magnitudes redder than the standard method of selecting galaxies at $z>$1.3, we
find a significant overdensity, in which $\rm76\%$ ($\rm33\%$) of the 33 fields
have a surface density greater than the $\rm3\sigma$ ($\rm5\sigma$) level.
There is a statistical excess of these redder galaxies within $\rm0.5\,'$,
rising to a central peak $\rm\sim2$--4 times the average density. This implies
that these galaxies are statistically linked to the radio-WISE selected galaxy,
indicating similar structures to those traced by red galaxies around radio-loud
AGN.
Authors' comments: 17 pages, 16 figures, 2 tables
Yuefeng Liang, Cho-Jui Hsieh, Thomas C. M. Lee
Extreme multi-label classification aims to learn a classifier that annotates an instance with a relevant subset of labels from an extremely large label set. Many existing solutions embed the label matrix to a low-dimensional linear subspace, or examine the relevance of a test instance to every label via a linear scan. In practice, however, those approaches can be computationally exorbitant. To alleviate this drawback, we propose a Block-wise Partitioning (BP) pretreatment that divides all instances into disjoint clusters, to each of which the most frequently tagged label subset is attached. One multi-label classifier is trained on one pair of instance and label clusters, and the label set of a test instance is predicted by first delivering it to the most appropriate instance cluster. Experiments on benchmark multi-label data sets reveal that BP pretreatment significantly reduces prediction time, and retains almost the same level of prediction accuracy.