Insu Han, Haim Avron, Jinwoo Shin
This paper studies how to sketch element-wise functions of low-rank matrices. Formally, given low-rank matrix A = [Aij] and scalar non-linear function f, we aim for finding an approximated low-rank representation of the (possibly high-rank) matrix [f(Aij)]. To this end, we propose an efficient sketching-based algorithm whose complexity is significantly lower than the number of entries of A, i.e., it runs without accessing all entries of [f(Aij)] explicitly. The main idea underlying our method is to combine a polynomial approximation of f with the existing tensor sketch scheme for approximating monomials of entries of A. To balance the errors of the two approximation components in an optimal manner, we propose a novel regression formula to find polynomial coefficients given A and f. In particular, we utilize a coreset-based regression with a rigorous approximation guarantee. Finally, we demonstrate the applicability and superiority of the proposed scheme under various machine learning tasks.
Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen et al.
We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam.
Authors' comments: Preprint, under review
Xiang Li, Xiaolin Hu, Jian Yang
The Convolutional Neural Networks (CNNs) generate the feature representation
of complex objects by collecting hierarchical and different parts of semantic
sub-features. These sub-features can usually be distributed in grouped form in
the feature vector of each layer, representing various semantic entities.
However, the activation of these sub-features is often spatially affected by
similar patterns and noisy backgrounds, resulting in erroneous localization and
identification. We propose a Spatial Group-wise Enhance (SGE) module that can
adjust the importance of each sub-feature by generating an attention factor for
each spatial location in each semantic group, so that every individual group
can autonomously enhance its learnt expression and suppress possible noise. The
attention factors are only guided by the similarities between the global and
local feature descriptors inside each group, thus the design of SGE module is
extremely lightweight with \emph{almost no extra parameters and calculations}.
Despite being trained with only category supervisions, the SGE component is
extremely effective in highlighting multiple active areas with various
high-order semantics (such as the dog's eyes, nose, etc.). When integrated with
popular CNN backbones, SGE can significantly boost the performance of image
recognition tasks. Specifically, based on ResNet50 backbones, SGE achieves
1.2\% Top-1 accuracy improvement on the ImageNet benchmark and 1.0$\sim$2.0\%
AP gain on the COCO benchmark across a wide range of detectors
(Faster/Mask/Cascade RCNN and RetinaNet). Codes and pretrained models are
available at https://github.com/implus/PytorchInsight.
Authors' comments: Code available at: https://github.com/implus/PytorchInsight
Karsten Roth, Tomasz Konopczyński, Jürgen Hesser
At present, lesion segmentation is still performed manually (or semi-automatically) by medical experts. To facilitate this process, we contribute a fully-automatic lesion segmentation pipeline. This work proposes a method as a part of the LiTS (Liver Tumor Segmentation Challenge) competition for ISBI 17 and MICCAI 17 comparing methods for automatics egmentation of liver lesions in CT scans. By utilizing cascaded, densely connected 2D U-Nets and a Tversky-coefficient based loss function, our framework achieves very good shape extractions with high detection sensitivity, with competitive scores at time of publication. In addition, adjusting hyperparameters in our Tversky-loss allows to tune the network towards higher sensitivity or robustness.
Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, Changhu Wang
Multi-person pose estimation is an important but challenging problem in
computer vision. Although current approaches have achieved significant progress
by fusing the multi-scale feature maps, they pay little attention to enhancing
the channel-wise and spatial information of the feature maps. In this paper, we
propose two novel modules to perform the enhancement of the information for the
multi-person pose estimation. First, a Channel Shuffle Module (CSM) is proposed
to adopt the channel shuffle operation on the feature maps with different
levels, promoting cross-channel information communication among the pyramid
feature maps. Second, a Spatial, Channel-wise Attention Residual Bottleneck
(SCARB) is designed to boost the original residual unit with attention
mechanism, adaptively highlighting the information of the feature maps both in
the spatial and channel-wise context. The effectiveness of our proposed modules
is evaluated on the COCO keypoint benchmark, and experimental results show that
our approach achieves the state-of-the-art results.
Authors' comments: Accepted by CVPR 2019
Seungyul Han, Youngchul Sung
In importance sampling (IS)-based reinforcement learning algorithms such as
Proximal Policy Optimization (PPO), IS weights are typically clipped to avoid
large variance in learning. However, policy update from clipped statistics
induces large bias in tasks with high action dimensions, and bias from clipping
makes it difficult to reuse old samples with large IS weights. In this paper,
we consider PPO, a representative on-policy algorithm, and propose its
improvement by dimension-wise IS weight clipping which separately clips the IS
weight of each action dimension to avoid large bias and adaptively controls the
IS weight to bound policy update from the current policy. This new technique
enables efficient learning for high action-dimensional tasks and reusing of old
samples like in off-policy learning to increase the sample efficiency.
Numerical results show that the proposed new algorithm outperforms PPO and
other RL algorithms in various Open AI Gym tasks.
Authors' comments: Accepted to the 36th International Conference on Machine Learning
(ICML), 2019
Jimuyang Zhang, Sanping Zhou, Jinjun Wang, Dong Huang
The main challenge of Multiple Object Tracking (MOT) is the efficiency in
associating indefinite number of objects between video frames. Standard motion
estimators used in tracking, e.g., Long Short Term Memory (LSTM), only deal
with single object, while Re-IDentification (Re-ID) based approaches
exhaustively compare object appearances. Both approaches are computationally
costly when they are scaled to a large number of objects, making it very
difficult for real-time MOT. To address these problems, we propose a highly
efficient Deep Neural Network (DNN) that simultaneously models association
among indefinite number of objects. The inference computation of the DNN does
not increase with the number of objects. Our approach, Frame-wise Motion and
Appearance (FMA), computes the Frame-wise Motion Fields (FMF) between two
frames, which leads to very fast and reliable matching among a large number of
object bounding boxes. As auxiliary information is used to fix uncertain
matches, Frame-wise Appearance Features (FAF) are learned in parallel with
FMFs. Extensive experiments on the MOT17 benchmark show that our method
achieved real-time MOT with competitive results as the state-of-the-art
approaches.
Authors' comments: 13 pages, 4 figures, 4 tables
Jie Xing, Zheren Li, Biyuan Wang, Yuji Qi, Bingbin Yu, Farhad G. Zanjani, Aiwen Zheng, Remco Duits et al.
Breast cancer is the most common invasive cancer with the highest cancer occurrence in females. Handheld ultrasound is one of the most efficient ways to identify and diagnose the breast cancer. The area and the shape information of a lesion is very helpful for clinicians to make diagnostic decisions. In this study we propose a new deep-learning scheme, semi-pixel-wise cycle generative adversarial net (SPCGAN) for segmenting the lesion in 2D ultrasound. The method takes the advantage of a fully convolutional neural network (FCN) and a generative adversarial net to segment a lesion by using prior knowledge. We compared the proposed method to a fully connected neural network and the level set segmentation method on a test dataset consisting of 32 malignant lesions and 109 benign lesions. Our proposed method achieved a Dice similarity coefficient (DSC) of 0.92 while FCN and the level set achieved 0.90 and 0.79 respectively. Particularly, for malignant lesions, our method increases the DSC (0.90) of the fully connected neural network to 0.93 significantly (p$<$0.001). The results show that our SPCGAN can obtain robust segmentation results. The framework of SPCGAN is particularly effective when sufficient training samples are not available compared to FCN. Our proposed method may be used to relieve the radiologists' burden for annotation.
Xingyuan Zhang, Fuhai Zhang
3D Hand pose estimation from a single depth image is an essential topic in
computer vision and human-computer interaction. Although the rising of deep
learning method boosts the accuracy a lot, the problem is still hard to solve
due to the complex structure of the human hand. Existing methods with deep
learning either lose spatial information of hand structure or lack a direct
supervision of joint coordinates. In this paper, we propose a novel Pixel-wise
Regression method, which use spatial-form representation (SFR) and
differentiable decoder (DD) to solve the two problems. To use our method, we
build a model, in which we design a particular SFR and its correlative DD which
divided the 3D joint coordinates into two parts, plane coordinates and depth
coordinates and use two modules named Plane Regression (PR) and Depth
Regression (DR) to deal with them respectively. We conduct an ablation
experiment to show the method we proposed achieve better results than the
former methods. We also make an exploration on how different training
strategies influence the learned SFRs and results. The experiment on three
public datasets demonstrates that our model is comparable with the existing
state-of-the-art models and in one of them our model can reduce mean 3D joint
error by 25%.
Authors' comments: Update LaTeX version. Code coming soon
Zaizheng Li, Qidi Zhang
We prove a Hopf's lemma in the point-wise sense. The essential technique is
to prove $(-\Delta)^s_p u(x)$ is uniformly bounded in the unit ball
$B_1\subset\mathbb{R}^n$, where $u(x)=(1-|x|^2)^s_{+}$. Also we study the
global H\"older continuity of bounded positive solutions for $(-\Delta)^s_p
u(x)=f(x,u).$
Authors' comments: 33 pages
Chunfeng Song, Yan Huang, Wanli Ouyang, Liang Wang
Semantic segmentation has achieved huge progress via adopting deep Fully
Convolutional Networks (FCN). However, the performance of FCN based models
severely rely on the amounts of pixel-level annotations which are expensive and
time-consuming. To address this problem, it is a good choice to learn to
segment with weak supervision from bounding boxes. How to make full use of the
class-level and region-level supervisions from bounding boxes is the critical
challenge for the weakly supervised learning task. In this paper, we first
introduce a box-driven class-wise masking model (BCM) to remove irrelevant
regions of each class. Moreover, based on the pixel-level segment proposal
generated from the bounding box supervision, we could calculate the mean
filling rates of each class to serve as an important prior cue, then we propose
a filling rate guided adaptive loss (FR-Loss) to help the model ignore the
wrongly labeled pixels in proposals. Unlike previous methods directly training
models with the fixed individual segment proposals, our method can adjust the
model learning with global statistical information. Thus it can help reduce the
negative impacts from wrongly labeled proposals. We evaluate the proposed
method on the challenging PASCAL VOC 2012 benchmark and compare with other
methods. Extensive experimental results show that the proposed method is
effective and achieves the state-of-the-art results.
Authors' comments: Accepted by CVPR 2019
Shonosuke Sugasawa, Genya Kobayashi, Yuki Kawakubo
Estimating income distributions plays an important role in the measurement of
inequality and poverty over space. The existing literature on income
distributions predominantly focuses on estimating an income distribution for a
country or a region separately and the simultaneous estimation of multiple
income distributions has not been discussed in spite of its practical
importance. In this work, we develop an effective method for the simultaneous
estimation and inference for area-wise spatial income distributions taking
account of geographical information from grouped data. Based on the multinomial
likelihood function for grouped data, we propose a spatial state-space model
for area-wise parameters of parametric income distributions. We provide an
efficient Bayesian approach to estimation and inference for area-wise latent
parameters, which enables us to compute area-wise summary measures of income
distributions such as mean incomes and Gini indices, not only for sampled areas
but also for areas without any samples thanks to the latent spatial state-space
structure. The proposed method is demonstrated using the Japanese
municipality-wise grouped income data. The simulation studies show the
superiority of the proposed method to a crude conventional approach which
estimates the income distributions separately.
Authors' comments: 25 pages
Thomas Uriot
In this paper, we propose an extension to an existing algorithm
(instance-MIR) which tackles the multiple instance regression (MIR) problem,
also known as distribution regression. The MIR setting arises when the data is
a collection of bags, where each bag consists of several instances which
correspond to the same and unique real-valued label. The goal of a MIR
algorithm is to find a mapping from the instances of an unseen bag to its
target value. The instance-MIR algorithm treats all the instances separately
and maps each instance to a label. The final bag label is then taken as the
mean or the median of the predictions for that given bag. While it is
conceptually simple, taking a single statistic to summarize the distribution of
the labels in each bag is a limitation. In spite of this performance
bottleneck, the instance-MIR algorithm has been shown to be competitive when
compared to the current state-of-the-art methods. We address the aforementioned
issue by computing the kernel mean embeddings of the distributions of the
predicted labels, for each bag, and learn a regressor from these embeddings to
the bag label. We test our algorithm (instance-kme-MIR) on five real world
datasets and obtain better results than the baseline instance-MIR across all
the datasets, while achieving state-of-the-art results on two of the datasets.
Authors' comments: KDD 2019, FEED Workshop
Dongjun Lee
Most deep learning approaches for text-to-SQL generation are limited to the
WikiSQL dataset, which only supports very simple queries over a single table.
We focus on the Spider dataset, a complex and cross-domain text-to-SQL task,
which includes complex queries over multiple tables. In this paper, we propose
a SQL clause-wise decoding neural architecture with a self-attention based
database schema encoder to address the Spider task. Each of the clause-specific
decoders consists of a set of sub-modules, which is defined by the syntax of
each clause. Additionally, our model works recursively to support nested
queries. When evaluated on the Spider dataset, our approach achieves 4.6\% and
9.8\% accuracy gain in the test and dev sets, respectively. In addition, we
show that our model is significantly more effective at predicting complex and
nested queries than previous work.
Authors' comments: EMNLP 2019
Yaoyao Liu, Bernt Schiele, Qianru Sun
Few-shot learning aims to train efficient predictive models with a few examples. The lack of training data leads to poor models that perform high-variance or low-confidence predictions. In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E3BM) to achieve robust predictions. "Epoch-wise" means that each training epoch has a Bayes model whose parameters are specifically learned and deployed. "Empirical" means that the hyperparameters, e.g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data. We introduce four kinds of hyperprior learners by considering inductive vs. transductive, and epoch-dependent vs. epoch-independent, in the paradigm of meta-learning. We conduct extensive experiments for five-class few-shot tasks on three challenging benchmarks: miniImageNet, tieredImageNet, and FC100, and achieve top performance using the epoch-dependent transductive hyperprior learner, which captures the richest information. Our ablation study shows that both "epoch-wise ensemble" and "empirical" encourage high efficiency and robustness in the model performance.
Shaahin Angizi, Deliang Fan
With Von-Neumann computing architectures struggling to address
computationally- and memory-intensive big data analytic task today,
Processing-in-Memory (PIM) platforms are gaining growing interests. In this
way, processing-in-DRAM architecture has achieved remarkable success by
dramatically reducing data transfer energy and latency. However, the
performance of such system unavoidably diminishes when dealing with more
complex applications seeking bulk bit-wise X(N)OR- or addition operations,
despite utilizing maximum internal DRAM bandwidth and in-memory parallelism. In
this paper, we develop DRIM platform that harnesses DRAM as computational
memory and transforms it into a fundamental processing unit. DRIM uses the
analog operation of DRAM sub-arrays and elevates it to implement bit-wise
X(N)OR operation between operands stored in the same bit-line, based on a new
dual-row activation mechanism with a modest change to peripheral circuits such
sense amplifiers. The simulation results show that DRIM achieves on average 71x
and 8.4x higher throughput for performing bulk bit-wise X(N)OR-based operations
compared with CPU and GPU, respectively. Besides, DRIM outperforms recent
processing-in-DRAM platforms with up to 3.7x better performance.
Authors' comments: 7 pages, 9 Figures
Jon Hoffman
Neural Networks accomplish amazing things, but they suffer from computational
and memory bottlenecks that restrict their usage. Nowhere can this be better
seen than in the mobile space, where specialized hardware is being created just
to satisfy the demand for neural networks. Previous studies have shown that
neural networks have vastly more connections than they actually need to do
their work. This thesis develops a method that can compress networks to less
than 10% of memory and less than 25% of computational power, without loss of
accuracy, and without creating sparse networks that require special code to
run.
Authors' comments: Thesis for Masters degree
Yuning Chai
Recent advances in single-frame object detection and segmentation techniques
have motivated a wide range of works to extend these methods to process video
streams. In this paper, we explore the idea of hard attention aimed for
latency-sensitive applications. Instead of reasoning about every frame
separately, our method selects and only processes a small sub-window of the
frame. Our technique then makes predictions for the full frame based on the
sub-windows from previous frames and the update from the current sub-window.
The latency reduction by this hard attention mechanism comes at the cost of
degraded accuracy. We made two contributions to address this. First, we propose
a specialized memory cell that recovers lost context when processing
sub-windows. Secondly, we adopt a Q-learning-based policy training strategy
that enables our approach to intelligently select the sub-windows such that the
staleness in the memory hurts the performance the least. Our experiments
suggest that our approach reduces the latency by approximately four times
without significantly sacrificing the accuracy on the ImageNet VID video object
detection dataset and the DAVIS video object segmentation dataset. We further
demonstrate that we can reinvest the saved computation into other parts of the
network, and thus resulting in an accuracy increase at a comparable
computational cost as the original system and beating other recently proposed
state-of-the-art methods in the low latency range.
Authors' comments: ICCV 2019 Camera Ready + Supplementary
Ashu Sharma, Sanjay K. Sahay
In the fast-growing smart devices, Android is the most popular OS, and due to
its attractive features, mobility, ease of use, these devices hold sensitive
information such as personal data, browsing history, shopping history,
financial details, etc. Therefore, any security gap in these devices means that
the information stored or accessing the smart devices are at high risk of being
breached by the malware. These malware are continuously growing and are also
used for military espionage, disrupting the industry, power grids, etc. To
detect these malware, traditional signature matching techniques are widely
used. However, such strategies are not capable to detect the advanced Android
malicious apps because malware developer uses several obfuscation techniques.
Hence, researchers are continuously addressing the security issues in the
Android based smart devices. Therefore, in this paper using Drebin benchmark
malware dataset we experimentally demonstrate how to improve the detection
accuracy by analyzing the apps after grouping the collected data based on the
permissions and achieved 97.15% overall average accuracy. Our results
outperform the accuracy obtained without grouping data (79.27%, 2017), Arp, et
al. (94%, 2014), Annamalai et al. (84.29%, 2016), Bahman Rashidi et al. (82%,
2017)) and Ali Feizollah, et al. (95.5%, 2017). The analysis also shows that
among the groups, Microphone group detection accuracy is least while Calendar
group apps are detected with the highest accuracy, and with the highest
accuracy, and for the best performance, one shall take 80-100 features.
Authors' comments: 9 pages, 20 Figures
João Lita da Silva
The main purpose of this paper is to obtain strong laws of large numbers for
arrays or weighted sums of random variables under a scenario of dependence.
Namely, for triangular arrays $\{X_{n,k}, \, 1 \leqslant k \leqslant n, \, n
\geqslant 1 \}$ of row-wise extended negatively dependent random variables
weakly mean dominated by a random variable $X \in \mathscr{L}_{1}$ and
sequences $\{b_{n} \}$ of positive constants, conditions are given to ensure
$\sum_{k=1}^{n} \left(X_{n,k} - \mathbb{E} \, X_{n,k} \right)/b_{n}
\overset{\textnormal{a.s.}}{\longrightarrow} 0$. Our statements also allow us
to improve recent results about complete convergence.
Authors' comments: 15 pages