Nikolaos Manginas, Ilias Chalkidis, Prodromos Malakasiotis
Although BERT is widely used by the NLP community, little is known about its
inner workings. Several attempts have been made to shed light on certain
aspects of BERT, often with contradicting conclusions. A much raised concern
focuses on BERT's over-parameterization and under-utilization issues. To this
end, we propose o novel approach to fine-tune BERT in a structured manner.
Specifically, we focus on Large Scale Multilabel Text Classification (LMTC)
where documents are assigned with one or more labels from a large predefined
set of hierarchically organized labels. Our approach guides specific BERT
layers to predict labels from specific hierarchy levels. Experimenting with two
LMTC datasets we show that this structured fine-tuning approach not only yields
better classification results but also leads to better parameter utilization.
Authors' comments: 5 pages, short paper at SPNLP 2020 (EMNLP 2020 Workshop)
Seyedsaeid Mirkamali, P. Nagabhushan
Image segmentation has long been a basic problem in computer vision. Depth-wise Layering is a kind of segmentation that slices an image in a depth-wise sequence unlike the conventional image segmentation problems dealing with surface-wise decomposition. The proposed Depth-wise Layering technique uses a single depth image of a static scene to slice it into multiple layers. The technique employs a thresholding approach to segment rows of the dense depth map into smaller partitions called Line-Segments in this paper. Then, it uses the line-segment labelling method to identify number of objects and layers of the scene independently. The final stage is to link objects of the scene to their respective object-layers. We evaluate the efficiency of the proposed technique by applying that on many images along with their dense depth maps. The experiments have shown promising results of layering.
Zhong-Qiu Wang, Peidong Wang, DeLiang Wang
We propose multi-microphone complex spectral mapping, a simple way of
applying deep learning for time-varying non-linear beamforming, for speaker
separation in reverberant conditions. We aim at both speaker separation and
dereverberation. Our study first investigates offline utterance-wise speaker
separation and then extends to block-online continuous speech separation (CSS).
Assuming a fixed array geometry between training and testing, we train deep
neural networks (DNN) to predict the real and imaginary (RI) components of
target speech at a reference microphone from the RI components of multiple
microphones. We then integrate multi-microphone complex spectral mapping with
minimum variance distortionless response (MVDR) beamforming and post-filtering
to further improve separation, and combine it with frame-level speaker counting
for block-online CSS. Although our system is trained on simulated room impulse
responses (RIR) based on a fixed number of microphones arranged in a given
geometry, it generalizes well to a real array with the same geometry.
State-of-the-art separation performance is obtained on the simulated two-talker
SMS-WSJ corpus and the real-recorded LibriCSS dataset.
Authors' comments: 14 pages, 6 figures. To appear in IEEE/ACM Transactions on Audio,
Speech, and Language Processing. Sound demo
https://zqwang7.github.io/demos/SMSWSJ_demo/taslp20_SMSWSJ_demo.html
Hujie Pan, Xuesong Li, Min Xu
Classic algebraic reconstruction technology (ART) for computed tomography requires pre-determined weights of the voxels for projecting pixel values. However, such weight cannot be accurately obtained due to the limitation of the physical understanding and computation resources. In this study, we propose a semi-case-wise learning-based method named Weight Encode Reconstruction Network (WERNet) to tackle the issues mentioned above. The model is trained in a self-supervised manner without the label of a voxel set. It contains two branches, including the voxel weight encoder and the voxel attention part. Using gradient normalization, we are able to co-train the encoder and voxel set numerically stably. With WERNet, the reconstructed result was obtained with a cosine similarity greater than 0.999 with the ground truth. Moreover, the model shows the extraordinary capability of denoising comparing to the classic ART method. In the generalization test of the model, the encoder is transferable from a voxel set with complex structure to the unseen cases without the deduction of the accuracy.
Oskari Miettinen
Physically unassociated background or foreground objects seen towards
submillimetre sources are potential contaminants of both the studies of young
stellar objects embedded in Galactic dust clumps and multiwavelength
counterparts of submillimetre galaxies (SMGs). We employed the near-infrared
and mid-infrared data from the Wide-field Infrared Survey Explorer (WISE) and
the submillimetre data from the Planck satellite, and uncovered a source,
namely WISE J044232.92+322734.9, whose WISE infrared colours suggest that it is
a star-forming galaxy (SFG), and which is seen in projection towards the
Planck-detected dust clump PGCC G169.20-8.96. We used the MAGPHYS+photo-$z$
spectral energy distribution code to derive the photometric redshift and
physical properties of J044232.92. The redshift was derived to be $z_{\rm
phot}=1.132^{+0.280}_{-0.165}$, while, for example, the stellar mass, IR (8-1
000 $\mu$m) luminosity, and star formation rate were derived to be
$M_{\star}=4.6^{+4.7}_{-2.5}\times10^{11}$ M$_{\odot}$, $L_{\rm
IR}=2.8^{+5.7}_{-1.5}\times10^{12}$ L$_{\odot}$, and ${\rm
SFR}=191^{+580}_{-146}$ ${\rm M}_{\odot}$ yr$^{-1}$. The derived value of
$L_{\rm IR}$ suggests that J044232.92 could be an ultraluminous infrared
galaxy, and we found that it is consistent with a main sequence SFG at a
redshift of 1.132. Moreover, the estimated physical properties of J044232.92
are comparable to those of SMGs. Further observations, in particular
high-resolution (sub-)millimetre and radio continuum imaging, are needed to
better constrain the redshift and physical properties of J044232.92 and to see
if the source really is a galaxy seen through a Galactic dust clump, in
particular an SMG population member at $z\sim1.1$.
Authors' comments: 7 pages, 4 figures, 3 tables, accepted for publication in A&A,
abstract abridged for arXiv
MaungMaung AprilPyone, Hitoshi Kiya
In this paper, we propose a novel defensive transformation that enables us to
maintain a high classification accuracy under the use of both clean images and
adversarial examples for adversarially robust defense. The proposed
transformation is a block-wise preprocessing technique with a secret key to
input images. We developed three algorithms to realize the proposed
transformation: Pixel Shuffling, Bit Flipping, and FFX Encryption. Experiments
were carried out on the CIFAR-10 and ImageNet datasets by using both black-box
and white-box attacks with various metrics including adaptive ones. The results
show that the proposed defense achieves high accuracy close to that of using
clean images even under adaptive attacks for the first time. In the best-case
scenario, a model trained by using images transformed by FFX Encryption (block
size of 4) yielded an accuracy of 92.30% on clean images and 91.48% under PGD
attack with a noise distance of 8/255, which is close to the non-robust
accuracy (95.45%) for the CIFAR-10 dataset, and it yielded an accuracy of
72.18% on clean images and 71.43% under the same attack, which is also close to
the standard accuracy (73.70%) for the ImageNet dataset. Overall, all three
proposed algorithms are demonstrated to outperform state-of-the-art defenses
including adversarial training whether or not a model is under attack.
Authors' comments: Under review
Sam Sattarzadeh, Mahesh Sudhakar, Anthony Lem, Shervin Mehryar, K. N. Plataniotis, Jongseong Jang, Hyunwoo Kim, Yeonjeong Jeong et al.
As an emerging field in Machine Learning, Explainable AI (XAI) has been
offering remarkable performance in interpreting the decisions made by
Convolutional Neural Networks (CNNs). To achieve visual explanations for CNNs,
methods based on class activation mapping and randomized input sampling have
gained great popularity. However, the attribution methods based on these
techniques provide lower resolution and blurry explanation maps that limit
their explanation power. To circumvent this issue, visualization based on
various layers is sought. In this work, we collect visualization maps from
multiple layers of the model based on an attribution-based input sampling
technique and aggregate them to reach a fine-grained and complete explanation.
We also propose a layer selection strategy that applies to the whole family of
CNN-based models, based on which our extraction framework is applied to
visualize the last layers of each convolutional block of the model. Moreover,
we perform an empirical analysis of the efficacy of derived lower-level
information to enhance the represented attributions. Comprehensive experiments
conducted on shallow and deep models trained on natural and industrial
datasets, using both ground-truth and model-truth based evaluation metrics
validate our proposed algorithm by meeting or outperforming the
state-of-the-art methods in terms of explanation ability and visual quality,
demonstrating that our method shows stability regardless of the size of objects
or instances to be explained.
Authors' comments: 9 pages, 9 figures, Accepted at the Thirty-Fifth AAAI Conference on
Artificial Intelligence (AAAI-21)
Xinyue Liang, Alireza M. Javid, Mikael Skoglund, Saikat Chatterjee
We design a low complexity decentralized learning algorithm to train a
recently proposed large neural network in distributed processing nodes
(workers). We assume the communication network between the workers is
synchronized and can be modeled as a doubly-stochastic mixing matrix without
having any master node. In our setup, the training data is distributed among
the workers but is not shared in the training process due to privacy and
security concerns. Using alternating-direction-method-of-multipliers (ADMM)
along with a layerwise convex optimization approach, we propose a decentralized
learning algorithm which enjoys low computational complexity and communication
cost among the workers. We show that it is possible to achieve equivalent
learning performance as if the data is available in a single place. Finally, we
experimentally illustrate the time complexity and convergence behavior of the
algorithm.
Authors' comments: Accepted to The International Joint Conference on Neural Networks
(IJCNN) 2020, to appear
Xuezhe Ma
In this paper, we introduce Apollo, a quasi-Newton method for nonconvex
stochastic optimization, which dynamically incorporates the curvature of the
loss function by approximating the Hessian via a diagonal matrix. Importantly,
the update and storage of the diagonal approximation of Hessian is as efficient
as adaptive first-order optimization methods with linear complexity for both
time and memory. To handle nonconvexity, we replace the Hessian with its
rectified absolute value, which is guaranteed to be positive-definite.
Experiments on three tasks of vision and language show that Apollo achieves
significant improvements over other stochastic optimization methods, including
SGD and variants of Adam, in term of both convergence speed and generalization
performance. The implementation of the algorithm is available at
https://github.com/XuezheMax/apollo.
Authors' comments: Fixed errors in convergence analysis. 29 pages (plus appendix), 6
figures, 7 tables
Qing Guo, Jingyang Sun, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Wei Feng, Yang Liu
Single-image deraining is rather challenging due to the unknown rain model.
Existing methods often make specific assumptions of the rain model, which can
hardly cover many diverse circumstances in the real world, making them have to
employ complex optimization or progressive refinement. This, however,
significantly affects these methods' efficiency and effectiveness for many
efficiency-critical applications. To fill this gap, in this paper, we regard
the single-image deraining as a general image-enhancing problem and originally
propose a model-free deraining method, i.e., EfficientDeRain, which is able to
process a rainy image within 10~ms (i.e., around 6~ms on average), over 80
times faster than the state-of-the-art method (i.e., RCDNet), while achieving
similar de-rain effects. We first propose the novel pixel-wise dilation
filtering. In particular, a rainy image is filtered with the pixel-wise kernels
estimated from a kernel prediction network, by which suitable multi-scale
kernels for each pixel can be efficiently predicted. Then, to eliminate the gap
between synthetic and real data, we further propose an effective data
augmentation method (i.e., RainMix) that helps to train network for real rainy
image handling.We perform comprehensive evaluation on both synthetic and
real-world rainy datasets to demonstrate the effectiveness and efficiency of
our method. We release the model and code in
https://github.com/tsingqguo/efficientderain.git.
Authors' comments: 9 pages, 9 figures
Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang
Designing hardware accelerators for deep neural networks (DNNs) has been much
desired. Nonetheless, most of these existing accelerators are built for either
convolutional neural networks (CNNs) or recurrent neural networks (RNNs).
Recently, the Transformer model is replacing the RNN in the natural language
processing (NLP) area. However, because of intensive matrix computations and
complicated data flow being involved, the hardware design for the Transformer
model has never been reported. In this paper, we propose the first hardware
accelerator for two key components, i.e., the multi-head attention (MHA)
ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are
the two most complex layers in the Transformer. Firstly, an efficient method is
introduced to partition the huge matrices in the Transformer, allowing the two
ResBlocks to share most of the hardware resources. Secondly, the computation
flow is well designed to ensure the high hardware utilization of the systolic
array, which is the biggest module in our design. Thirdly, complicated
nonlinear functions are highly optimized to further reduce the hardware
complexity and also the latency of the entire system. Our design is coded using
hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared
with the implementation on GPU with the same setting, the proposed design
demonstrates a speed-up of 14.6x in the MHA ResBlock, and 3.4x in the FFN
ResBlock, respectively. Therefore, this work lays a good foundation for
building efficient hardware accelerators for multiple Transformer networks.
Authors' comments: 6 pages, 8 figures. This work has been accepted by IEEE SOCC
(System-on-chip Conference) 2020, and peresnted by Siyuan Lu in SOCC2020. It
also received the Best Paper Award in the Methdology Track in this conference
Metodi P. Yankov, Uiara Celine de Moura, Francesco Da Ros
Cascades of a machine learning-based EDFA gain model trained on a single physical device and a fully differentiable stimulated Raman scattering fiber model are used to predict and optimize the power profile at the output of an experimental multi-span fully-loaded C-band optical communication system.
Li Yang, Zhezhi He, Junshan Zhang, Deliang Fan
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as \textit{catastrophic forgetting}. While recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets, some issues still remain to be tackled when applying them in real-world problems. Recently, the fast mask-based learning method (e.g. piggyback \cite{mallya2018piggyback}) is proposed to address these issues by learning only a binary element-wise mask in a fast manner, while keeping the backbone model fixed. However, the binary mask has limited modeling capacity for new tasks. A more recent work \cite{hung2019compacting} proposes a compress-grow-based method (CPG) to achieve better accuracy for new tasks by partially training backbone model, but with order-higher training cost, which makes it infeasible to be deployed into popular state-of-the-art edge-/mobile-learning. The primary goal of this work is to simultaneously achieve fast and high-accuracy multi task adaption in continual learning setting. Thus motivated, we propose a new training method called \textit{kernel-wise Soft Mask} (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task, while using the same backbone model. Such a soft mask can be viewed as a superposition of a binary mask and a properly scaled real-value tensor, which offers a richer representation capability without low-level kernel support to meet the objective of low hardware overhead. We validate KSM on multiple benchmark datasets against recent state-of-the-art methods (e.g. Piggyback, Packnet, CPG, etc.), which shows good improvement in both accuracy and training cost.
Laura Giordano, Valentina Gliozzi, Daniele Theseider Dupré
Inthispaperwedescribeaconcept-wisemulti-preferencesemantics for description
logic which has its root in the preferential approach for modeling defeasible
reasoning in knowledge representation. We argue that this proposal, beside
satisfying some desired properties, such as KLM postulates, and avoiding the
drowning problem, also defines a plausible notion of semantics. We motivate the
plausibility of the concept-wise multi-preference semantics by developing a
logical semantics of self-organising maps, which have been proposed as possible
candidates to explain the psychological mechanisms underlying category
generalisation, in terms of multi-preference interpretations.
Authors' comments: 13 pages
Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li et al.
Network pruning can reduce the high computation cost of deep neural network
(DNN) models. However, to maintain their accuracies, sparse models often carry
randomly-distributed weights, leading to irregular computations. Consequently,
sparse models cannot achieve meaningful speedup on commodity hardware (e.g.,
GPU) built for dense matrix computations. As such, prior works usually modify
or design completely new sparsity-optimized architectures for exploiting
sparsity. We propose an algorithm-software co-designed pruning method that
achieves latency speedups on existing dense architectures. Our work builds upon
the insight that the matrix multiplication generally breaks the large matrix
into multiple smaller tiles for parallel execution. We propose a
tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern
at the tile level for efficient execution but allows for irregular, arbitrary
pruning at the global scale to maintain the high accuracy. We implement and
evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup
over the dense model.
Authors' comments: 12pages, ACM/IEEE Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis (SC20)
Qiong Liu
Debris disks around stars are considered as components of planetary systems.
Constrain the dust properties of these disks can give crucial information to
formation and evolution of planetary systems. As an all-sky survey,
\textit{InfRared Astronomical Satellite} (\iras) gave great contribution to the
debris disk searching which discovered the first debris disk host star (Vega).
The \iras-detected debris disk sample published by Rhee \citep{rhe07} contains
146 stars with detailed information of dust properties. While the dust
properties of 45 of them still can not be determined due to the limitations
with the \iras\ database (have \iras\ detection at 60 $\mu$m only). Therefore,
using more sensitivity data of \textit{Wide-field Infrared Survey Explorer}
(\wise), we can better characterize the sample stars: For the stars with \iras\
detection at 60 $\mu$m only, we refit the excessive flux densities and obtain
the dust temperatures and fractional luminosities; While for the remaining
stars with multi-bands \iras\ detections, the dust properties are revised which
show that the dust temperatures were over estimated in high temperatures band
before. Moreover, we identify 17 stars with excesses at the \wise\ 22 $\mu$m
which have smaller distribution of distance from Earth and higher fractional
luminosities than the other stars without mid-infrared excess emission. Among
them, 15 stars can be found in previous works.
Authors' comments: 21 pages, 4 figures; 4 Tables, RAA in press
O. V. Maryeva, V. V. Gvaramadze, A. Y. Kniazev, L. N. Berdnikov
We present the results of study of the Galactic candidate luminous blue
variable Wray 15-906, revealed via detection of its infrared circumstellar
shell (of \approx2 pc in diameter) with the Wide-field Infrared Survey Explorer
and the Herschel Space Observatory. Using the stellar atmosphere code CMFGEN
and the Gaia parallax, we found that Wray 15-906 is a relatively
low-luminosity, log(L/Lsun)\approx5.4, star of temperature of 25\pm2 kK, with a
mass-loss rate of \approx3\times10^{-5} Msun/yr, a wind velocity of 280\pm50
km/s, and a surface helium abundance of 65\pm2 per cent (by mass). In the
framework of single star evolution, the obtained results suggest that Wray
15-906 is a post-red supergiant star with initial mass of \approx26\pm2 Msun
and that before exploding as a supernova it could transform for a short time
into a WN11h star. Our spectroscopic monitoring with the Southern African Large
Telescope (SALT) does not reveal significant changes in the spectrum of Wray
15-906 during the last 8 yr, while the V-band light curve of this star over
years 1999--2019 shows quasi-periodic variability with a period of \approx1700
d and an amplitude of \approx0.1 mag. We estimated the mass of the shell to be
2.9\pm0.5 Msun assuming the gas-to-dust mass ratio of 200. The presence of such
a shell indicates that Wray 15-906 has suffered substantial mass loss in the
recent past. We found that the open star cluster C1128-631 could be the birth
place of Wray 15-906 provided that this star is a rejuvenated product of binary
evolution (a blue straggler).
Authors' comments: 18 pages, 15 figures, accepted to MNRAS
Esa Ollila, Ammar Mian
Huber's criterion can be used for robust joint estimation of regression and
scale parameters in the linear model. Huber's (Huber, 1981) motivation for
introducing the criterion stemmed from non-convexity of the joint maximum
likelihood objective function as well as non-robustness (unbounded influence
function) of the associated ML-estimate of scale. In this paper, we illustrate
how the original algorithm proposed by Huber can be set within the block-wise
minimization majorization framework. In addition, we propose novel
data-adaptive step sizes for both the location and scale, which are further
improving the convergence. We then illustrate how Huber's criterion can be used
for sparse learning of underdetermined linear model using the iterative hard
thresholding approach. We illustrate the usefulness of the algorithms in an
image denoising application and simulation studies.
Authors' comments: To appear in International Workshop on Machine Learning for Signal
Processing (MLSP), 2020
Liangwei Li, Liucheng Sun, Chenwei Weng, Chengfu Huo, Weijun Ren
Online electronic coupon (e-coupon) is becoming a primary tool for e-commerce platforms to attract users to place orders. E-coupons are the digital equivalent of traditional paper coupons which provide customers with discounts or gifts. One of the fundamental problems related is how to deliver e-coupons with minimal cost while users' willingness to place an order is maximized. We call this problem the coupon allocation problem. This is a non-trivial problem since the number of regular users on a mature e-platform often reaches hundreds of millions and the types of e-coupons to be allocated are often multiple. The policy space is extremely large and the online allocation has to satisfy a budget constraint. Besides, one can never observe the responses of one user under different policies which increases the uncertainty of the policy making process. Previous work fails to deal with these challenges. In this paper, we decompose the coupon allocation task into two subtasks: the user intent detection task and the allocation task. Accordingly, we propose a two-stage solution: at the first stage (detection stage), we put forward a novel Instantaneous Intent Detection Network (IIDN) which takes the user-coupon features as input and predicts user real-time intents; at the second stage (allocation stage), we model the allocation problem as a Multiple-Choice Knapsack Problem (MCKP) and provide a computational efficient allocation method using the intents predicted at the detection stage. We conduct extensive online and offline experiments and the results show the superiority of our proposed framework, which has brought great profits to the platform and continues to function online.
Wenli Mo, Anthony Gonzalez, Mark Brodwin, Bandon Decker, Peter Eisenhardt, Emily Moravec, S. A. Stanford, Daniel Stern et al.
We present a study of the central radio activity of galaxy clusters at high
redshift. Using a large sample of galaxy clusters at $0.7<z<1.5$ from the
Massive and Distant Clusters of {\it WISE} Survey and the Faint Images of the
Radio Sky at Twenty-Centimeters $1.4$~GHz catalog, we measure the fraction of
clusters containing a radio source within the central $500$~kpc, which we term
the cluster radio-active fraction, and the fraction of cluster galaxies within
the central $500$~kpc exhibiting radio emission. We find tentative
($2.25\sigma$) evidence that the cluster radio-active fraction increases with
cluster richness, while the fraction of cluster galaxies that are
radio-luminous ($L_{1.4~\mathrm{GHz}}\geq10^{25}$~W~Hz$^{-1}$) does not
correlate with richness at a statistically significant level. Compared to that
calculated at $0 < z < 0.6$, the cluster radio-active fraction at $0 < z < 1.5$
increases by a factor of $10$. This fraction is also dependent on the radio
luminosity. Clusters at higher redshift are much more likely to host a radio
source of luminosity $L_{1.4~\mathrm{GHz}}\gtrsim10^{26}$~W~Hz$^{-1}$ than are
lower redshift clusters. We compare the fraction of radio-luminous cluster
galaxies to the fraction measured in a field environment. For $0.7<z<1.5$, we
find that both the cluster and field radio-luminous galaxy fraction increases
with stellar mass, regardless of environment, though at fixed stellar mass,
cluster galaxies are roughly $2$ times more likely to be radio-luminous than
field galaxies.
Authors' comments: 12 pages, 6 figures, accepted to ApJ