Jiajun Zhang, Pengyuan Ren, Jianmin Li
Pedestrian Attribute Recognition (PAR) has aroused extensive attention due to
its important role in video surveillance scenarios. In most cases, the
existence of a particular attribute is strongly related to a partial region.
Recent works design complicated modules, e.g., attention mechanism and proposal
of body parts to localize the attribute corresponding region. These works
further prove that localization of attribute specific regions precisely will
help in improving performance. However, these part-information-based methods
are still not accurate as well as increasing model complexity which makes it
hard to deploy on realistic applications. In this paper, we propose a Deep
Template Matching based method to capture body parts features with less
computation. Further, we also proposed an auxiliary supervision method that use
human pose keypoints to guide the learning toward discriminative local cues.
Extensive experiments show that the proposed method outperforms and has lower
computational complexity, compared with the state-of-the-art approaches on
large-scale pedestrian attribute datasets, including PETA, PA-100K, RAP, and
RAPv2 zs.
Authors' comments: 8 pages, 5 figures
Ayush Manish Agrawal, Atharva Tendle, Harshvardhan Sikka, Sahib Singh, Amr Kayid
Understanding the per-layer learning dynamics of deep neural networks is of
significant interest as it may provide insights into how neural networks learn
and the potential for better training regimens. We investigate learning in Deep
Convolutional Neural Networks (CNNs) by measuring the relative weight change of
layers while training. Several interesting trends emerge in a variety of CNN
architectures across various computer vision classification tasks, including
the overall increase in relative weight change of later layers as compared to
earlier ones.
Authors' comments: 14 pages, 20 figures
Songxiang Liu, Yuewen Cao, Na Hu, Dan Su, Helen Meng
This paper presents FastSVC, a light-weight cross-domain singing voice
conversion (SVC) system, which can achieve high conversion performance, with
inference speed 4x faster than real-time on CPUs. FastSVC uses Conformer-based
phoneme recognizer to extract singer-agnostic linguistic features from singing
signals. A feature-wise linear modulation based generator is used to synthesize
waveform directly from linguistic features, leveraging information from
sine-excitation signals and loudness features. The waveform generator can be
trained conveniently using a multi-resolution spectral loss and an adversarial
loss. Experimental results show that the proposed FastSVC system, compared with
a computationally heavy baseline system, can achieve comparable conversion
performance in some scenarios and significantly better conversion performance
in other scenarios. Moreover, the proposed FastSVC system achieves desirable
cross-lingual singing conversion performance. The inference speed of the
FastSVC system is 3x and 70x faster than the baseline system on GPUs and CPUs,
respectively.
Authors' comments: Accepted by IEEE International Conference on Multimedia and Expo
(ICME) 2021
Bo Li, Wiro J. Niessen, Stefan Klein, M. Arfan Ikram, Meike W. Vernooij, Esther E. Bron
Analysis of longitudinal changes in imaging studies often involves both
segmentation of structures of interest and registration of multiple timeframes.
The accuracy of such analysis could benefit from a tailored framework that
jointly optimizes both tasks to fully exploit the information available in the
longitudinal data. Most learning-based registration algorithms, including joint
optimization approaches, currently suffer from bias due to selection of a fixed
reference frame and only support pairwise transformations. We here propose an
analytical framework based on an unbiased learning strategy for group-wise
registration that simultaneously registers images to the mean space of a group
to obtain consistent segmentations. We evaluate the proposed method on
longitudinal analysis of a white matter tract in a brain MRI dataset with 2-3
time-points for 3249 individuals, i.e., 8045 images in total. The
reproducibility of the method is evaluated on test-retest data from 97
individuals. The results confirm that the implicit reference image is an
average of the input image. In addition, the proposed framework leads to
consistent segmentations and significantly lower processing bias than that of a
pair-wise fixed-reference approach. This processing bias is even smaller than
those obtained when translating segmentations by only one voxel, which can be
attributed to subtle numerical instabilities and interpolation. Therefore, we
postulate that the proposed mean-space learning strategy could be widely
applied to learning-based registration tasks. In addition, this group-wise
framework introduces a novel way for learning-based longitudinal studies by
direct construction of an unbiased within-subject template and allowing
reliable and efficient analysis of spatio-temporal imaging biomarkers.
Authors' comments: SPIE Medical Imaging 2021 (oral)
Ryo Fujii, Masato Mita, Kaori Abe, Kazuaki Hanawa, Makoto Morishita, Jun Suzuki, Kentaro Inui
Neural Machine Translation (NMT) has shown drastic improvement in its quality
when translating clean input, such as text from the news domain. However,
existing studies suggest that NMT still struggles with certain kinds of input
with considerable noise, such as User-Generated Contents (UGC) on the Internet.
To make better use of NMT for cross-cultural communication, one of the most
promising directions is to develop a model that correctly handles these
expressions. Though its importance has been recognized, it is still not clear
as to what creates the great gap in performance between the translation of
clean input and that of UGC. To answer the question, we present a new dataset,
PheMT, for evaluating the robustness of MT systems against specific linguistic
phenomena in Japanese-English translation. Our experiments with the created
dataset revealed that not only our in-house models but even widely used
off-the-shelf systems are greatly disturbed by the presence of certain
phenomena.
Authors' comments: 15 pages, 4 figures, accepted at COLING 2020
Huijuan Zhou, Xianyang Zhang, Jun Chen
The family-wise error rate (FWER) has been widely used in genome-wide association studies. With the increasing availability of functional genomics data, it is possible to increase the detection power by leveraging these genomic functional annotations. Previous efforts to accommodate covariates in multiple testing focus on the false discovery rate control while covariate-adaptive FWER-controlling procedures remain under-developed. Here we propose a novel covariate-adaptive FWER-controlling procedure that incorporates external covariates which are potentially informative of either the statistical power or the prior null probability. An efficient algorithm is developed to implement the proposed method. We prove its asymptotic validity and obtain the rate of convergence through a perturbation-type argument. Our numerical studies show that the new procedure is more powerful than competing methods and maintains robustness across different settings. We apply the proposed approach to the UK Biobank data and analyze 27 traits with 9 million single-nucleotide polymorphisms tested for associations. Seventy-five genomic annotations are used as covariates. Our approach detects more genome-wide significant loci than other methods in 21 out of the 27 traits.
Isaac Alonso Asensio, Claudio Dalla Vecchia
Widely used Lagrangian numerical codes that compute the physical interaction
with neighbouring resolution elements (particles), duplicate the calculation of
the interaction between pairs of particles. We developed an algorithm that
reduces the number of redundant calculations. The algorithm makes use of a hash
function to flag already computed interactions and eventual collisions. The
result of the hashing is stored in two caches. Without limiting the cache
memory usage, all duplicated calculations can be avoided, achieving the
speed-up of a factor on two. We show that, limiting the cache size (in bits) to
double the typical number of neighbouring particles, 70 per cent of the
redundant calculations can be avoided, yielding a speed-up of almost 35 per
cent.
Authors' comments: There are existing methods that solve this problem more efficiently
Hang Dong, Víctor Suárez-Paniagua, William Whiteley, Honghan Wu
Diagnostic or procedural coding of clinical notes aims to derive a coded
summary of disease-related information about patients. Such coding is usually
done manually in hospitals but could potentially be automated to improve the
efficiency and accuracy of medical coding. Recent studies on deep learning for
automated medical coding achieved promising performances. However, the
explainability of these models is usually poor, preventing them to be used
confidently in supporting clinical practice. Another limitation is that these
models mostly assume independence among labels, ignoring the complex
correlation among medical codes which can potentially be exploited to improve
the performance. We propose a Hierarchical Label-wise Attention Network (HLAN),
which aimed to interpret the model by quantifying importance (as attention
weights) of words and sentences related to each of the labels. Secondly, we
propose to enhance the major deep learning models with a label embedding (LE)
initialisation approach, which learns a dense, continuous vector representation
and then injects the representation into the final layers and the label-wise
attention layers in the models. We evaluated the methods using three settings
on the MIMIC-III discharge summaries: full codes, top-50 codes, and the UK NHS
COVID-19 shielding codes. Experiments were conducted to compare HLAN and LE
initialisation to the state-of-the-art neural network based methods. HLAN
achieved the best Micro-level AUC and $F_1$ on the top-50 code prediction and
comparable results on the NHS COVID-19 shielding code prediction to other
models. By highlighting the most salient words and sentences for each label,
HLAN showed more meaningful and comprehensive model interpretation compared to
its downgraded baselines and the CNN-based models. LE initialisation
consistently boosted most deep learning models for automated medical coding.
Authors' comments: Accepted to Journal of Biomedical Informatics, structured abstract in
full text, 21 pages, 5 figures, 4 supplementary materials (4 extra pages)
Philipp Benz, Chaoning Zhang, Adil Karjauv, In So Kweon
Convolutional neural networks (CNNs) have made significant advancement, however, they are widely known to be vulnerable to adversarial attacks. Adversarial training is the most widely used technique for improving adversarial robustness to strong white-box attacks. Prior works have been evaluating and improving the model average robustness without class-wise evaluation. The average evaluation alone might provide a false sense of robustness. For example, the attacker can focus on attacking the vulnerable class, which can be dangerous, especially, when the vulnerable class is a critical one, such as "human" in autonomous driving. We propose an empirical study on the class-wise accuracy and robustness of adversarially trained models. We find that there exists inter-class discrepancy for accuracy and robustness even when the training dataset has an equal number of samples for each class. For example, in CIFAR10, "cat" is much more vulnerable than other classes. Moreover, this inter-class discrepancy also exists for normally trained models, while adversarial training tends to further increase the discrepancy. Our work aims to investigate the following questions: (a) is the phenomenon of inter-class discrepancy universal regardless of datasets, model architectures and optimization hyper-parameters? (b) If so, what can be possible explanations for the inter-class discrepancy? (c) Can the techniques proposed in the long tail classification be readily extended to adversarial training for addressing the inter-class discrepancy?
Ethan M. Alt, Matthew A. Psioda, Joseph G. Ibrahim
Given the cost and duration of phase III and phase IV clinical trials, the development of statistical methods for go/no-go decisions is vital. In this paper, we introduce a Bayesian methodology to compute the probability of success based on the current data of a treatment regimen for the multivariate linear model. Our approach utilizes a Bayesian seemingly unrelated regression model, which allows for multiple endpoints to be modeled jointly even if the covariates between the endpoints are different. Correlations between endpoints are explicitly modeled. This Bayesian joint modeling approach unifies single and multiple testing procedures under a single framework. We develop an approach to multiple testing that asymptotically guarantees strict family-wise error rate control, and is more powerful than frequentist approaches to multiplicity. The method effectively yields those of Ibrahim et al. and Chuang-Stein as special cases, and, to our knowledge, is the only method that allows for robust sample size determination for multiple endpoints and/or hypotheses and the only method that provides strict family-wise type I error control in the presence of multiplicity.
Younggyo Seo, Kimin Lee, Ignasi Clavera, Thanard Kurutach, Jinwoo Shin, Pieter Abbeel
Model-based reinforcement learning (RL) has shown great potential in various
control tasks in terms of both sample-efficiency and final performance.
However, learning a generalizable dynamics model robust to changes in dynamics
remains a challenge since the target transition dynamics follow a multi-modal
distribution. In this paper, we present a new model-based RL algorithm, coined
trajectory-wise multiple choice learning, that learns a multi-headed dynamics
model for dynamics generalization. The main idea is updating the most accurate
prediction head to specialize each head in certain environments with similar
dynamics, i.e., clustering environments. Moreover, we incorporate context
learning, which encodes dynamics-specific information from past experiences
into the context latent vector, enabling the model to perform online adaptation
to unseen environments. Finally, to utilize the specialized prediction heads
more effectively, we propose an adaptive planning method, which selects the
most accurate prediction head over a recent experience. Our method exhibits
superior zero-shot generalization performance across a variety of control
tasks, compared to state-of-the-art RL methods. Source code and videos are
available at https://sites.google.com/view/trajectory-mcl.
Authors' comments: Accepted in NeurIPS2020. First two authors contributed equally,
website: https://sites.google.com/view/trajectory-mcl code:
https://github.com/younggyoseo/trajectory_mcl
Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg
We propose SpeakerNet - a new neural architecture for speaker recognition and
speaker verification tasks. It is composed of residual blocks with 1D
depth-wise separable convolutions, batch-normalization, and ReLU layers. This
architecture uses x-vector based statistics pooling layer to map
variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M
is a simple lightweight model with just 5M parameters. It doesn't use voice
activity detection (VAD) and achieves close to state-of-the-art performance
scoring an Equal Error Rate (EER) of 2.10% on the VoxCeleb1 cleaned and 2.29%
on the VoxCeleb1 trial files.
Authors' comments: Preprint, submitted to ICASSP 2021
Mohammad Hamghalam, Baiying Lei, Tianfu Wang
Structural magnetic resonance imaging (MRI) has been widely utilized for analysis and diagnosis of brain diseases. Automatic segmentation of brain tumors is a challenging task for computer-aided diagnosis due to low-tissue contrast in the tumor subregions. To overcome this, we devise a novel pixel-wise segmentation framework through a convolutional 3D to 2D MR patch conversion model to predict class labels of the central pixel in the input sliding patches. Precisely, we first extract 3D patches from each modality to calibrate slices through the squeeze and excitation (SE) block. Then, the output of the SE block is fed directly into subsequent bottleneck layers to reduce the number of channels. Finally, the calibrated 2D slices are concatenated to obtain multimodal features through a 2D convolutional neural network (CNN) for prediction of the central pixel. In our architecture, both local inter-slice and global intra-slice features are jointly exploited to predict class label of the central voxel in a given patch through the 2D CNN classifier. We implicitly apply all modalities through trainable parameters to assign weights to the contributions of each sequence for segmentation. Experimental results on the segmentation of brain tumors in multimodal MRI scans (BraTS'19) demonstrate that our proposed method can efficiently segment the tumor regions.
Wenchi Ma, Miao Yu, Kaidong Li, Guanghui Wang
Layer-wise learning, as an alternative to global back-propagation, is easy to interpret, analyze, and it is memory efficient. Recent studies demonstrate that layer-wise learning can achieve state-of-the-art performance in image classification on various datasets. However, previous studies of layer-wise learning are limited to networks with simple hierarchical structures, and the performance decreases severely for deeper networks like ResNet. This paper, for the first time, reveals the fundamental reason that impedes the scale-up of layer-wise learning is due to the relatively poor separability of the feature space in shallow layers. This argument is empirically verified by controlling the intensity of the convolution operation in local layers. We discover that the poorly-separable features from shallow layers are mismatched with the strong supervision constraint throughout the entire network, making the layer-wise learning sensitive to network depth. The paper further proposes a downsampling acceleration approach to weaken the poor learning of shallow layers so as to transfer the learning emphasis to deep feature space where the separability matches better with the supervision restraint. Extensive experiments have been conducted to verify the new finding and demonstrate the advantages of the proposed downsampling acceleration in improving the performance of layer-wise learning.
Alexandre Moly, Alexandre Aksenov, Alim Louis Benabid, Tetiana Aksenova
Objective. Brain-computer interfaces (BCIs) create a new communication pathway between the brain and an effector without neuromuscular activation. BCI experiments highlighted high intra and inter-subjects variability in the BCI decoders. Although BCI model is generally relying on neurological markers generalizable on the majority of subjects, it requires to generate a wide range of neural features to include possible neurophysiological patterns. However, the processing of noisy and high dimensional features, such as brain signals, brings several challenges to overcome such as model calibration issues, model generalization and interpretation problems and hardware related obstacles. Approach. An online adaptive group-wise sparse decoder named Lp-Penalized REW-NPLS algorithm (PREW-NPLS) is presented to reduce the feature space dimension employed for BCI decoding. The proposed decoder was designed to create BCI systems with low computational cost suited for portable applications and tested during offline pseudo-online study based on online closed-loop BCI control of the left and right 3D arm movements of a virtual avatar from the ECoG recordings of a tetraplegic patient. Main results. PREW-NPLS algorithm highlight at least as good decoding performance as REW-NPLS algorithm. However, the decoding performance obtained with PREW-NPLS were achieved thanks to sparse models with up to 64% and 75% of the electrodes set to 0 for the left and right hand models respectively using L1-PREW-NPLS. Significance. The designed solution proposed an online incremental adaptive algorithm suitable for online adaptive decoder calibration which estimate sparse decoding solutions. The PREW-NPLS models are suited for portable applications with low computational power using only small number of electrodes with degrading the decoding performance.
Nikolaos Manginas, Ilias Chalkidis, Prodromos Malakasiotis
Although BERT is widely used by the NLP community, little is known about its
inner workings. Several attempts have been made to shed light on certain
aspects of BERT, often with contradicting conclusions. A much raised concern
focuses on BERT's over-parameterization and under-utilization issues. To this
end, we propose o novel approach to fine-tune BERT in a structured manner.
Specifically, we focus on Large Scale Multilabel Text Classification (LMTC)
where documents are assigned with one or more labels from a large predefined
set of hierarchically organized labels. Our approach guides specific BERT
layers to predict labels from specific hierarchy levels. Experimenting with two
LMTC datasets we show that this structured fine-tuning approach not only yields
better classification results but also leads to better parameter utilization.
Authors' comments: 5 pages, short paper at SPNLP 2020 (EMNLP 2020 Workshop)
Seyedsaeid Mirkamali, P. Nagabhushan
Image segmentation has long been a basic problem in computer vision. Depth-wise Layering is a kind of segmentation that slices an image in a depth-wise sequence unlike the conventional image segmentation problems dealing with surface-wise decomposition. The proposed Depth-wise Layering technique uses a single depth image of a static scene to slice it into multiple layers. The technique employs a thresholding approach to segment rows of the dense depth map into smaller partitions called Line-Segments in this paper. Then, it uses the line-segment labelling method to identify number of objects and layers of the scene independently. The final stage is to link objects of the scene to their respective object-layers. We evaluate the efficiency of the proposed technique by applying that on many images along with their dense depth maps. The experiments have shown promising results of layering.
Zhong-Qiu Wang, Peidong Wang, DeLiang Wang
We propose multi-microphone complex spectral mapping, a simple way of
applying deep learning for time-varying non-linear beamforming, for speaker
separation in reverberant conditions. We aim at both speaker separation and
dereverberation. Our study first investigates offline utterance-wise speaker
separation and then extends to block-online continuous speech separation (CSS).
Assuming a fixed array geometry between training and testing, we train deep
neural networks (DNN) to predict the real and imaginary (RI) components of
target speech at a reference microphone from the RI components of multiple
microphones. We then integrate multi-microphone complex spectral mapping with
minimum variance distortionless response (MVDR) beamforming and post-filtering
to further improve separation, and combine it with frame-level speaker counting
for block-online CSS. Although our system is trained on simulated room impulse
responses (RIR) based on a fixed number of microphones arranged in a given
geometry, it generalizes well to a real array with the same geometry.
State-of-the-art separation performance is obtained on the simulated two-talker
SMS-WSJ corpus and the real-recorded LibriCSS dataset.
Authors' comments: 14 pages, 6 figures. To appear in IEEE/ACM Transactions on Audio,
Speech, and Language Processing. Sound demo
https://zqwang7.github.io/demos/SMSWSJ_demo/taslp20_SMSWSJ_demo.html
Hujie Pan, Xuesong Li, Min Xu
Classic algebraic reconstruction technology (ART) for computed tomography requires pre-determined weights of the voxels for projecting pixel values. However, such weight cannot be accurately obtained due to the limitation of the physical understanding and computation resources. In this study, we propose a semi-case-wise learning-based method named Weight Encode Reconstruction Network (WERNet) to tackle the issues mentioned above. The model is trained in a self-supervised manner without the label of a voxel set. It contains two branches, including the voxel weight encoder and the voxel attention part. Using gradient normalization, we are able to co-train the encoder and voxel set numerically stably. With WERNet, the reconstructed result was obtained with a cosine similarity greater than 0.999 with the ground truth. Moreover, the model shows the extraordinary capability of denoising comparing to the classic ART method. In the generalization test of the model, the encoder is transferable from a voxel set with complex structure to the unseen cases without the deduction of the accuracy.
Oskari Miettinen
Physically unassociated background or foreground objects seen towards
submillimetre sources are potential contaminants of both the studies of young
stellar objects embedded in Galactic dust clumps and multiwavelength
counterparts of submillimetre galaxies (SMGs). We employed the near-infrared
and mid-infrared data from the Wide-field Infrared Survey Explorer (WISE) and
the submillimetre data from the Planck satellite, and uncovered a source,
namely WISE J044232.92+322734.9, whose WISE infrared colours suggest that it is
a star-forming galaxy (SFG), and which is seen in projection towards the
Planck-detected dust clump PGCC G169.20-8.96. We used the MAGPHYS+photo-$z$
spectral energy distribution code to derive the photometric redshift and
physical properties of J044232.92. The redshift was derived to be $z_{\rm
phot}=1.132^{+0.280}_{-0.165}$, while, for example, the stellar mass, IR (8-1
000 $\mu$m) luminosity, and star formation rate were derived to be
$M_{\star}=4.6^{+4.7}_{-2.5}\times10^{11}$ M$_{\odot}$, $L_{\rm
IR}=2.8^{+5.7}_{-1.5}\times10^{12}$ L$_{\odot}$, and ${\rm
SFR}=191^{+580}_{-146}$ ${\rm M}_{\odot}$ yr$^{-1}$. The derived value of
$L_{\rm IR}$ suggests that J044232.92 could be an ultraluminous infrared
galaxy, and we found that it is consistent with a main sequence SFG at a
redshift of 1.132. Moreover, the estimated physical properties of J044232.92
are comparable to those of SMGs. Further observations, in particular
high-resolution (sub-)millimetre and radio continuum imaging, are needed to
better constrain the redshift and physical properties of J044232.92 and to see
if the source really is a galaxy seen through a Galactic dust clump, in
particular an SMG population member at $z\sim1.1$.
Authors' comments: 7 pages, 4 figures, 3 tables, accepted for publication in A&A,
abstract abridged for arXiv