Jinlei Hou, Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu, Hong Zhou
Reconstruction-based methods play an important role in unsupervised anomaly
detection in images. Ideally, we expect a perfect reconstruction for normal
samples and poor reconstruction for abnormal samples. Since the
generalizability of deep neural networks is difficult to control, existing
models such as autoencoder do not work well. In this work, we interpret the
reconstruction of an image as a divide-and-assemble procedure. Surprisingly, by
varying the granularity of division on feature maps, we are able to modulate
the reconstruction capability of the model for both normal and abnormal
samples. That is, finer granularity leads to better reconstruction, while
coarser granularity leads to poorer reconstruction. With proper granularity,
the gap between the reconstruction error of normal and abnormal samples can be
maximized. The divide-and-assemble framework is implemented by embedding a
novel multi-scale block-wise memory module into an autoencoder network.
Besides, we introduce adversarial learning and explore the semantic latent
representation of the discriminator, which improves the detection of subtle
anomaly. We achieve state-of-the-art performance on the challenging MVTec AD
dataset. Remarkably, we improve the vanilla autoencoder model by 10.1% in terms
of the AUROC score.
Authors' comments: accepted by ICCV 2021
David Bonet, Antonio Ortega, Javier Ruiz-Hidalgo, Sarath Shekkizhar
State-of-the-art neural network architectures continue to scale in size and
deliver impressive generalization results, although this comes at the expense
of limited interpretability. In particular, a key challenge is to determine
when to stop training the model, as this has a significant impact on
generalization. Convolutional neural networks (ConvNets) comprise
high-dimensional feature spaces formed by the aggregation of multiple channels,
where analyzing intermediate data representations and the model's evolution can
be challenging owing to the curse of dimensionality. We present channel-wise
DeepNNK (CW-DeepNNK), a novel channel-wise generalization estimate based on
non-negative kernel regression (NNK) graphs with which we perform local
polytope interpolation on low-dimensional channels. This method leads to
instance-based interpretability of both the learned data representations and
the relationship between channels. Motivated by our observations, we use
CW-DeepNNK to propose a novel early stopping criterion that (i) does not
require a validation set, (ii) is based on a task performance metric, and (iii)
allows stopping to be reached at different points for each channel. Our
experiments demonstrate that our proposed method has advantages as compared to
the standard criterion based on validation set performance.
Authors' comments: Submitted to APSIPA 2021
Andrés Gómez, Thomas Genevois, Jerome Lussereau, Christian Laugier
Object detection is a critical problem for the safe interaction between
autonomous vehicles and road users. Deep-learning methodologies allowed the
development of object detection approaches with better performance. However,
there is still the challenge to obtain more characteristics from the objects
detected in real-time. The main reason is that more information from the
environment's objects can improve the autonomous vehicle capacity to face
different urban situations. This paper proposes a new approach to detect static
and dynamic objects in front of an autonomous vehicle. Our approach can also
get other characteristics from the objects detected, like their position,
velocity, and heading. We develop our proposal fusing results of the
environment's interpretations achieved of YoloV3 and a Bayesian filter. To
demonstrate our proposal's performance, we asses it through a benchmark dataset
and real-world data obtained from an autonomous platform. We compared the
results achieved with another approach.
Authors' comments: 6 pages, 7 figures
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, Weiming Hu
Graph convolutional networks (GCNs) have been widely used and achieved
remarkable results in skeleton-based action recognition. In GCNs, graph
topology dominates feature aggregation and therefore is the key to extracting
representative features. In this work, we propose a novel Channel-wise Topology
Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies
and effectively aggregate joint features in different channels for
skeleton-based action recognition. The proposed CTR-GC models channel-wise
topologies through learning a shared topology as a generic prior for all
channels and refining it with channel-specific correlations for each channel.
Our refinement method introduces few extra parameters and significantly reduces
the difficulty of modeling channel-wise topologies. Furthermore, via
reformulating graph convolutions into a unified form, we find that CTR-GC
relaxes strict constraints of graph convolutions, leading to stronger
representation capability. Combining CTR-GC with temporal modeling modules, we
develop a powerful graph convolutional network named CTR-GCN which notably
outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and
NW-UCLA datasets.
Authors' comments: Accepted to ICCV2021. Camera-ready version with supplementary
materials. Code is available at https://github.com/Uason-Chen/CTR-GCN
Napat Wanchaitanawong, Masayuki Tanaka, Takashi Shibata, Masatoshi Okutomi
The combined use of multiple modalities enables accurate pedestrian detection
under poor lighting conditions by using the high visibility areas from these
modalities together. The vital assumption for the combination use is that there
is no or only a weak misalignment between the two modalities. In general,
however, this assumption often breaks in actual situations. Due to this
assumption's breakdown, the position of the bounding boxes does not match
between the two modalities, resulting in a significant decrease in detection
accuracy, especially in regions where the amount of misalignment is large. In
this paper, we propose a multi-modal Faster-RCNN that is robust against large
misalignment. The keys are 1) modal-wise regression and 2) multi-modal IoU for
mini-batch sampling. To deal with large misalignment, we perform bounding box
regression for both the RPN and detection-head with both modalities. We also
propose a new sampling strategy called "multi-modal mini-batch sampling" that
integrates the IoU for both modalities. We demonstrate that the proposed
method's performance is much better than that of the state-of-the-art methods
for data with large misalignment through actual image experiments.
Authors' comments: Accepted by MVA2021
Hiroki Ito, MaungMaung AprilPyone, Hitoshi Kiya
Since production-level trained deep neural networks (DNNs) are of a great
business value, protecting such DNN models against copyright infringement and
unauthorized access is in a rising demand. However, conventional model
protection methods focused only the image classification task, and these
protection methods were never applied to semantic segmentation although it has
an increasing number of applications. In this paper, we propose to protect
semantic segmentation models from unauthorized access by utilizing block-wise
transformation with a secret key for the first time. Protected models are
trained by using transformed images. Experiment results show that the proposed
protection method allows rightful users with the correct key to access the
model to full capacity and deteriorate the performance for unauthorized users.
However, protected models slightly drop the segmentation performance compared
to non-protected models.
Authors' comments: To appear in 2021 International Workshop on Smart Info-Media Systems
in Asia (SISA 2021)
Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng
Existing approaches for anti-spoofing in automatic speaker verification (ASV)
still lack generalizability to unseen attacks. The Res2Net approach designs a
residual-like connection between feature groups within one block, which
increases the possible receptive fields and improves the system's detection
generalizability. However, such a residual-like connection is performed by a
direct addition between feature groups without channel-wise priority. We argue
that the information across channels may not contribute to spoofing cues
equally, and the less relevant channels are expected to be suppressed before
adding onto the next feature group, so that the system can generalize better to
unseen attacks. This argument motivates the current work that presents a novel,
channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a
channel-wise gating mechanism in the connection between feature groups. This
gating mechanism dynamically selects channel-wise features based on the input,
to suppress the less relevant channels and enhance the detection
generalizability. Three gating mechanisms with different structures are
proposed and integrated into Res2Net. Experimental results conducted on
ASVspoof 2019 logical access (LA) demonstrate that the proposed CG-Res2Net
significantly outperforms Res2Net on both the overall LA evaluation set and
individual difficult unseen attacks, which also outperforms other
state-of-the-art single systems, depicting the effectiveness of our method.
Authors' comments: Accepted to INTERSPEECH 2021
Laurent Lejeune, Raphael Sznitman
The ability to quickly annotate medical imaging data plays a critical role in training deep learning frameworks for segmentation. Doing so for image volumes or video sequences is even more pressing as annotating these is particularly burdensome. To alleviate this problem, this work proposes a new method to efficiently segment medical imaging volumes or videos using point-wise annotations only. This allows annotations to be collected extremely quickly and remains applicable to numerous segmentation tasks. Our approach trains a deep learning model using an appropriate Positive/Unlabeled objective function using sparse point-wise annotations. While most methods of this kind assume that the proportion of positive samples in the data is known a-priori, we introduce a novel self-supervised method to estimate this prior efficiently by combining a Bayesian estimation framework and new stopping criteria. Our method iteratively estimates appropriate class priors and yields high segmentation quality for a variety of object types and imaging modalities. In addition, by leveraging a spatio-temporal tracking framework, we regularize our predictions by leveraging the complete data volume. We show experimentally that our approach outperforms state-of-the-art methods tailored to the same problem.
Chong Tang, Wenda Li, Shelly Vishwakarma, Fangzhan Shi, Simon Julier, Kevin Chetty
Micro-Doppler signatures contain considerable information about target dynamics. However, the radar sensing systems are easily affected by noisy surroundings, resulting in uninterpretable motion patterns on the micro-Doppler spectrogram. Meanwhile, radar returns often suffer from multipath, clutter and interference. These issues lead to difficulty in, for example motion feature extraction, activity classification using micro Doppler signatures ($\mu$-DS), etc. In this paper, we propose a latent feature-wise mapping strategy, called Feature Mapping Network (FMNet), to transform measured spectrograms so that they more closely resemble the output from a simulation under the same conditions. Based on measured spectrogram and the matched simulated data, our framework contains three parts: an Encoder which is used to extract latent representations/features, a Decoder outputs reconstructed spectrogram according to the latent features, and a Discriminator minimizes the distance of latent features of measured and simulated data. We demonstrate the FMNet with six activities data and two experimental scenarios, and final results show strong enhanced patterns and can keep actual motion information to the greatest extent. On the other hand, we also propose a novel idea which trains a classifier with only simulated data and predicts new measured samples after cleaning them up with the FMNet. From final classification results, we can see significant improvements.
Zeyu Wu, Cheng Wang, Weidong Liu
In this paper, we estimate the high dimensional precision matrix under the
weak sparsity condition where many entries are nearly zero. We revisit the
sparse column-wise inverse operator (SCIO) estimator \cite{liu2015fast} and
derive its general error bounds under the weak sparsity condition. A unified
framework is established to deal with various cases including the heavy-tailed
data, the non-paranormal data, and the matrix variate data. These new methods
can achieve the same convergence rates as the existing methods and can be
implemented efficiently.
Authors' comments: 29 pages, 5 figures
Michael C. Cushing, Adam C. Schneider, J. Davy Kirkpatrick, Caroline V. Morley, Mark S. Marley, Christopher R. Gelino, Gregory N. Mace, Edward L. Wright et al.
We present a Hubble Space Telescope/Wide-Field Camera 3 near infrared
spectrum of the archetype Y dwarf WISEP 182831.08+265037.8. The spectrum covers
the 0.9-1.7 um wavelength range at a resolving power of lambda/Delta lambda
~180 and is a significant improvement over the previously published spectrum
because it covers a broader wavelength range and is uncontaminated by light
from a background star. The spectrum is unique for a cool brown dwarf in that
the flux peaks in the Y, J, and H band are of near equal intensity in units of
f_lambda. We fail to detect any absorption bands of NH_3 in the spectrum, in
contrast to the predictions of chemical equilibrium models, but tentatively
identify CH_4 as the carrier of an unknown absorption feature centered at 1.015
um. Using previously published ground- and spaced-based photometry, and using a
Rayleigh Jeans tail to account for flux emerging longward of 4.5 um, we compute
a bolometric luminosity of log (L_bol/L_sun)=-6.50+-0.02 which is significantly
lower than previously published results. Finally, we compare the spectrum and
photometry to two sets of atmospheric models and find that best overall match
to the observed properties of WISEP 182831.08+265037.8 is a ~1 Gyr old binary
composed of two T_eff~325 K, ~5 M_Jup brown dwarfs with subsolar [C/O] ratios.
Authors' comments: Accepted for publication in the Astrophysical Journal
Yuming Zhang, Davide Cucci, Roberto Molinari, Stéphane Guerrier
The increased use of low-cost gyroscopes within inertial sensors for
navigation purposes, among others, has brought to the development of a
considerable amount of research in improving their measurement precision. Aside
from developing methods that allow to model and account for the deterministic
and stochastic components that contribute to the measurement errors of these
devices, an approach that has been put forward in recent years is to make use
of arrays of such sensors in order to combine their measurements thereby
reducing the impact of individual sensor noise. Nevertheless combining these
measurements is not straightforward given the complex stochastic nature of
these errors and, although some solutions have been suggested, these are
limited to certain specific settings which do not allow to achieve solutions in
more general and common circumstances. Hence, in this work we put forward a
non-parametric method that makes use of the wavelet cross-covariance at
different scales to combine the measurements coming from an array of gyroscopes
in order to deliver an optimal measurement signal without needing any
assumption on the processes underlying the individual error signals. We also
study an appropriate non-parametric approach for the estimation of the
asymptotic covariance matrix of the wavelet cross-covariance estimator which
has important applications beyond the scope of this work. The theoretical
properties of the proposed approach are studied and are supported by
simulations and real applications, indicating that this method represents an
appropriate and general tool for the construction of optimal virtual signals
that are particularly relevant for arrays of gyroscopes. Moreover, our results
can support the creation of optimal signals for other types of inertial sensors
other than gyroscopes as well as for redundant measurements in other domains
other than navigation.
Authors' comments: 18 pages, 10 figures
Meiling Fang, Naser Damer, Fadi Boutros, Florian Kirchbuchner, Arjan Kuijper
Iris presentation attack detection (PAD) plays a vital role in iris
recognition systems. Most existing CNN-based iris PAD solutions 1) perform only
binary label supervision during the training of CNNs, serving global
information learning but weakening the capture of local discriminative
features, 2) prefer the stacked deeper convolutions or expert-designed
networks, raising the risk of overfitting, 3) fuse multiple PAD systems or
various types of features, increasing difficulty for deployment on mobile
devices. Hence, we propose a novel attention-based deep pixel-wise binary
supervision (A-PBS) method. Pixel-wise supervision is first able to capture the
fine-grained pixel/patch-level cues. Then, the attention mechanism guides the
network to automatically find regions that most contribute to an accurate PAD
decision. Extensive experiments are performed on LivDet-Iris 2017 and three
other publicly available databases to show the effectiveness and robustness of
proposed A-PBS methods. For instance, the A-PBS model achieves an HTER of 6.50%
on the IIITD-WVU database outperforming state-of-the-art methods.
Authors' comments: To appear at the 2021 International Joint Conference on Biometrics
(IJCB 2021)
Xianjing Liu, Bo Li, Esther Bron, Wiro Niessen, Eppo Wolvius, Gennady Roshchupkin
Confounding bias is a crucial problem when applying machine learning to
practice, especially in clinical practice. We consider the problem of learning
representations independent to multiple biases. In literature, this is mostly
solved by purging the bias information from learned representations. We however
expect this strategy to harm the diversity of information in the
representation, and thus limiting its prospective usage (e.g., interpretation).
Therefore, we propose to mitigate the bias while keeping almost all information
in the latent representations, which enables us to observe and interpret them
as well. To achieve this, we project latent features onto a learned vector
direction, and enforce the independence between biases and projected features
rather than all learned features. To interpret the mapping between projected
features and input data, we propose projection-wise disentangling: a sampling
and reconstruction along the learned vector direction. The proposed method was
evaluated on the analysis of 3D facial shape and patient characteristics
(N=5011). Experiments showed that this conceptually simple method achieved
state-of-the-art fair prediction performance and interpretability, showing its
great potential for clinical applications.
Authors' comments: Accepted at MICCAI 2021
Yan Liu, Zheng Li, Lin Li, Qingyang Hong
This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for textdependent SV.
Yuki Endo, Yoshihiro Kanamori
Semantic image synthesis is a process for generating photorealistic images
from a single semantic mask. To enrich the diversity of multimodal image
synthesis, previous methods have controlled the global appearance of an output
image by learning a single latent space. However, a single latent code is often
insufficient for capturing various object styles because object appearance
depends on multiple factors. To handle individual factors that determine object
styles, we propose a class- and layer-wise extension to the variational
autoencoder (VAE) framework that allows flexible control over each object class
at the local to global levels by learning multiple latent spaces. Furthermore,
we demonstrate that our method generates images that are both plausible and
more diverse compared to state-of-the-art methods via extensive experiments
with real and synthetic datasets inthree different domains. We also show that
our method enables a wide range of applications in image synthesis and editing
tasks.
Authors' comments: Accepted to Pacific Graphics 2020, codes available at
https://github.com/endo-yuki-t/DiversifyingSMIS
Wanqing Xie, Lizhong Liang, Yao Lu, Chen Wang, Jihong Shen, Hui Luo, Xiaofeng Liu
Self-Rating Depression Scale (SDS) questionnaire has frequently been used for
efficient depression preliminary screening. However, the uncontrollable
self-administered measure can be easily affected by insouciantly or deceptively
answering, and producing the different results with the clinician-administered
Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically,
facial expression (FE) and actions play a vital role in clinician-administered
evaluation, while FE and action are underexplored for self-administered
evaluations. In this work, we collect a novel dataset of 200 subjects to
evidence the validity of self-rating questionnaires with their corresponding
question-wise video recording. To automatically interpret depression from the
SDS evaluation and the paired video, we propose an end-to-end hierarchical
framework for the long-term variable-length video, which is also conditioned on
the questionnaire results and the answering time. Specifically, we resort to a
hierarchical model which utilizes a 3D CNN for local temporal pattern
exploration and a redundancy-aware self-attention (RAS) scheme for
question-wise global feature aggregation. Targeting for the redundant long-term
FE video processing, our RAS is able to effectively exploit the correlations of
each video clip within a question set to emphasize the discriminative
information and eliminate the redundancy based on feature pair-wise affinity.
Then, the question-wise video feature is concatenated with the questionnaire
scores for final depression detection. Our thorough evaluations also show the
validity of fusing SDS evaluation and its video recording, and the superiority
of our framework to the conventional state-of-the-art temporal modeling
methods.
Authors' comments: Published in IEEE Journal of Biomedical and Health Informatics
Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie
Deep complex convolution recurrent network (DCCRN), which extends CRN with complex structure, has achieved superior performance in MOS evaluation in Interspeech 2020 deep noise suppression challenge (DNS2020). This paper further extends DCCRN with the following significant revisions. We first extend the model to sub-band processing where the bands are split and merged by learnable neural network filters instead of engineered FIR filters, leading to a faster noise suppressor trained in an end-to-end manner. Then the LSTM is further substituted with a complex TF-LSTM to better model temporal dependencies along both time and frequency axes. Moreover, instead of simply concatenating the output of each encoder layer to the input of the corresponding decoder layer, we use convolution blocks to first aggregate essential information from the encoder output before feeding it to the decoder layers. We specifically formulate the decoder with an extra a priori SNR estimation module to maintain good speech quality while removing noise. Finally a post-processing module is adopted to further suppress the unnatural residual noise. The new model, named DCCRN+, has surpassed the original DCCRN as well as several competitive models in terms of PESQ and DNSMOS, and has achieved superior performance in the new Interspeech 2021 DNS challenge
Alberto Boffi
This paper describes the most efficient way to manage operations on ranges of
elements within an ordered set. The goal is to improve existing solutions, by
optimizing the average-case time complexity and getting rid of heavy
multiplicative constants in the worst-case, without sacrificing space
complexity. This is a high-impact operation in practical applications,
performed by introducing a new data structure called Wise Red-Black Tree, an
augmented version of the Red-Black Tree.
Authors' comments: Added references to order-statistic trees. Corrected some terms and
form. Results unchanged
Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang
Vision Transformer (ViT) attains state-of-the-art performance in visual
recognition, and the variant, Local Vision Transformer, makes further
improvements. The major component in Local Vision Transformer, local attention,
performs the attention separately over small local windows. We rephrase local
attention as a channel-wise locally-connected layer and analyze it from two
network regularization manners, sparse connectivity and weight sharing, as well
as weight computation. Sparse connectivity: there is no connection across
channels, and each position is connected to the positions within a small local
window. Weight sharing: the connection weights for one position are shared
across channels or within each group of channels. Dynamic weight: the
connection weights are dynamically predicted according to each image instance.
We point out that local attention resembles depth-wise convolution and its
dynamic version in sparse connectivity. The main difference lies in weight
sharing - depth-wise convolution shares connection weights (kernel weights)
across spatial positions. We empirically observe that the models based on
depth-wise convolution and the dynamic variant with lower computation
complexity perform on-par with or sometimes slightly better than Swin
Transformer, an instance of Local Vision Transformer, for ImageNet
classification, COCO object detection and ADE semantic segmentation. These
observations suggest that Local Vision Transformer takes advantage of two
regularization forms and dynamic weight to increase the network capacity. Code
is available at https://github.com/Atten4Vis/DemystifyLocalViT.
Authors' comments: ICLR 2022 Spotlight