Zifan Shi, Na Fan, Dit-Yan Yeung, Qifeng Chen
Existing vision systems for autonomous driving or robots are sensitive to
waterdrops adhered to windows or camera lenses. Most recent waterdrop removal
approaches take a single image as input and often fail to recover the missing
content behind waterdrops faithfully. Thus, we propose a learning-based model
for waterdrop removal with stereo images. To better detect and remove
waterdrops from stereo images, we propose a novel row-wise dilated attention
module to enlarge attention's receptive field for effective information
propagation between the two stereo images. In addition, we propose an attention
consistency loss between the ground-truth disparity map and attention scores to
enhance the left-right consistency in stereo images. Because of related
datasets' unavailability, we collect a real-world dataset that contains stereo
images with and without waterdrops. Extensive experiments on our dataset
suggest that our model outperforms state-of-the-art methods both quantitatively
and qualitatively. Our source code and the stereo waterdrop dataset are
available at
\href{https://github.com/VivianSZF/Stereo-Waterdrop-Removal}{https://github.com/VivianSZF/Stereo-Waterdrop-Removal}
Authors' comments: IROS 2021
Noga Alon
Let $r \geq 2$, $n$ and $k$ be integers satisfying $k \leq \frac{r-1}{r}n$. In the original arXiv version of this note we suggested a conjecture that the family of all $k$-subsets of an $n$-set cannot be partitioned into fewer than $\lceil n-\frac{r}{r-1}(k-1) \rceil$ $r$-wise intersecting families. We noted that if true this is tight for all values of the parameters, that the case $r=2$ is Kneser's conjecture, proved by Lov\'asz, and observed that the assertion also holds provided $r$ is either a prime number or a power of $2$. We have recently learned, however, that the assertion of the conjecture for all values of the parameters follows from a recent result of Azarpendar and Jafari \cite{AJ}.
Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, Guoqi Li
How to effectively and efficiently deal with spatio-temporal event streams,
where the events are generally sparse and non-uniform and have the microsecond
temporal resolution, is of great value and has various real-life applications.
Spiking neural network (SNN), as one of the brain-inspired event-triggered
computing models, has the potential to extract effective spatio-temporal
features from the event streams. However, when aggregating individual events
into frames with a new higher temporal resolution, existing SNN models do not
attach importance to that the serial frames have different signal-to-noise
ratios since event streams are sparse and non-uniform. This situation
interferes with the performance of existing SNNs. In this work, we propose a
temporal-wise attention SNN (TA-SNN) model to learn frame-based representation
for processing event streams. Concretely, we extend the attention concept to
temporal-wise input to judge the significance of frames for the final decision
at the training stage, and discard the irrelevant frames at the inference
stage. We demonstrate that TA-SNN models improve the accuracy of event streams
classification tasks. We also study the impact of multiple-scale temporal
resolutions for frame-based representation. Our approach is tested on three
different classification tasks: gesture recognition, image classification, and
spoken digit recognition. We report the state-of-the-art results on these
tasks, and get the essential improvement of accuracy (almost 19\%) for gesture
recognition with only 60 ms.
Authors' comments: Accepted by ICCV 2021
Lucas Liebenwein, Alaa Maalouf, Oren Gal, Dan Feldman, Daniela Rus
We present a novel global compression framework for deep neural networks that
automatically analyzes each layer to identify the optimal per-layer compression
ratio, while simultaneously achieving the desired overall compression. Our
algorithm hinges on the idea of compressing each convolutional (or
fully-connected) layer by slicing its channels into multiple groups and
decomposing each group via low-rank decomposition. At the core of our algorithm
is the derivation of layer-wise error bounds from the Eckart Young Mirsky
theorem. We then leverage these bounds to frame the compression problem as an
optimization problem where we wish to minimize the maximum compression error
across layers and propose an efficient algorithm towards a solution. Our
experiments indicate that our method outperforms existing low-rank compression
approaches across a wide range of networks and data sets. We believe that our
results open up new avenues for future research into the global
performance-size trade-offs of modern neural networks. Our code is available at
https://github.com/lucaslie/torchprune.
Authors' comments: NeurIPS 2021
Mingjie He, Jie Zhang, Shiguang Shan, Xiao Liu, Zhongqin Wu, Xilin Chen
Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or several object templates including sunglasses, scarfs and phones, which cannot well simulate the realistic occlusions. In this paper, based on the argument that the occlusion essentially damages a group of neurons, we propose a novel and elegant occlusion-simulation method via dropping the activations of a group of neurons in some elaborately selected channel. Specifically, we first employ a spatial regularization to encourage each feature channel to respond to local and different face regions. In this way, the activations affected by an occlusion in a local region are more likely to be located in a single feature channel. Then, the locality-aware channel-wise dropout (LCD) is designed to simulate the occlusion by dropping out the entire feature channel. Furthermore, by randomly dropping out several feature channels, our method can well simulate the occlusion of larger area. The proposed LCD can encourage its succeeding layers to minimize the intra-class feature variance caused by occlusions, thus leading to improved robustness against occlusion. In addition, we design an auxiliary spatial attention module by learning a channel-wise attention vector to reweight the feature channels, which improves the contributions of non-occluded regions. Extensive experiments on various benchmarks show that the proposed method outperforms state-of-the-art methods with a remarkable improvement.
Yu-Yen Chang, Bau-Ching Hsieh, Wei-Hao Wang, Yen-Ting Lin, Chen-Fatt Lim, Yoshiki Toba, Yuxing Zhong, Siou-Yu Chang
We use machine learning techniques to investigate their performance in
classifying active galactic nuclei (AGNs), including X-ray selected AGNs
(XAGNs), infrared selected AGNs (IRAGNs), and radio selected AGNs (RAGNs).
Using known physical parameters in the Cosmic Evolution Survey (COSMOS) field,
we are able to well-established training samples in the region of Hyper
Suprime-Cam (HSC) survey. We compare several Python packages (e.g.,
scikit-learn, Keras, and XGBoost), and use XGBoost to identify AGNs and show
the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our
results indicate that the performance is high for bright XAGN and IRAGN host
galaxies. The combination of the HSC (optical) information with the Wide-field
Infrared Survey Explorer (WISE) band-1 and WISE band-2 (near-infrared)
information perform well to identify AGN hosts. For both type-1 (broad-line)
XAGNs and type-1 (unobscured) IRAGNs, the performance is very good by using
optical to infrared information. These results can apply to the five-band data
from the wide regions of the HSC survey, and future all-sky surveys.
Authors' comments: accepted for publication in ApJ
Yuanyi Zhong, Yuan Zhou, Jian Peng
The control variates (CV) method is widely used in policy gradient estimation
to reduce the variance of the gradient estimators in practice. A control
variate is applied by subtracting a baseline function from the state-action
value estimates. Then the variance-reduced policy gradient presumably leads to
higher learning efficiency. Recent research on control variates with deep
neural net policies mainly focuses on scalar-valued baseline functions. The
effect of vector-valued baselines is under-explored. This paper investigates
variance reduction with coordinate-wise and layer-wise control variates
constructed from vector-valued baselines for neural net policies. We present
experimental evidence suggesting that lower variance can be obtained with such
baselines than with the conventional scalar-valued baseline. We demonstrate how
to equip the popular Proximal Policy Optimization (PPO) algorithm with these
new control variates. We show that the resulting algorithm with proper
regularization can achieve higher sample efficiency than scalar control
variates in continuous control benchmarks.
Authors' comments: 14 pages, 3 figures, added references compared to v1
Yiqun Lin, Lichang Chen, Haibin Huang, Chongyang Ma, Xiaoguang Han, Shuguang Cui
Sampling, grouping, and aggregation are three important components in the
multi-scale analysis of point clouds. In this paper, we present a novel
data-driven sampler learning strategy for point-wise analysis tasks. Unlike the
widely used sampling technique, Farthest Point Sampling (FPS), we propose to
learn sampling and downstream applications jointly. Our key insight is that
uniform sampling methods like FPS are not always optimal for different tasks:
sampling more points around boundary areas can make the point-wise
classification easier for segmentation. Towards this end, we propose a novel
sampler learning strategy that learns sampling point displacement supervised by
task-related ground truth information and can be trained jointly with the
underlying tasks. We further demonstrate our methods in various point-wise
analysis tasks, including semantic part segmentation, point cloud completion,
and keypoint detection. Our experiments show that jointly learning of the
sampler and task brings better performance than using FPS in various
point-based networks.
Authors' comments: 14 pages, 13 figures and 14 tables
Francisco Eiras, Motasem Alfarra, M. Pawan Kumar, Philip H. S. Torr, Puneet K. Dokania, Bernard Ghanem, Adel Bibi
Randomized smoothing has recently emerged as an effective tool that enables
certification of deep neural network classifiers at scale. All prior art on
randomized smoothing has focused on isotropic $\ell_p$ certification, which has
the advantage of yielding certificates that can be easily compared among
isotropic methods via $\ell_p$-norm radius. However, isotropic certification
limits the region that can be certified around an input to worst-case
adversaries, i.e., it cannot reason about other "close", potentially large,
constant prediction safe regions. To alleviate this issue, (i) we theoretically
extend the isotropic randomized smoothing $\ell_1$ and $\ell_2$ certificates to
their generalized anisotropic counterparts following a simplified analysis.
Moreover, (ii) we propose evaluation metrics allowing for the comparison of
general certificates - a certificate is superior to another if it certifies a
superset region - with the quantification of each certificate through the
volume of the certified region. We introduce ANCER, a framework for obtaining
anisotropic certificates for a given test set sample via volume maximization.
We achieve it by generalizing memory-based certification of data-dependent
classifiers. Our empirical results demonstrate that ANCER achieves
state-of-the-art $\ell_1$ and $\ell_2$ certified accuracy on CIFAR-10 and
ImageNet in the data-dependence setting, while certifying larger regions in
terms of volume, highlighting the benefits of moving away from isotropic
analysis. Our code is available in https://github.com/MotasemAlfarra/ANCER.
Authors' comments: First two authors and the last one contributed equally to this work
Ankita Pasad, Ju-Chieh Chou, Karen Livescu
Recently proposed self-supervised learning approaches have been successful
for pre-training speech representation models. The utility of these learned
representations has been observed empirically, but not much has been studied
about the type or extent of information encoded in the pre-trained
representations themselves. Developing such insights can help understand the
capabilities and limits of these models and enable the research community to
more efficiently develop their usage for downstream applications. In this work,
we begin to fill this gap by examining one recent and successful pre-trained
model (wav2vec 2.0), via its intermediate representation vectors, using a suite
of analysis tools. We use the metrics of canonical correlation, mutual
information, and performance on simple downstream tasks with non-parametric
probes, in order to (i) query for acoustic and linguistic information content,
(ii) characterize the evolution of information across model layers, and (iii)
understand how fine-tuning the model for automatic speech recognition (ASR)
affects these observations. Our findings motivate modifying the fine-tuning
protocol for ASR, which produces improved word error rates in a low-resource
setting.
Authors' comments: Accepted to ASRU 2021. Code:
https://github.com/ankitapasad/layerwise-analysis
Chen Dun, Cameron R. Wolfe, Christopher M. Jermaine, Anastasios Kyrillidis
We propose ResIST, a novel distributed training protocol for Residual
Networks (ResNets). ResIST randomly decomposes a global ResNet into several
shallow sub-ResNets that are trained independently in a distributed manner for
several local iterations, before having their updates synchronized and
aggregated into the global model. In the next round, new sub-ResNets are
randomly generated and the process repeats until convergence. By construction,
per iteration, ResIST communicates only a small portion of network parameters
to each machine and never uses the full model during training. Thus, ResIST
reduces the per-iteration communication, memory, and time requirements of
ResNet training to only a fraction of the requirements of full-model training.
In comparison to common protocols, like data-parallel training and
data-parallel training with local SGD, ResIST yields a decrease in
communication and compute requirements, while being competitive with respect to
model performance.
Authors' comments: 26 pages, 8 figures, pre-print under review
Huajun Liu, Fuqiang Liu, Xinyi Fan, Dong Huang
Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by $2-4$ points, and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic segmentation benchmarks.
Cédric Rommel, Thomas Moreau, Joseph Paillard, Alexandre Gramfort
Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals. Besides, class-dependent augmentation strategies have been surprisingly unexplored in the literature, although it is quite intuitive: changing the color of a car image does not change the object class to be predicted, but doing the same to the picture of an orange does. This paper investigates gradient-based automatic data augmentation algorithms amenable to class-wise policies with exponentially larger search spaces. Motivated by supervised learning applications using EEG signals for which good augmentation policies are mostly unknown, we propose a new differentiable relaxation of the problem. In the class-agnostic setting, results show that our new relaxation leads to optimal performance with faster training than competing gradient-based methods, while also outperforming gradient-free methods in the class-wise setting. This work proposes also novel differentiable augmentation operations relevant for sleep stage classification.
J. Davy Kirkpatrick, Federico Marocco, Dan Caselden, Aaron M. Meisner, Jacqueline K. Faherty, Adam C. Schneider, Marc J. Kuchner, S. L. Casewell et al.
Continued follow-up of WISEA J153429.75-104303.3, announced in Meisner et al
(2020), has proven it to have an unusual set of properties. New imaging data
from Keck/MOSFIRE and HST/WFC3 show that this object is one of the few faint
proper motion sources known with J-ch2 > 8 mag, indicating a very cold
temperature consistent with the latest known Y dwarfs. Despite this, it has
W1-W2 and ch1-ch2 colors ~1.6 mag bluer than a typical Y dwarf. A new
trigonometric parallax measurement from a combination of WISE, Spitzer, and HST
astrometry confirms a nearby distance of $16.3^{+1.4}_{-1.2}$ pc and a large
transverse velocity of $207.4{\pm}15.9$ km/s. The absolute J, W2, and ch2
magnitudes are in line with the coldest known Y dwarfs, despite the highly
discrepant W1-W2 and ch1-ch2 colors. We explore possible reasons for the unique
traits of this object and conclude that it is most likely an old, metal-poor
brown dwarf and possibly the first Y subdwarf. Given that the object has an HST
F110W magnitude of 24.7 mag, broad-band spectroscopy and photometry from JWST
are the best options for testing this hypothesis.
Authors' comments: 8 pages, 4 figures, accepted for publication in The Astrophysical
Journal Letters
An-phi Nguyen, Maria Rodriguez Martinez
Interpretability has become a necessary feature for machine learning models deployed in critical scenarios, e.g. legal system, healthcare. In these situations, algorithmic decisions may have (potentially negative) long-lasting effects on the end-user affected by the decision. In many cases, the representational power of deep learning models is not needed, therefore simple and interpretable models (e.g. linear models) should be preferred. However, in high-dimensional and/or complex domains (e.g. computer vision), the universal approximation capabilities of neural networks are required. Inspired by linear models and the Kolmogorov-Arnold representation theorem, we propose a novel class of structurally-constrained neural networks, which we call FLANs (Feature-wise Latent Additive Networks). Crucially, FLANs process each input feature separately, computing for each of them a representation in a common latent space. These feature-wise latent representations are then simply summed, and the aggregated representation is used for prediction. These constraints (which are at the core of the interpretability of linear models) allow a user to estimate the effect of each individual feature independently from the others, enhancing interpretability. In a set of experiments across different domains, we show how without compromising excessively the test performance, the structural constraints proposed in FLANs indeed facilitates the interpretability of deep learning models. We quantitatively compare FLANs interpretability to post-hoc methods using recently introduced metrics, discussing the advantages of natively interpretable models over a post-hoc analysis.
Yifan Wu, Min Zeng, Ying Yu, Min Li
Automatic International Classification of Diseases (ICD) coding is defined as a kind of text multi-label classification problem, which is difficult because the number of labels is very large and the distribution of labels is unbalanced. The label-wise attention mechanism is widely used in automatic ICD coding because it can assign weights to every word in full Electronic Medical Records (EMR) for different ICD codes. However, the label-wise attention mechanism is computational redundant and costly. In this paper, we propose a pseudo label-wise attention mechanism to tackle the problem. Instead of computing different attention modes for different ICD codes, the pseudo label-wise attention mechanism automatically merges similar ICD codes and computes only one attention mode for the similar ICD codes, which greatly compresses the number of attention modes and improves the predicted accuracy. In addition, we apply a more convenient and effective way to obtain the ICD vectors, and thus our model can predict new ICD codes by calculating the similarities between EMR vectors and ICD vectors. Extensive experiments show the superior performance of our model. On the public MIMIC-III dataset and private Xiangya dataset, our model achieves micro f1 of 0.583 and 0.806, respectively, which outperforms other competing models. Furthermore, we verify the ability of our model in predicting new ICD codes. The case study shows how pseudo label-wise attention works, and demonstrates the effectiveness of pseudo label-wise attention mechanism.
Zhong Ji, Kexin Chen, Haoran Wang
Image-text matching plays a central role in bridging the semantic gap between
vision and language. The key point to achieve precise visual-semantic alignment
lies in capturing the fine-grained cross-modal correspondence between image and
text. Most previous methods rely on single-step reasoning to discover the
visual-semantic interactions, which lacks the ability of exploiting the
multi-level information to locate the hierarchical fine-grained relevance.
Different from them, in this work, we propose a step-wise hierarchical
alignment network (SHAN) that decomposes image-text matching into multi-step
cross-modal reasoning process. Specifically, we first achieve local-to-local
alignment at fragment level, following by performing global-to-local and
global-to-global alignment at context level sequentially. This progressive
alignment strategy supplies our model with more complementary and sufficient
semantic clues to understand the hierarchical correlations between image and
text. The experimental results on two benchmark datasets demonstrate the
superiority of our proposed method.
Authors' comments: Accepted by IJCAI 2021
Zefan Li, Chenxi Liu, Alan Yuille, Bingbing Ni, Wenjun Zhang, Wen Gao
Unsupervised learning methods have recently shown their competitiveness
against supervised training. Typically, these methods use a single objective to
train the entire network. But one distinct advantage of unsupervised over
supervised learning is that the former possesses more variety and freedom in
designing the objective. In this work, we explore new dimensions of
unsupervised learning by proposing the Progressive Stage-wise Learning (PSL)
framework. For a given unsupervised task, we design multilevel tasks and define
different learning stages for the deep network. Early learning stages are
forced to focus on lowlevel tasks while late stages are guided to extract
deeper information through harder tasks. We discover that by progressive
stage-wise learning, unsupervised feature representation can be effectively
enhanced. Our extensive experiments show that PSL consistently improves results
for the leading unsupervised learning methods.
Authors' comments: Accepted by the IEEE conference on computer vision and pattern
recognition. 2021
Jianqiang Huang, Ke Hu, Qingtao Tang, Mingjian Chen, Yi Qi, Jia Cheng, Jun Lei
Click-through rate (CTR) prediction plays an important role in online
advertising and recommender systems. In practice, the training of CTR models
depends on click data which is intrinsically biased towards higher positions
since higher position has higher CTR by nature. Existing methods such as actual
position training with fixed position inference and inverse propensity weighted
training with no position inference alleviate the bias problem to some extend.
However, the different treatment of position information between training and
inference will inevitably lead to inconsistency and sub-optimal online
performance. Meanwhile, the basic assumption of these methods, i.e., the click
probability is the product of examination probability and relevance
probability, is oversimplified and insufficient to model the rich interaction
between position and other information. In this paper, we propose a Deep
Position-wise Interaction Network (DPIN) to efficiently combine all candidate
items and positions for estimating CTR at each position, achieving consistency
between offline and online as well as modeling the deep non-linear interaction
among position, user, context and item under the limit of serving performance.
Following our new treatment to the position bias in CTR prediction, we propose
a new evaluation metrics named PAUC (position-wise AUC) that is suitable for
measuring the ranking quality at a given position. Through extensive
experiments on a real world dataset, we show empirically that our method is
both effective and efficient in solving position bias problem. We have also
deployed our method in production and observed statistically significant
improvement over a highly optimized baseline in a rigorous A/B test.
Authors' comments: Accepted by SIGIR 2021
Zichuan Lin, Jing Huang, Bowen Zhou, Xiaodong He, Tengyu Ma
Recent work (Takanobu et al., 2020) proposed the system-wise evaluation on
dialog systems and found that improvement on individual components (e.g., NLU,
policy) in prior work may not necessarily bring benefit to pipeline systems in
system-wise evaluation. To improve the system-wise performance, in this paper,
we propose new joint system-wise optimization techniques for the pipeline
dialog system. First, we propose a new data augmentation approach which
automates the labeling process for NLU training. Second, we propose a novel
stochastic policy parameterization with Poisson distribution that enables
better exploration and offers a principled way to compute policy gradient.
Third, we propose a reward bonus to help policy explore successful dialogs. Our
approaches outperform the competitive pipeline systems from Takanobu et al.
(2020) by big margins of 12% success rate in automatic system-wise evaluation
and of 16% success rate in human evaluation on the standard multi-domain
benchmark dataset MultiWOZ 2.1, and also outperform the recent state-of-the-art
end-to-end trained model from DSTC9.
Authors' comments: 13 pages