Dan Liu, Libo Zhang, Tiejian Luo, Lili Tao, Yanjun Wu
The lack of interpretability of existing CNN-based hand detection methods
makes it difficult to understand the rationale behind their predictions. In
this paper, we propose a novel neural network model, which introduces
interpretability into hand detection for the first time. The main improvements
include: (1) Detect hands at pixel level to explain what pixels are the basis
for its decision and improve transparency of the model. (2) The explainable
Highlight Feature Fusion block highlights distinctive features among multiple
layers and learns discriminative ones to gain robust performance. (3) We
introduce a transparent representation, the rotation map, to learn rotation
features instead of complex and non-transparent rotation and derotation layers.
(4) Auxiliary supervision accelerates the training process, which saves more
than 10 hours in our experiments. Experimental results on the VIVA and Oxford
hand detection and tracking datasets show competitive accuracy of our method
compared with state-of-the-art methods with higher speed.
Authors' comments: Accepted to Pattern Recognition
Xin Zhou, Dejing Dou, Boyang Li
Search space is a key consideration for neural architecture search. Recently, Xie et al. (2019) found that randomly generated networks from the same distribution perform similarly, which suggests we should search for random graph distributions instead of graphs. We propose graphon as a new search space. A graphon is the limit of Cauchy sequence of graphs and a scale-free probabilistic distribution, from which graphs of different number of nodes can be drawn. By utilizing properties of the graphon space and the associated cut-distance metric, we develop theoretically motivated techniques that search for and scale up small-capacity stage-wise graphs found on small datasets to large-capacity graphs that can handle ImageNet. The scaled stage-wise graphs outperform DenseNet and randomly wired Watts-Strogatz networks, indicating the benefits of graphon theory in NAS applications.
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
We present an end-to-end joint training framework that explicitly models
6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular
camera setup without supervision. Our technical contributions are three-fold.
First, we propose a differentiable forward rigid projection module that plays a
key role in our instance-wise depth and motion learning. Second, we design an
instance-wise photometric and geometric consistency loss that effectively
decomposes background and moving object regions. Lastly, we introduce a new
auto-annotation scheme to produce video instance segmentation maps that will be
utilized as input to our training pipeline. These proposed elements are
validated in a detailed ablation study. Through extensive experiments conducted
on the KITTI dataset, our framework is shown to outperform the state-of-the-art
depth and motion estimation methods. Our code and dataset will be available at
https://github.com/SeokjuLee/Insta-DM.
Authors' comments: Project page at https://sites.google.com/site/seokjucv/home/instadm
Bingqing Xie, Pei Niu, Ting Su, Valérie Kaftandjian, Loic Boussel, Philippe Douek Feng Yang, Philippe Duvauchelle, Yuemin Zhu
Spectral photon-counting X-ray CT (sCT) opens up new possibilities for the quantitative measurement of materials in an object, compared to conventional energy-integrating CT or dual energy CT. However, achieving reliable and accurate material decomposition in sCT is extremely challenging, due to similarity between different basis materials, strong quantum noise and photon-counting detector limitations. We propose a novel material decomposition method that works in a region-wise manner. The method consists in optimizing basis materials based on spatio-energy segmentation of regions-of-interests (ROIs) in sCT images and performing a fine material decomposition involving optimized decomposition matrix and sparsity regularization. The effectiveness of the proposed method was validated on both digital and physical data. The results showed that the proposed ROI-wise material decomposition method presents clearly higher reliability and accuracy compared to common decomposition methods based on total variation (TV) or L1-norm (lasso) regularization.
Sebastian Guendel, Andreas Maier
The current accessibility to large medical datasets for training
convolutional neural networks is tremendously high. The associated dataset
labels are always considered to be the real "ground truth". However, the
labeling procedures often seem to be inaccurate and many wrong labels are
integrated. This may have fatal consequences on the performance of both
training and evaluation. In this paper, we show the impact of label noise in
the training set on a specific medical problem based on chest X-ray images.
With a simple one-class problem, the classification of tuberculosis, we measure
the performance on a clean evaluation set when training with label-corrupt
data. We develop a method to compete with incorrectly labeled data during
training by randomly attacking labels on individual epochs. The network tends
to be robust when flipping correct labels for a single epoch and initiates a
good step to the optimal minimum on the error surface when flipping noisy
labels. On a baseline with an AUC (Area under Curve) score of 0.924, the
performance drops to 0.809 when 30% of our training data is misclassified. With
our approach the baseline performance could almost be maintained, the
performance raised to 0.918.
Authors' comments: Accepted at BVM 2020
Tejus Gupta, Abhishek Sinha, Nupur Kumari, Mayank Singh, Balaji Krishnamurthy
We present an algorithm for computing class-specific universal adversarial perturbations for deep neural networks. Such perturbations can induce misclassification in a large fraction of images of a specific class. Unlike previous methods that use iterative optimization for computing a universal perturbation, the proposed method employs a perturbation that is a linear function of weights of the neural network and hence can be computed much faster. The method does not require any training data and has no hyper-parameters. The attack obtains 34% to 51% fooling rate on state-of-the-art deep neural networks on ImageNet and transfers across models. We also study the characteristics of the decision boundaries learned by standard and adversarially trained models to understand the universal adversarial perturbations.
Yousef Atoum, Mao Ye, Liu Ren, Ying Tai, Xiaoming Liu
Absence of nearby light sources while capturing an image will degrade the
visibility and quality of the captured image, making computer vision tasks
difficult. In this paper, a color-wise attention network (CWAN) is proposed for
low-light image enhancement based on convolutional neural networks. Motivated
by the human visual system when looking at dark images, CWAN learns an
end-to-end mapping between low-light and enhanced images while searching for
any useful color cues in the low-light image to aid in the color enhancement
process. Once these regions are identified, CWAN attention will be mainly
focused to synthesize these local regions, as well as the global image. Both
quantitative and qualitative experiments on challenging datasets demonstrate
the advantages of our method in comparison with state-of-the-art methods.
Authors' comments: 8 pages, 9 figures
Yiyao Shi, Jian Wang, Xiangyang Xue
In this paper, a learning-free color constancy algorithm called the
Patch-wise Bright Pixels (PBP) is proposed. In this algorithm, an input image
is first downsampled and then cut equally into a few patches. After that,
according to the modified brightness of each patch, a proper fraction of
brightest pixels in the patch is selected. Finally, Gray World (GW)-based
methods are applied to the selected bright pixels to estimate the illuminant of
the scene. Experiments on NUS $8$-Camera Dataset show that the PBP algorithm
outperforms the state-of-the-art learning-free methods as well as a broad range
of learning-based ones. In particular, PBP processes a $1080$p image within two
milliseconds, which is hundreds of times faster than the existing learning-free
ones. Our algorithm offers a potential solution to the full-screen smart phones
whose screen-to-body ratio is $100$\%.
Authors' comments: 7 figures and 4 tables
Lu Wang, Jie Yang
Large-scale cross-modal hashing similarity retrieval has attracted more and
more attention in modern search applications such as search engines and
autopilot, showing great superiority in computation and storage. However,
current unsupervised cross-modal hashing methods still have some limitations:
(1)many methods relax the discrete constraints to solve the optimization
objective which may significantly degrade the retrieval performance;(2)most
existing hashing model project heterogenous data into a common latent space,
which may always lose sight of diversity in heterogenous data;(3)transforming
real-valued data point to binary codes always results in abundant loss of
information, producing the suboptimal continuous latent space. To overcome
above problems, in this paper, a novel Cluster-wise Unsupervised Hashing (CUH)
method is proposed. Specifically, CUH jointly performs the multi-view
clustering that projects the original data points from different modalities
into its own low-dimensional latent semantic space and finds the cluster
centroid points and the common clustering indicators in its own low-dimensional
space, and learns the compact hash codes and the corresponding linear hash
functions. An discrete optimization framework is developed to learn the unified
binary codes across modalities under the guidance cluster-wise code-prototypes.
The reasonableness and effectiveness of CUH is well demonstrated by
comprehensive experiments on diverse benchmark datasets.
Authors' comments: 13 pages, 26 figures
Pavel Sulimov, Elena Sukmanova, Roman Chereshnev, Attila Kertesz-Farkas
Training of deep models for classification tasks is hindered by local minima problems and vanishing gradients, while unsupervised layer-wise pretraining does not exploit information from class labels. Here, we propose a new regularization technique, called diversifying regularization (DR), which applies a penalty on hidden units at any layer if they obtain similar features for different types of data. For generative models, DR is defined as divergence over the variational posteriori distributions and included in the maximum likelihood estimation as a prior. Thus, DR includes class label information for greedy pretraining of deep belief networks which result in a better weight initialization for fine-tuning methods. On the other hand, for discriminative training of deep neural networks, DR is defined as a distance over the features and included in the learning objective. With our experimental tests, we show that DR can help the backpropagation to cope with vanishing gradient problems and to provide faster convergence and smaller generalization errors.
Yuhu Shan
Among the neural network compression techniques, knowledge distillation is an effective one which forces a simpler student network to mimic the output of a larger teacher network. However, most of such model distillation methods focus on the image-level classification task. Directly adapting these methods to the task of semantic segmentation only brings marginal improvements. In this paper, we propose a simple, yet effective knowledge representation referred to as pixel-wise feature similarities (PFS) to tackle the challenging distillation problem of semantic segmentation. The developed PFS encodes spatial structural information for each pixel location of the high-level convolutional features, which helps guide the distillation process in an easier way. Furthermore, a novel weighted pixel-level soft prediction imitation approach is proposed to enable the student network to selectively mimic the teacher network's output, according to their pixel-wise knowledge-gaps. Extensive experiments are conducted on the challenging datasets of Pascal VOC 2012, ADE20K and Pascal Context. Our approach brings significant performance improvements compared to several strong baselines and achieves new state-of-the-art results.
Shiyu Chang, Yang Zhang, Mo Yu, Tommi S. Jaakkola
Selection of input features such as relevant pieces of text has become a
common technique of highlighting how complex neural predictors operate. The
selection can be optimized post-hoc for trained models or incorporated directly
into the method itself (self-explaining). However, an overall selection does
not properly capture the multi-faceted nature of useful rationales such as pros
and cons for decisions. To this end, we propose a new game theoretic approach
to class-dependent rationalization, where the method is specifically trained to
highlight evidence supporting alternative conclusions. Each class involves
three players set up competitively to find evidence for factual and
counterfactual scenarios. We show theoretically in a simplified scenario how
the game drives the solution towards meaningful class-dependent rationales. We
evaluate the method in single- and multi-aspect sentiment classification tasks
and demonstrate that the proposed method is able to identify both factual
(justifying the ground truth label) and counterfactual (countering the ground
truth label) rationales consistent with human rationalization. The code for our
method is publicly available.
Authors' comments: Accepted by Neural Information Processing Systems (NeurIPS 2019),
Vancouver, Canada
Pratik Mazumder, Pravendra Singh, Vinay Namboodiri
Convolutional layers are a major driving force behind the successes of deep
learning. Pointwise convolution (PWC) is a 1x1 convolutional filter that is
primarily used for parameter reduction. However, the PWC ignores the spatial
information around the points it is processing. This design is by choice, in
order to reduce the overall parameters and computations. However, we
hypothesize that this shortcoming of PWC has a significant impact on the
network performance. We propose an alternative design for pointwise
convolution, which uses spatial information from the input efficiently. Our
design significantly improves the performance of the networks without
substantially increasing the number of parameters and computations. We
experimentally show that our design results in significant improvement in the
performance of the network for classification as well as detection.
Authors' comments: Accepted in ICASSP 2020
Maciej Paszynski
We focus on the finite element method computations with higher-order C1
continuity basis functions that preserve the partition of unity. We show that
the rows of the system of linear equations can be combined, and the test
functions can be sum up to 1 using the partition of unity property at the
quadrature points. Thus, the test functions in higher continuity IGA can be set
to piece-wise constants. This formulation is equivalent to testing with
piece-wise constant basis functions, with supports span over some parts of the
domain. The resulting method is a Petrov-Galerkin formulation with piece-wise
constant test functions. This observation has the following consequences. The
numerical integration cost can be reduced because we do not need to evaluate
the test functions since they are equal to 1. This observation is valid for any
basis functions preserving the partition of unity property. It is independent
of the problem dimension and geometry of the computational domain. It also can
be used in time-dependent problems, e.g., in the explicit dynamics
computations, where we can reduce the cost of generation of the right-hand
side. This summation of test functions can be performed for an arbitrary linear
differential operator resulting from the Galerkin method applied to a PDE where
we discretize with C1 continuity basis functions. The resulting method is
equivalent to a linear combination of the collocations at points and with
weights resulting from applied quadrature over the spans defined by supports of
the piece-wise constant test functions.
Authors' comments: 32 pages, 8 figures
Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, Yonghong Tian
Few-shot learning, which aims at extracting new concepts rapidly from extremely few examples of novel classes, has been featured into the meta-learning paradigm recently. Yet, the key challenge of how to learn a generalizable classifier with the capability of adapting to specific tasks with severely limited data still remains in this domain. To this end, we propose a Transductive Episodic-wise Adaptive Metric (TEAM) framework for few-shot learning, by integrating the meta-learning paradigm with both deep metric learning and transductive inference. With exploring the pairwise constraints and regularization prior within each task, we explicitly formulate the adaptation procedure into a standard semi-definite programming problem. By solving the problem with its closed-form solution on the fly with the setup of transduction, our approach efficiently tailors an episodic-wise metric for each task to adapt all features from a shared task-agnostic embedding space into a more discriminative task-specific metric space. Moreover, we further leverage an attention-based bi-directional similarity strategy for extracting the more robust relationship between queries and prototypes. Extensive experiments on three benchmark datasets show that our framework is superior to other existing approaches and achieves the state-of-the-art performance in the few-shot literature.
Yuqing Ma, Xianglong Liu, Shihao Bai, Lei Wang, Aishan Liu, Dacheng Tao, Edwin Hancock
Recently deep neutral networks have achieved promising performance for
filling large missing regions in image inpainting tasks. They usually adopted
the standard convolutional architecture over the corrupted image, leading to
meaningless contents, such as color discrepancy, blur and artifacts. Moreover,
most inpainting approaches cannot well handle the large continuous missing area
cases. To address these problems, we propose a generic inpainting framework
capable of handling with incomplete images on both continuous and discontinuous
large missing areas, in an adversarial manner. From which, region-wise
convolution is deployed in both generator and discriminator to separately
handle with the different regions, namely existing regions and missing ones.
Moreover, a correlation loss is introduced to capture the non-local
correlations between different patches, and thus guides the generator to obtain
more information during inference. With the help of our proposed framework, we
can restore semantically reasonable and visually realistic images. Extensive
experiments on three widely-used datasets for image inpainting tasks have been
conducted, and both qualitative and quantitative experimental results
demonstrate that the proposed model significantly outperforms the
state-of-the-art approaches, both on the large continuous and discontinuous
missing areas.
Authors' comments: 13 pages, 8 figures, 3 tables
Jinsung Yoon, Sercan O. Arik, Tomas Pfister
Understanding black-box machine learning models is crucial for their
widespread adoption. Learning globally interpretable models is one approach,
but achieving high performance with them is challenging. An alternative
approach is to explain individual predictions using locally interpretable
models. For locally interpretable modeling, various methods have been proposed
and indeed commonly used, but they suffer from low fidelity, i.e. their
explanations do not approximate the predictions well. In this paper, our goal
is to push the state-of-the-art in high-fidelity locally interpretable
modeling. We propose a novel framework, Locally Interpretable Modeling using
Instance-wise Subsampling (LIMIS). LIMIS utilizes a policy gradient to select a
small number of instances and distills the black-box model into a low-capacity
locally interpretable model using those selected instances. Training is guided
with a reward obtained directly by measuring the fidelity of the locally
interpretable models. We show on multiple tabular datasets that LIMIS
near-matches the prediction accuracy of black-box models, significantly
outperforming state-of-the-art locally interpretable models in terms of
fidelity and prediction accuracy.
Authors' comments: Published in Transactions on Machine Learning Research (TMLR) -
September, 2022 - https://openreview.net/forum?id=S8eABAy8P3
Luxuan Li, Tao Kong, Fuchun Sun, Huaping Liu
Detecting actions in videos is an important yet challenging task. Previous
works usually utilize (a) sliding window paradigms, or (b) per-frame action
scoring and grouping to enumerate the possible temporal locations. Their
performances are also limited to the designs of sliding windows or grouping
strategies. In this paper, we present a simple and effective method for
temporal action proposal generation, named Deep Point-wise Prediction (DPP).
DPP simultaneously predicts the action existing possibility and the
corresponding temporal locations, without the utilization of any handcrafted
sliding window or grouping. The whole system is end-to-end trained with joint
loss of temporal action proposal classification and location prediction. We
conduct extensive experiments to verify its effectiveness, generality and
robustness on standard THUMOS14 dataset. DPP runs more than 1000 frames per
second, which largely satisfies the real-time requirement. The code is
available at https://github.com/liluxuan1997/DPP.
Authors' comments: accepted by ICONIP2019 oral presentation (International Conference on
Neural Information Processing)
Jue Jiang, Elguindi Sharif, Hyemin Um, Sean Berry, Harini Veeraraghavan
We developed a new and computationally simple local block-wise self attention based normal structures segmentation approach applied to head and neck computed tomography (CT) images. Our method uses the insight that normal organs exhibit regularity in their spatial location and inter-relation within images, which can be leveraged to simplify the computations required to aggregate feature information. We accomplish this by using local self attention blocks that pass information between each other to derive the attention map. We show that adding additional attention layers increases the contextual field and captures focused attention from relevant structures. We developed our approach using U-net and compared it against multiple state-of-the-art self attention methods. All models were trained on 48 internal headneck CT scans and tested on 48 CT scans from the external public domain database of computational anatomy dataset. Our method achieved the highest Dice similarity coefficient segmentation accuracy of 0.85$\pm$0.04, 0.86$\pm$0.04 for left and right parotid glands, 0.79$\pm$0.07 and 0.77$\pm$0.05 for left and right submandibular glands, 0.93$\pm$0.01 for mandible and 0.88$\pm$0.02 for the brain stem with the lowest increase of 66.7\% computing time per image and 0.15\% increase in model parameters compared with standard U-net. The best state-of-the-art method called point-wise spatial attention, achieved \textcolor{black}{comparable accuracy but with 516.7\% increase in computing time and 8.14\% increase in parameters compared with standard U-net.} Finally, we performed ablation tests and studied the impact of attention block size, overlap of the attention blocks, additional attention layers, and attention block placement on segmentation performance.
Kun Zhang, Peng He, Ping Yao, Ge Chen, Rui Wu, Min Du, Huimin Li, Li Fu et al.
Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc.)
have achieved significant performance on pose estimation by combining feature
maps of various resolutions. In this paper, we propose a Resolution-wise
Attention Module (RAM) and Gradual Pyramid Refinement (GPR), to learn enhanced
resolution-wise feature maps for precise pose estimation. Specifically, RAM
learns a group of weights to represent the different importance of feature maps
across resolutions, and the GPR gradually merges every two feature maps from
low to high resolutions to regress final human keypoint heatmaps. With the
enhanced resolution-wise features learnt by CNN, we obtain more accurate human
keypoint locations. The efficacies of our proposed methods are demonstrated on
MS-COCO dataset, achieving state-of-the-art performance with average precision
of 77.7 on COCO val2017 set and 77.0 on test-dev2017 set without using extra
human keypoint training dataset.
Authors' comments: Published on ICIP 2020