Pratik Mazumder, Pravendra Singh, Vinay Namboodiri
Convolutional layers are a major driving force behind the successes of deep
learning. Pointwise convolution (PWC) is a 1x1 convolutional filter that is
primarily used for parameter reduction. However, the PWC ignores the spatial
information around the points it is processing. This design is by choice, in
order to reduce the overall parameters and computations. However, we
hypothesize that this shortcoming of PWC has a significant impact on the
network performance. We propose an alternative design for pointwise
convolution, which uses spatial information from the input efficiently. Our
design significantly improves the performance of the networks without
substantially increasing the number of parameters and computations. We
experimentally show that our design results in significant improvement in the
performance of the network for classification as well as detection.
Authors' comments: Accepted in ICASSP 2020
Maciej Paszynski
We focus on the finite element method computations with higher-order C1
continuity basis functions that preserve the partition of unity. We show that
the rows of the system of linear equations can be combined, and the test
functions can be sum up to 1 using the partition of unity property at the
quadrature points. Thus, the test functions in higher continuity IGA can be set
to piece-wise constants. This formulation is equivalent to testing with
piece-wise constant basis functions, with supports span over some parts of the
domain. The resulting method is a Petrov-Galerkin formulation with piece-wise
constant test functions. This observation has the following consequences. The
numerical integration cost can be reduced because we do not need to evaluate
the test functions since they are equal to 1. This observation is valid for any
basis functions preserving the partition of unity property. It is independent
of the problem dimension and geometry of the computational domain. It also can
be used in time-dependent problems, e.g., in the explicit dynamics
computations, where we can reduce the cost of generation of the right-hand
side. This summation of test functions can be performed for an arbitrary linear
differential operator resulting from the Galerkin method applied to a PDE where
we discretize with C1 continuity basis functions. The resulting method is
equivalent to a linear combination of the collocations at points and with
weights resulting from applied quadrature over the spans defined by supports of
the piece-wise constant test functions.
Authors' comments: 32 pages, 8 figures
Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, Yonghong Tian
Few-shot learning, which aims at extracting new concepts rapidly from extremely few examples of novel classes, has been featured into the meta-learning paradigm recently. Yet, the key challenge of how to learn a generalizable classifier with the capability of adapting to specific tasks with severely limited data still remains in this domain. To this end, we propose a Transductive Episodic-wise Adaptive Metric (TEAM) framework for few-shot learning, by integrating the meta-learning paradigm with both deep metric learning and transductive inference. With exploring the pairwise constraints and regularization prior within each task, we explicitly formulate the adaptation procedure into a standard semi-definite programming problem. By solving the problem with its closed-form solution on the fly with the setup of transduction, our approach efficiently tailors an episodic-wise metric for each task to adapt all features from a shared task-agnostic embedding space into a more discriminative task-specific metric space. Moreover, we further leverage an attention-based bi-directional similarity strategy for extracting the more robust relationship between queries and prototypes. Extensive experiments on three benchmark datasets show that our framework is superior to other existing approaches and achieves the state-of-the-art performance in the few-shot literature.
Yuqing Ma, Xianglong Liu, Shihao Bai, Lei Wang, Aishan Liu, Dacheng Tao, Edwin Hancock
Recently deep neutral networks have achieved promising performance for
filling large missing regions in image inpainting tasks. They usually adopted
the standard convolutional architecture over the corrupted image, leading to
meaningless contents, such as color discrepancy, blur and artifacts. Moreover,
most inpainting approaches cannot well handle the large continuous missing area
cases. To address these problems, we propose a generic inpainting framework
capable of handling with incomplete images on both continuous and discontinuous
large missing areas, in an adversarial manner. From which, region-wise
convolution is deployed in both generator and discriminator to separately
handle with the different regions, namely existing regions and missing ones.
Moreover, a correlation loss is introduced to capture the non-local
correlations between different patches, and thus guides the generator to obtain
more information during inference. With the help of our proposed framework, we
can restore semantically reasonable and visually realistic images. Extensive
experiments on three widely-used datasets for image inpainting tasks have been
conducted, and both qualitative and quantitative experimental results
demonstrate that the proposed model significantly outperforms the
state-of-the-art approaches, both on the large continuous and discontinuous
missing areas.
Authors' comments: 13 pages, 8 figures, 3 tables
Jinsung Yoon, Sercan O. Arik, Tomas Pfister
Understanding black-box machine learning models is crucial for their
widespread adoption. Learning globally interpretable models is one approach,
but achieving high performance with them is challenging. An alternative
approach is to explain individual predictions using locally interpretable
models. For locally interpretable modeling, various methods have been proposed
and indeed commonly used, but they suffer from low fidelity, i.e. their
explanations do not approximate the predictions well. In this paper, our goal
is to push the state-of-the-art in high-fidelity locally interpretable
modeling. We propose a novel framework, Locally Interpretable Modeling using
Instance-wise Subsampling (LIMIS). LIMIS utilizes a policy gradient to select a
small number of instances and distills the black-box model into a low-capacity
locally interpretable model using those selected instances. Training is guided
with a reward obtained directly by measuring the fidelity of the locally
interpretable models. We show on multiple tabular datasets that LIMIS
near-matches the prediction accuracy of black-box models, significantly
outperforming state-of-the-art locally interpretable models in terms of
fidelity and prediction accuracy.
Authors' comments: Published in Transactions on Machine Learning Research (TMLR) -
September, 2022 - https://openreview.net/forum?id=S8eABAy8P3
Luxuan Li, Tao Kong, Fuchun Sun, Huaping Liu
Detecting actions in videos is an important yet challenging task. Previous
works usually utilize (a) sliding window paradigms, or (b) per-frame action
scoring and grouping to enumerate the possible temporal locations. Their
performances are also limited to the designs of sliding windows or grouping
strategies. In this paper, we present a simple and effective method for
temporal action proposal generation, named Deep Point-wise Prediction (DPP).
DPP simultaneously predicts the action existing possibility and the
corresponding temporal locations, without the utilization of any handcrafted
sliding window or grouping. The whole system is end-to-end trained with joint
loss of temporal action proposal classification and location prediction. We
conduct extensive experiments to verify its effectiveness, generality and
robustness on standard THUMOS14 dataset. DPP runs more than 1000 frames per
second, which largely satisfies the real-time requirement. The code is
available at https://github.com/liluxuan1997/DPP.
Authors' comments: accepted by ICONIP2019 oral presentation (International Conference on
Neural Information Processing)
Jue Jiang, Elguindi Sharif, Hyemin Um, Sean Berry, Harini Veeraraghavan
We developed a new and computationally simple local block-wise self attention based normal structures segmentation approach applied to head and neck computed tomography (CT) images. Our method uses the insight that normal organs exhibit regularity in their spatial location and inter-relation within images, which can be leveraged to simplify the computations required to aggregate feature information. We accomplish this by using local self attention blocks that pass information between each other to derive the attention map. We show that adding additional attention layers increases the contextual field and captures focused attention from relevant structures. We developed our approach using U-net and compared it against multiple state-of-the-art self attention methods. All models were trained on 48 internal headneck CT scans and tested on 48 CT scans from the external public domain database of computational anatomy dataset. Our method achieved the highest Dice similarity coefficient segmentation accuracy of 0.85$\pm$0.04, 0.86$\pm$0.04 for left and right parotid glands, 0.79$\pm$0.07 and 0.77$\pm$0.05 for left and right submandibular glands, 0.93$\pm$0.01 for mandible and 0.88$\pm$0.02 for the brain stem with the lowest increase of 66.7\% computing time per image and 0.15\% increase in model parameters compared with standard U-net. The best state-of-the-art method called point-wise spatial attention, achieved \textcolor{black}{comparable accuracy but with 516.7\% increase in computing time and 8.14\% increase in parameters compared with standard U-net.} Finally, we performed ablation tests and studied the impact of attention block size, overlap of the attention blocks, additional attention layers, and attention block placement on segmentation performance.
Kun Zhang, Peng He, Ping Yao, Ge Chen, Rui Wu, Min Du, Huimin Li, Li Fu et al.
Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc.)
have achieved significant performance on pose estimation by combining feature
maps of various resolutions. In this paper, we propose a Resolution-wise
Attention Module (RAM) and Gradual Pyramid Refinement (GPR), to learn enhanced
resolution-wise feature maps for precise pose estimation. Specifically, RAM
learns a group of weights to represent the different importance of feature maps
across resolutions, and the GPR gradually merges every two feature maps from
low to high resolutions to regress final human keypoint heatmaps. With the
enhanced resolution-wise features learnt by CNN, we obtain more accurate human
keypoint locations. The efficacies of our proposed methods are demonstrated on
MS-COCO dataset, achieving state-of-the-art performance with average precision
of 77.7 on COCO val2017 set and 77.0 on test-dev2017 set without using extra
human keypoint training dataset.
Authors' comments: Published on ICIP 2020
Weidi Xu, Xingyi Cheng, Kunlong Chen, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Luo Si et al.
The ability of semantic reasoning over the sentence pair is essential for
many natural language understanding tasks, e.g., natural language inference and
machine reading comprehension. A recent significant improvement in these tasks
comes from BERT. As reported, the next sentence prediction (NSP) in BERT, which
learns the contextual relationship between two sentences, is of great
significance for downstream problems with sentence-pair input. Despite the
effectiveness of NSP, we suggest that NSP still lacks the essential signal to
distinguish between entailment and shallow correlation. To remedy this, we
propose to augment the NSP task to a 3-class categorization task, which
includes a category for previous sentence prediction (PSP). The involvement of
PSP encourages the model to focus on the informative semantics to determine the
sentence order, thereby improves the ability of semantic understanding. This
simple modification yields remarkable improvement against vanilla BERT. To
further incorporate the document-level information, the scope of NSP and PSP is
expanded into a broader range, i.e., NSP and PSP also include close but
nonsuccessive sentences, the noise of which is mitigated by the label-smoothing
technique. Both qualitative and quantitative experimental results demonstrate
the effectiveness of the proposed method. Our method consistently improves the
performance on the NLI and MRC benchmarks, including the challenging HANS
dataset \cite{hans}, suggesting that the document-level task is still promising
for the pre-training.
Authors' comments: 8 pages, 3 figures, 6 tables
F. Din-Houn Lau, Sebastian Krumscheid
Markov chain Monte Carlo (MCMC) methods are sampling methods that have become
a commonly used tool in statistics, for example to perform Monte Carlo
integration. As a consequence of the increase in computational power, many
variations of MCMC methods exist for generating samples from arbitrary,
possibly complex, target distributions. The performance of an MCMC method is
predominately governed by the choice of the so-called proposal distribution
used. In this paper, we introduce a new type of proposal distribution for the
use in MCMC methods that operates component-wise and with multiple trials per
iteration. Specifically, the novel class of proposal distributions, called
Plateau distributions, do not overlap, thus ensuring that the multiple trials
are drawn from different regions of the state space. Furthermore, the Plateau
proposal distributions allow for a bespoke adaptation procedure that lends
itself to a Markov chain with efficient problem dependent state space
exploration and improved burn-in properties. Simulation studies show that our
novel MCMC algorithm outperforms competitors when sampling from distributions
with a complex shape, highly correlated components or multiple modes.
Authors' comments: 24 pages, 12 figures
Kei Takemura, Shinji Ito
Combinatorial linear semi-bandits (CLS) are widely applicable frameworks of sequential decision-making, in which a learner chooses a subset of arms from a given set of arms associated with feature vectors. Existing algorithms work poorly for the clustered case, in which the feature vectors form several large clusters. This shortcoming is critical in practice because it can be found in many applications, including recommender systems. In this paper, we clarify why such a shortcoming occurs, and we introduce a key technique of arm-wise randomization to overcome it. We propose two algorithms with this technique: the perturbed C${}^2$UCB (PC${}^2$UCB) and the Thompson sampling (TS). Our empirical evaluation with artificial and real-world datasets demonstrates that the proposed algorithms with the arm-wise randomization technique outperform the existing algorithms without this technique, especially for the clustered case. Our contributions also include theoretical analyses that provide high probability asymptotic regret bounds for our algorithms.
Xin Li, Tianwei Lin, Xiao Liu, Chuang Gan, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He et al.
Existing action localization approaches adopt shallow temporal convolutional networks (\ie, TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(\%), achieving a relative improvement of 21.7\%. Favorable result is also obtained on ActivityNet.
Yaman Dang, Deepak Anand, Amit Sethi
One of the first steps in the diagnosis of most cardiac diseases, such as
pulmonary hypertension, coronary heart disease is the segmentation of
ventricles from cardiac magnetic resonance (MRI) images. Manual segmentation of
the right ventricle requires diligence and time, while its automated
segmentation is challenging due to shape variations and illdefined borders. We
propose a deep learning based method for the accurate segmentation of right
ventricle, which does not require post-processing and yet it achieves the
state-of-the-art performance of 0.86 Dice coefficient and 6.73 mm Hausdorff
distance on RVSC-MICCAI 2012 dataset. We use a novel adaptive cost function to
counter extreme class-imbalance in the dataset. We present a comprehensive
comparative study of loss functions, architectures, and ensembling techniques
to build a principled approach for biomedical segmentation tasks.
Authors' comments: Accepted at IEEE TENCON 2019
Trent J. Dupuy, Michael C. Liu, William M. J. Best, Andrew W. Mann, Michael A. Tucker, Zhoujian Zhang, Isabelle Baraffe, Gilles Chabrier et al.
We present individual dynamical masses for the nearby M9.5+T5.5 binary WISE
J072003.20$-$084651.2AB, a.k.a. Scholz's star. Combining high-precision
CFHT/WIRCam photocenter astrometry and Keck adaptive optics resolved imaging,
we measure the first high-quality parallactic distance ($6.80_{-0.06}^{+0.05}$
pc) and orbit ($8.06_{-0.25}^{+0.24}$ yr period) for this system composed of a
low-mass star and brown dwarf. We find a moderately eccentric orbit ($e =
0.240_{-0.010}^{+0.009}$), incompatible with previous work based on less data,
and dynamical masses of $99\pm6$ $M_{\rm Jup}$ and $66\pm4$ $M_{\rm Jup}$ for
the two components. The primary mass is marginally inconsistent (2.1$\sigma$)
with the empirical mass$-$magnitude$-$metallicity relation and models of
main-sequence stars. The relatively high mass of the cold ($T_{\rm eff} =
1250\pm40$ K) brown dwarf companion indicates an age older than a few Gyr, in
accord with age estimates for the primary star, and is consistent with our
recent estimate of $\approx$70 $M_{\rm Jup}$ for the stellar/substellar
boundary among the field population. Our improved parallax and proper motion,
as well as an orbit-corrected system velocity, improve the accuracy of the
system's close encounter with the solar system by an order of magnitude. WISE
J0720$-$0846AB passed within $68.7\pm2.0$ kAU of the Sun $80.5\pm0.7$ kyr ago,
passing through the outer Oort cloud where comets can have stable orbits.
Authors' comments: accepted to AJ
Junyu Gao, Qi Wang, Yuan Yuan
Recently, crowd counting is a hot topic in crowd analysis. Many CNN-based
counting algorithms attain good performance. However, these methods only focus
on the local appearance features of crowd scenes but ignore the large-range
pixel-wise contextual and crowd attention information. To remedy the above
problems, in this paper, we introduce the Spatial-/Channel-wise Attention
Models into the traditional Regression CNN to estimate the density map, which
is named as "SCAR". It consists of two modules, namely Spatial-wise Attention
Model (SAM) and Channel-wise Attention Model (CAM). The former can encode the
pixel-wise context of the entire image to more accurately predict density maps
at the pixel level. The latter attempts to extract more discriminative features
among different channels, which aids model to pay attention to the head region,
the core of crowd scenes. Intuitively, CAM alleviates the mistaken estimation
for background regions. Finally, two types of attention information and
traditional CNN's feature maps are integrated by a concatenation operation.
Furthermore, the extensive experiments are conducted on four popular datasets,
Shanghai Tech Part A/B, GCC, and UCF_CC_50 Dataset. The results show that the
proposed method achieves state-of-the-art results.
Authors' comments: accepted by Neurocomputing
Ronald Barber, Christian Garcia-Arellano, Ronen Grosman, Guy Lohman, C. Mohan, Rene Muller, Hamid Pirahesh, Vijayshankar Raman et al.
In a classic transactional distributed database management system (DBMS), write transactions invariably synchronize with a coordinator before final commitment. While enforcing serializability, this model has long been criticized for not satisfying the applications' availability requirements. When entering the era of Internet of Things (IoT), this problem has become more severe, as an increasing number of applications call for the capability of hybrid transactional and analytical processing (HTAP), where aggregation constraints need to be enforced as part of transactions. Current systems work around this by creating escrows, allowing occasional overshoots of constraints, which are handled via compensating application logic. The WiSer DBMS targets consistency with availability, by splitting the database commit into two steps. First, a PROMISE step that corresponds to what humans are used to as commitment, and runs without talking to a coordinator. Second, a SERIALIZE step, that fixes transactions' positions in the serializable order, via a consensus procedure. We achieve this split via a novel data representation that embeds read-sets into transaction deltas, and serialization sequence numbers into table rows. WiSer does no sharding (all nodes can run transactions that modify the entire database), and yet enforces aggregation constraints. Both readwrite conflicts and aggregation constraint violations are resolved lazily in the serialized data. WiSer also covers node joins and departures as database tables, thus simplifying correctness and failure handling. We present the design of WiSer as well as experiments suggesting this approach has promise.
Erhan Bilal
Stochastic gradient descent (SGD) has been the dominant optimization method
for training deep neural networks due to its many desirable properties. One of
the more remarkable and least understood quality of SGD is that it generalizes
relatively well on unseen data even when the neural network has millions of
parameters. We hypothesize that in certain cases it is desirable to relax its
intrinsic generalization properties and introduce an extension of SGD called
deep gradient boosting (DGB). The key idea of DGB is that back-propagated
gradients inferred using the chain rule can be viewed as pseudo-residual
targets of a gradient boosting problem. Thus at each layer of a neural network
the weight update is calculated by solving the corresponding boosting problem
using a linear base learner. The resulting weight update formula can also be
viewed as a normalization procedure of the data that arrives at each layer
during the forward pass. When implemented as a separate input normalization
layer (INN) the new architecture shows improved performance on image
recognition tasks when compared to the same architecture without normalization
layers. As opposed to batch normalization (BN), INN has no learnable parameters
however it matches its performance on CIFAR10 and ImageNet classification
tasks.
Authors' comments: Solving the pseudo-inverse with SVD and splitting this into two
separate papers. There are too many changes to just update this version
Chuanjian Liu, Yunhe Wang, Kai Han, Chunjing Xu, Chang Xu
Exploring deep convolutional neural networks of high efficiency and low
memory usage is very essential for a wide variety of machine learning tasks.
Most of existing approaches used to accelerate deep models by manipulating
parameters or filters without data, e.g., pruning and decomposition. In
contrast, we study this problem from a different perspective by respecting the
difference between data. An instance-wise feature pruning is developed by
identifying informative features for different instances. Specifically, by
investigating a feature decay regularization, we expect intermediate feature
maps of each instance in deep neural networks to be sparse while preserving the
overall network performance. During online inference, subtle features of input
images extracted by intermediate layers of a well-trained neural network can be
eliminated to accelerate the subsequent calculations. We further take
coefficient of variation as a measure to select the layers that are appropriate
for acceleration. Extensive experiments conducted on benchmark datasets and
networks demonstrate the effectiveness of the proposed method.
Authors' comments: Accepted by IJCAI 2019
Gen Li, Inyoung Yun, Jonghyun Kim, Joongkyu Kim
As a pixel-level prediction task, semantic segmentation needs large
computational cost with enormous parameters to obtain high performance.
Recently, due to the increasing demand for autonomous systems and robots, it is
significant to make a tradeoff between accuracy and inference speed. In this
paper, we propose a novel Depthwise Asymmetric Bottleneck (DAB) module to
address this dilemma, which efficiently adopts depth-wise asymmetric
convolution and dilated convolution to build a bottleneck structure. Based on
the DAB module, we design a Depth-wise Asymmetric Bottleneck Network (DABNet)
especially for real-time semantic segmentation, which creates sufficient
receptive field and densely utilizes the contextual information. Experiments on
Cityscapes and CamVid datasets demonstrate that the proposed DABNet achieves a
balance between speed and precision. Specifically, without any pretrained model
and postprocessing, it achieves 70.1% Mean IoU on the Cityscapes test dataset
with only 0.76 million parameters and a speed of 104 FPS on a single GTX 1080Ti
card.
Authors' comments: Accepted to BMVC 2019
Anjith George, Sebastien Marcel
Face recognition has evolved as a prominent biometric authentication
modality. However, vulnerability to presentation attacks curtails its reliable
deployment. Automatic detection of presentation attacks is essential for secure
use of face recognition technology in unattended scenarios. In this work, we
introduce a Convolutional Neural Network (CNN) based framework for presentation
attack detection, with deep pixel-wise supervision. The framework uses only
frame level information making it suitable for deployment in smart devices with
minimal computational and time overhead. We demonstrate the effectiveness of
the proposed approach in public datasets for both intra as well as
cross-dataset experiments. The proposed approach achieves an HTER of 0% in
Replay Mobile dataset and an ACER of 0.42% in Protocol-1 of OULU dataset
outperforming state of the art methods.
Authors' comments: 8 pages, 5 figures, To appear in : International Conference on
Biometrics, ICB 2019