Weidi Xu, Xingyi Cheng, Kunlong Chen, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Luo Si et al.
The ability of semantic reasoning over the sentence pair is essential for
many natural language understanding tasks, e.g., natural language inference and
machine reading comprehension. A recent significant improvement in these tasks
comes from BERT. As reported, the next sentence prediction (NSP) in BERT, which
learns the contextual relationship between two sentences, is of great
significance for downstream problems with sentence-pair input. Despite the
effectiveness of NSP, we suggest that NSP still lacks the essential signal to
distinguish between entailment and shallow correlation. To remedy this, we
propose to augment the NSP task to a 3-class categorization task, which
includes a category for previous sentence prediction (PSP). The involvement of
PSP encourages the model to focus on the informative semantics to determine the
sentence order, thereby improves the ability of semantic understanding. This
simple modification yields remarkable improvement against vanilla BERT. To
further incorporate the document-level information, the scope of NSP and PSP is
expanded into a broader range, i.e., NSP and PSP also include close but
nonsuccessive sentences, the noise of which is mitigated by the label-smoothing
technique. Both qualitative and quantitative experimental results demonstrate
the effectiveness of the proposed method. Our method consistently improves the
performance on the NLI and MRC benchmarks, including the challenging HANS
dataset \cite{hans}, suggesting that the document-level task is still promising
for the pre-training.
Authors' comments: 8 pages, 3 figures, 6 tables
F. Din-Houn Lau, Sebastian Krumscheid
Markov chain Monte Carlo (MCMC) methods are sampling methods that have become
a commonly used tool in statistics, for example to perform Monte Carlo
integration. As a consequence of the increase in computational power, many
variations of MCMC methods exist for generating samples from arbitrary,
possibly complex, target distributions. The performance of an MCMC method is
predominately governed by the choice of the so-called proposal distribution
used. In this paper, we introduce a new type of proposal distribution for the
use in MCMC methods that operates component-wise and with multiple trials per
iteration. Specifically, the novel class of proposal distributions, called
Plateau distributions, do not overlap, thus ensuring that the multiple trials
are drawn from different regions of the state space. Furthermore, the Plateau
proposal distributions allow for a bespoke adaptation procedure that lends
itself to a Markov chain with efficient problem dependent state space
exploration and improved burn-in properties. Simulation studies show that our
novel MCMC algorithm outperforms competitors when sampling from distributions
with a complex shape, highly correlated components or multiple modes.
Authors' comments: 24 pages, 12 figures
Kei Takemura, Shinji Ito
Combinatorial linear semi-bandits (CLS) are widely applicable frameworks of sequential decision-making, in which a learner chooses a subset of arms from a given set of arms associated with feature vectors. Existing algorithms work poorly for the clustered case, in which the feature vectors form several large clusters. This shortcoming is critical in practice because it can be found in many applications, including recommender systems. In this paper, we clarify why such a shortcoming occurs, and we introduce a key technique of arm-wise randomization to overcome it. We propose two algorithms with this technique: the perturbed C${}^2$UCB (PC${}^2$UCB) and the Thompson sampling (TS). Our empirical evaluation with artificial and real-world datasets demonstrates that the proposed algorithms with the arm-wise randomization technique outperform the existing algorithms without this technique, especially for the clustered case. Our contributions also include theoretical analyses that provide high probability asymptotic regret bounds for our algorithms.
Xin Li, Tianwei Lin, Xiao Liu, Chuang Gan, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He et al.
Existing action localization approaches adopt shallow temporal convolutional networks (\ie, TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(\%), achieving a relative improvement of 21.7\%. Favorable result is also obtained on ActivityNet.
Yaman Dang, Deepak Anand, Amit Sethi
One of the first steps in the diagnosis of most cardiac diseases, such as
pulmonary hypertension, coronary heart disease is the segmentation of
ventricles from cardiac magnetic resonance (MRI) images. Manual segmentation of
the right ventricle requires diligence and time, while its automated
segmentation is challenging due to shape variations and illdefined borders. We
propose a deep learning based method for the accurate segmentation of right
ventricle, which does not require post-processing and yet it achieves the
state-of-the-art performance of 0.86 Dice coefficient and 6.73 mm Hausdorff
distance on RVSC-MICCAI 2012 dataset. We use a novel adaptive cost function to
counter extreme class-imbalance in the dataset. We present a comprehensive
comparative study of loss functions, architectures, and ensembling techniques
to build a principled approach for biomedical segmentation tasks.
Authors' comments: Accepted at IEEE TENCON 2019
Trent J. Dupuy, Michael C. Liu, William M. J. Best, Andrew W. Mann, Michael A. Tucker, Zhoujian Zhang, Isabelle Baraffe, Gilles Chabrier et al.
We present individual dynamical masses for the nearby M9.5+T5.5 binary WISE
J072003.20$-$084651.2AB, a.k.a. Scholz's star. Combining high-precision
CFHT/WIRCam photocenter astrometry and Keck adaptive optics resolved imaging,
we measure the first high-quality parallactic distance ($6.80_{-0.06}^{+0.05}$
pc) and orbit ($8.06_{-0.25}^{+0.24}$ yr period) for this system composed of a
low-mass star and brown dwarf. We find a moderately eccentric orbit ($e =
0.240_{-0.010}^{+0.009}$), incompatible with previous work based on less data,
and dynamical masses of $99\pm6$ $M_{\rm Jup}$ and $66\pm4$ $M_{\rm Jup}$ for
the two components. The primary mass is marginally inconsistent (2.1$\sigma$)
with the empirical mass$-$magnitude$-$metallicity relation and models of
main-sequence stars. The relatively high mass of the cold ($T_{\rm eff} =
1250\pm40$ K) brown dwarf companion indicates an age older than a few Gyr, in
accord with age estimates for the primary star, and is consistent with our
recent estimate of $\approx$70 $M_{\rm Jup}$ for the stellar/substellar
boundary among the field population. Our improved parallax and proper motion,
as well as an orbit-corrected system velocity, improve the accuracy of the
system's close encounter with the solar system by an order of magnitude. WISE
J0720$-$0846AB passed within $68.7\pm2.0$ kAU of the Sun $80.5\pm0.7$ kyr ago,
passing through the outer Oort cloud where comets can have stable orbits.
Authors' comments: accepted to AJ
Junyu Gao, Qi Wang, Yuan Yuan
Recently, crowd counting is a hot topic in crowd analysis. Many CNN-based
counting algorithms attain good performance. However, these methods only focus
on the local appearance features of crowd scenes but ignore the large-range
pixel-wise contextual and crowd attention information. To remedy the above
problems, in this paper, we introduce the Spatial-/Channel-wise Attention
Models into the traditional Regression CNN to estimate the density map, which
is named as "SCAR". It consists of two modules, namely Spatial-wise Attention
Model (SAM) and Channel-wise Attention Model (CAM). The former can encode the
pixel-wise context of the entire image to more accurately predict density maps
at the pixel level. The latter attempts to extract more discriminative features
among different channels, which aids model to pay attention to the head region,
the core of crowd scenes. Intuitively, CAM alleviates the mistaken estimation
for background regions. Finally, two types of attention information and
traditional CNN's feature maps are integrated by a concatenation operation.
Furthermore, the extensive experiments are conducted on four popular datasets,
Shanghai Tech Part A/B, GCC, and UCF_CC_50 Dataset. The results show that the
proposed method achieves state-of-the-art results.
Authors' comments: accepted by Neurocomputing
Ronald Barber, Christian Garcia-Arellano, Ronen Grosman, Guy Lohman, C. Mohan, Rene Muller, Hamid Pirahesh, Vijayshankar Raman et al.
In a classic transactional distributed database management system (DBMS), write transactions invariably synchronize with a coordinator before final commitment. While enforcing serializability, this model has long been criticized for not satisfying the applications' availability requirements. When entering the era of Internet of Things (IoT), this problem has become more severe, as an increasing number of applications call for the capability of hybrid transactional and analytical processing (HTAP), where aggregation constraints need to be enforced as part of transactions. Current systems work around this by creating escrows, allowing occasional overshoots of constraints, which are handled via compensating application logic. The WiSer DBMS targets consistency with availability, by splitting the database commit into two steps. First, a PROMISE step that corresponds to what humans are used to as commitment, and runs without talking to a coordinator. Second, a SERIALIZE step, that fixes transactions' positions in the serializable order, via a consensus procedure. We achieve this split via a novel data representation that embeds read-sets into transaction deltas, and serialization sequence numbers into table rows. WiSer does no sharding (all nodes can run transactions that modify the entire database), and yet enforces aggregation constraints. Both readwrite conflicts and aggregation constraint violations are resolved lazily in the serialized data. WiSer also covers node joins and departures as database tables, thus simplifying correctness and failure handling. We present the design of WiSer as well as experiments suggesting this approach has promise.
Erhan Bilal
Stochastic gradient descent (SGD) has been the dominant optimization method
for training deep neural networks due to its many desirable properties. One of
the more remarkable and least understood quality of SGD is that it generalizes
relatively well on unseen data even when the neural network has millions of
parameters. We hypothesize that in certain cases it is desirable to relax its
intrinsic generalization properties and introduce an extension of SGD called
deep gradient boosting (DGB). The key idea of DGB is that back-propagated
gradients inferred using the chain rule can be viewed as pseudo-residual
targets of a gradient boosting problem. Thus at each layer of a neural network
the weight update is calculated by solving the corresponding boosting problem
using a linear base learner. The resulting weight update formula can also be
viewed as a normalization procedure of the data that arrives at each layer
during the forward pass. When implemented as a separate input normalization
layer (INN) the new architecture shows improved performance on image
recognition tasks when compared to the same architecture without normalization
layers. As opposed to batch normalization (BN), INN has no learnable parameters
however it matches its performance on CIFAR10 and ImageNet classification
tasks.
Authors' comments: Solving the pseudo-inverse with SVD and splitting this into two
separate papers. There are too many changes to just update this version
Chuanjian Liu, Yunhe Wang, Kai Han, Chunjing Xu, Chang Xu
Exploring deep convolutional neural networks of high efficiency and low
memory usage is very essential for a wide variety of machine learning tasks.
Most of existing approaches used to accelerate deep models by manipulating
parameters or filters without data, e.g., pruning and decomposition. In
contrast, we study this problem from a different perspective by respecting the
difference between data. An instance-wise feature pruning is developed by
identifying informative features for different instances. Specifically, by
investigating a feature decay regularization, we expect intermediate feature
maps of each instance in deep neural networks to be sparse while preserving the
overall network performance. During online inference, subtle features of input
images extracted by intermediate layers of a well-trained neural network can be
eliminated to accelerate the subsequent calculations. We further take
coefficient of variation as a measure to select the layers that are appropriate
for acceleration. Extensive experiments conducted on benchmark datasets and
networks demonstrate the effectiveness of the proposed method.
Authors' comments: Accepted by IJCAI 2019
Gen Li, Inyoung Yun, Jonghyun Kim, Joongkyu Kim
As a pixel-level prediction task, semantic segmentation needs large
computational cost with enormous parameters to obtain high performance.
Recently, due to the increasing demand for autonomous systems and robots, it is
significant to make a tradeoff between accuracy and inference speed. In this
paper, we propose a novel Depthwise Asymmetric Bottleneck (DAB) module to
address this dilemma, which efficiently adopts depth-wise asymmetric
convolution and dilated convolution to build a bottleneck structure. Based on
the DAB module, we design a Depth-wise Asymmetric Bottleneck Network (DABNet)
especially for real-time semantic segmentation, which creates sufficient
receptive field and densely utilizes the contextual information. Experiments on
Cityscapes and CamVid datasets demonstrate that the proposed DABNet achieves a
balance between speed and precision. Specifically, without any pretrained model
and postprocessing, it achieves 70.1% Mean IoU on the Cityscapes test dataset
with only 0.76 million parameters and a speed of 104 FPS on a single GTX 1080Ti
card.
Authors' comments: Accepted to BMVC 2019
Anjith George, Sebastien Marcel
Face recognition has evolved as a prominent biometric authentication
modality. However, vulnerability to presentation attacks curtails its reliable
deployment. Automatic detection of presentation attacks is essential for secure
use of face recognition technology in unattended scenarios. In this work, we
introduce a Convolutional Neural Network (CNN) based framework for presentation
attack detection, with deep pixel-wise supervision. The framework uses only
frame level information making it suitable for deployment in smart devices with
minimal computational and time overhead. We demonstrate the effectiveness of
the proposed approach in public datasets for both intra as well as
cross-dataset experiments. The proposed approach achieves an HTER of 0% in
Replay Mobile dataset and an ACER of 0.42% in Protocol-1 of OULU dataset
outperforming state of the art methods.
Authors' comments: 8 pages, 5 figures, To appear in : International Conference on
Biometrics, ICB 2019
Hongyang Gao, Shuiwang Ji
Attention operators have been widely applied in various fields, including
computer vision, natural language processing, and network embedding learning.
Attention operators on graph data enables learnable weights when aggregating
information from neighboring nodes. However, graph attention operators (GAOs)
consume excessive computational resources, preventing their applications on
large graphs. In addition, GAOs belong to the family of soft attention, instead
of hard attention, which has been shown to yield better performance. In this
work, we propose novel hard graph attention operator (hGAO) and channel-wise
graph attention operator (cGAO). hGAO uses the hard attention mechanism by
attending to only important nodes. Compared to GAO, hGAO improves performance
and saves computational cost by only attending to important nodes. To further
reduce the requirements on computational resources, we propose the cGAO that
performs attention operations along channels. cGAO avoids the dependency on the
adjacency matrix, leading to dramatic reductions in computational resource
requirements. Experimental results demonstrate that our proposed deep models
with the new operators achieve consistently better performance. Comparison
results also indicates that hGAO achieves significantly better performance than
GAO on both node and graph embedding tasks. Efficiency comparison shows that
our cGAO leads to dramatic savings in computational resources, making them
applicable to large graphs.
Authors' comments: 9 pages, KDD19
Marc Brockschmidt
This paper presents a new Graph Neural Network (GNN) type using feature-wise
linear modulation (FiLM). Many standard GNN variants propagate information
along the edges of a graph by computing "messages" based only on the
representation of the source of each edge. In GNN-FiLM, the representation of
the target node of an edge is additionally used to compute a transformation
that can be applied to all incoming messages, allowing feature-wise modulation
of the passed information.
Results of experiments comparing different GNN architectures on three tasks
from the literature are presented, based on re-implementations of baseline
methods. Hyperparameters for all methods were found using extensive search,
yielding somewhat surprising results: differences between baseline models are
smaller than reported in the literature. Nonetheless, GNN-FiLM outperforms
baseline methods on a regression task on molecular graphs and performs
competitively on other tasks.
Authors' comments: As published in ICML 2020 proceedings
Grégoire Clarté, Christian P. Robert, Robin Ryder, Julien Stoehr
Approximate Bayesian computation methods are useful for generative models
with intractable likelihoods. These methods are however sensitive to the
dimension of the parameter space, requiring exponentially increasing resources
as this dimension grows. To tackle this difficulty, we explore a Gibbs version
of the ABC approach that runs component-wise approximate Bayesian computation
steps aimed at the corresponding conditional posterior distributions, and based
on summary statistics of reduced dimensions. While lacking the standard
justifications for the Gibbs sampler, the resulting Markov chain is shown to
converge in distribution under some partial independence conditions. The
associated stationary distribution can further be shown to be close to the true
posterior distribution and some hierarchical versions of the proposed mechanism
enjoy a closed form limiting distribution. Experiments also demonstrate the
gain in efficiency brought by the Gibbs version over the standard solution.
Authors' comments: 28 pages, 13 figures, third revision (accepted for publication in
Biometrika on 17 September, 2020)
Andrea Testa, Francesco Farina, Giuseppe Notarstefano
In this paper we deal with a network of computing agents with local processing and neighboring communication capabilities that aim at solving (without any central unit) a submodular optimization problem. The cost function is the sum of many local submodular functions and each agent in the network has access to one function in the sum only. In this \emph{distributed} set-up, in order to preserve their own privacy, agents communicate with neighbors but do not share their local cost functions. We propose a distributed algorithm in which agents resort to the Lov\`{a}sz extension of their local submodular functions and perform local updates and communications in terms of single blocks of the entire optimization variable. Updates are performed by means of a greedy algorithm which is run only until the selected block is computed, thus resulting in a reduced computational burden. The proposed algorithm is shown to converge in expected value to the optimal cost of the problem, and an approximate solution to the submodular problem is retrieved by a thresholding operation. As an application, we consider a distributed image segmentation problem in which each agent has access only to a portion of the entire image. While agents cannot segment the entire image on their own, they correctly complete the task by cooperating through the proposed distributed algorithm.
Cao Vien Phung, Jasenka Dizdarevic, Admela Jukan
CoAP (Constrained Application Protocol) with block-wise transfer (BWT) option
is a known protocol choice for large data transfer in general lossy IoT network
environments. Lossy transmission environments on the other hand lead to CoAP
resending multiple blocks, which creates overheads. To tackle this problem, we
design a BWT with network coding (NC), with the goal to reducing the number of
unnecessary retransmissions. The results show the reduction in the number of
block retransmissions for different values of blocksize, implying the reduced
transfer time. For the maximum blocksize of 1024 bytes and total probability
loss of 0.5, CoAP with NC can resend up to 5 times less blocks.
Authors' comments: 4 pages, 2 figures, submitted to Euro-Par 2019
Sathyaprakash Narayanan, Yeshwanth Bethi, Chetan Singh Thakur
Manifold amount of video data gets generated every minute as we read this document, ranging from surveillance to broadcasting purposes. There are two roadblocks that restrain us from using this data as such, first being the storage which restricts us from only storing the information based on the hardware constraints. Secondly, the computation required to process this data is highly expensive which makes it infeasible to work on them. Compressive sensing(CS)[2] is a signal process technique[11], through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Shannon-Nyquist sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity which requires the signal to be sparse in some domain. The second one is incoherence which is applied through the isometric property which is sufficient for sparse signals[9][10]. To sustain these characteristics, preserving all attributes in the uncompressed domain would help any kind of in this field. However, existing dataset fallback in terms of continuous tracking of all the object present in the scene, very few video datasets have comprehensive continuous tracking of objects. To address these problems collectively, in this work we propose a new comprehensive video dataset, where the data is compressed using pixel-wise coded exposure [3] that resolves various other impediments.
Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.
Kui Jia, Jiehong Lin, Mingkui Tan, Dacheng Tao
Many machine learning problems concern with discovering or associating common patterns in data of multiple views or modalities. Multi-view learning is of the methods to achieve such goals. Recent methods propose deep multi-view networks via adaptation of generic Deep Neural Networks (DNNs), which concatenate features of individual views at intermediate network layers (i.e., fusion layers). In this work, we study the problem of multi-view learning in such end-to-end networks. We take a regularization approach via multi-view learning criteria, and propose a novel, effective, and efficient neuron-wise correlation-maximizing regularizer. We implement our proposed regularizers collectively as a correlation-regularized network layer (CorrReg). CorrReg can be applied to either fully-connected or convolutional fusion layers, simply by replacing them with their CorrReg counterparts. By partitioning neurons of a hidden layer in generic DNNs into multiple subsets, we also consider a multi-view feature learning perspective of generic DNNs. Such a perspective enables us to study deep multi-view learning in the context of regularized network training, for which we present control experiments of benchmark image classification to show the efficacy of our proposed CorrReg. To investigate how CorrReg is useful for practical multi-view learning problems, we conduct experiments of RGB-D object/scene recognition and multi-view based 3D object recognition, using networks with fusion layers that concatenate intermediate features of individual modalities or views for subsequent classification. Applying CorrReg to fusion layers of these networks consistently improves classification performance. In particular, we achieve the new state of the art on the benchmark RGB-D object and RGB-D scene datasets. We make the implementation of CorrReg publicly available.