Penghui Fu, Zhiqiang Tan
For multivariate nonparametric regression, doubly penalized ANOVA modeling (DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empirical norms as penalties on the component functions such as main effects and multi-way interactions in a functional ANOVA decomposition of the underlying regression function. The two penalties play complementary roles: the HTV penalty promotes sparsity in the selection of basis functions within each component function, whereas the empirical-norm penalty promotes sparsity in the selection of component functions. We adopt backfitting or block minimization for training DPAM, and develop two suitable primal-dual algorithms, including both batch and stochastic versions, for updating each component function in single-block optimization. Existing applications of primal-dual algorithms are intractable in our setting with both HTV and empirical-norm penalties. Through extensive numerical experiments, we demonstrate the validity and advantage of our stochastic primal-dual algorithms, compared with their batch versions and a previous active-set algorithm, in large-scale scenarios.
Suchetana Sadhukhan, Poulomi Sadhukhan
This paper, for the first time, focuses on the sector-wise analysis of a
stock market through multifractal analysis. We have considered Bombay Stock
Exchange, India, and identified two time scales, short ($<200$ days) and long
time-scale ($>200$ days) for investment. We infer that long-term investment
will be more profitable. For long time scale, sectors can be separated into two
categories based on the Hurst exponent values; one corresponds to stable
sectors with small fluctuations, and the other with dominance of large
fluctuations leading to possible downturns in those sectors.
Authors' comments: 15 pages, 3 figures, 2 tables
Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki
We propose Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks combining pseudo-labeling and consistency regularization via strong data augmentation. We enable the application of FixMatch in semi-supervised learning problems beyond image classification by adding a matching operation on the pseudo-labels. This allows us to still use the full strength of data augmentation pipelines, including geometric transformations. We evaluate it on semi-supervised semantic segmentation on Cityscapes and Pascal VOC with different percentages of labeled data and ablate design choices and hyper-parameters. Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.
Marco Landt-Hayen, Peer Kröger, Martin Claus, Willi Rath
Artificial neural networks (ANNs) are known to be powerful methods for many
hard problems (e.g. image classification, speech recognition or time series
prediction). However, these models tend to produce black-box results and are
often difficult to interpret. Layer-wise relevance propagation (LRP) is a
widely used technique to understand how ANN models come to their conclusion and
to understand what a model has learned. Here, we focus on Echo State Networks
(ESNs) as a certain type of recurrent neural networks, also known as reservoir
computing. ESNs are easy to train and only require a small number of trainable
parameters, but are still black-box models. We show how LRP can be applied to
ESNs in order to open the black-box. We also show how ESNs can be used not only
for time series prediction but also for image classification: Our ESN model
serves as a detector for El Nino Southern Oscillation (ENSO) from sea surface
temperature anomalies. ENSO is actually a well-known problem and has been
extensively discussed before. But here we use this simple problem to
demonstrate how LRP can significantly enhance the explainablility of ESNs.
Authors' comments: Shortened title, corrected author affiliation, added citation
reference: Accepted at 3rd International Conference on Machine Learning
Techniques (MLTEC 2022), Zurich, Switzerland
Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky
Self-supervised learning of speech representations from large amounts of
unlabeled data has enabled state-of-the-art results in several speech
processing tasks. Aggregating these speech representations across time is
typically approached by using descriptive statistics, and in particular, using
the first- and second-order statistics of representation coefficients. In this
paper, we examine an alternative way of extracting speaker and emotion
information from self-supervised trained models, based on the correlations
between the coefficients of the representations - correlation pooling. We show
improvements over mean pooling and further gains when the pooling methods are
combined via fusion. The code is available at
github.com/Lamomal/s3prl_correlation.
Authors' comments: Accepted at IEEE-SLT 2022
Zhiyuan Zhang, Qi Su, Xu Sun
Despite the potential of federated learning, it is known to be vulnerable to
backdoor attacks. Many robust federated aggregation methods are proposed to
reduce the potential backdoor risk. However, they are mainly validated in the
CV field. In this paper, we find that NLP backdoors are hard to defend against
than CV, and we provide a theoretical analysis that the malicious update
detection error probabilities are determined by the relative backdoor
strengths. NLP attacks tend to have small relative backdoor strengths, which
may result in the failure of robust federated aggregation methods for NLP
attacks. Inspired by the theoretical results, we can choose some dimensions
with higher backdoor strengths to settle this issue. We propose a novel
federated aggregation algorithm, Dim-Krum, for NLP tasks, and experimental
results validate its effectiveness.
Authors' comments: Accepted by Findings of EMNLP 2022
Alexandre Martin
We show that two-dimensional Artin groups satisfy a strengthening of the Tits
alternative: their subgroups either contain a non-abelian free group or are
virtually free abelian of rank at most $2$.
When in addition the associated Coxeter group is hyperbolic, we answer in the
affirmative a question of Wise on the subgroups generated by large powers of
two elements: given any two elements $a, b$ of a two-dimensional Artin group of
hyperbolic type, there exists an integer $n\geq 1$ such that $a^n$ and $b^n$
either commute or generate a non-abelian free subgroup.
Authors' comments: 24 pages, 7 figures. Final version accepted for publication
Alex J. Chan, Mihaela van der Schaar
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data - instead given access to a set of expert models and their predictions alongside some limited information about the dataset used to train them. In scenarios from finance to the medical sciences, and even consumer practice, stakeholders have developed models on private data they either cannot, or do not want to, share. Given the value and legislation surrounding personal information, it is not surprising that only the models, and not the data, will be released - the pertinent question becoming: how best to use these models? Previous work has focused on global model selection or ensembling, with the result of a single final model across the feature space. Machine learning models perform notoriously poorly on data outside their training domain however, and so we argue that when ensembling models the weightings for individual instances must reflect their respective domains - in other words models that are more likely to have seen information on that instance should have more attention paid to them. We introduce a method for such an instance-wise ensembling of models, including a novel representation learning step for handling sparse high-dimensional domains. Finally, we demonstrate the need and generalisability of our method on classical machine learning tasks as well as highlighting a real world use case in the pharmacological setting of vancomycin precision dosing.
Huiyang Shao, Qianqian Xu, Zhiyong Yang, Shilong Bao, Qingming Huang
The Partial Area Under the ROC Curve (PAUC), typically including One-way
Partial AUC (OPAUC) and Two-way Partial AUC (TPAUC), measures the average
performance of a binary classifier within a specific false positive rate and/or
true positive rate interval, which is a widely adopted measure when decision
constraints must be considered. Consequently, PAUC optimization has naturally
attracted increasing attention in the machine learning community within the
last few years. Nonetheless, most of the existing methods could only optimize
PAUC approximately, leading to inevitable biases that are not controllable.
Fortunately, a recent work presents an unbiased formulation of the PAUC
optimization problem via distributional robust optimization. However, it is
based on the pair-wise formulation of AUC, which suffers from the limited
scalability w.r.t. sample size and a slow convergence rate, especially for
TPAUC. To address this issue, we present a simpler reformulation of the problem
in an asymptotically unbiased and instance-wise manner. For both OPAUC and
TPAUC, we come to a nonconvex strongly concave minimax regularized problem of
instance-wise functions. On top of this, we employ an efficient solver enjoys a
linear per-iteration computational complexity w.r.t. the sample size and a
time-complexity of $O(\epsilon^{-1/3})$ to reach a $\epsilon$ stationary point.
Furthermore, we find that the minimax reformulation also facilitates the
theoretical analysis of generalization error as a byproduct. Compared with the
existing results, we present new error bounds that are much easier to prove and
could deal with hypotheses with real-valued outputs. Finally, extensive
experiments on several benchmark datasets demonstrate the effectiveness of our
method.
Authors' comments: NeurIPS 2022
Misbah Shafi, Rakesh Kumar Jha, Sanjeev Jain
The advancement in wireless communication technologies is becoming more
demanding and pervasive. One of the fundamental parameters that limit the
efficiency of the network are the security challenges. The communication
network is vulnerable to security attacks such as spoofing attacks and signal
strength attacks. Intrusion detection signifies a central approach to ensuring
the security of the communication network. In this paper, an Intrusion
Detection System based on the framework of graph theory is proposed. A
Layerwise Graph Theory-Based Intrusion Detection System (LGTBIDS) algorithm is
designed to detect the attacked node. The algorithm performs the layer-wise
analysis to extract the vulnerable nodes and ultimately the attacked node(s).
For each layer, every node is scanned for the possibility of susceptible
node(s). The strategy of the IDS is based on the analysis of energy efficiency
and secrecy rate. The nodes with the energy efficiency and secrecy rate beyond
the range of upper and lower thresholds are detected as the nodes under attack.
Further, detected node(s) are transmitted with a random sequence of bits
followed by the process of re-authentication. The obtained results validate the
better performance, low time computations, and low complexity. Finally, the
proposed approach is compared with the conventional solution of intrusion
detection.
Authors' comments: in IEEE Transactions on Network and Service Management, 2022
Valeri V. Makarov, Nathan J. Secrest
Making use of strong correlations between closely separated multiple or
double sources and photometric and astrometric metadata in Gaia EDR3, we
generate a catalog of candidate double and multiply imaged lensed quasars and
AGNs, comprising 3140 systems. It includes two partially overlapping parts, a
sample of distant (redshifts mostly greater than 1) sources with perturbed
data, and systems resolved into separate components by Gaia at separations less
than $2\arcsec$. For the first part, which is roughly one third of the
published catalog, we synthesized 0.617 million redshifts by multiple machine
learning prediction and classification methods, using independent photometric
and astrometric data from Gaia EDR3 and WISE with accurate spectroscopic
redshifts from SDSS as a training set. Using these synthetic redshifts, we
estimate a rate of 4.9\% of interlopers with spectroscopic redshift below 1 in
this part of the catalog. Unresolved candidate double and dual AGNs and quasars
are selected as sources with marginally high BP/RP excess factor
(phot_bp_rp_excess_factor), which is sensitive to source extent, limiting our
search to high-redshift quasars. For the second part of the catalog, additional
filters on measured parallax and near-neighbor statistics are applied to
diminish the propagation of remaining stellar contaminants. The estimated rate
of positives (double or multiple sources) is 98\%, and the estimated rate of
dual (physically related quasars) is greater than 54\%. A few dozen
serendipitously found objects of interest are discussed in more detail,
including known and new lensed images, planetary nebulae and young infrared
stars of peculiar morphology, and quasars with catastrophic redshift errors in
SDSS.
Authors' comments: Accepted in ApJS
Skander Karkar, Ibrahim Ayed, Emmanuel de Bézenac, Patrick Gallinari
End-to-end backpropagation has a few shortcomings: it requires loading the
entire model during training, which can be impossible in constrained settings,
and suffers from three locking problems (forward locking, update locking and
backward locking), which prohibit training the layers in parallel. Solving
layer-wise optimization problems can address these problems and has been used
in on-device training of neural networks. We develop a layer-wise training
method, particularly welladapted to ResNets, inspired by the minimizing
movement scheme for gradient flows in distribution space. The method amounts to
a kinetic energy regularization of each block that makes the blocks optimal
transport maps and endows them with regularity. It works by alleviating the
stagnation problem observed in layer-wise training, whereby greedily-trained
early layers overfit and deeper layers stop increasing test accuracy after a
certain depth. We show on classification tasks that the test accuracy of
block-wise trained ResNets is improved when using our method, whether the
blocks are trained sequentially or in parallel.
Authors' comments: 1st International Workshop on Practical Deep Learning in the Wild at
AAAI 2022
Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao
Layer-wise distillation is a powerful tool to compress large models (i.e.
teacher models) into small ones (i.e., student models). The student distills
knowledge from the teacher by mimicking the hidden representations of the
teacher at every intermediate layer. However, layer-wise distillation is
difficult. Since the student has a smaller model capacity than the teacher, it
is often under-fitted. Furthermore, the hidden representations of the teacher
contain redundant information that the student does not necessarily need for
the target task's learning. To address these challenges, we propose a novel
Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to
align the hidden representations of the student and the teacher at each layer.
The filters select the knowledge that is useful for the target task from the
hidden representations. As such, TED reduces the knowledge gap between the two
models and helps the student to fit better on the target task. We evaluate TED
in two scenarios: continual pre-training and fine-tuning. TED demonstrates
significant and consistent improvements over existing distillation methods in
both scenarios. Code is available at
https://github.com/cliang1453/task-aware-distillation.
Authors' comments: Proceedings of ICML 2023
Maaike M. Galama, Hao Wu, Andreas Krämer, Mohsen Sadeghi, Frank Noé
The dynamics of molecules are governed by rare event transitions between long-lived (metastable) states. To explore these transitions efficiently, many enhanced sampling protocols have been introduced that involve using simulations with biases or changed temperatures. Two established statistically optimal estimators for obtaining unbiased equilibrium properties from such simulations are the multistate Bennett Acceptance Ratio (MBAR) and the transition-based reweighting analysis method (TRAM). Both MBAR and TRAM are solved iteratively and can suffer from long convergence times. Here we introduce stochastic approximators (SA) for both estimators, resulting in SAMBAR and SATRAM, which are shown to converge faster than their deterministic counterparts, without significant accuracy loss. Both methods are demonstrated on different molecular systems.
Muhammad ElNokrashy, Badr AlKhamissi, Mona Diab
Language Models pretrained on large textual data have been shown to encode
different types of knowledge simultaneously. Traditionally, only the features
from the last layer are used when adapting to new tasks or data. We put forward
that, when using or finetuning deep pretrained models, intermediate layer
features that may be relevant to the downstream task are buried too deep to be
used efficiently in terms of needed samples or steps. To test this, we propose
a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface
signals from non-final layers. We compare DWAtt to a basic concatenation-based
layer fusion method (Concat), and compare both to a deeper model baseline --
all kept within a similar parameter budget. Our findings show that DWAtt and
Concat are more step- and sample-efficient than the baseline, especially in the
few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03
NER, layer fusion shows 3.68-9.73% F1 gain at different few-shot sizes. The
layer fusion models presented significantly outperform the baseline in various
training scenarios with different data sizes, architectures, and training
constraints.
Authors' comments: 7 pages, 7 figures
Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, Zhe Wang
Most of today's computer vision pipelines are built around deep neural
networks, where convolution operations require most of the generally high
compute effort. The Winograd convolution algorithm computes convolutions with
fewer MACs compared to the standard algorithm, reducing the operation count by
a factor of 2.25x for 3x3 convolutions when using the version with 2x2-sized
tiles $F_2$. Even though the gain is significant, the Winograd algorithm with
larger tile sizes, i.e., $F_4$, offers even more potential in improving
throughput and energy efficiency, as it reduces the required MACs by 4x.
Unfortunately, the Winograd algorithm with larger tile sizes introduces
numerical issues that prevent its use on integer domain-specific accelerators
and higher computational overhead to transform input and output data between
spatial and Winograd domains.
To unlock the full potential of Winograd $F_4$, we propose a novel tap-wise
quantization method that overcomes the numerical issues of using larger tiles,
enabling integer-only inference. Moreover, we present custom hardware units
that process the Winograd transformations in a power- and area-efficient way,
and we show how to integrate such custom modules in an industrial-grade,
programmable DSA. An extensive experimental evaluation on a large set of
state-of-the-art computer vision benchmarks reveals that the tap-wise
quantization algorithm makes the quantized Winograd $F_4$ network almost as
accurate as the FP32 baseline. The Winograd-enhanced DSA achieves up to 1.85x
gain in energy efficiency and up to 1.83x end-to-end speed-up for
state-of-the-art segmentation and detection networks.
Authors' comments: Accepted at IEEE/ACM MICRO 2022 (1-5 October 2022)
Shoichiro Mizukoshi, Takeo Minezaki, Shoichi Tsunetsugu, Atsuhiro Yoshida, Hiroaki Sameshima, Mitsuru Kokubo, Hirofumi Noda
We present the measurement of the line-of-sight extinction of the dusty torus
for a large number of obscured active galactic nuclei (AGNs) based on the
reddening of the colour of the variable flux component in near-infrared (NIR)
wavelengths. We collected long-term monitoring data by $\textit{Wide-field
Infrared Survey Explorer (WISE)}$ for 513 local AGNs catalogued by the
$\mathit{Swift/}$BAT AGN Spectroscopic Survey (BASS) and found that the
multi-epoch NIR flux data in two different bands (WISE $W1$ and $W2$) are
tightly correlated for more than 90% of the targets. The flux variation
gradient (FVG) in the $W1$ and $W2$ bands was derived by applying linear
regression analysis, and we reported that those for unobscured AGNs fall in a
relatively narrow range, whereas those for obscured AGNs are distributed in a
redder and broader range. The AGN's line-of-sight dust extinction ($A_V$) is
calculated using the amount of the reddening in the FVG and is compared with
the neutral hydrogen column density ($N_{\rm{}H}$) of the BASS catalogue. We
found that the $N_{\rm{}H}/A_V$ ratios of obscured AGNs are greater than those
of the Galactic diffuse interstellar medium (ISM) and are distributed with a
large scatter by at most two orders of magnitude. Furthermore, we found that
the lower envelope of the $N_{\rm{}H}/A_V$ of obscured AGNs is comparable to
the Galactic diffuse ISM. These properties of the $N_{\rm{}H}/A_V$ can be
explained by increase in the $N_{\rm{}H}$ attributed to the dust-free gas
clouds covering the line of sight in the broad-line region.
Authors' comments: 11 pages, 7 figures, published in MNRAS
Guoqing Wang, Abhirup Datta, Martin A. Lindquist
In recent years, neuroimaging has undergone a paradigm shift, moving away from the traditional brain mapping approach toward developing integrated, multivariate brain models that can predict categories of mental events. However, large interindividual differences in brain anatomy and functional localization after standard anatomical alignment remain a major limitation in performing this analysis, as it leads to feature misalignment across subjects in subsequent predictive models.
Dongsuk Oh, Yejin Kim, Hodong Lee, H. Howie Huang, Heuiseok Lim
Recent pre-trained language models (PLMs) achieved great success on many
natural language processing tasks through learning linguistic features and
contextualized sentence representation. Since attributes captured in stacked
layers of PLMs are not clearly identified, straightforward approaches such as
embedding the last layer are commonly preferred to derive sentence
representations from PLMs. This paper introduces the attention-based pooling
strategy, which enables the model to preserve layer-wise signals captured in
each layer and learn digested linguistic features for downstream tasks. The
contrastive learning objective can adapt the layer-wise attention pooling to
both unsupervised and supervised manners. It results in regularizing the
anisotropic space of pre-trained embeddings and being more uniform. We evaluate
our model on standard semantic textual similarity (STS) and semantic search
tasks. As a result, our method improved the performance of the base contrastive
learned BERT_base and variants.
Authors' comments: Accepted to COLING 2022
Jiapeng Wang, Ming Ma, Zhenhua Yu
Existing differentiable channel pruning methods often attach scaling factors or masks behind channels to prune filters with less importance, and implicitly assume uniform contribution of input samples to filter importance. Specifically, the effects of instance complexity on pruning performance are not yet fully investigated in static network pruning. In this paper, we propose a simple yet effective differentiable network pruning method CWP based on instance complexity weighted filter importance scores. We define instance complexity related weight for each instance by giving higher weights to hard instances, and measure the weighted sum of instance-specific soft masks to model non-uniform contribution of different inputs, which encourages hard instances to dominate the pruning process and the model performance to be well preserved. In addition, we introduce a regularizer to maximize polarization of the masks, such that a sweet spot can be easily found to identify the filters to be pruned. Performance evaluations on various network architectures and datasets demonstrate CWP has advantages over the state-of-the-arts in pruning large networks. For instance, CWP improves the accuracy of ResNet56 on CIFAR-10 dataset by 0.32% aftering removing 64.11% FLOPs, and prunes 87.75% FLOPs of ResNet50 on ImageNet dataset with only 0.93% Top-1 accuracy loss.