Asim Naveed, Syed S. Naqvi, Tariq M. Khan, Imran Razzak
Skin cancer holds the highest incidence rate among all cancers globally. The importance of early detection cannot be overstated, as late-stage cases can be lethal. Classifying skin lesions, however, presents several challenges due to the many variations they can exhibit, such as differences in colour, shape, and size, significant variation within the same class, and notable similarities between different classes. This paper introduces a novel class-wise attention technique that equally regards each class while unearthing more specific details about skin lesions. This attention mechanism is progressively used to amalgamate discriminative feature details from multiple scales. The introduced technique demonstrated impressive performance, surpassing more than 15 cutting-edge methods including the winners of HAM1000 and ISIC 2019 leaderboards. It achieved an impressive accuracy rate of 97.40% on the HAM10000 dataset and 94.9% on the ISIC 2019 dataset.
Zachary Robertson, Oluwasanmi Koyejo
In the quest to enhance the efficiency and bio-plausibility of training deep
neural networks, Feedback Alignment (FA), which replaces the backward pass
weights with random matrices in the training process, has emerged as an
alternative to traditional backpropagation. While the appeal of FA lies in its
circumvention of computational challenges and its plausible biological
alignment, the theoretical understanding of this learning rule remains partial.
This paper uncovers a set of conservation laws underpinning the learning
dynamics of FA, revealing intriguing parallels between FA and Gradient Descent
(GD). Our analysis reveals that FA harbors implicit biases akin to those
exhibited by GD, challenging the prevailing narrative that these learning
algorithms are fundamentally different. Moreover, we demonstrate that these
conservation laws elucidate sufficient conditions for layer-wise alignment with
feedback matrices in ReLU networks. We further show that this implies
over-parameterized two-layer linear networks trained with FA converge to
minimum-norm solutions. The implications of our findings offer avenues for
developing more efficient and biologically plausible alternatives to
backpropagation through an understanding of the principles governing learning
dynamics in deep networks.
Authors' comments: 8 pages, 2 figures
Tobias Cord-Landwehr, Christoph Boeddeker, Cătălin Zorilă, Rama Doddipatla, Reinhold Haeb-Umbach
Using a Teacher-Student training approach we developed a speaker embedding
extraction system that outputs embeddings at frame rate. Given this high
temporal resolution and the fact that the student produces sensible speaker
embeddings even for segments with speech overlap, the frame-wise embeddings
serve as an appropriate representation of the input speech signal for an
end-to-end neural meeting diarization (EEND) system. We show in experiments
that this representation helps mitigate a well-known problem of EEND systems:
when increasing the number of speakers the diarization performance drop is
significantly reduced. We also introduce block-wise processing to be able to
diarize arbitrarily long meetings.
Authors' comments: ICASSP 2023
Karim Lounici, Grégoire Pacreau
Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, the robust procedures designed to address this problem require the invertibility of the covariance operator and thus are not effective on high-dimensional data. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves near minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
Edmund Dervakos, Konstantinos Thomas, Giorgos Filandrianos, Giorgos Stamou
Counterfactual explanations have been argued to be one of the most intuitive
forms of explanation. They are typically defined as a minimal set of edits on a
given data sample that, when applied, changes the output of a model on that
sample. However, a minimal set of edits is not always clear and understandable
to an end-user, as it could, for instance, constitute an adversarial example
(which is indistinguishable from the original data sample to an end-user).
Instead, there are recent ideas that the notion of minimality in the context of
counterfactuals should refer to the semantics of the data sample, and not to
the feature space. In this work, we build on these ideas, and propose a
framework that provides counterfactual explanations in terms of knowledge
graphs. We provide an algorithm for computing such explanations (given some
assumptions about the underlying knowledge), and quantitatively evaluate the
framework with a user study.
Authors' comments: To appear at IJCAI 2023
Shuai Wang, Zipei Yan, Daoan Zhang, Zhongsen Li, Sirui Wu, Wenxuan Chen, Rui Li
Deep neural networks (DNNs) achieve promising performance in visual
recognition under the independent and identically distributed (IID) hypothesis.
In contrast, the IID hypothesis is not universally guaranteed in numerous
real-world applications, especially in medical image analysis. Medical image
segmentation is typically formulated as a pixel-wise classification task in
which each pixel is classified into a category. However, this formulation
ignores the hard-to-classified pixels, e.g., some pixels near the boundary
area, as they usually confuse DNNs. In this paper, we first explore that
hard-to-classified pixels are associated with high uncertainty. Based on this,
we propose a novel framework that utilizes uncertainty estimation to highlight
hard-to-classified pixels for DNNs, thereby improving its generalization. We
evaluate our method on two popular benchmarks: prostate and fundus datasets.
The results of the experiment demonstrate that our method outperforms
state-of-the-art methods.
Authors' comments: 10 pages, 3 figures
Junrui Xiao, Zhikai Li, Lianwei Yang, Qingyi Gu
As emerging hardware begins to support mixed bit-width arithmetic computation, mixed-precision quantization is widely used to reduce the complexity of neural networks. However, Vision Transformers (ViTs) require complex self-attention computation to guarantee the learning of powerful feature representations, which makes mixed-precision quantization of ViTs still challenging. In this paper, we propose a novel patch-wise mixed-precision quantization (PMQ) for efficient inference of ViTs. Specifically, we design a lightweight global metric, which is faster than existing methods, to measure the sensitivity of each component in ViTs to quantization errors. Moreover, we also introduce a pareto frontier approach to automatically allocate the optimal bit-precision according to the sensitivity. To further reduce the computational complexity of self-attention in inference stage, we propose a patch-wise module to reallocate bit-width of patches in each layer. Extensive experiments on the ImageNet dataset shows that our method greatly reduces the search cost and facilitates the application of mixed-precision quantization to ViTs.
Raúl Vargas, Lenny A. Romero, Song Zhang, Andres G. Marrugo
This Letter presents a novel structured light system model that effectively
considers local lens distortion by pixel-wise rational functions. We leverage
the stereo method for initial calibration and then estimate the rational model
for each pixel. Our proposed model can achieve high measurement accuracy within
and outside the calibration volume, demonstrating its robustness and accuracy.
Authors' comments: 4 pages, 5 figures
Fan Zhang, Mei Tu, Sangha Kim, Song Liu, Jinyao Yan
Most multi-domain machine translation models rely on domain-annotated data. Unfortunately, domain labels are usually unavailable in both training processes and real translation scenarios. In this work, we propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference. Our model is composed of three parts: a backbone model, a domain discriminator taking responsibility to discriminate data from different domains, and a set of experts that transfer the decoded features from generic to specific. We design a stage-wise training strategy and train the three parts sequentially. To leverage the extra domain knowledge and improve the training stability, in the discriminator training stage, domain differences are modeled explicitly with clustering and distilled into the discriminator through a multi-classification task. Meanwhile, the Gumbel-Max sampling is adopted as the routing scheme in the expert training stage to achieve the balance of each expert in specialization and generalization. Experimental results on the German-to-English translation task show that our model significantly improves BLEU scores on six different domains and even outperforms most of the models trained with domain-annotated data.
Xuan Kien Phung, Sylvie Hamel
Kemeny's rule is one of the most studied and well-known voting schemes with
various important applications in computational social choice and biology.
Recently, Kemeny's rule was generalized via a set-wise approach by Gilbert et.
al. This paradigm presents interesting advantages in comparison with Kemeny's
rule since not only pairwise comparisons but also the discordance between the
winners of subsets of three alternatives are also taken into account in the
definition of the $3$-wise Kendall-tau distance between two rankings. In spite
of the NP-hardness of the 3-wise Kemeny problem which consists of computing the
set of $3$-wise consensus rankings, namely rankings whose total $3$-wise
Kendall-tau distance to a given voting profile is minimized, we establish in
this paper several generalizations of the Major Order Theorems, as obtained by
Milosz and Hamel for Kemeny's rule, for the $3$-wise Kemeny voting schemes to
achieve a substantial search space reduction by efficiently determining in
polynomial time the relative orders of pairs of alternatives. Essentially, our
theorems quantify precisely the nontrivial property that if the preference for
an alternative over another one in an election is strong enough, not only in
the head-to-head competition but even when taking into account one or two more
alternatives, then the relative order of these two alternatives in all $3$-wise
consensus rankings must be as expected. As an application, we also obtain an
improvement of the Major Order Theorems for Kememy's rule. Moreover, we show
that the well-known $3/4$-majority rule of Betzler et al. for Kemeny's rule is
only valid in general for elections with no more than $5$ alternatives with
respect to the $3$-wise Kemeny scheme. Several simulations and tests of our
algorithms on real-world and uniform data are provided.
Authors' comments: several improvements included
Norihide Tokushige
Let ${\mathcal G}$ be a family of subsets of an $n$-element set. The family ${\mathcal G}$ is called $3$-wise $t$-intersecting if the intersection of any three subsets in ${\mathcal G}$ is of size at least $t$. For a real number $p\in(0,1)$ we define the measure of the family by the sum of $p^{|G|}(1-p)^{n-|G|}$ over all $G\in{\mathcal G}$. For example, if ${\mathcal G}$ consists of all subsets containing a fixed $t$-element set, then it is a $3$-wise $t$-intersecting family with the measure $p^t$. Let $0<p\leq 2/(\sqrt{4t+9}-1)$, $\delta>0$, and let ${\mathcal G}$ be a $3$-wise $t$-intersecting family. It is known that the measure of ${\mathcal G}$ is at most $p^t$. Suppose, moreover, that ${\mathcal G}$ has the measure at least $(\frac12+\delta)p^t$. We show that, by choosing $t$ sufficiently large depending on $\delta$, the structure of ${\mathcal G}$ is one of (i) and (ii): (i) every subset in ${\mathcal G}$ contains a fixed $t$-element set, (ii) every subset in ${\mathcal G}$ contains at least $t+2$ elements from a fixed $(t+3)$-element set.
Jianfeng Zhang
For time inconsistent optimal control problems, a quite popular approach is the equilibrium approach, taken by the sophisticated agents. In this short note we construct a deterministic continuous time example where the unique equilibrium is dominated by another control. Therefore, in this situation it may not be wise to take the equilibrium strategy.
Lin Li, Michael Spratling
Deep neural networks can be easily fooled into making incorrect predictions
through corruption of the input by adversarial perturbations:
human-imperceptible artificial noise. So far adversarial training has been the
most successful defense against such adversarial attacks. This work focuses on
improving adversarial training to boost adversarial robustness. We first
analyze, from an instance-wise perspective, how adversarial vulnerability
evolves during adversarial training. We find that during training an overall
reduction of adversarial loss is achieved by sacrificing a considerable
proportion of training samples to be more vulnerable to adversarial attack,
which results in an uneven distribution of adversarial vulnerability among
data. Such "uneven vulnerability", is prevalent across several popular robust
training methods and, more importantly, relates to overfitting in adversarial
training. Motivated by this observation, we propose a new adversarial training
method: Instance-adaptive Smoothness Enhanced Adversarial Training (ISEAT). It
jointly smooths both input and weight loss landscapes in an adaptive,
instance-specific, way to enhance robustness more for those samples with higher
adversarial vulnerability. Extensive experiments demonstrate the superiority of
our method over existing defense methods. Noticeably, our method, when combined
with the latest data augmentation and semi-supervised learning techniques,
achieves state-of-the-art robustness against $\ell_{\infty}$-norm constrained
attacks on CIFAR10 of 59.32% for Wide ResNet34-10 without extra data, and
61.55% for Wide ResNet28-10 with extra data. Code is available at
https://github.com/TreeLLi/Instance-adaptive-Smoothness-Enhanced-AT.
Authors' comments: 12 pages, work in submission
Jie Wang, Zhihao Shi, Xize Liang, Defu Lian, Shuiwang Ji, Bin Li, Enhong Chen, Feng Wu
Subgraph-wise sampling -- a promising class of mini-batch training techniques
for graph neural networks (GNNs -- is critical for real-world applications.
During the message passing (MP) in GNNs, subgraph-wise sampling methods discard
messages outside the mini-batches in backward passes to avoid the well-known
neighbor explosion problem, i.e., the exponentially increasing dependencies of
nodes with the number of MP iterations. However, discarding messages may
sacrifice the gradient estimation accuracy, posing significant challenges to
their convergence analysis and convergence speeds. To address this challenge,
we propose a novel subgraph-wise sampling method with a convergence guarantee,
namely Local Message Compensation (LMC). To the best of our knowledge, LMC is
the first subgraph-wise sampling method with provable convergence. The key idea
is to retrieve the discarded messages in backward passes based on a message
passing formulation of backward passes. By efficient and effective
compensations for the discarded messages in both forward and backward passes,
LMC computes accurate mini-batch gradients and thus accelerates convergence.
Moreover, LMC is applicable to various MP-based GNN architectures, including
convolutional GNNs (finite message passing iterations with different layers)
and recurrent GNNs (infinite message passing iterations with a shared layer).
Experiments on large-scale benchmarks demonstrate that LMC is significantly
faster than state-of-the-art subgraph-wise sampling methods.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2302.00924
Gaochen Dong, Wei Chen
With the popularity of the recent Transformer-based models represented by
BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range
of natural language processing tasks. However, the massive computations, huge
memory footprint, and thus high latency of Transformer-based models is an
inevitable challenge for the cloud with high real-time requirement. To tackle
the issue, we propose BBCT, a method of block-wise bit-compression for
transformer without retraining. Our method achieves more fine-grained
compression of the whole transformer, including embedding, matrix
multiplication, GELU, softmax, layer normalization, and all the intermediate
results. As a case, we compress an efficient BERT with the method of BBCT. Our
benchmark test results on General Language Understanding Evaluation (GLUE) show
that BBCT can achieve less than 1% accuracy drop in most tasks.
Authors' comments: Need to add figures and adjust languages to improve readability
Gil Keren
Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for
speech recognition are iterating over the time axis, such that one time step is
decoded before moving on to the next time step. Those algorithms result in a
large number of calls to the joint network, which were shown in previous work
to be an important factor that reduces decoding speed. We present a decoding
beam search algorithm that batches the joint network calls across a segment of
time steps, which results in 20%-96% decoding speedups consistently across all
models and settings experimented with. In addition, aggregating emission
probabilities over a segment may be seen as a better approximation to finding
the most likely model output, causing our algorithm to improve oracle word
error rate by up to 11% relative as the segment size increases, and to slightly
improve general word error rate.
Authors' comments: Accepted for Presentation at ASRU 2023
Sophie Potts, Elisabeth Bergherr, Constantin Reinke, Colin Griesbach
Model-based component-wise gradient boosting is a popular tool for data-driven variable selection. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed, that mainly focus on different stopping criteria, leaving the actual variable selection mechanism untouched. We investigate different prediction-based mechanisms for the variable selection step in model-based component-wise gradient boosting. These approaches include Akaikes Information Criterion (AIC) as well as a selection rule relying on the component-wise test error computed via cross-validation. We implemented the AIC and cross-validation routines for Generalized Linear Models and evaluated them regarding their variable selection properties and predictive performance. An extensive simulation study revealed improved selection properties whereas the prediction error could be lowered in a real world application with age-standardized COVID-19 incidence rates.
Najeeb Moharram Jebreel, Josep Domingo-Ferrer, Yiming Li
Training deep neural networks (DNNs) usually requires massive training data
and computational resources. Users who cannot afford this may prefer to
outsource training to a third party or resort to publicly available pre-trained
models. Unfortunately, doing so facilitates a new training-time attack (i.e.,
backdoor attack) against DNNs. This attack aims to induce misclassification of
input samples containing adversary-specified trigger patterns. In this paper,
we first conduct a layer-wise feature analysis of poisoned and benign samples
from the target class. We find out that the feature difference between benign
and poisoned samples tends to be maximum at a critical layer, which is not
always the one typically used in existing defenses, namely the layer before
fully-connected layers. We also demonstrate how to locate this critical layer
based on the behaviors of benign samples. We then propose a simple yet
effective method to filter poisoned samples by analyzing the feature
differences between suspicious and benign samples at the critical layer. We
conduct extensive experiments on two benchmark datasets, which confirm the
effectiveness of our defense.
Authors' comments: This paper is accepted by PAKDD 2023
Adam X. Yang, Laurence Aitchison, Henry B. Moss
In Bayesian optimisation, we often seek to minimise the black-box objective functions that arise in real-world physical systems. A primary contributor to the cost of evaluating such black-box objective functions is often the effort required to prepare the system for measurement. We consider a common scenario where preparation costs grow as the distance between successive evaluations increases. In this setting, smooth optimisation trajectories are preferred and the jumpy paths produced by the standard myopic (i.e.\ one-step-optimal) Bayesian optimisation methods are sub-optimal. Our algorithm, MONGOOSE, uses a meta-learnt parametric policy to generate smooth optimisation trajectories, achieving performance gains over existing methods when optimising functions with large movement costs.
Maxime Darrin, Guillaume Staerman, Eduardo Dadalto Câmara Gomes, Jackie CK Cheung, Pablo Piantanida, Pierre Colombo
Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on an anomaly score (e.g., Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results could be achieved if the best layer were picked. To leverage this observation, we propose a data-driven, unsupervised method to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a greater number of classes (up to 77), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results while removing manual feature selection altogether. Their performance achieves near oracle's best layer performance.