Fan Zhang, Mei Tu, Sangha Kim, Song Liu, Jinyao Yan
Most multi-domain machine translation models rely on domain-annotated data. Unfortunately, domain labels are usually unavailable in both training processes and real translation scenarios. In this work, we propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference. Our model is composed of three parts: a backbone model, a domain discriminator taking responsibility to discriminate data from different domains, and a set of experts that transfer the decoded features from generic to specific. We design a stage-wise training strategy and train the three parts sequentially. To leverage the extra domain knowledge and improve the training stability, in the discriminator training stage, domain differences are modeled explicitly with clustering and distilled into the discriminator through a multi-classification task. Meanwhile, the Gumbel-Max sampling is adopted as the routing scheme in the expert training stage to achieve the balance of each expert in specialization and generalization. Experimental results on the German-to-English translation task show that our model significantly improves BLEU scores on six different domains and even outperforms most of the models trained with domain-annotated data.
Xuan Kien Phung, Sylvie Hamel
Kemeny's rule is one of the most studied and well-known voting schemes with
various important applications in computational social choice and biology.
Recently, Kemeny's rule was generalized via a set-wise approach by Gilbert et.
al. This paradigm presents interesting advantages in comparison with Kemeny's
rule since not only pairwise comparisons but also the discordance between the
winners of subsets of three alternatives are also taken into account in the
definition of the $3$-wise Kendall-tau distance between two rankings. In spite
of the NP-hardness of the 3-wise Kemeny problem which consists of computing the
set of $3$-wise consensus rankings, namely rankings whose total $3$-wise
Kendall-tau distance to a given voting profile is minimized, we establish in
this paper several generalizations of the Major Order Theorems, as obtained by
Milosz and Hamel for Kemeny's rule, for the $3$-wise Kemeny voting schemes to
achieve a substantial search space reduction by efficiently determining in
polynomial time the relative orders of pairs of alternatives. Essentially, our
theorems quantify precisely the nontrivial property that if the preference for
an alternative over another one in an election is strong enough, not only in
the head-to-head competition but even when taking into account one or two more
alternatives, then the relative order of these two alternatives in all $3$-wise
consensus rankings must be as expected. As an application, we also obtain an
improvement of the Major Order Theorems for Kememy's rule. Moreover, we show
that the well-known $3/4$-majority rule of Betzler et al. for Kemeny's rule is
only valid in general for elections with no more than $5$ alternatives with
respect to the $3$-wise Kemeny scheme. Several simulations and tests of our
algorithms on real-world and uniform data are provided.
Authors' comments: several improvements included
Norihide Tokushige
Let ${\mathcal G}$ be a family of subsets of an $n$-element set. The family ${\mathcal G}$ is called $3$-wise $t$-intersecting if the intersection of any three subsets in ${\mathcal G}$ is of size at least $t$. For a real number $p\in(0,1)$ we define the measure of the family by the sum of $p^{|G|}(1-p)^{n-|G|}$ over all $G\in{\mathcal G}$. For example, if ${\mathcal G}$ consists of all subsets containing a fixed $t$-element set, then it is a $3$-wise $t$-intersecting family with the measure $p^t$. Let $0<p\leq 2/(\sqrt{4t+9}-1)$, $\delta>0$, and let ${\mathcal G}$ be a $3$-wise $t$-intersecting family. It is known that the measure of ${\mathcal G}$ is at most $p^t$. Suppose, moreover, that ${\mathcal G}$ has the measure at least $(\frac12+\delta)p^t$. We show that, by choosing $t$ sufficiently large depending on $\delta$, the structure of ${\mathcal G}$ is one of (i) and (ii): (i) every subset in ${\mathcal G}$ contains a fixed $t$-element set, (ii) every subset in ${\mathcal G}$ contains at least $t+2$ elements from a fixed $(t+3)$-element set.
Jianfeng Zhang
For time inconsistent optimal control problems, a quite popular approach is the equilibrium approach, taken by the sophisticated agents. In this short note we construct a deterministic continuous time example where the unique equilibrium is dominated by another control. Therefore, in this situation it may not be wise to take the equilibrium strategy.
Lin Li, Michael Spratling
Deep neural networks can be easily fooled into making incorrect predictions
through corruption of the input by adversarial perturbations:
human-imperceptible artificial noise. So far adversarial training has been the
most successful defense against such adversarial attacks. This work focuses on
improving adversarial training to boost adversarial robustness. We first
analyze, from an instance-wise perspective, how adversarial vulnerability
evolves during adversarial training. We find that during training an overall
reduction of adversarial loss is achieved by sacrificing a considerable
proportion of training samples to be more vulnerable to adversarial attack,
which results in an uneven distribution of adversarial vulnerability among
data. Such "uneven vulnerability", is prevalent across several popular robust
training methods and, more importantly, relates to overfitting in adversarial
training. Motivated by this observation, we propose a new adversarial training
method: Instance-adaptive Smoothness Enhanced Adversarial Training (ISEAT). It
jointly smooths both input and weight loss landscapes in an adaptive,
instance-specific, way to enhance robustness more for those samples with higher
adversarial vulnerability. Extensive experiments demonstrate the superiority of
our method over existing defense methods. Noticeably, our method, when combined
with the latest data augmentation and semi-supervised learning techniques,
achieves state-of-the-art robustness against $\ell_{\infty}$-norm constrained
attacks on CIFAR10 of 59.32% for Wide ResNet34-10 without extra data, and
61.55% for Wide ResNet28-10 with extra data. Code is available at
https://github.com/TreeLLi/Instance-adaptive-Smoothness-Enhanced-AT.
Authors' comments: 12 pages, work in submission
Jie Wang, Zhihao Shi, Xize Liang, Defu Lian, Shuiwang Ji, Bin Li, Enhong Chen, Feng Wu
Subgraph-wise sampling -- a promising class of mini-batch training techniques
for graph neural networks (GNNs -- is critical for real-world applications.
During the message passing (MP) in GNNs, subgraph-wise sampling methods discard
messages outside the mini-batches in backward passes to avoid the well-known
neighbor explosion problem, i.e., the exponentially increasing dependencies of
nodes with the number of MP iterations. However, discarding messages may
sacrifice the gradient estimation accuracy, posing significant challenges to
their convergence analysis and convergence speeds. To address this challenge,
we propose a novel subgraph-wise sampling method with a convergence guarantee,
namely Local Message Compensation (LMC). To the best of our knowledge, LMC is
the first subgraph-wise sampling method with provable convergence. The key idea
is to retrieve the discarded messages in backward passes based on a message
passing formulation of backward passes. By efficient and effective
compensations for the discarded messages in both forward and backward passes,
LMC computes accurate mini-batch gradients and thus accelerates convergence.
Moreover, LMC is applicable to various MP-based GNN architectures, including
convolutional GNNs (finite message passing iterations with different layers)
and recurrent GNNs (infinite message passing iterations with a shared layer).
Experiments on large-scale benchmarks demonstrate that LMC is significantly
faster than state-of-the-art subgraph-wise sampling methods.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2302.00924
Gaochen Dong, Wei Chen
With the popularity of the recent Transformer-based models represented by
BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range
of natural language processing tasks. However, the massive computations, huge
memory footprint, and thus high latency of Transformer-based models is an
inevitable challenge for the cloud with high real-time requirement. To tackle
the issue, we propose BBCT, a method of block-wise bit-compression for
transformer without retraining. Our method achieves more fine-grained
compression of the whole transformer, including embedding, matrix
multiplication, GELU, softmax, layer normalization, and all the intermediate
results. As a case, we compress an efficient BERT with the method of BBCT. Our
benchmark test results on General Language Understanding Evaluation (GLUE) show
that BBCT can achieve less than 1% accuracy drop in most tasks.
Authors' comments: Need to add figures and adjust languages to improve readability
Gil Keren
Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for
speech recognition are iterating over the time axis, such that one time step is
decoded before moving on to the next time step. Those algorithms result in a
large number of calls to the joint network, which were shown in previous work
to be an important factor that reduces decoding speed. We present a decoding
beam search algorithm that batches the joint network calls across a segment of
time steps, which results in 20%-96% decoding speedups consistently across all
models and settings experimented with. In addition, aggregating emission
probabilities over a segment may be seen as a better approximation to finding
the most likely model output, causing our algorithm to improve oracle word
error rate by up to 11% relative as the segment size increases, and to slightly
improve general word error rate.
Authors' comments: Accepted for Presentation at ASRU 2023
Sophie Potts, Elisabeth Bergherr, Constantin Reinke, Colin Griesbach
Model-based component-wise gradient boosting is a popular tool for data-driven variable selection. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed, that mainly focus on different stopping criteria, leaving the actual variable selection mechanism untouched. We investigate different prediction-based mechanisms for the variable selection step in model-based component-wise gradient boosting. These approaches include Akaikes Information Criterion (AIC) as well as a selection rule relying on the component-wise test error computed via cross-validation. We implemented the AIC and cross-validation routines for Generalized Linear Models and evaluated them regarding their variable selection properties and predictive performance. An extensive simulation study revealed improved selection properties whereas the prediction error could be lowered in a real world application with age-standardized COVID-19 incidence rates.
Najeeb Moharram Jebreel, Josep Domingo-Ferrer, Yiming Li
Training deep neural networks (DNNs) usually requires massive training data
and computational resources. Users who cannot afford this may prefer to
outsource training to a third party or resort to publicly available pre-trained
models. Unfortunately, doing so facilitates a new training-time attack (i.e.,
backdoor attack) against DNNs. This attack aims to induce misclassification of
input samples containing adversary-specified trigger patterns. In this paper,
we first conduct a layer-wise feature analysis of poisoned and benign samples
from the target class. We find out that the feature difference between benign
and poisoned samples tends to be maximum at a critical layer, which is not
always the one typically used in existing defenses, namely the layer before
fully-connected layers. We also demonstrate how to locate this critical layer
based on the behaviors of benign samples. We then propose a simple yet
effective method to filter poisoned samples by analyzing the feature
differences between suspicious and benign samples at the critical layer. We
conduct extensive experiments on two benchmark datasets, which confirm the
effectiveness of our defense.
Authors' comments: This paper is accepted by PAKDD 2023
Adam X. Yang, Laurence Aitchison, Henry B. Moss
In Bayesian optimisation, we often seek to minimise the black-box objective functions that arise in real-world physical systems. A primary contributor to the cost of evaluating such black-box objective functions is often the effort required to prepare the system for measurement. We consider a common scenario where preparation costs grow as the distance between successive evaluations increases. In this setting, smooth optimisation trajectories are preferred and the jumpy paths produced by the standard myopic (i.e.\ one-step-optimal) Bayesian optimisation methods are sub-optimal. Our algorithm, MONGOOSE, uses a meta-learnt parametric policy to generate smooth optimisation trajectories, achieving performance gains over existing methods when optimising functions with large movement costs.
Maxime Darrin, Guillaume Staerman, Eduardo Dadalto Câmara Gomes, Jackie CK Cheung, Pablo Piantanida, Pierre Colombo
Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on an anomaly score (e.g., Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results could be achieved if the best layer were picked. To leverage this observation, we propose a data-driven, unsupervised method to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a greater number of classes (up to 77), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results while removing manual feature selection altogether. Their performance achieves near oracle's best layer performance.
Yuanjie Yan, Jian Zhao, Furao Shen
Image manipulation on the latent space of the pre-trained StyleGAN can control the semantic attributes of the generated images. Recently, some studies have focused on detecting channels with specific properties to directly manipulate the latent code, which is limited by the entanglement of the latent space. To detect the attribute-specific channels, we propose a novel detection method in the context of pre-trained classifiers. We analyse the gradients layer by layer on the style space. The intensities of the gradients indicate the channel's responses to specific attributes. The latent style codes of channels control separate attributes in the layers. We choose channels with top-$k$ gradients to control specific attributes in the maximum response layer. We implement single-channel and multi-channel manipulations with a certain attribute. Our methods can accurately detect relevant channels for a large number of face attributes. Extensive qualitative and quantitative results demonstrate that the proposed methods outperform state-of-the-art methods in generalization and scalability.
Nuoya Xiong, Yihan Du, Longbo Huang
In this paper, we investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we consider stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications which need to ensure safety in all decision steps and may not always possess safe actions, e.g., robot control and autonomous driving. We propose a novel algorithm SUCBVI, which guarantees $\widetilde{O}(\sqrt{ST})$ step-wise violation and $\widetilde{O}(\sqrt{H^3SAT})$ regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to $S$ and $T$. Moreover, we further study a novel safe reward-free exploration problem with step-wise violation constraints. For this problem, we design an $(\varepsilon,\delta)$-PAC algorithm SRF-UCRL, which achieves nearly state-of-the-art sample complexity $\widetilde{O}((\frac{S^2AH^2}{\varepsilon}+\frac{H^4SA}{\varepsilon^2})(\log(\frac{1}{\delta})+S))$, and guarantees $\widetilde{O}(\sqrt{ST})$ violation during the exploration. The experimental results demonstrate the superiority of our algorithms in safety performance, and corroborate our theoretical results.
Giovanni Araujo Bacochina, Rodrigo Clemente Thom de Souza
The use of Attention Layers has become a trend since the popularization of the Transformer-based models, being the key element for many state-of-the-art models that have been developed through recent years. However, one of the biggest obstacles in implementing these architectures - as well as many others in Deep Learning Field - is the enormous amount of optimizing parameters they possess, which make its use conditioned on the availability of robust hardware. In this paper, it's proposed a new method of attention mechanism that adapts the Dot-Product Attention, which uses matrices multiplications, to become element-wise through the use of arrays multiplications. To test the effectiveness of such approach, two models (one with a VGG-like architecture and one with the proposed method) have been trained in a classification task using Fashion MNIST and CIFAR10 datasets. Each model has been trained for 10 epochs in a single Tesla T4 GPU from Google Colaboratory. The results show that this mechanism allows for an accuracy of 92% of the VGG-like counterpart in Fashion MNIST dataset, while reducing the number of parameters in 97%. For CIFAR10, the accuracy is still equivalent to 60% of the VGG-like counterpart while using 50% less parameters.
Rohit Yadav, François-Xavier Dupé, S. Takerkart, Guillaume Auzias
Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, Moontae Lee
Since the recent advent of regulations for data protection (e.g., the General
Data Protection Regulation), there has been increasing demand in deleting
information learned from sensitive data in pre-trained models without
retraining from scratch. The inherent vulnerability of neural networks towards
adversarial attacks and unfairness also calls for a robust method to remove or
correct information in an instance-wise fashion, while retaining the predictive
performance across remaining data. To this end, we consider instance-wise
unlearning, of which the goal is to delete information on a set of instances
from a pre-trained model, by either misclassifying each instance away from its
original prediction or relabeling the instance to a different label. We also
propose two methods that reduce forgetting on the remaining data: 1) utilizing
adversarial examples to overcome forgetting at the representation-level and 2)
leveraging weight importance metrics to pinpoint network parameters guilty of
propagating unwanted information. Both methods only require the pre-trained
model and data instances to forget, allowing painless application to real-life
settings where the entire training set is unavailable. Through extensive
experimentation on various image classification benchmarks, we show that our
approach effectively preserves knowledge of remaining data while unlearning
given instances in both single-task and continual unlearning scenarios.
Authors' comments: AAAI 2024 camera ready version
Zanjia Tong, Yuhang Chen, Zewei Xu, Rong Yu
The loss function for bounding box regression (BBR) is essential to object detection. Its good definition will bring significant performance improvement to the model. Most existing works assume that the examples in the training data are high-quality and focus on strengthening the fitting ability of BBR loss. If we blindly strengthen BBR on low-quality examples, it will jeopardize localization performance. Focal-EIoU v1 was proposed to solve this problem, but due to its static focusing mechanism (FM), the potential of non-monotonic FM was not fully exploited. Based on this idea, we propose an IoU-based loss with a dynamic non-monotonic FM named Wise-IoU (WIoU). The dynamic non-monotonic FM uses the outlier degree instead of IoU to evaluate the quality of anchor boxes and provides a wise gradient gain allocation strategy. This strategy reduces the competitiveness of high-quality anchor boxes while also reducing the harmful gradient generated by low-quality examples. This allows WIoU to focus on ordinary-quality anchor boxes and improve the detector's overall performance. When WIoU is applied to the state-of-the-art real-time detector YOLOv7, the AP-75 on the MS-COCO dataset is improved from 53.03% to 54.50%. Code is available at https://github.com/Instinct323/wiou.
Basel Barakat, Qiang Huang
Finetuning can be used to tackle domain-specific tasks by transferring
knowledge. Previous studies on finetuning focused on adapting only the weights
of a task-specific classifier or re-optimizing all layers of the pre-trained
model using the new task data. The first type of methods cannot mitigate the
mismatch between a pre-trained model and the new task data, and the second type
of methods easily cause over-fitting when processing tasks with limited data.
To explore the effectiveness of fine-tuning, we propose a novel block-wise
optimization mechanism, which adapts the weights of a group of layers of a
pre-trained model. In our work, the layer selection can be done in four
different ways. The first is layer-wise adaptation, which aims to search for
the most salient single layer according to the classification performance. The
second way is based on the first one, jointly adapting a small number of
top-ranked layers instead of using an individual layer. The third is block
based segmentation, where the layers of a deep network is segmented into blocks
by non-weighting layers, such as the MaxPooling layer and Activation layer. The
last one is to use a fixed-length sliding window to group layers block by
block. To identify which group of layers is the most suitable for finetuning,
the search starts from the target end and is conducted by freezing other layers
excluding the selected layers and the classification layers. The most salient
group of layers is determined in terms of classification performance. In our
experiments, the proposed approaches are tested on an often-used dataset,
Tf_flower, by finetuning five typical pre-trained models, VGG16, MobileNet-v1,
MobileNet-v2, MobileNet-v3, and ResNet50v2, respectively. The obtained results
show that the use of our proposed block-wise approaches can achieve better
performances than the two baseline methods and the layer-wise method.
Authors' comments: 10 pages
Kislay Raj, Aditya Singh, Abhishek Mandal, Teerath Kumar, Arunabha M. Roy
In a growing world of technology, psychological disorders became a challenge
to be solved. The methods used for cognitive stimulation are very conventional
and based on one-way communication, which only relies on the material or method
used for training of an individual. It doesn't use any kind of feedback from
the individual to analyze the progress of the training process. We have
proposed a closed-loop methodology to improve the cognitive state of a person
with ID (Intellectual disability). We have used a platform named 'Armoni', for
providing training to the intellectually disabled individuals. The learning is
performed in a closed-loop by using feedback in the form of change in affective
state. For feedback to the Armoni, an EEG (Electroencephalograph) headband is
used. All the changes in EEG are observed and classified against the change in
the mean and standard deviation value of all frequency bands of signal. This
comparison is being helpful in defining every activity with respect to change
in brain signals. In this paper, we have discussed the process of treatment of
EEG signal and its definition against the different activities of Armoni. We
have tested it on 6 different systems with different age groups and cognitive
levels.
Authors' comments: Submitted to SN Computer Science journal