Yuri Lavinas, Marcelo Ladeira, Gabriela Ochoa, Claus Aranha
The performance of multiobjective algorithms varies across problems, making it hard to develop new algorithms or apply existing ones to new problems. To simplify the development and application of new multiobjective algorithms, there has been an increasing interest in their automatic design from component parts. These automatically designed metaheuristics can outperform their human-developed counterparts. However, it is still uncertain what are the most influential components leading to their performance improvement. This study introduces a new methodology to investigate the effects of the final configuration of an automatically designed algorithm. We apply this methodology to a well-performing Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D) designed by the irace package on nine constrained problems. We then contrast the impact of the algorithm components in terms of their Search Trajectory Networks (STNs), the diversity of the population, and the hypervolume. Our results indicate that the most influential components were the restart and update strategies, with higher increments in performance and more distinct metric values. Also, their relative influence depends on the problem difficulty: not using the restart strategy was more influential in problems where MOEA/D performs better; while the update strategy was more influential in problems where MOEA/D performs the worst.
Bingxin Zhao, Shurong Zheng, Hongtu Zhu
Genetic prediction of complex traits and diseases has attracted enormous
attention in precision medicine, mainly because it has the potential to
translate discoveries from genome-wide association studies (GWAS) into medical
advances. As the high dimen- sional covariance matrix (or the linkage
disequilibrium (LD) pattern) of genetic vari- ants has a block-diagonal
structure, many existing methods attempt to account for the dependence among
variants in predetermined local LD blocks/regions. Moreover, due to privacy
restrictions and data protection concerns, genetic variant dependence in each
LD block is typically estimated from external reference panels rather than the
original training dataset. This paper presents a unified analysis of block-wise
and reference panel-based estimators in a high-dimensional prediction framework
with- out sparsity restrictions. We find that, surprisingly, even when the
covariance matrix has a block-diagonal structure with well-defined boundaries,
block-wise estimation methods adjusting for local dependence can be
substantially less accurate than meth- ods controlling for the whole covariance
matrix. Further, estimation methods built on the original training dataset and
external reference panels are likely to have varying performance in high
dimensions, which may reflect the cost of having only access to summary level
data from the training dataset. This analysis is based on our novel re- sults
in random matrix theory for block-diagonal covariance matrix. We numerically
evaluate our results using extensive simulations and the large-scale UK Biobank
real data analysis of 36 complex traits
Authors' comments: 27 pages, 5 figures
Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang
Polyphonic sound event localization and detection (SELD) aims at detecting
types of sound events with corresponding temporal activities and spatial
locations. In this paper, a track-wise ensemble event independent network with
a novel data augmentation method is proposed. The proposed model is based on
our previous proposed Event-Independent Network V2 and is extended by conformer
blocks and dense blocks. The track-wise ensemble model with track-wise output
format is proposed to solve an ensemble model problem for track-wise output
format that track permutation may occur among different models. The data
augmentation approach contains several data augmentation chains, which are
composed of random combinations of several data augmentation operations. The
method also utilizes log-mel spectrograms, intensity vectors, and Spatial
Cues-Augmented Log-Spectrogram (SALSA) for different models. We evaluate our
proposed method in the Task of the L3DAS22 challenge and obtain the top ranking
solution with a location-dependent F-score to be 0.699. Source code is
released.
Authors' comments: 6 pages, 2 figures, submitted to IEEE ICASSP 2022
Naoki Masuyama, Yusuke Nojima, Farhan Dawood, Zongying Liu
This paper proposes a supervised classification algorithm capable of
continual learning by utilizing an Adaptive Resonance Theory (ART)-based
growing self-organizing clustering algorithm. The ART-based clustering
algorithm is theoretically capable of continual learning, and the proposed
algorithm independently applies it to each class of training data for
generating classifiers. Whenever an additional training data set from a new
class is given, a new ART-based clustering will be defined in a different
learning space. Thanks to the above-mentioned features, the proposed algorithm
realizes continual learning capability. Simulation experiments showed that the
proposed algorithm has superior classification performance compared with
state-of-the-art clustering-based classification algorithms capable of
continual learning.
Authors' comments: This paper is currently under review. arXiv admin note: substantial
text overlap with arXiv:2201.10713
Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, Federico Tombari
While 6D object pose estimation has recently made a huge leap forward, most
methods can still only handle a single or a handful of different objects, which
limits their applications. To circumvent this problem, category-level object
pose estimation has recently been revamped, which aims at predicting the 6D
pose as well as the 3D metric size for previously unseen instances from a given
set of object classes. This is, however, a much more challenging task due to
severe intra-class shape variations. To address this issue, we propose
GPV-Pose, a novel framework for robust category-level pose estimation,
harnessing geometric insights to enhance the learning of category-level
pose-sensitive features. First, we introduce a decoupled confidence-driven
rotation representation, which allows geometry-aware recovery of the associated
rotation matrix. Second, we propose a novel geometry-guided point-wise voting
paradigm for robust retrieval of the 3D object bounding box. Finally,
leveraging these different output streams, we can enforce several geometric
consistency terms, further increasing performance, especially for non-symmetric
categories. GPV-Pose produces superior results to state-of-the-art competitors
on common public benchmarks, whilst almost achieving real-time inference speed
at 20 FPS.
Authors' comments: CVPR 2022
Yangming Shi, Haisong Ding, Kai Chen, Qiang Huo
Style-guided text image generation tries to synthesize text image by imitating reference image's appearance while keeping text content unaltered. The text image appearance includes many aspects. In this paper, we focus on transferring style image's background and foreground color patterns to the content image to generate photo-realistic text image. To achieve this goal, we propose 1) a content-style cross attention based pixel sampling approach to roughly mimicking the style text image's background; 2) a pixel-wise style modulation technique to transfer varying color patterns of the style image to the content image spatial-adaptively; 3) a cross attention based multi-scale style fusion approach to solving text foreground misalignment issue between style and content images; 4) an image patch shuffling strategy to create style, content and ground truth image tuples for training. Experimental results on Chinese handwriting text image synthesis with SCUT-HCCDoc and CASIA-OLHWDB datasets demonstrate that the proposed method can improve the quality of synthetic text images and make them more photo-realistic.
Man Luo, Kazuma Hashimoto, Semih Yavuz, Zhiwei Liu, Chitta Baral, Yingbo Zhou
While both extractive and generative readers have been successfully applied to the Question Answering (QA) task, little attention has been paid toward the systematic comparison of them. Characterizing the strengths and weaknesses of the two readers is crucial not only for making a more informed reader selection in practice but also for developing a deeper understanding to foster further research on improving readers in a principled manner. Motivated by this goal, we make the first attempt to systematically study the comparison of extractive and generative readers for question answering. To be aligned with the state-of-the-art, we explore nine transformer-based large pre-trained language models (PrLMs) as backbone architectures. Furthermore, we organize our findings under two main categories: (1) keeping the architecture invariant, and (2) varying the underlying PrLMs. Among several interesting findings, it is important to highlight that (1) the generative readers perform better in long context QA, (2) the extractive readers perform better in short context while also showing better out-of-domain generalization, and (3) the encoder of encoder-decoder PrLMs (e.g., T5) turns out to be a strong extractive reader and outperforms the standard choice of encoder-only PrLMs (e.g., RoBERTa). We also study the effect of multi-task learning on the two types of readers varying the underlying PrLMs and perform qualitative and quantitative diagnosis to provide further insights into future directions in modeling better readers.
Tengpeng Li, Hanli Wang, Bin He, Chang Wen Chen
As a technically challenging topic, visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. Hence, these schemes could not capture consistent dependencies from holistic representation, impairing the generation of reasonable and fluent story. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed. Three main novel components are designed and supported by substantial experiments to reveal practical advantages. First, a knowledge-enriched attention network is designed to extract implicit concepts from external knowledge system, and these concepts are followed by a cascade cross-modal attention mechanism to characterize imaginative and concrete representations. Second, a group-wise semantic module with second-order pooling is developed to explore the globally consistent guidance. Third, a unified one-stage story generation model with encoder-decoder structure is proposed to simultaneously train and infer the knowledge-enriched attention network, group-wise semantic module and multi-modal story generation decoder in an end-to-end fashion. Substantial experiments on the popular Visual Storytelling dataset with both objective and subjective evaluation metrics demonstrate the superior performance of the proposed scheme as compared with other state-of-the-art methods.
Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang
In recent years, exploring effective sound separation (SSep) techniques to
improve overlapping sound event detection (SED) attracts more and more
attention. Creating accurate separation signals to avoid the catastrophic error
accumulation during SED model training is very important and challenging. In
this study, we first propose a novel selective pseudo-labeling approach, termed
SPL, to produce high confidence separated target events from blind sound
separation outputs. These target events are then used to fine-tune the original
SED model that pre-trained on the sound mixtures in a multi-objective learning
style. Then, to further leverage the SSep outputs, a class-wise discriminative
fusion is proposed to improve the final SED performances, by combining multiple
frame-level event predictions of both sound mixtures and their separated
signals. All experiments are performed on the public DCASE 2021 Task 4 dataset,
and results show that our approaches significantly outperforms the official
baseline, the collar-based F 1, PSDS1 and PSDS2 performances are improved from
44.3%, 37.3% and 54.9% to 46.5%, 44.5% and 75.4%, respectively.
Authors' comments: This article was submitted to Interspeech 2022
Haonan Dong, Jian Yao
Learning-based multi-view stereo (MVS) has gained fine reconstructions on popular datasets. However, supervised learning methods require ground truth for training, which is hard to be collected, especially for the large-scale datasets. Though nowadays unsupervised learning methods have been proposed and have gotten gratifying results, those methods still fail to reconstruct intact results in challenging scenes, such as weakly-textured surfaces, as those methods primarily depend on pixel-wise photometric consistency which is subjected to various illuminations. To alleviate matching ambiguity in those challenging scenes, this paper proposes robust loss functions leveraging constraints beneath multi-view images: 1) Patch-wise photometric consistency loss, which expands the receptive field of the features in multi-view similarity measuring, 2) Robust twoview geometric consistency, which includes a cross-view depth consistency checking with the minimum occlusion. Our unsupervised strategy can be implemented with arbitrary depth estimation frameworks and can be trained with arbitrary large-scale MVS datasets. Experiments show that our method can decrease the matching ambiguity and particularly improve the completeness of weakly-textured reconstruction. Moreover, our method reaches the performance of the state-of-the-art methods on popular benchmarks, like DTU, Tanks and Temples and ETH3D. The code will be released soon.
Yiu-ming Cheung, Juyong Jiang, Feng Yu, Jian Lou
Despite enormous research interest and rapid application of federated learning (FL) to various areas, existing studies mostly focus on supervised federated learning under the horizontally partitioned local dataset setting. This paper will study the unsupervised FL under the vertically partitioned dataset setting. Accordingly, we propose the federated principal component analysis for vertically partitioned dataset (VFedPCA) method, which reduces the dimensionality across the joint datasets over all the clients and extracts the principal component feature information for downstream data analysis. We further take advantage of the nonlinear dimensionality reduction and propose the vertical federated advanced kernel principal component analysis (VFedAKPCA) method, which can effectively and collaboratively model the nonlinear nature existing in many real datasets. In addition, we study two communication topologies. The first is a server-client topology where a semi-trusted server coordinates the federated training, while the second is the fully-decentralized topology which further eliminates the requirement of the server by allowing clients themselves to communicate with their neighbors. Extensive experiments conducted on five types of real-world datasets corroborate the efficacy of VFedPCA and VFedAKPCA under the vertically partitioned FL setting. Code is available at: https://github.com/juyongjiang/VFedPCA-VFedAKPCA
Chanyong Jung, Gihyun Kwon, Jong Chul Ye
Recently, contrastive learning-based image translation methods have been
proposed, which contrasts different spatial locations to enhance the spatial
correspondence. However, the methods often ignore the diverse semantic relation
within the images. To address this, here we propose a novel semantic relation
consistency (SRC) regularization along with the decoupled contrastive learning,
which utilize the diverse semantics by focusing on the heterogeneous semantics
between the image patches of a single image. To further improve the
performance, we present a hard negative mining by exploiting the semantic
relation. We verified our method for three tasks: single-modal and multi-modal
image translations, and GAN compression task for image translation.
Experimental results confirmed the state-of-art performance of our method in
all the three tasks.
Authors' comments: CVPR 2022
Zhi-Yuan Zhang, Di Liu
Recent works reveal that re-calibrating the intermediate activation of adversarial examples can improve the adversarial robustness of a CNN model. The state of the arts [Baiet al., 2021] and [Yanet al., 2021] explores this feature at the channel level, i.e. the activation of a channel is uniformly scaled by a factor. In this paper, we investigate the intermediate activation manipulation at a more fine-grained level. Instead of uniformly scaling the activation, we individually adjust each element within an activation and thus propose Element-Wise Activation Scaling, dubbed EWAS, to improve CNNs' adversarial robustness. Experimental results on ResNet-18 and WideResNet with CIFAR10 and SVHN show that EWAS significantly improves the robustness accuracy. Especially for ResNet18 on CIFAR10, EWAS increases the adversarial accuracy by 37.65% to 82.35% against C&W attack. EWAS is simple yet very effective in terms of improving robustness. The codes are anonymously available at https://anonymous.4open.science/r/EWAS-DD64.
Dazhao Du, Bing Su, Zhewei Wei
Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
Lyu Bing, Wu Qingwen, Yan Zhen, Yu Wenfei, Liu Hao
The discovery of changing-look active galactic nuclei (CLAGNs) with the
significant change of optical broad emission lines (optical CLAGNs) and/or
strong variation of line-of-sight column densities (X-ray CLAGNs) challenges
the orientation-based AGN unification model. We explore mid-infrared (mid-IR)
properties for a sample of 57 optical CLAGNs and 11 X-ray CLAGNs based on the
{\it Wide-field Infrared Survey Explorer} ({\it WISE}) archive data. We find
that Eddington-scaled mid-IR luminosities of both optical and X-ray CLAGNs stay
just between low-luminosity AGNs (LLAGNs) and luminous QSOs. The average
Eddington-scaled mid-IR luminosities for optical and X-ray CLAGNs are $\sim
0.4$\% and $\sim 0.5$\%, respectively, which roughly correspond the bolometric
luminosity of transition between a radiatively inefficient accretion flow
(RIAF) and Shakura-Sunyaev disk (SSD). We estimate the time lags of the
variation in the mid-IR behind that in the optical band for 13 CLAGNs with
strong mid-IR variability, where the tight correlation between the time lag and
the bolometric luminosity ($\tau - L$) for CLAGNs roughly follows that found in
the luminous QSOs.
Authors' comments: 18 pages, accepted in APJ
Mohammad Khalooei, Mohammad Mehdi Homayounpour, Maryam Amirmazlaghani
Deep neural network models are used today in various applications of
artificial intelligence, the strengthening of which, in the face of adversarial
attacks is of particular importance. An appropriate solution to adversarial
attacks is adversarial training, which reaches a trade-off between robustness
and generalization. This paper introduces a novel framework (Layer
Sustainability Analysis (LSA)) for the analysis of layer vulnerability in an
arbitrary neural network in the scenario of adversarial attacks. LSA can be a
helpful toolkit to assess deep neural networks and to extend the adversarial
training approaches towards improving the sustainability of model layers via
layer monitoring and analysis. The LSA framework identifies a list of Most
Vulnerable Layers (MVL list) of the given network. The relative error, as a
comparison measure, is used to evaluate representation sustainability of each
layer against adversarial inputs. The proposed approach for obtaining robust
neural networks to fend off adversarial attacks is based on a layer-wise
regularization (LR) over LSA proposal(s) for adversarial training (AT); i.e.
the AT-LR procedure. AT-LR could be used with any benchmark adversarial attack
to reduce the vulnerability of network layers and to improve conventional
adversarial training approaches. The proposed idea performs well theoretically
and experimentally for state-of-the-art multilayer perceptron and convolutional
neural network architectures. Compared with the AT-LR and its corresponding
base adversarial training, the classification accuracy of more significant
perturbations increased by 16.35%, 21.79%, and 10.730% on Moon, MNIST, and
CIFAR-10 benchmark datasets, respectively. The LSA framework is available and
published at https://github.com/khalooei/LSA.
Authors' comments: Layers Sustainability Analysis (LSA) framework
Pedro R. A. S. Bassi, Sergio S. J. Dertkigil, Andrea Cavalli
Features in images' backgrounds can spuriously correlate with the images'
classes, representing background bias. They can influence the classifier's
decisions, causing shortcut learning (Clever Hans effect). The phenomenon
generates deep neural networks (DNNs) that perform well on standard evaluation
datasets but generalize poorly to real-world data. Layer-wise Relevance
Propagation (LRP) explains DNNs' decisions. Here, we show that the optimization
of LRP heatmaps can minimize the background bias influence on deep classifiers,
hindering shortcut learning. By not increasing run-time computational cost, the
approach is light and fast. Furthermore, it applies to virtually any
classification architecture. After injecting synthetic bias in images'
backgrounds, we compared our approach (dubbed ISNet) to eight state-of-the-art
DNNs, quantitatively demonstrating its superior robustness to background bias.
Mixed datasets are common for COVID-19 and tuberculosis classification with
chest X-rays, fostering background bias. By focusing on the lungs, the ISNet
reduced shortcut learning. Thus, its generalization performance on external
(out-of-distribution) test databases significantly surpassed all implemented
benchmark models.
Authors' comments: Text and style improvements. Included reference to the published
article (Nature Communications, https://doi.org/10.1038/s41467-023-44371-z)
Chau Tran, Pedro Cisneros-Velarde, Sang-Yun Oh, Alexander Petersen
Recently, a special case of precision matrix estimation based on a distributionally robust optimization (DRO) framework has been shown to be equivalent to the graphical lasso. From this formulation, a method for choosing the regularization term, i.e., for graphical model selection, was proposed. In this work, we establish a theoretical connection between the confidence level of graphical model selection via the DRO formulation and the asymptotic family-wise error rate of estimating false edges. Simulation experiments and real data analyses illustrate the utility of the asymptotic family-wise error rate control behavior even in finite samples.
Piotr Garbaczewski, Mariusz Żaba
We discuss an impact of various (path-wise) reflection-from-the barrier
scenarios upon confining properties of a paradigmatic family of symmetric
$\alpha $-stable L\'{e}vy processes, whose permanent residence in a finite
interval on a line is secured by a two-sided reflection. Depending on the
specific reflection "mechanism", the inferred jump-type processes differ in
their spectral and statistical characteristics, like e.g. relaxation
properties, and functional shapes of invariant (equilibrium, or asymptotic
near-equilibrium) probability density functions in the interval. The analysis
is carried out in conjunction with attempts to give meaning to the notion of a
reflecting L\'{e}vy process, in terms of the domain of its motion generator, to
which an invariant pdf (actually an eigenfunction) does belong.
Authors' comments: 20 pp, 8 figures, Text amendments, Abstract and Section I modified
Ian Mason, Sebastian Starke, Taku Komura
Controlling the manner in which a character moves in a real-time animation system is a challenging task with useful applications. Existing style transfer systems require access to a reference content motion clip, however, in real-time systems the future motion content is unknown and liable to change with user input. In this work we present a style modelling system that uses an animation synthesis network to model motion content based on local motion phases. An additional style modulation network uses feature-wise transformations to modulate style in real-time. To evaluate our method, we create and release a new style modelling dataset, 100STYLE, containing over 4 million frames of stylised locomotion data in 100 different styles that present a number of challenges for existing systems. To model these styles, we extend the local phase calculation with a contact-free formulation. In comparison to other methods for real-time style modelling, we show our system is more robust and efficient in its style representation while improving motion quality.