Ze Wang, Guogang Liao, Xiaowen Shi, Xiaoxu Wu, Chuheng Zhang, Yongkang Wang, Xingxing Wang, Dong Wang
With the recent prevalence of reinforcement learning (RL), there have been
tremendous interests in utilizing RL for ads allocation in recommendation
platforms (e.g., e-commerce and news feed sites). To achieve better allocation,
the input of recent RL-based ads allocation methods is upgraded from point-wise
single item to list-wise item arrangement. However, this also results in a
high-dimensional space of state-action pairs, making it difficult to learn
list-wise representations with good generalization ability. This further
hinders the exploration of RL agents and causes poor sample efficiency. To
address this problem, we propose a novel RL-based approach for ads allocation
which learns better list-wise representations by leveraging task-specific
signals on Meituan food delivery platform. Specifically, we propose three
different auxiliary tasks based on reconstruction, prediction, and contrastive
learning respectively according to prior domain knowledge on ads allocation. We
conduct extensive experiments on Meituan food delivery platform to evaluate the
effectiveness of the proposed auxiliary tasks. Both offline and online
experimental results show that the proposed method can learn better list-wise
representations and achieve higher revenue for the platform compared to the
state-of-the-art baselines.
Authors' comments: Accepted by CIKM-22
Arshdeep Singh
This paper presents an alternate representation framework to commonly used
time-frequency representation for acoustic scene classification (ASC). A raw
audio signal is represented using a pre-trained convolutional neural network
(CNN) using its various intermediate layers. The study assumes that the
representations obtained from the intermediate layers lie in low-dimensions
intrinsically. To obtain low-dimensional embeddings, principal component
analysis is performed, and the study analyzes that only a few principal
components are significant. However, the appropriate number of significant
components are not known. To address this, an automatic dictionary learning
framework is utilized that approximates the underlying subspace. Further, the
low-dimensional embeddings are aggregated in a late-fusion manner in the
ensemble framework to incorporate hierarchical information learned at various
intermediate layers. The experimental evaluation is performed on publicly
available DCASE 2017 and 2018 ASC datasets on a pre-trained 1-D CNN, SoundNet.
Empirically, it is observed that deeper layers show more compression ratio than
others. At 70% compression ratio across different datasets, the performance is
similar to that obtained without performing any dimensionality reduction. The
proposed framework outperforms the time-frequency representation based methods.
Authors' comments: No comments
Zhifang Fan, Dan Ou, Yulong Gu, Bairan Fu, Xiang Li, Wentian Bao, Xin-Yu Dai, Xiaoyi Zeng et al.
Modeling user's historical feedback is essential for Click-Through Rate Prediction in personalized search and recommendation. Existing methods usually only model users' positive feedback information such as click sequences which neglects the context information of the feedback. In this paper, we propose a new perspective for context-aware users' behavior modeling by including the whole page-wisely exposed products and the corresponding feedback as contextualized page-wise feedback sequence. The intra-page context information and inter-page interest evolution can be captured to learn more specific user preference. We design a novel neural ranking model RACP(i.e., Recurrent Attention over Contextualized Page sequence), which utilizes page-context aware attention to model the intra-page context. A recurrent attention process is used to model the cross-page interest convergence evolution as denoising the interest in the previous pages. Experiments on public and real-world industrial datasets verify our model's effectiveness.
Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti
Several solutions for lightweight TTS have shown promising results. Still,
they either rely on a hand-crafted design that reaches non-optimum size or use
a neural architecture search but often suffer training costs. We present
Nix-TTS, a lightweight TTS achieved via knowledge distillation to a
high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free)
TTS teacher model. Specifically, we offer module-wise distillation, enabling
flexible and independent distillation to the encoder and decoder module. The
resulting Nix-TTS inherited the advantageous properties of being
non-autoregressive and end-to-end from the teacher, yet significantly smaller
in size, with only 5.23M parameters or up to 89.34% reduction of the teacher
model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU
and Raspberry Pi 3B respectively and still retains a fair voice naturalness and
intelligibility compared to the teacher model. We provide pretrained models and
audio samples of Nix-TTS.
Authors' comments: Accepted at SLT 2022 (https://slt2022.org/). Associated materials can
be seen in https://github.com/rendchevi/nix-tts
Minghao Chen, Fangyun Wei, Chong Li, Deng Cai
Prior works on action representation learning mainly focus on designing
various architectures to extract the global representations for short video
clips. In contrast, many practical applications such as video alignment have
strong demand for learning dense representations for long videos. In this
paper, we introduce a novel contrastive action representation learning (CARL)
framework to learn frame-wise action representations, especially for long
videos, in a self-supervised manner. Concretely, we introduce a simple yet
efficient video encoder that considers spatio-temporal context to extract
frame-wise representations. Inspired by the recent progress of self-supervised
learning, we present a novel sequence contrastive loss (SCL) applied on two
correlated views obtained through a series of spatio-temporal data
augmentations. SCL optimizes the embedding space by minimizing the
KL-divergence between the sequence similarity of two augmented views and a
prior Gaussian distribution of timestamp distance. Experiments on FineGym,
PennAction and Pouring datasets show that our method outperforms previous
state-of-the-art by a large margin for downstream fine-grained action
classification. Surprisingly, although without training on paired videos, our
approach also shows outstanding performance on video alignment and fine-grained
frame retrieval tasks. Code and models are available at
https://github.com/minghchen/CARL_code.
Authors' comments: Accepted by CVPR 2022
Yuri Lavinas, Marcelo Ladeira, Gabriela Ochoa, Claus Aranha
The performance of multiobjective algorithms varies across problems, making it hard to develop new algorithms or apply existing ones to new problems. To simplify the development and application of new multiobjective algorithms, there has been an increasing interest in their automatic design from component parts. These automatically designed metaheuristics can outperform their human-developed counterparts. However, it is still uncertain what are the most influential components leading to their performance improvement. This study introduces a new methodology to investigate the effects of the final configuration of an automatically designed algorithm. We apply this methodology to a well-performing Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D) designed by the irace package on nine constrained problems. We then contrast the impact of the algorithm components in terms of their Search Trajectory Networks (STNs), the diversity of the population, and the hypervolume. Our results indicate that the most influential components were the restart and update strategies, with higher increments in performance and more distinct metric values. Also, their relative influence depends on the problem difficulty: not using the restart strategy was more influential in problems where MOEA/D performs better; while the update strategy was more influential in problems where MOEA/D performs the worst.
Bingxin Zhao, Shurong Zheng, Hongtu Zhu
Genetic prediction of complex traits and diseases has attracted enormous
attention in precision medicine, mainly because it has the potential to
translate discoveries from genome-wide association studies (GWAS) into medical
advances. As the high dimen- sional covariance matrix (or the linkage
disequilibrium (LD) pattern) of genetic vari- ants has a block-diagonal
structure, many existing methods attempt to account for the dependence among
variants in predetermined local LD blocks/regions. Moreover, due to privacy
restrictions and data protection concerns, genetic variant dependence in each
LD block is typically estimated from external reference panels rather than the
original training dataset. This paper presents a unified analysis of block-wise
and reference panel-based estimators in a high-dimensional prediction framework
with- out sparsity restrictions. We find that, surprisingly, even when the
covariance matrix has a block-diagonal structure with well-defined boundaries,
block-wise estimation methods adjusting for local dependence can be
substantially less accurate than meth- ods controlling for the whole covariance
matrix. Further, estimation methods built on the original training dataset and
external reference panels are likely to have varying performance in high
dimensions, which may reflect the cost of having only access to summary level
data from the training dataset. This analysis is based on our novel re- sults
in random matrix theory for block-diagonal covariance matrix. We numerically
evaluate our results using extensive simulations and the large-scale UK Biobank
real data analysis of 36 complex traits
Authors' comments: 27 pages, 5 figures
Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang
Polyphonic sound event localization and detection (SELD) aims at detecting
types of sound events with corresponding temporal activities and spatial
locations. In this paper, a track-wise ensemble event independent network with
a novel data augmentation method is proposed. The proposed model is based on
our previous proposed Event-Independent Network V2 and is extended by conformer
blocks and dense blocks. The track-wise ensemble model with track-wise output
format is proposed to solve an ensemble model problem for track-wise output
format that track permutation may occur among different models. The data
augmentation approach contains several data augmentation chains, which are
composed of random combinations of several data augmentation operations. The
method also utilizes log-mel spectrograms, intensity vectors, and Spatial
Cues-Augmented Log-Spectrogram (SALSA) for different models. We evaluate our
proposed method in the Task of the L3DAS22 challenge and obtain the top ranking
solution with a location-dependent F-score to be 0.699. Source code is
released.
Authors' comments: 6 pages, 2 figures, submitted to IEEE ICASSP 2022
Naoki Masuyama, Yusuke Nojima, Farhan Dawood, Zongying Liu
This paper proposes a supervised classification algorithm capable of
continual learning by utilizing an Adaptive Resonance Theory (ART)-based
growing self-organizing clustering algorithm. The ART-based clustering
algorithm is theoretically capable of continual learning, and the proposed
algorithm independently applies it to each class of training data for
generating classifiers. Whenever an additional training data set from a new
class is given, a new ART-based clustering will be defined in a different
learning space. Thanks to the above-mentioned features, the proposed algorithm
realizes continual learning capability. Simulation experiments showed that the
proposed algorithm has superior classification performance compared with
state-of-the-art clustering-based classification algorithms capable of
continual learning.
Authors' comments: This paper is currently under review. arXiv admin note: substantial
text overlap with arXiv:2201.10713
Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, Federico Tombari
While 6D object pose estimation has recently made a huge leap forward, most
methods can still only handle a single or a handful of different objects, which
limits their applications. To circumvent this problem, category-level object
pose estimation has recently been revamped, which aims at predicting the 6D
pose as well as the 3D metric size for previously unseen instances from a given
set of object classes. This is, however, a much more challenging task due to
severe intra-class shape variations. To address this issue, we propose
GPV-Pose, a novel framework for robust category-level pose estimation,
harnessing geometric insights to enhance the learning of category-level
pose-sensitive features. First, we introduce a decoupled confidence-driven
rotation representation, which allows geometry-aware recovery of the associated
rotation matrix. Second, we propose a novel geometry-guided point-wise voting
paradigm for robust retrieval of the 3D object bounding box. Finally,
leveraging these different output streams, we can enforce several geometric
consistency terms, further increasing performance, especially for non-symmetric
categories. GPV-Pose produces superior results to state-of-the-art competitors
on common public benchmarks, whilst almost achieving real-time inference speed
at 20 FPS.
Authors' comments: CVPR 2022
Yangming Shi, Haisong Ding, Kai Chen, Qiang Huo
Style-guided text image generation tries to synthesize text image by imitating reference image's appearance while keeping text content unaltered. The text image appearance includes many aspects. In this paper, we focus on transferring style image's background and foreground color patterns to the content image to generate photo-realistic text image. To achieve this goal, we propose 1) a content-style cross attention based pixel sampling approach to roughly mimicking the style text image's background; 2) a pixel-wise style modulation technique to transfer varying color patterns of the style image to the content image spatial-adaptively; 3) a cross attention based multi-scale style fusion approach to solving text foreground misalignment issue between style and content images; 4) an image patch shuffling strategy to create style, content and ground truth image tuples for training. Experimental results on Chinese handwriting text image synthesis with SCUT-HCCDoc and CASIA-OLHWDB datasets demonstrate that the proposed method can improve the quality of synthetic text images and make them more photo-realistic.
Man Luo, Kazuma Hashimoto, Semih Yavuz, Zhiwei Liu, Chitta Baral, Yingbo Zhou
While both extractive and generative readers have been successfully applied to the Question Answering (QA) task, little attention has been paid toward the systematic comparison of them. Characterizing the strengths and weaknesses of the two readers is crucial not only for making a more informed reader selection in practice but also for developing a deeper understanding to foster further research on improving readers in a principled manner. Motivated by this goal, we make the first attempt to systematically study the comparison of extractive and generative readers for question answering. To be aligned with the state-of-the-art, we explore nine transformer-based large pre-trained language models (PrLMs) as backbone architectures. Furthermore, we organize our findings under two main categories: (1) keeping the architecture invariant, and (2) varying the underlying PrLMs. Among several interesting findings, it is important to highlight that (1) the generative readers perform better in long context QA, (2) the extractive readers perform better in short context while also showing better out-of-domain generalization, and (3) the encoder of encoder-decoder PrLMs (e.g., T5) turns out to be a strong extractive reader and outperforms the standard choice of encoder-only PrLMs (e.g., RoBERTa). We also study the effect of multi-task learning on the two types of readers varying the underlying PrLMs and perform qualitative and quantitative diagnosis to provide further insights into future directions in modeling better readers.
Tengpeng Li, Hanli Wang, Bin He, Chang Wen Chen
As a technically challenging topic, visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. Hence, these schemes could not capture consistent dependencies from holistic representation, impairing the generation of reasonable and fluent story. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed. Three main novel components are designed and supported by substantial experiments to reveal practical advantages. First, a knowledge-enriched attention network is designed to extract implicit concepts from external knowledge system, and these concepts are followed by a cascade cross-modal attention mechanism to characterize imaginative and concrete representations. Second, a group-wise semantic module with second-order pooling is developed to explore the globally consistent guidance. Third, a unified one-stage story generation model with encoder-decoder structure is proposed to simultaneously train and infer the knowledge-enriched attention network, group-wise semantic module and multi-modal story generation decoder in an end-to-end fashion. Substantial experiments on the popular Visual Storytelling dataset with both objective and subjective evaluation metrics demonstrate the superior performance of the proposed scheme as compared with other state-of-the-art methods.
Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang
In recent years, exploring effective sound separation (SSep) techniques to
improve overlapping sound event detection (SED) attracts more and more
attention. Creating accurate separation signals to avoid the catastrophic error
accumulation during SED model training is very important and challenging. In
this study, we first propose a novel selective pseudo-labeling approach, termed
SPL, to produce high confidence separated target events from blind sound
separation outputs. These target events are then used to fine-tune the original
SED model that pre-trained on the sound mixtures in a multi-objective learning
style. Then, to further leverage the SSep outputs, a class-wise discriminative
fusion is proposed to improve the final SED performances, by combining multiple
frame-level event predictions of both sound mixtures and their separated
signals. All experiments are performed on the public DCASE 2021 Task 4 dataset,
and results show that our approaches significantly outperforms the official
baseline, the collar-based F 1, PSDS1 and PSDS2 performances are improved from
44.3%, 37.3% and 54.9% to 46.5%, 44.5% and 75.4%, respectively.
Authors' comments: This article was submitted to Interspeech 2022
Haonan Dong, Jian Yao
Learning-based multi-view stereo (MVS) has gained fine reconstructions on popular datasets. However, supervised learning methods require ground truth for training, which is hard to be collected, especially for the large-scale datasets. Though nowadays unsupervised learning methods have been proposed and have gotten gratifying results, those methods still fail to reconstruct intact results in challenging scenes, such as weakly-textured surfaces, as those methods primarily depend on pixel-wise photometric consistency which is subjected to various illuminations. To alleviate matching ambiguity in those challenging scenes, this paper proposes robust loss functions leveraging constraints beneath multi-view images: 1) Patch-wise photometric consistency loss, which expands the receptive field of the features in multi-view similarity measuring, 2) Robust twoview geometric consistency, which includes a cross-view depth consistency checking with the minimum occlusion. Our unsupervised strategy can be implemented with arbitrary depth estimation frameworks and can be trained with arbitrary large-scale MVS datasets. Experiments show that our method can decrease the matching ambiguity and particularly improve the completeness of weakly-textured reconstruction. Moreover, our method reaches the performance of the state-of-the-art methods on popular benchmarks, like DTU, Tanks and Temples and ETH3D. The code will be released soon.
Yiu-ming Cheung, Juyong Jiang, Feng Yu, Jian Lou
Despite enormous research interest and rapid application of federated learning (FL) to various areas, existing studies mostly focus on supervised federated learning under the horizontally partitioned local dataset setting. This paper will study the unsupervised FL under the vertically partitioned dataset setting. Accordingly, we propose the federated principal component analysis for vertically partitioned dataset (VFedPCA) method, which reduces the dimensionality across the joint datasets over all the clients and extracts the principal component feature information for downstream data analysis. We further take advantage of the nonlinear dimensionality reduction and propose the vertical federated advanced kernel principal component analysis (VFedAKPCA) method, which can effectively and collaboratively model the nonlinear nature existing in many real datasets. In addition, we study two communication topologies. The first is a server-client topology where a semi-trusted server coordinates the federated training, while the second is the fully-decentralized topology which further eliminates the requirement of the server by allowing clients themselves to communicate with their neighbors. Extensive experiments conducted on five types of real-world datasets corroborate the efficacy of VFedPCA and VFedAKPCA under the vertically partitioned FL setting. Code is available at: https://github.com/juyongjiang/VFedPCA-VFedAKPCA
Chanyong Jung, Gihyun Kwon, Jong Chul Ye
Recently, contrastive learning-based image translation methods have been
proposed, which contrasts different spatial locations to enhance the spatial
correspondence. However, the methods often ignore the diverse semantic relation
within the images. To address this, here we propose a novel semantic relation
consistency (SRC) regularization along with the decoupled contrastive learning,
which utilize the diverse semantics by focusing on the heterogeneous semantics
between the image patches of a single image. To further improve the
performance, we present a hard negative mining by exploiting the semantic
relation. We verified our method for three tasks: single-modal and multi-modal
image translations, and GAN compression task for image translation.
Experimental results confirmed the state-of-art performance of our method in
all the three tasks.
Authors' comments: CVPR 2022
Zhi-Yuan Zhang, Di Liu
Recent works reveal that re-calibrating the intermediate activation of adversarial examples can improve the adversarial robustness of a CNN model. The state of the arts [Baiet al., 2021] and [Yanet al., 2021] explores this feature at the channel level, i.e. the activation of a channel is uniformly scaled by a factor. In this paper, we investigate the intermediate activation manipulation at a more fine-grained level. Instead of uniformly scaling the activation, we individually adjust each element within an activation and thus propose Element-Wise Activation Scaling, dubbed EWAS, to improve CNNs' adversarial robustness. Experimental results on ResNet-18 and WideResNet with CIFAR10 and SVHN show that EWAS significantly improves the robustness accuracy. Especially for ResNet18 on CIFAR10, EWAS increases the adversarial accuracy by 37.65% to 82.35% against C&W attack. EWAS is simple yet very effective in terms of improving robustness. The codes are anonymously available at https://anonymous.4open.science/r/EWAS-DD64.
Dazhao Du, Bing Su, Zhewei Wei
Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
Lyu Bing, Wu Qingwen, Yan Zhen, Yu Wenfei, Liu Hao
The discovery of changing-look active galactic nuclei (CLAGNs) with the
significant change of optical broad emission lines (optical CLAGNs) and/or
strong variation of line-of-sight column densities (X-ray CLAGNs) challenges
the orientation-based AGN unification model. We explore mid-infrared (mid-IR)
properties for a sample of 57 optical CLAGNs and 11 X-ray CLAGNs based on the
{\it Wide-field Infrared Survey Explorer} ({\it WISE}) archive data. We find
that Eddington-scaled mid-IR luminosities of both optical and X-ray CLAGNs stay
just between low-luminosity AGNs (LLAGNs) and luminous QSOs. The average
Eddington-scaled mid-IR luminosities for optical and X-ray CLAGNs are $\sim
0.4$\% and $\sim 0.5$\%, respectively, which roughly correspond the bolometric
luminosity of transition between a radiatively inefficient accretion flow
(RIAF) and Shakura-Sunyaev disk (SSD). We estimate the time lags of the
variation in the mid-IR behind that in the optical band for 13 CLAGNs with
strong mid-IR variability, where the tight correlation between the time lag and
the bolometric luminosity ($\tau - L$) for CLAGNs roughly follows that found in
the luminous QSOs.
Authors' comments: 18 pages, accepted in APJ