Yuanjie Yan, Jian Zhao, Furao Shen
Image manipulation on the latent space of the pre-trained StyleGAN can control the semantic attributes of the generated images. Recently, some studies have focused on detecting channels with specific properties to directly manipulate the latent code, which is limited by the entanglement of the latent space. To detect the attribute-specific channels, we propose a novel detection method in the context of pre-trained classifiers. We analyse the gradients layer by layer on the style space. The intensities of the gradients indicate the channel's responses to specific attributes. The latent style codes of channels control separate attributes in the layers. We choose channels with top-$k$ gradients to control specific attributes in the maximum response layer. We implement single-channel and multi-channel manipulations with a certain attribute. Our methods can accurately detect relevant channels for a large number of face attributes. Extensive qualitative and quantitative results demonstrate that the proposed methods outperform state-of-the-art methods in generalization and scalability.
Nuoya Xiong, Yihan Du, Longbo Huang
In this paper, we investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we consider stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications which need to ensure safety in all decision steps and may not always possess safe actions, e.g., robot control and autonomous driving. We propose a novel algorithm SUCBVI, which guarantees $\widetilde{O}(\sqrt{ST})$ step-wise violation and $\widetilde{O}(\sqrt{H^3SAT})$ regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to $S$ and $T$. Moreover, we further study a novel safe reward-free exploration problem with step-wise violation constraints. For this problem, we design an $(\varepsilon,\delta)$-PAC algorithm SRF-UCRL, which achieves nearly state-of-the-art sample complexity $\widetilde{O}((\frac{S^2AH^2}{\varepsilon}+\frac{H^4SA}{\varepsilon^2})(\log(\frac{1}{\delta})+S))$, and guarantees $\widetilde{O}(\sqrt{ST})$ violation during the exploration. The experimental results demonstrate the superiority of our algorithms in safety performance, and corroborate our theoretical results.
Giovanni Araujo Bacochina, Rodrigo Clemente Thom de Souza
The use of Attention Layers has become a trend since the popularization of the Transformer-based models, being the key element for many state-of-the-art models that have been developed through recent years. However, one of the biggest obstacles in implementing these architectures - as well as many others in Deep Learning Field - is the enormous amount of optimizing parameters they possess, which make its use conditioned on the availability of robust hardware. In this paper, it's proposed a new method of attention mechanism that adapts the Dot-Product Attention, which uses matrices multiplications, to become element-wise through the use of arrays multiplications. To test the effectiveness of such approach, two models (one with a VGG-like architecture and one with the proposed method) have been trained in a classification task using Fashion MNIST and CIFAR10 datasets. Each model has been trained for 10 epochs in a single Tesla T4 GPU from Google Colaboratory. The results show that this mechanism allows for an accuracy of 92% of the VGG-like counterpart in Fashion MNIST dataset, while reducing the number of parameters in 97%. For CIFAR10, the accuracy is still equivalent to 60% of the VGG-like counterpart while using 50% less parameters.
Rohit Yadav, François-Xavier Dupé, S. Takerkart, Guillaume Auzias
Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, Moontae Lee
Since the recent advent of regulations for data protection (e.g., the General
Data Protection Regulation), there has been increasing demand in deleting
information learned from sensitive data in pre-trained models without
retraining from scratch. The inherent vulnerability of neural networks towards
adversarial attacks and unfairness also calls for a robust method to remove or
correct information in an instance-wise fashion, while retaining the predictive
performance across remaining data. To this end, we consider instance-wise
unlearning, of which the goal is to delete information on a set of instances
from a pre-trained model, by either misclassifying each instance away from its
original prediction or relabeling the instance to a different label. We also
propose two methods that reduce forgetting on the remaining data: 1) utilizing
adversarial examples to overcome forgetting at the representation-level and 2)
leveraging weight importance metrics to pinpoint network parameters guilty of
propagating unwanted information. Both methods only require the pre-trained
model and data instances to forget, allowing painless application to real-life
settings where the entire training set is unavailable. Through extensive
experimentation on various image classification benchmarks, we show that our
approach effectively preserves knowledge of remaining data while unlearning
given instances in both single-task and continual unlearning scenarios.
Authors' comments: AAAI 2024 camera ready version
Zanjia Tong, Yuhang Chen, Zewei Xu, Rong Yu
The loss function for bounding box regression (BBR) is essential to object detection. Its good definition will bring significant performance improvement to the model. Most existing works assume that the examples in the training data are high-quality and focus on strengthening the fitting ability of BBR loss. If we blindly strengthen BBR on low-quality examples, it will jeopardize localization performance. Focal-EIoU v1 was proposed to solve this problem, but due to its static focusing mechanism (FM), the potential of non-monotonic FM was not fully exploited. Based on this idea, we propose an IoU-based loss with a dynamic non-monotonic FM named Wise-IoU (WIoU). The dynamic non-monotonic FM uses the outlier degree instead of IoU to evaluate the quality of anchor boxes and provides a wise gradient gain allocation strategy. This strategy reduces the competitiveness of high-quality anchor boxes while also reducing the harmful gradient generated by low-quality examples. This allows WIoU to focus on ordinary-quality anchor boxes and improve the detector's overall performance. When WIoU is applied to the state-of-the-art real-time detector YOLOv7, the AP-75 on the MS-COCO dataset is improved from 53.03% to 54.50%. Code is available at https://github.com/Instinct323/wiou.
Basel Barakat, Qiang Huang
Finetuning can be used to tackle domain-specific tasks by transferring
knowledge. Previous studies on finetuning focused on adapting only the weights
of a task-specific classifier or re-optimizing all layers of the pre-trained
model using the new task data. The first type of methods cannot mitigate the
mismatch between a pre-trained model and the new task data, and the second type
of methods easily cause over-fitting when processing tasks with limited data.
To explore the effectiveness of fine-tuning, we propose a novel block-wise
optimization mechanism, which adapts the weights of a group of layers of a
pre-trained model. In our work, the layer selection can be done in four
different ways. The first is layer-wise adaptation, which aims to search for
the most salient single layer according to the classification performance. The
second way is based on the first one, jointly adapting a small number of
top-ranked layers instead of using an individual layer. The third is block
based segmentation, where the layers of a deep network is segmented into blocks
by non-weighting layers, such as the MaxPooling layer and Activation layer. The
last one is to use a fixed-length sliding window to group layers block by
block. To identify which group of layers is the most suitable for finetuning,
the search starts from the target end and is conducted by freezing other layers
excluding the selected layers and the classification layers. The most salient
group of layers is determined in terms of classification performance. In our
experiments, the proposed approaches are tested on an often-used dataset,
Tf_flower, by finetuning five typical pre-trained models, VGG16, MobileNet-v1,
MobileNet-v2, MobileNet-v3, and ResNet50v2, respectively. The obtained results
show that the use of our proposed block-wise approaches can achieve better
performances than the two baseline methods and the layer-wise method.
Authors' comments: 10 pages
Kislay Raj, Aditya Singh, Abhishek Mandal, Teerath Kumar, Arunabha M. Roy
In a growing world of technology, psychological disorders became a challenge
to be solved. The methods used for cognitive stimulation are very conventional
and based on one-way communication, which only relies on the material or method
used for training of an individual. It doesn't use any kind of feedback from
the individual to analyze the progress of the training process. We have
proposed a closed-loop methodology to improve the cognitive state of a person
with ID (Intellectual disability). We have used a platform named 'Armoni', for
providing training to the intellectually disabled individuals. The learning is
performed in a closed-loop by using feedback in the form of change in affective
state. For feedback to the Armoni, an EEG (Electroencephalograph) headband is
used. All the changes in EEG are observed and classified against the change in
the mean and standard deviation value of all frequency bands of signal. This
comparison is being helpful in defining every activity with respect to change
in brain signals. In this paper, we have discussed the process of treatment of
EEG signal and its definition against the different activities of Armoni. We
have tested it on 6 different systems with different age groups and cognitive
levels.
Authors' comments: Submitted to SN Computer Science journal
Neri Merhav
We propose a universal ensemble for random selection of rate-distortion
codes, which is asymptotically optimal in a sample-wise sense. According to
this ensemble, each reproduction vector, $\hbx$, is selected independently at
random under the probability distribution that is proportional to
$2^{-LZ(\hbx)}$, where $LZ(\hbx)$ is the code-length of $\hbx$ pertaining to
the 1978 version of the Lempel-Ziv (LZ) algorithm. We show that, with high
probability, the resulting codebook gives rise to an asymptotically optimal
variable-rate lossy compression scheme under an arbitrary distortion measure,
in the sense that a matching converse theorem also holds. According to the
converse theorem, even if the decoder knew $\ell$-th order type of source
vector in advance ($\ell$ being a large but fixed positive integer), the
performance of the above-mentioned code could not have been improved
essentially, for the vast majority of codewords that represent all source
vectors in the same type. Finally, we provide a discussion of our results,
which includes, among other things, a comparison to a coding scheme that
selects the reproduction vector with the shortest LZ code length among all
vectors that are within the allowed distortion from the source vector.
Authors' comments: 22 pages, submitted for publication
Joo Chan Lee, Daniel Rho, Jong Hwan Ko, Eunbyung Park
Neural fields, also known as coordinate-based or implicit neural
representations, have shown a remarkable capability of representing,
generating, and manipulating various forms of signals. For video
representations, however, mapping pixel-wise coordinates to RGB colors has
shown relatively low compression performance and slow convergence and inference
speed. Frame-wise video representation, which maps a temporal coordinate to its
entire frame, has recently emerged as an alternative method to represent
videos, improving compression rates and encoding speed. While promising, it has
still failed to reach the performance of state-of-the-art video compression
algorithms. In this work, we propose FFNeRV, a novel method for incorporating
flow information into frame-wise representations to exploit the temporal
redundancy across the frames in videos inspired by the standard video codecs.
Furthermore, we introduce a fully convolutional architecture, enabled by
one-dimensional temporal grids, improving the continuity of spatial features.
Experimental results show that FFNeRV yields the best performance for video
compression and frame interpolation among the methods using frame-wise
representations or neural fields. To reduce the model size even further, we
devise a more compact convolutional architecture using the group and pointwise
convolutions. With model compression techniques, including quantization-aware
training and entropy coding, FFNeRV outperforms widely-used standard video
codecs (H.264 and HEVC) and performs on par with state-of-the-art video
compression algorithms.
Authors' comments: Our project page including code is available at
https://maincold2.github.io/ffnerv/
Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos
Multimodal learning pipelines have benefited from the success of pretrained language models. However, this comes at the cost of increased model parameters. In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers. The adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. During the adaptation process the pre-trained language model parameters remain frozen, allowing for fast, parameter-efficient training. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise. Our experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the current state-of-the-art across metrics, with 3.4% relative reduction in the resulting error and 2.1% relative improvement in 7-class classification accuracy.
Stefan Braun, Erik McDermott, Roger Hsiao
The neural transducer is an end-to-end model for automatic speech recognition
(ASR). While the model is well-suited for streaming ASR, the training process
remains challenging. During training, the memory requirements may quickly
exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence
lengths. In this work, we analyze the time and space complexity of a typical
transducer training setup. We propose a memory-efficient training method that
computes the transducer loss and gradients sample by sample. We present
optimizations to increase the efficiency and parallelism of the sample-wise
method. In a set of thorough benchmarks, we show that our sample-wise method
significantly reduces memory usage, and performs at competitive speed when
compared to the default batched computation. As a highlight, we manage to
compute the transducer loss and gradients for a batch size of 1024, and audio
length of 40 seconds, using only 6 GB of memory.
Authors' comments: 5 pages, 4 figures, 1 table, 1 algorithm
Adrian L. H. Lam, Jean-Luc Margot, Emily Whittaker, Nathan Myhrvold
We used 22 $\mu$m (W4) Wide-field Infrared Survey Explorer (WISE)
observations of 4420 asteroids to analyze lightcurves and determined spin
period estimates for 1929 asteroids. We fit second-order Fourier models at a
large number of trial frequencies to the W4 data and analyzed the resulting
periodograms. We initially excluded rotational frequencies exceeding 7.57
rotations per day (P < 3.17 hr), which are not sampled adequately by WISE, and
periods that exceed twice the WISE observation interval, which is typically 36
hr. Three solutions accurately capture the vast majority of the rotational
frequencies in our sample: the best-fit frequency and its mirrors around 3.78
and 7.57 rotations per day. By comparing our solutions to a high-quality
control group of 752 asteroid spin periods, we found that one of our solutions
is accurate (within 5%) in 88% of the cases. The best-fit, secondary, and
tertiary solutions are accurate in 55%, 27%, and 6% of the cases, respectively.
We also observed that suppression of aliased solutions was more effective with
non-uniform sampling than with quasi-uniform sampling.
Authors' comments: 19 pages, 14 figures, in press at The Planetary Science Journal
Quan Quan, Qingsong Yao, Jun Li, S. kevin Zhou
Contrastive learning (CL) is a form of self-supervised learning and has been widely used for various tasks. Different from widely studied instance-level contrastive learning, pixel-wise contrastive learning mainly helps with pixel-wise tasks such as medical landmark detection. The counterpart to an instance in instance-level CL is a pixel, along with its neighboring context, in pixel-wise CL. Aiming to build better feature representation, there is a vast literature about designing instance augmentation strategies for instance-level CL; but there is little similar work on pixel augmentation for pixel-wise CL with a pixel granularity. In this paper, we attempt to bridge this gap. We first classify a pixel into three categories, namely low-, medium-, and high-informative, based on the information quantity the pixel contains. Inspired by the ``InfoMin" principle, we then design separate augmentation strategies for each category in terms of augmentation intensity and sampling ratio. Extensive experiments validate that our information-guided pixel augmentation strategy succeeds in encoding more discriminative representations and surpassing other competitive approaches in unsupervised local feature matching. Furthermore, our pretrained model improves the performance of both one-shot and fully supervised models. To the best of our knowledge, we are the first to propose a pixel augmentation method with a pixel granularity for enhancing unsupervised pixel-wise contrastive learning.
Ankita Pasad, Bowen Shi, Karen Livescu
Many self-supervised speech models, varying in their pre-training objective,
input modality, and pre-training data, have been proposed in the last few
years. Despite impressive successes on downstream tasks, we still have a
limited understanding of the properties encoded by the models and the
differences across models. In this work, we examine the intermediate
representations for a variety of recent models. Specifically, we measure
acoustic, phonetic, and word-level properties encoded in individual layers,
using a lightweight analysis tool based on canonical correlation analysis
(CCA). We find that these properties evolve across layers differently depending
on the model, and the variations relate to the choice of pre-training
objective. We further investigate the utility of our analyses for downstream
tasks by comparing the property trends with performance on speech recognition
and spoken language understanding tasks. We discover that CCA trends provide
reliable guidance to choose layers of interest for downstream tasks and that
single-layer performance often matches or improves upon using all layers,
suggesting implications for more efficient use of pre-trained models.
Authors' comments: Accepted to ICASSP 2023. Code:
https://github.com/ankitapasad/layerwise-analysis
Le Xia, Yao Sun, Chengsi Liang, Daquan Feng, Runze Cheng, Yang Yang, Muhammad Ali Imran
Virtual reality (VR) over wireless is expected to be one of the killer
applications in next-generation communication networks. Nevertheless, the huge
data volume along with stringent requirements on latency and reliability under
limited bandwidth resources makes untethered wireless VR delivery increasingly
challenging. Such bottlenecks, therefore, motivate this work to seek the
potential of using semantic communication, a new paradigm that promises to
significantly ease the resource pressure, for efficient VR delivery. To this
end, we propose a novel framework, namely WIreless SEmantic deliveRy for VR
(WiserVR), for delivering consecutive 360{\deg} video frames to VR users.
Specifically, deep learning-based multiple modules are well-devised for the
transceiver in WiserVR to realize high-performance feature extraction and
semantic recovery. Among them, we dedicatedly develop a concept of semantic
location graph and leverage the joint-semantic-channel-coding method with
knowledge sharing to not only substantially reduce communication latency, but
also to guarantee adequate transmission reliability and resilience under
various channel states. Moreover, implementation of WiserVR is presented,
followed by corresponding initial simulations for performance evaluation
compared with benchmarks. Finally, we discuss several open issues and offer
feasible solutions to unlock the full potential of WiserVR.
Authors' comments: This magazine article has been accepted for publication by IEEE
Wireless Communications. Copyright may be transferred without notice, after
which this version may no longer be accessible
Anton Vasiliuk, Daria Frolova, Mikhail Belyaev, Boris Shirokikh
When applying a Deep Learning model to medical images, it is crucial to estimate the model uncertainty. Voxel-wise uncertainty is a useful visual marker for human experts and could be used to improve the model's voxel-wise output, such as segmentation. Moreover, uncertainty provides a solid foundation for out-of-distribution (OOD) detection, improving the model performance on the image-wise level. However, one of the frequent tasks in medical imaging is the segmentation of distinct, local structures such as tumors or lesions. Here, the structure-wise uncertainty allows more precise operations than image-wise and more semantic-aware than voxel-wise. The way to produce uncertainty for individual structures remains poorly explored. We propose a framework to measure the structure-wise uncertainty and evaluate the impact of OOD data on the model performance. Thus, we identify the best UE method to improve the segmentation quality. The proposed framework is tested on three datasets with the tumor segmentation task: LIDC-IDRI, LiTS, and a private one with multiple brain metastases cases.
Qihan Wang, Chen Dun, Fangshuo Liao, Chris Jermaine, Anastasios Kyrillidis
Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist ``\textit{winning tickets}'' in large neural networks. These tickets represent ``sparse'' versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to \emph{pretrain} the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the \emph{LOttery ticket through Filter-wise Training} algorithm, dubbed as \textsc{LoFT}. \textsc{LoFT} is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that \textsc{LoFT} $i)$ preserves and finds good lottery tickets, while $ii)$ it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
Przemysław Berk, Frank Trujillo
The goal of this article is to show a rigidity property of conjugacies of generalized interval exchange transformations (GIETs). More precisely, we show that if two piecewise $C^3$ GIETs $f$ and $g$ of generic rotation number with mean-non-linearity 0 are homeomorphic, boundary-equivalent and their renormalizations approach in an appropriate way the set of affine interval exchange transformations, then their respective renormalizations converge to each other and the conjugating map is $C^1$. Moreover, if $f$ and $g$ are GIETs with rotation type combinatorial data, generic rotation number and they are break-equivalent as piecewise circle diffeomorphisms, they are actually $C^1$-conjugated as circle diffeomorphisms. These results generalize the work of K. Cunha and D. Smania \cite{cunha_rigidity_2014} in the case of piecewise $C^3$ circle maps, where the authors prove an analogous result for GIETs with rotation type combinatorial data and bounded rotation number.
Weirui Ye, Pieter Abbeel, Yang Gao
One of the most important AI research questions is to trade off computation versus performance since ``perfect rationality" exists in theory but is impossible to achieve in practice. Recently, Monte-Carlo tree search (MCTS) has attracted considerable attention due to the significant performance improvement in various challenging domains. However, the expensive time cost during search severely restricts its scope for applications. This paper proposes the Virtual MCTS (V-MCTS), a variant of MCTS that spends more search time on harder states and less search time on simpler states adaptively. We give theoretical bounds of the proposed method and evaluate the performance and computations on $9 \times 9$ Go board games and Atari games. Experiments show that our method can achieve comparable performances to the original search algorithm while requiring less than $50\%$ search time on average. We believe that this approach is a viable alternative for tasks under limited time and resources. The code is available at \url{https://github.com/YeWR/V-MCTS.git}.