Neri Merhav
We propose a universal ensemble for random selection of rate-distortion
codes, which is asymptotically optimal in a sample-wise sense. According to
this ensemble, each reproduction vector, $\hbx$, is selected independently at
random under the probability distribution that is proportional to
$2^{-LZ(\hbx)}$, where $LZ(\hbx)$ is the code-length of $\hbx$ pertaining to
the 1978 version of the Lempel-Ziv (LZ) algorithm. We show that, with high
probability, the resulting codebook gives rise to an asymptotically optimal
variable-rate lossy compression scheme under an arbitrary distortion measure,
in the sense that a matching converse theorem also holds. According to the
converse theorem, even if the decoder knew $\ell$-th order type of source
vector in advance ($\ell$ being a large but fixed positive integer), the
performance of the above-mentioned code could not have been improved
essentially, for the vast majority of codewords that represent all source
vectors in the same type. Finally, we provide a discussion of our results,
which includes, among other things, a comparison to a coding scheme that
selects the reproduction vector with the shortest LZ code length among all
vectors that are within the allowed distortion from the source vector.
Authors' comments: 22 pages, submitted for publication
Joo Chan Lee, Daniel Rho, Jong Hwan Ko, Eunbyung Park
Neural fields, also known as coordinate-based or implicit neural
representations, have shown a remarkable capability of representing,
generating, and manipulating various forms of signals. For video
representations, however, mapping pixel-wise coordinates to RGB colors has
shown relatively low compression performance and slow convergence and inference
speed. Frame-wise video representation, which maps a temporal coordinate to its
entire frame, has recently emerged as an alternative method to represent
videos, improving compression rates and encoding speed. While promising, it has
still failed to reach the performance of state-of-the-art video compression
algorithms. In this work, we propose FFNeRV, a novel method for incorporating
flow information into frame-wise representations to exploit the temporal
redundancy across the frames in videos inspired by the standard video codecs.
Furthermore, we introduce a fully convolutional architecture, enabled by
one-dimensional temporal grids, improving the continuity of spatial features.
Experimental results show that FFNeRV yields the best performance for video
compression and frame interpolation among the methods using frame-wise
representations or neural fields. To reduce the model size even further, we
devise a more compact convolutional architecture using the group and pointwise
convolutions. With model compression techniques, including quantization-aware
training and entropy coding, FFNeRV outperforms widely-used standard video
codecs (H.264 and HEVC) and performs on par with state-of-the-art video
compression algorithms.
Authors' comments: Our project page including code is available at
https://maincold2.github.io/ffnerv/
Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos
Multimodal learning pipelines have benefited from the success of pretrained language models. However, this comes at the cost of increased model parameters. In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers. The adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. During the adaptation process the pre-trained language model parameters remain frozen, allowing for fast, parameter-efficient training. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise. Our experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the current state-of-the-art across metrics, with 3.4% relative reduction in the resulting error and 2.1% relative improvement in 7-class classification accuracy.
Stefan Braun, Erik McDermott, Roger Hsiao
The neural transducer is an end-to-end model for automatic speech recognition
(ASR). While the model is well-suited for streaming ASR, the training process
remains challenging. During training, the memory requirements may quickly
exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence
lengths. In this work, we analyze the time and space complexity of a typical
transducer training setup. We propose a memory-efficient training method that
computes the transducer loss and gradients sample by sample. We present
optimizations to increase the efficiency and parallelism of the sample-wise
method. In a set of thorough benchmarks, we show that our sample-wise method
significantly reduces memory usage, and performs at competitive speed when
compared to the default batched computation. As a highlight, we manage to
compute the transducer loss and gradients for a batch size of 1024, and audio
length of 40 seconds, using only 6 GB of memory.
Authors' comments: 5 pages, 4 figures, 1 table, 1 algorithm
Adrian L. H. Lam, Jean-Luc Margot, Emily Whittaker, Nathan Myhrvold
We used 22 $\mu$m (W4) Wide-field Infrared Survey Explorer (WISE)
observations of 4420 asteroids to analyze lightcurves and determined spin
period estimates for 1929 asteroids. We fit second-order Fourier models at a
large number of trial frequencies to the W4 data and analyzed the resulting
periodograms. We initially excluded rotational frequencies exceeding 7.57
rotations per day (P < 3.17 hr), which are not sampled adequately by WISE, and
periods that exceed twice the WISE observation interval, which is typically 36
hr. Three solutions accurately capture the vast majority of the rotational
frequencies in our sample: the best-fit frequency and its mirrors around 3.78
and 7.57 rotations per day. By comparing our solutions to a high-quality
control group of 752 asteroid spin periods, we found that one of our solutions
is accurate (within 5%) in 88% of the cases. The best-fit, secondary, and
tertiary solutions are accurate in 55%, 27%, and 6% of the cases, respectively.
We also observed that suppression of aliased solutions was more effective with
non-uniform sampling than with quasi-uniform sampling.
Authors' comments: 19 pages, 14 figures, in press at The Planetary Science Journal
Quan Quan, Qingsong Yao, Jun Li, S. kevin Zhou
Contrastive learning (CL) is a form of self-supervised learning and has been widely used for various tasks. Different from widely studied instance-level contrastive learning, pixel-wise contrastive learning mainly helps with pixel-wise tasks such as medical landmark detection. The counterpart to an instance in instance-level CL is a pixel, along with its neighboring context, in pixel-wise CL. Aiming to build better feature representation, there is a vast literature about designing instance augmentation strategies for instance-level CL; but there is little similar work on pixel augmentation for pixel-wise CL with a pixel granularity. In this paper, we attempt to bridge this gap. We first classify a pixel into three categories, namely low-, medium-, and high-informative, based on the information quantity the pixel contains. Inspired by the ``InfoMin" principle, we then design separate augmentation strategies for each category in terms of augmentation intensity and sampling ratio. Extensive experiments validate that our information-guided pixel augmentation strategy succeeds in encoding more discriminative representations and surpassing other competitive approaches in unsupervised local feature matching. Furthermore, our pretrained model improves the performance of both one-shot and fully supervised models. To the best of our knowledge, we are the first to propose a pixel augmentation method with a pixel granularity for enhancing unsupervised pixel-wise contrastive learning.
Ankita Pasad, Bowen Shi, Karen Livescu
Many self-supervised speech models, varying in their pre-training objective,
input modality, and pre-training data, have been proposed in the last few
years. Despite impressive successes on downstream tasks, we still have a
limited understanding of the properties encoded by the models and the
differences across models. In this work, we examine the intermediate
representations for a variety of recent models. Specifically, we measure
acoustic, phonetic, and word-level properties encoded in individual layers,
using a lightweight analysis tool based on canonical correlation analysis
(CCA). We find that these properties evolve across layers differently depending
on the model, and the variations relate to the choice of pre-training
objective. We further investigate the utility of our analyses for downstream
tasks by comparing the property trends with performance on speech recognition
and spoken language understanding tasks. We discover that CCA trends provide
reliable guidance to choose layers of interest for downstream tasks and that
single-layer performance often matches or improves upon using all layers,
suggesting implications for more efficient use of pre-trained models.
Authors' comments: Accepted to ICASSP 2023. Code:
https://github.com/ankitapasad/layerwise-analysis
Le Xia, Yao Sun, Chengsi Liang, Daquan Feng, Runze Cheng, Yang Yang, Muhammad Ali Imran
Virtual reality (VR) over wireless is expected to be one of the killer
applications in next-generation communication networks. Nevertheless, the huge
data volume along with stringent requirements on latency and reliability under
limited bandwidth resources makes untethered wireless VR delivery increasingly
challenging. Such bottlenecks, therefore, motivate this work to seek the
potential of using semantic communication, a new paradigm that promises to
significantly ease the resource pressure, for efficient VR delivery. To this
end, we propose a novel framework, namely WIreless SEmantic deliveRy for VR
(WiserVR), for delivering consecutive 360{\deg} video frames to VR users.
Specifically, deep learning-based multiple modules are well-devised for the
transceiver in WiserVR to realize high-performance feature extraction and
semantic recovery. Among them, we dedicatedly develop a concept of semantic
location graph and leverage the joint-semantic-channel-coding method with
knowledge sharing to not only substantially reduce communication latency, but
also to guarantee adequate transmission reliability and resilience under
various channel states. Moreover, implementation of WiserVR is presented,
followed by corresponding initial simulations for performance evaluation
compared with benchmarks. Finally, we discuss several open issues and offer
feasible solutions to unlock the full potential of WiserVR.
Authors' comments: This magazine article has been accepted for publication by IEEE
Wireless Communications. Copyright may be transferred without notice, after
which this version may no longer be accessible
Anton Vasiliuk, Daria Frolova, Mikhail Belyaev, Boris Shirokikh
When applying a Deep Learning model to medical images, it is crucial to estimate the model uncertainty. Voxel-wise uncertainty is a useful visual marker for human experts and could be used to improve the model's voxel-wise output, such as segmentation. Moreover, uncertainty provides a solid foundation for out-of-distribution (OOD) detection, improving the model performance on the image-wise level. However, one of the frequent tasks in medical imaging is the segmentation of distinct, local structures such as tumors or lesions. Here, the structure-wise uncertainty allows more precise operations than image-wise and more semantic-aware than voxel-wise. The way to produce uncertainty for individual structures remains poorly explored. We propose a framework to measure the structure-wise uncertainty and evaluate the impact of OOD data on the model performance. Thus, we identify the best UE method to improve the segmentation quality. The proposed framework is tested on three datasets with the tumor segmentation task: LIDC-IDRI, LiTS, and a private one with multiple brain metastases cases.
Qihan Wang, Chen Dun, Fangshuo Liao, Chris Jermaine, Anastasios Kyrillidis
Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist ``\textit{winning tickets}'' in large neural networks. These tickets represent ``sparse'' versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to \emph{pretrain} the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the \emph{LOttery ticket through Filter-wise Training} algorithm, dubbed as \textsc{LoFT}. \textsc{LoFT} is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that \textsc{LoFT} $i)$ preserves and finds good lottery tickets, while $ii)$ it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
Przemysław Berk, Frank Trujillo
The goal of this article is to show a rigidity property of conjugacies of generalized interval exchange transformations (GIETs). More precisely, we show that if two piecewise $C^3$ GIETs $f$ and $g$ of generic rotation number with mean-non-linearity 0 are homeomorphic, boundary-equivalent and their renormalizations approach in an appropriate way the set of affine interval exchange transformations, then their respective renormalizations converge to each other and the conjugating map is $C^1$. Moreover, if $f$ and $g$ are GIETs with rotation type combinatorial data, generic rotation number and they are break-equivalent as piecewise circle diffeomorphisms, they are actually $C^1$-conjugated as circle diffeomorphisms. These results generalize the work of K. Cunha and D. Smania \cite{cunha_rigidity_2014} in the case of piecewise $C^3$ circle maps, where the authors prove an analogous result for GIETs with rotation type combinatorial data and bounded rotation number.
Weirui Ye, Pieter Abbeel, Yang Gao
One of the most important AI research questions is to trade off computation versus performance since ``perfect rationality" exists in theory but is impossible to achieve in practice. Recently, Monte-Carlo tree search (MCTS) has attracted considerable attention due to the significant performance improvement in various challenging domains. However, the expensive time cost during search severely restricts its scope for applications. This paper proposes the Virtual MCTS (V-MCTS), a variant of MCTS that spends more search time on harder states and less search time on simpler states adaptively. We give theoretical bounds of the proposed method and evaluate the performance and computations on $9 \times 9$ Go board games and Atari games. Experiments show that our method can achieve comparable performances to the original search algorithm while requiring less than $50\%$ search time on average. We believe that this approach is a viable alternative for tasks under limited time and resources. The code is available at \url{https://github.com/YeWR/V-MCTS.git}.
Xiaojun Xu, Linyi Li, Bo Li
Recent studies show that training deep neural networks (DNNs) with Lipschitz
constraints are able to enhance adversarial robustness and other model
properties such as stability. In this paper, we propose a layer-wise orthogonal
training method (LOT) to effectively train 1-Lipschitz convolution layers via
parametrizing an orthogonal matrix with an unconstrained matrix. We then
efficiently compute the inverse square root of a convolution kernel by
transforming the input domain to the Fourier frequency domain. On the other
hand, as existing works show that semi-supervised training helps improve
empirical robustness, we aim to bridge the gap and prove that semi-supervised
learning also improves the certified robustness of Lipschitz-bounded models. We
conduct comprehensive evaluations for LOT under different settings. We show
that LOT significantly outperforms baselines regarding deterministic l2
certified robustness, and scales to deeper neural networks. Under the
supervised scenario, we improve the state-of-the-art certified robustness for
all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to
34.59% on CIFAR-100 at radius rho = 36/255 for 40-layer networks). With
semi-supervised learning over unlabelled data, we are able to improve
state-of-the-art certified robustness on CIFAR-10 at rho = 108/255 from 36.04%
to 42.39%. In addition, LOT consistently outperforms baselines on different
model architectures with only 1/3 evaluation time.
Authors' comments: NeurIPS 2022
Dong-Hee Paek, Kevin Tirta Wijaya, Seung-Hyun Kong
Lane detection is one of the most important functions for autonomous driving.
In recent years, deep learning-based lane detection networks with RGB camera
images have shown promising performance. However, camera-based methods are
inherently vulnerable to adverse lighting conditions such as poor or dazzling
lighting. Unlike camera, LiDAR sensor is robust to the lighting conditions. In
this work, we propose a novel two-stage LiDAR lane detection network with
row-wise detection approach. The first-stage network produces lane proposals
through a global feature correlator backbone and a row-wise detection head.
Meanwhile, the second-stage network refines the feature map of the first-stage
network via attention-based mechanism between the local features around the
lane proposals, and outputs a set of new lane proposals. Experimental results
on the K-Lane dataset show that the proposed network advances the
state-of-the-art in terms of F1-score with 30% less GFLOPs. In addition, the
second-stage network is found to be especially robust to lane occlusions, thus,
demonstrating the robustness of the proposed network for driving in crowded
environments.
Authors' comments: Accepted at 2022 IEEE Conference on Intelligent Transportation
Systems (ITSC)
Tran Van Sang, Mhd Irvan, Rie Shigetomi Yamaguchi, Toshiyuki Nakata
Natural Gradient Descent (NGD) is a second-order neural network training that preconditions the gradient descent with the inverse of the Fisher Information Matrix (FIM). Although NGD provides an efficient preconditioner, it is not practicable due to the expensive computation required when inverting the FIM. This paper proposes a new NGD variant algorithm named Component-Wise Natural Gradient Descent (CW-NGD). CW-NGD is composed of 2 steps. Similar to several existing works, the first step is to consider the FIM matrix as a block-diagonal matrix whose diagonal blocks correspond to the FIM of each layer's weights. In the second step, unique to CW-NGD, we analyze the layer's structure and further decompose the layer's FIM into smaller segments whose derivatives are approximately independent. As a result, individual layers' FIMs are approximated in a block-diagonal form that trivially supports the inversion. The segment decomposition strategy is varied by layer structure. Specifically, we analyze the dense and convolutional layers and design their decomposition strategies appropriately. In an experiment of training a network containing these 2 types of layers, we empirically prove that CW-NGD requires fewer iterations to converge compared to the state-of-the-art first-order and second-order methods.
Avraham Chapman, Lingqiao Liu
It is well-known that a deep neural network has a strong fitting capability
and can easily achieve a low training error even with randomly assigned class
labels. When the number of training samples is small, or the class labels are
noisy, networks tend to memorize patterns specific to individual instances to
minimize the training error. This leads to the issue of overfitting and poor
generalisation performance. This paper explores a remedy by suppressing the
network's tendency to rely on instance-specific patterns for empirical error
minimisation. The proposed method is based on an adversarial training
framework. It suppresses features that can be utilized to identify individual
instances among samples within each class. This leads to classifiers only using
features that are both discriminative across classes and common within each
class. We call our method Adversarial Suppression of Identity Features (ASIF),
and demonstrate the usefulness of this technique in boosting generalisation
accuracy when faced with small datasets or noisy labels. Our source code is
available.
Authors' comments: DICTA 2022
Katelinh Jones, Yuya Jeremy Ong, Yi Zhou, Nathalie Baracaldo
Federated Learning (FL) is a paradigm for jointly training machine learning
algorithms in a decentralized manner which allows for parties to communicate
with an aggregator to create and train a model, without exposing the underlying
raw data distribution of the local parties involved in the training process.
Most research in FL has been focused on Neural Network-based approaches,
however Tree-Based methods, such as XGBoost, have been underexplored in
Federated Learning due to the challenges in overcoming the iterative and
additive characteristics of the algorithm. Decision tree-based models, in
particular XGBoost, can handle non-IID data, which is significant for
algorithms used in Federated Learning frameworks since the underlying
characteristics of the data are decentralized and have risks of being non-IID
by nature. In this paper, we focus on investigating the effects of how
Federated XGBoost is impacted by non-IID distributions by performing
experiments on various sample size-based data skew scenarios and how these
models perform under various non-IID scenarios. We conduct a set of extensive
experiments across multiple different datasets and different data skew
partitions. Our experimental results demonstrate that despite the various
partition ratios, the performance of the models stayed consistent and performed
close to or equally well against models that were trained in a centralized
manner.
Authors' comments: 9 Pages, 1 figure, 3 tables
Hao-Wei Chen, Ting-Hsuan Liao, Hsuan-Kung Yang, Chun-Yi Lee
This paper introduces pixel-wise prediction based visual odometry (PWVO), which is a dense prediction task that evaluates the values of translation and rotation for every pixel in its input observations. PWVO employs uncertainty estimation to identify the noisy regions in the input observations, and adopts a selection mechanism to integrate pixel-wise predictions based on the estimated uncertainty maps to derive the final translation and rotation. In order to train PWVO in a comprehensive fashion, we further develop a data generation workflow for generating synthetic training data. The experimental results show that PWVO is able to deliver favorable results. In addition, our analyses validate the effectiveness of the designs adopted in PWVO, and demonstrate that the uncertainty maps estimated by PWVO is capable of capturing the noises in its input observations.
Faranak Tohidi, Manoranjan Paul, Anwaar Ulhaq
With the fast growth of immersive video sequences, achieving seamless and high-quality compressed 3D content is even more critical. MPEG recently developed a video-based point cloud compression (V-PCC) standard for dynamic point cloud coding. However, reconstructed point clouds using V-PCC suffer from different artifacts, including losing data during pre-processing before applying existing video coding techniques, e.g., High-Efficiency Video Coding (HEVC). Patch generations and self-occluded points in the 3D to the 2D projection are the main reasons for missing data using V-PCC. This paper proposes a new method that introduces overlapping slicing as an alternative to patch generation to decrease the number of patches generated and the amount of data lost. In the proposed method, the entire point cloud has been cross-sectioned into variable-sized slices based on the number of self-occluded points so that data loss can be minimized in the patch generation process and projection. For this, a variable number of layers are considered, partially overlapped to retain the self-occluded points. The proposed method's added advantage is to reduce the bits requirement and to encode geometric data using the slicing base position. The experimental results show that the proposed method is much more flexible than the standard V-PCC method, improves the rate-distortion performance, and decreases the data loss significantly compared to the standard V-PCC method.
Yunpeng Bai, Chao Dong, Cairong Wang
We study how to represent a video with implicit neural representations
(INRs). Classical INRs methods generally utilize MLPs to map input coordinates
to output pixels. While some recent works have tried to directly reconstruct
the whole image with CNNs. However, we argue that both the above pixel-wise and
image-wise strategies are not favorable to video data. Instead, we propose a
patch-wise solution, PS-NeRV, which represents videos as a function of patches
and the corresponding patch coordinate. It naturally inherits the advantages of
image-wise methods, and achieves excellent reconstruction performance with fast
decoding speed. The whole method includes conventional modules, like positional
embedding, MLPs and CNNs, while also introduces AdaIN to enhance intermediate
features. These simple yet essential changes could help the network easily fit
high-frequency details. Extensive experiments have demonstrated its
effectiveness in several video-related tasks, such as video compression and
video inpainting.
Authors' comments: 9 pages, 11 figures