Benjamin F. Maier
We discuss multiple classes of piece-wise Pareto-like power law probability
density functions $p(x)$ with two regimes, a non-pathological core with
non-zero, finite values for support $0\leq x\leq x_{\mathrm{min}}$ and a
power-law tail with exponent $-\alpha$ for $x>x_{\mathrm{min}}$. The cores take
the respective shapes (i) $p(x)\propto (x/x_{\mathrm{min}})^\beta$, (ii)
$p(x)\propto\exp(-\beta[x/x_{\mathrm{min}}-1])$, and (iii) $p(x)\propto
[2-(x/x_{\mathrm{min}})^\beta]$, including the special case $\beta=0$ leading
to core $p(x)=\mathrm{const}$. We derive explicit maximum-likelihood estimators
and/or efficient numerical methods to find the best-fit parameter values for
empirical data. Solutions for the special cases $\alpha=\beta$ are presented,
as well. The results are made available as a Python package.
Authors' comments: 9 pages, 3 figures
Shaocong Xu, Pengfei Li, Qianpu Sun, Xinyu Liu, Yang Li, Shihui Guo, Zhen Wang, Bo Jiang et al.
LiDAR-based semantic scene understanding is an important module in the modern
autonomous driving perception stack. However, identifying outlier points in a
LiDAR point cloud is challenging as LiDAR point clouds lack semantically-rich
information. While former SOTA methods adopt heuristic architectures, we
revisit this problem from the perspective of Selective Classification, which
introduces a selective function into the standard closed-set classification
setup. Our solution is built upon the basic idea of abstaining from choosing
any inlier categories but learns a point-wise abstaining penalty with a
margin-based loss. Apart from learning paradigms, synthesizing outliers to
approximate unlimited real outliers is also critical, so we propose a strong
synthesis pipeline that generates outliers originated from various factors:
object categories, sampling patterns and sizes. We demonstrate that learning
different abstaining penalties, apart from point-wise penalty, for different
types of (synthesized) outliers can further improve the performance. We
benchmark our method on SemanticKITTI and nuScenes and achieve SOTA results.
Codes are available at https://github.com/Daniellli/LiON/.
Authors' comments: Accepted by AAAI2025. Codes are available at
https://github.com/Daniellli/LiON/
Junwen Duan, Han Jiang, Ying Yu
International Classification of Diseases (ICD) coding is the task of
assigning ICD diagnosis codes to clinical notes. This can be challenging given
the large quantity of labels (nearly 9,000) and lengthy texts (up to 8,000
tokens). However, unlike the single-pass reading process in previous works,
humans tend to read the text and label definitions again to get more confident
answers. Moreover, although pretrained language models have been used to
address these problems, they suffer from huge memory usage. To address the
above problems, we propose a simple but effective model called the Multi-Hop
Label-wise ATtention (MHLAT), in which multi-hop label-wise attention is
deployed to get more precise and informative representations. Extensive
experiments on three benchmark MIMIC datasets indicate that our method achieves
significantly better or competitive performance on all seven metrics, with much
fewer parameters to optimize.
Authors' comments: 5 pages, 1 figure, accepted in ICASSP 2023
Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi
We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples are available at https://erl-j.github.io/neural-guitar-web-supplement.
Yakun Yu, Shi-ang Qi, Jiuding Yang, Liyao Jiang, Di Niu
Current recommender systems employ large-sized embedding tables with uniform
dimensions for all features, leading to overfitting, high computational cost,
and suboptimal generalizing performance. Many techniques aim to solve this
issue by feature selection or embedding dimension search. However, these
techniques typically select a fixed subset of features or embedding dimensions
for all instances and feed all instances into one recommender model without
considering heterogeneity between items or users. This paper proposes a novel
instance-wise Hierarchical Architecture Search framework, iHAS, which automates
neural architecture search at the instance level. Specifically, iHAS
incorporates three stages: searching, clustering, and retraining. The searching
stage identifies optimal instance-wise embedding dimensions across different
field features via carefully designed Bernoulli gates with stochastic selection
and regularizers. After obtaining these dimensions, the clustering stage
divides samples into distinct groups via a deterministic selection approach of
Bernoulli gates. The retraining stage then constructs different recommender
models, each one designed with optimal dimensions for the corresponding group.
We conduct extensive experiments to evaluate the proposed iHAS on two public
benchmark datasets from a real-world recommender system. The experimental
results demonstrate the effectiveness of iHAS and its outstanding
transferability to widely-used deep recommendation models.
Authors' comments: Accepted as CIKM23 Long paper
Tetsuya Abe, Ryusuke Sagawa, Ko Ayusawa, Wataru Takano
The present paper proposes an encoder-decoder model for extracting the structures of human motions represented by frame-wise discrete features in a self-supervised manner. In the proposed method, features are extracted as codes in a motion codebook without the use of human knowledge, and the relationship between these codes can be visualized on a graph. Since the codes are expected to be temporally sparse compared to the captured frame rate and can be shared by multiple sequences, the proposed network model also addresses the need for training constraints. Specifically, the model consists of self-attention layers and a vector clustering block. The attention layers contribute to finding sparse keyframes and discrete features as motion codes, which are then extracted by vector clustering. The constraints are realized as training losses so that the same motion codes can be as contiguous as possible and can be shared by multiple sequences. In addition, we propose the use of causal self-attention as a method by which to calculate attention for long sequences consisting of numerous frames. In our experiments, the sparse structures of motion codes were used to compile a graph that facilitates visualization of the relationship between the codes and the differences between sequences. We then evaluated the effectiveness of the extracted motion codes by applying them to multiple recognition tasks and found that performance levels comparable to task-optimized methods could be achieved by linear probing.
Pengfei Guo, Warren Richard Morningstar, Raviteja Vemulapalli, Karan Singhal, Vishal M. Patel, Philip Andrew Mansfield
Large machine learning models trained on diverse data have recently seen unprecedented success. Federated learning enables training on private data that may otherwise be inaccessible, such as domain-specific datasets decentralized across many clients. However, federated learning can be difficult to scale to large models when clients have limited resources. This challenge often results in a trade-off between model size and access to diverse data. To mitigate this issue and facilitate training of large models on edge devices, we introduce a simple yet effective strategy, Federated Layer-wise Learning, to simultaneously reduce per-client memory, computation, and communication costs. Clients train just a single layer each round, reducing resource costs considerably with minimal performance degradation. We also introduce Federated Depth Dropout, a complementary technique that randomly drops frozen layers during training, to further reduce resource usage. Coupling these two techniques enables us to effectively train significantly larger models on edge devices. Specifically, we reduce training memory usage by 5x or more in federated self-supervised representation learning and demonstrate that performance in downstream tasks is comparable to conventional federated self-supervised learning.
Shaojie Zhang, Jianqin Yin, Yonghao Dang, Jiajun Fu
Graph convolution networks (GCNs) have achieved remarkable performance in
skeleton-based action recognition. However, previous GCN-based methods rely on
elaborate human priors excessively and construct complex feature aggregation
mechanisms, which limits the generalizability and effectiveness of networks. To
solve these problems, we propose a novel Spatial Topology Gating Unit (STGU),
an MLP-based variant without extra priors, to capture the co-occurrence
topology features that encode the spatial dependency across all joints. In
STGU, to learn the point-wise topology features, a new gate-based feature
interaction mechanism is introduced to activate the features point-to-point by
the attention map generated from the input sample. Based on the STGU, we
propose the first MLP-based model, SiT-MLP, for skeleton-based action
recognition in this work. Compared with previous methods on three large-scale
datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP
reduces the parameters significantly with favorable results. The code will be
available at https://github.com/BUPTSJZhang/SiT?MLP.
Authors' comments: Accepted by IEEE TCSVT 2024
Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang et al.
The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Junjie Wang, Meng Guo, Zhongkui Li
Multi-agent systems outperform single agent in complex collaborative tasks.
However, in large-scale scenarios, ensuring timely information exchange during
decentralized task execution remains a challenge. This work presents an online
decentralized coordination scheme for multi-agent systems under complex local
tasks and intermittent communication constraints. Unlike existing strategies
that enforce all-time or intermittent connectivity, our approach allows agents
to join or leave communication networks at aperiodic intervals, as deemed
optimal by their online task execution. This scheme concurrently determines
local plans and refines the communication strategy, i.e., where and when to
communicate as a team. A decentralized potential game is modeled among agents,
for which a Nash equilibrium is generated iteratively through online local
search. It guarantees local task completion and intermittent communication
constraints. Extensive numerical simulations are conducted against several
strong baselines.
Authors' comments: 6 pages, 2 figures
Yuta Sato
Population aging is one of the most serious problems in certain countries. In order to implement its countermeasures, understanding its rapid progress is of urgency with a granular resolution. However, a detailed and rigorous survey with high frequency is not feasible due to the constraints of financial and human resources. Nowadays, Deep Learning is prevalent for pattern recognition with significant accuracy, with its application to remote sensing. This paper proposes a multi-head Convolutional Neural Network model with transfer learning from pre-trained ResNet50 for estimating mesh-wise demographics of Japan as one of the most aged countries in the world, with satellite images from Landsat-8/OLI and Suomi NPP/VIIRS-DNS as inputs and census demographics as labels. The trained model was performed on a testing dataset with a test score of at least 0.8914 in $\text{R}^2$ for all the demographic composition groups, and the estimated demographic composition was generated and visualised for 2022 as a non-census year.
Yifan Sun, Feihan Li, Weiye Zhao, Rui Chen, Tianhao Wei, Changliu Liu
Deep reinforcement learning (RL) excels in various control tasks, yet the
absence of safety guarantees hampers its real-world applicability. In
particular, explorations during learning usually results in safety violations,
while the RL agent learns from those mistakes. On the other hand, safe control
techniques ensure persistent safety satisfaction but demand strong priors on
system dynamics, which is usually hard to obtain in practice. To address these
problems, we present Safe Set Guided State-wise Constrained Policy Optimization
(S-3PO), a pioneering algorithm generating state-wise safe optimal policies
with zero training violations, i.e., learning without mistakes. S-3PO first
employs a safety-oriented monitor with black-box dynamics to ensure safe
exploration. It then enforces an "imaginary" cost for the RL agent to converge
to optimal behaviors within safety constraints. S-3PO outperforms existing
methods in high-dimensional robotics tasks, managing state-wise constraints
with zero training violation. This innovation marks a significant stride
towards real-world safe RL deployment.
Authors' comments: arXiv admin note: text overlap with arXiv:2306.12594
Soonwoo Kwon, Sojung Kim, Seunghyun Lee, Jin-Young Kim, Suyeong An, Kyuseok Kim
Computerized Adaptive Testing (CAT) is a widely used, efficient test mode
that adapts to the examinee's proficiency level in the test domain. CAT
requires pre-trained item profiles, for CAT iteratively assesses the student
real-time based on the registered items' profiles, and selects the next item to
administer using candidate items' profiles. However, obtaining such item
profiles is a costly process that involves gathering a large, dense
item-response data, then training a diagnostic model on the collected data. In
this paper, we explore the possibility of leveraging response data collected in
the CAT service. We first show that this poses a unique challenge due to the
inherent selection bias introduced by CAT, i.e., more proficient students will
receive harder questions. Indeed, when naively training the diagnostic model
using CAT response data, we observe that item profiles deviate significantly
from the ground-truth. To tackle the selection bias issue, we propose the
user-wise aggregate influence function method. Our intuition is to filter out
users whose response data is heavily biased in an aggregate manner, as judged
by how much perturbation the added data will introduce during parameter
estimation. This way, we may enhance the performance of CAT while introducing
minimal bias to the item profiles. We provide extensive experiments to
demonstrate the superiority of our proposed method based on the three public
datasets and one dataset that contains real-world CAT response data.
Authors' comments: CIKM 2023
Leander Weber, Jim Berend, Moritz Weckbecker, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Gradient-based optimization has been a cornerstone of machine learning enabling the vast advances of AI development over the past decades. However, since this type of optimization requires differentiation, it reduces flexibility in the choice of model and objective. With recent evidence of the benefits of non-differentiable (e.g. neuromorphic) architectures over classical models, such constraints can become limiting in the future. We present Layer-wise Feedback Propagation (LFP), a novel training principle for neural network-like predictors utilizing methods from the domain of explainability to decompose a reward to individual neurons based on their respective contributions to solving a given task without imposing any differentiability requirements. Leveraging these neuron-wise rewards, our method then implements a greedy approach reinforcing helpful parts of the network and weakening harmful ones. While having comparable computational complexity to gradient descent, LFP offers the advantage that it obtains sparse models due to an implicit weight scaling. We establish the convergence of LFP theoretically and empirically, demonstrating its effectiveness on various models and datasets. We further investigate two applications for LFP: Firstly, neural network pruning, and secondly, the optimization of neuromorphic architectures such as Heaviside step function activated Spiking Neural Networks (SNNs). In the first setting, LFP naturally generates sparse models that are easily prunable and thus efficiently encode and compute information. In the second setting, LFP achieves comparable performance to surrogate gradient descent, but provides approximation-free training, which eases the implementation on neuromorphic hardware. Consequently, LFP combines efficiency in terms of computation and representation with flexibility w.r.t. model architecture and objective function. Our code is available.
Mohammad Mahdi Abedi, David Pardo, Tariq Alkhalifah
Trace-wise noise is a type of noise often seen in seismic data, which is characterized by vertical coherency and horizontal incoherency. Using self-supervised deep learning to attenuate this type of noise, the conventional blind-trace deep learning trains a network to blindly reconstruct each trace in the data from its surrounding traces; it attenuates isolated trace-wise noise but causes signal leakage in clean and noisy traces and reconstruction errors next to each noisy trace. To reduce signal leakage and improve denoising, we propose a new loss function and masking procedure in semi-blind-trace deep learning. Our hybrid loss function has weighted active zones that cover masked and non-masked traces. Therefore, the network is not blinded to clean traces during their reconstruction. During training, we dynamically change the masks' characteristics. The goal is to train the network to learn the characteristics of the signal instead of noise. The proposed algorithm enables the designed U-net to detect and attenuate trace-wise noise without having prior information about the noise. A new hyperparameter of our method is the relative weight between the masked and non-masked traces' contribution to the loss function. Numerical experiments show that selecting a small value for this parameter is enough to significantly decrease signal leakage. The proposed algorithm is tested on synthetic and real off-shore and land datasets with different noises. The results show the superb ability of the method to attenuate trace-wise noise while preserving other events. An implementation of the proposed algorithm as a Python code is also made available.
Taeryung Lee, Yeonguk Oh, Kyoung Mu Lee
In this paper, we propose P3D, the human part-wise motion context learning
framework for sign language recognition. Our main contributions lie in two
dimensions: learning the part-wise motion context and employing the pose
ensemble to utilize 2D and 3D pose jointly. First, our empirical observation
implies that part-wise context encoding benefits the performance of sign
language recognition. While previous methods of sign language recognition
learned motion context from the sequence of the entire pose, we argue that such
methods cannot exploit part-specific motion context. In order to utilize
part-wise motion context, we propose the alternating combination of a part-wise
encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET
encodes the motion contexts from a part sequence, while WET merges them into a
unified context. By learning part-wise motion context, our P3D achieves
superior performance on WLASL compared to previous state-of-the-art methods.
Second, our framework is the first to ensemble 2D and 3D poses for sign
language recognition. Since the 3D pose holds rich motion context and depth
information to distinguish the words, our P3D outperformed the previous
state-of-the-art methods employing a pose ensemble.
Authors' comments: ICCV 2023
Yuhao Yang, Jun Wu, Guangjian Zhang, Rong Xiong
Traditional geometric registration based estimation methods only exploit the CAD model implicitly, which leads to their dependence on observation quality and deficiency to occlusion. To address the problem,the paper proposes a bidirectional correspondence prediction network with a point-wise attention-aware mechanism. This network not only requires the model points to predict the correspondence but also explicitly models the geometric similarities between observations and the model prior. Our key insight is that the correlations between each model point and scene point provide essential information for learning point-pair matches. To further tackle the correlation noises brought by feature distribution divergence, we design a simple but effective pseudo-siamese network to improve feature homogeneity. Experimental results on the public datasets of LineMOD, YCB-Video, and Occ-LineMOD show that the proposed method achieves better performance than other state-of-the-art methods under the same evaluation criteria. Its robustness in estimating poses is greatly improved, especially in an environment with severe occlusions.
Anton Baumann, Thomas Roßberg, Michael Schmitt
Uncertainty estimation in machine learning is paramount for enhancing the
reliability and interpretability of predictive models, especially in
high-stakes real-world scenarios. Despite the availability of numerous methods,
they often pose a trade-off between the quality of uncertainty estimation and
computational efficiency. Addressing this challenge, we present an adaptation
of the Multiple-Input Multiple-Output (MIMO) framework -- an approach
exploiting the overparameterization of deep neural networks -- for pixel-wise
regression tasks. Our MIMO variant expands the applicability of the approach
from simple image classification to broader computer vision domains. For that
purpose, we adapted the U-Net architecture to train multiple subnetworks within
a single model, harnessing the overparameterization in deep neural networks.
Additionally, we introduce a novel procedure for synchronizing subnetwork
performance within the MIMO framework. Our comprehensive evaluations of the
resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy
to existing models, superior calibration on in-distribution data, robust
out-of-distribution detection capabilities, and considerable improvements in
parameter size and inference time. Code available at
github.com/antonbaumann/MIMO-Unet
Authors' comments: 8 pages (references do not count), Accepted at UnCV (Workshop on
Uncertainty Quantification for Computer Vision at ICCV)
Jun Zhou, Kai Chen, Linlin Xu, Qi Dou, Jing Qin
One critical challenge in 6D object pose estimation from a single RGBD image
is efficient integration of two different modalities, i.e., color and depth. In
this work, we tackle this problem by a novel Deep Fusion Transformer~(DFTr)
block that can aggregate cross-modality features for improving pose estimation.
Unlike existing fusion methods, the proposed DFTr can better model
cross-modality semantic correlation by leveraging their semantic similarity,
such that globally enhanced features from different modalities can be better
integrated for improved information extraction. Moreover, to further improve
robustness and efficiency, we introduce a novel weighted vector-wise voting
algorithm that employs a non-iterative global optimization strategy for precise
3D keypoint localization while achieving near real-time inference. Extensive
experiments show the effectiveness and strong generalization capability of our
proposed 3D keypoint voting algorithm. Results on four widely used benchmarks
also demonstrate that our method outperforms the state-of-the-art methods by
large margins.
Authors' comments: Accepted by ICCV2023
Guillermo Carbajal, Patricia Vitoria, Jos Lezama, Pablo Mus
In recent years, the removal of motion blur in photographs has seen impressive progress in the hands of deep learning-based methods, trained to map directly from blurry to sharp images. For this reason, approaches that explicitly use a forward degradation model received significantly less attention. However, a well-defined specification of the blur genesis, as an intermediate step, promotes the generalization and explainability of the method. Towards this goal, we propose a learning-based motion deblurring method based on dense non-uniform motion blur estimation followed by a non-blind deconvolution approach. Specifically, given a blurry image, a first network estimates the dense per-pixel motion blur kernels using a lightweight representation composed of a set of image-adaptive basis motion kernels and the corresponding mixing coefficients. Then, a second network trained jointly with the first one, unrolls a non-blind deconvolution method using the motion kernel field estimated by the first network. The model-driven aspect is further promoted by training the networks on sharp/blurry pairs synthesized according to a convolution-based, non-uniform motion blur degradation model. Qualitative and quantitative evaluation shows that the kernel prediction network produces accurate motion blur estimates, and that the deblurring pipeline leads to restorations of real blurred images that are competitive or superior to those obtained with existing end-to-end deep learning-based methods. Code and trained models are available at https://github.com/GuillermoCarbajal/J-MKPD/.