Tobias Cord-Landwehr, Christoph Boeddeker, Cătălin Zorilă, Rama Doddipatla, Reinhold Haeb-Umbach
We propose a modified teacher-student training for the extraction of
frame-wise speaker embeddings that allows for an effective diarization of
meeting scenarios containing partially overlapping speech. To this end, a
geodesic distance loss is used that enforces the embeddings computed from
regions with two active speakers to lie on the shortest path on a sphere
between the points given by the d-vectors of each of the active speakers. Using
those frame-wise speaker embeddings in clustering-based diarization outperforms
segment-level clustering-based diarization systems such as VBx and Spectral
Clustering. By extending our approach to a mixture-model-based diarization, the
performance can be further improved, approaching the diarization error rates of
diarization systems that use a dedicated overlap detection, and outperforming
these systems when also employing an additional overlap detection.
Authors' comments: Accepted at ICASSP 2024
Punit Vadher, Devsi Bantva
Let $G$ be a simple finite connected graph with vertex set $V(G) =
\{v_1,v_2,\ldots,v_n\}$. Denote the degree of vertex $v_i$ by $d_i$ for all $1
\leq i \leq n$. The Randi\'c matrix of $G$, denoted by $R(G) = [r_{i,j}]$, is
the $n \times n$ matrix whose $(i,j)$-entry $r_{i,j}$ is $r_{i,j} =
1/\sqrt{d_id_j}$ if $v_i$ and $v_j$ are adjacent in $G$ and 0 otherwise. A tree
is a connected acyclic graph. A level-wise regular tree is a tree rooted at one
vertex $r$ or two (adjacent) vertices $r$ and $r'$ in which all vertices with
the minimum distance $i$ from $r$ or $r'$ have the same degree $m_i$ for $0
\leq i \leq h$, where $h$ is the height of $T$. In this paper, we give a
complete characterization of the eigenvalues with their multiplicity of the
Randi\'c matrix of level-wise regular trees. We prove that the eigenvalues of
the Randi\'c matrix of a level-wise regular tree are the eigenvalues of the
particular tridiagonal matrices, which are formed using the degree sequence
$(m_0,m_1,\ldots,m_{h-1})$ of level-wise regular trees.
Authors' comments: 20 pages, 2 figures
Rongyu Zhang, Yulin Luo, Jiaming Liu, Huanrui Yang, Zhen Dong, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata et al.
The Mixture-of-Experts (MoE) approach has demonstrated outstanding
scalability in multi-task learning including low-level upstream tasks such as
concurrent removal of multiple adverse weather effects. However, the
conventional MoE architecture with parallel Feed Forward Network (FFN) experts
leads to significant parameter and computational overheads that hinder its
efficient deployment. In addition, the naive MoE linear router is suboptimal in
assigning task-specific features to multiple experts which limits its further
scalability. In this work, we propose an efficient MoE architecture with weight
sharing across the experts. Inspired by the idea of linear feature modulation
(FM), our architecture implicitly instantiates multiple experts via learnable
activation modulations on a single shared expert block. The proposed Feature
Modulated Expert (FME) serves as a building block for the novel
Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up
the number of experts with low overhead. We further propose an
Uncertainty-aware Router (UaR) to assign task-specific features to different FM
modules with well-calibrated weights. This enables MoFME to effectively learn
diverse expert functions for multiple tasks. The conducted experiments on the
multi-deweather task show that our MoFME outperforms the baselines in the image
restoration quality by 0.1-0.2 dB and achieves SOTA-compatible performance
while saving more than 72% of parameters and 39% inference time over the
conventional MoE counterpart. Experiments on the downstream segmentation and
classification tasks further demonstrate the generalizability of MoFME to real
open-world applications.
Authors' comments: aaai2024
Anzhe Cheng, Zhenkun Wang, Chenzhong Yin, Mingxi Cheng, Heng Ping, Xiongye Xiao, Shahin Nazarian, Paul Bogdan
Backpropagation (BP) has been a successful optimization technique for deep
learning models. However, its limitations, such as backward- and
update-locking, and its biological implausibility, hinder the concurrent
updating of layers and do not mimic the local learning processes observed in
the human brain. To address these issues, recent research has suggested using
local error signals to asynchronously train network blocks. However, this
approach often involves extensive trial-and-error iterations to determine the
best configuration for local training. This includes decisions on how to
decouple network blocks and which auxiliary networks to use for each block. In
our work, we introduce a novel BP-free approach: a block-wise BP-free (BWBPF)
neural network that leverages local error signals to optimize distinct
sub-neural networks separately, where the global loss is only responsible for
updating the output layer. The local error signals used in the BP-free model
can be computed in parallel, enabling a potential speed-up in the weight update
process through parallel implementation. Our experimental results consistently
show that this approach can identify transferable decoupled architectures for
VGG and ResNet variations, outperforming models trained with end-to-end
backpropagation and other state-of-the-art block-wise learning techniques on
datasets such as CIFAR-10 and Tiny-ImageNet. The code is released at
https://github.com/Belis0811/BWBPF.
Authors' comments: The paper has been accepted by ICASSP2024
Lihao Zhang, Haijian Sun, Jin Sun, Rose Qingyang Hu
The accurate modeling of indoor radio propagation is crucial for
localization, monitoring, and device coordination, yet remains a formidable
challenge, due to the complex nature of indoor environments where radio can
propagate along hundreds of paths. These paths are resulted from the room
layout, furniture, appliances and even small objects like a glass cup. They are
also influenced by the object material and surface roughness. Advanced machine
learning (ML) techniques have the potential to take such non-linear and
hard-to-model factors into consideration. However, extensive and fine-grained
datasets are urgently required. This paper presents WiSegRT, an open-source
dataset for indoor radio propagation modeling. Generated by a differentiable
ray tracer within the segmented 3-dimensional (3D) indoor environments, WiSegRT
provides site-specific channel impulse responses for each grid point relative
to the given transmitter location. We expect WiSegRT to support a wide-range of
applications, such as ML-based channel prediction, accurate indoor
localization, radio-based object detection, wireless digital twin, and more.
Authors' comments: accepted by IEEE ICNC 2024
Junsu Kim, Sumin Hong, Chanwoo Kim, Jihyeon Kim, Yihalem Yimolal Tiruneh, Jeongwan On, Jihyun Song, Sunhwa Choi et al.
Class incremental learning aims to solve a problem that arises when
continuously adding unseen class instances to an existing model This approach
has been extensively studied in the context of image classification; however
its applicability to object detection is not well established yet. Existing
frameworks using replay methods mainly collect replay data without considering
the model being trained and tend to rely on randomness or the number of labels
of each sample. Also, despite the effectiveness of the replay, it was not yet
optimized for the object detection task. In this paper, we introduce an
effective buffer training strategy (eBTS) that creates the optimized replay
buffer on object detection. Our approach incorporates guarantee minimum and
hierarchical sampling to establish the buffer customized to the trained model.
%These methods can facilitate effective retrieval of prior knowledge.
Furthermore, we use the circular experience replay training to optimally
utilize the accumulated buffer data. Experiments on the MS COCO dataset
demonstrate that our eBTS achieves state-of-the-art performance compared to the
existing replay schemes.
Authors' comments: 5 pages, 3 figures, Accepted at ICASSP 2024
Yang Zhang, Huilin Pan, Mingying Li, An Wang, Yang Zhou, Hongliang Ren
Despite the successful application of convolutional neural networks (CNNs) in
object detection tasks, their efficiency in detecting faults from freight train
images remains inadequate for implementation in real-world engineering
scenarios. Existing modeling shortcomings of spatial invariance and pooling
layers in conventional CNNs often ignore the neglect of crucial global
information, resulting in error localization for fault objection tasks of
freight trains. To solve these problems, we design a spatial-wise dynamic
distillation framework based on multi-layer perceptron (MLP) for visual fault
detection of freight trains. We initially present the axial shift strategy,
which allows the MLP-like architecture to overcome the challenge of spatial
invariance and effectively incorporate both local and global cues. We propose a
dynamic distillation method without a pre-training teacher, including a dynamic
teacher mechanism that can effectively eliminate the semantic discrepancy with
the student model. Such an approach mines more abundant details from
lower-level feature appearances and higher-level label semantics as the extra
supervision signal, which utilizes efficient instance embedding to model the
global spatial and semantic information. In addition, the proposed dynamic
teacher can jointly train with students to further enhance the distillation
efficiency. Extensive experiments executed on six typical fault datasets reveal
that our approach outperforms the current state-of-the-art detectors and
achieves the highest accuracy with real-time detection at a lower computational
cost. The source code will be available at
\url{https://github.com/MVME-HBUT/SDD-FTI-FDet}.
Authors' comments: 10 pages, 6 figures
Risab Biswas, Swalpa Kumar Roy, Umapada Pal
Document image enhancement is a fundamental and important stage for attaining
the best performance in any document analysis assignment because there are many
degradation situations that could harm document images, making it more
difficult to recognize and analyze them. In this paper, we propose
\textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder
architecture based on a Tokens-to-token vision transformer. Each image is
divided into a set of tokens with a defined length using the ViT model, which
is then applied several times to model the global relationship between the
tokens. However, the conventional tokenization of input data does not
adequately reflect the crucial local structure between adjacent pixels of the
input image, which results in low efficiency. Instead of using a simple ViT and
hard splitting of images for the document image enhancement task, we employed a
progressive tokenization technique to capture this local information from an
image to achieve more effective results. Experiments on various DIBCO and
H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing
CNN and ViT-based state-of-the-art methods. In this research, the primary area
of examination is the application of the proposed architecture to the task of
document binarization. The source code will be made available at
https://github.com/RisabBiswas/T2T-BinFormer.
Authors' comments: arXiv admin note: text overlap with arXiv:2312.03568
Florian Kofler, Hendrik Mller, Josef A. Buchner, Ezequiel de la Rosa, Ivan Ezhov, Marcel Rosier, Isra Mekki, Suprosanna Shit et al.
This paper introduces panoptica, a versatile and performance-optimized
package designed for computing instance-wise segmentation quality metrics from
2D and 3D segmentation maps. panoptica addresses the limitations of existing
metrics and provides a modular framework that complements the original
intersection over union-based panoptic quality with other metrics, such as the
distance metric Average Symmetric Surface Distance. The package is open-source,
implemented in Python, and accompanied by comprehensive documentation and
tutorials. panoptica employs a three-step metrics computation process to cover
diverse use cases. The efficacy of panoptica is demonstrated on various
real-world biomedical datasets, where an instance-wise evaluation is
instrumental for an accurate representation of the underlying clinical task.
Overall, we envision panoptica as a valuable tool facilitating in-depth
evaluation of segmentation methods.
Authors' comments: 15 pages, 6 figures, 3 tables
Tianhao Peng, Ge Gao, Heming Sun, Fan Zhang, David Bull
In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Gradient decay enhances parameter exploration during sparsification whilst preventing runaway sparsity and is superior to the standard Straight-Through Estimation. The adaptive layer-wise distillation regulates the sparse training in various stages based on the distortion of intermediate features. This stage-wise design efficiently updates parameters with minimal computational overhead. The proposed approach has been applied to three popular end-to-end learnt video codecs, FVC, DCVC, and DCVC-HEM. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR. Supporting code and supplementary material can be downloaded from: https://jasminepp.github.io/lightweightdvc/
Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie Gu, Guannan Zhang
Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.
Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu
We present a new approach, termed GPS-Gaussian, for synthesizing novel views
of a character in a real-time manner. The proposed method enables 2K-resolution
rendering under a sparse-view camera setting. Unlike the original Gaussian
Splatting or neural implicit rendering methods that necessitate per-subject
optimizations, we introduce Gaussian parameter maps defined on the source views
and regress directly Gaussian Splatting properties for instant novel view
synthesis without any fine-tuning or optimization. To this end, we train our
Gaussian parameter regression module on a large amount of human scan data,
jointly with a depth estimation module to lift 2D parameter maps to 3D space.
The proposed framework is fully differentiable and experiments on several
datasets demonstrate that our method outperforms state-of-the-art methods while
achieving an exceeding rendering speed.
Authors' comments: Accepted by CVPR 2024. Project page:
https://shunyuanzheng.github.io/GPS-Gaussian
Shuchi Wu, Chuan Ma, Kang Wei, Xiaogang Xu, Ming Ding, Yuwen Qian, Tao Xiang
This paper introduces RDA, a pioneering approach designed to address two
primary deficiencies prevalent in previous endeavors aiming at stealing
pre-trained encoders: (1) suboptimal performances attributed to biased
optimization objectives, and (2) elevated query costs stemming from the
end-to-end paradigm that necessitates querying the target encoder every epoch.
Specifically, we initially Refine the representations of the target encoder for
each training sample, thereby establishing a less biased optimization objective
before the steal-training phase. This is accomplished via a sample-wise
prototype, which consolidates the target encoder's representations for a given
sample's various perspectives. Demanding exponentially fewer queries compared
to the end-to-end approach, prototypes can be instantiated to guide subsequent
query-free training. For more potent efficacy, we develop a multi-relational
extraction loss that trains the surrogate encoder to Discriminate mismatched
embedding-prototype pairs while Aligning those matched ones in terms of both
amplitude and angle. In this way, the trained surrogate encoder achieves
state-of-the-art results across the board in various downstream datasets with
limited queries. Moreover, RDA is shown to be robust to multiple widely-used
defenses.
Authors' comments: 25 pages, 12 figures, 15 tables
Yan Guo, C. Sengupta, T. C. Scott, P. Lagos, Y. Luo
We present resolved GMRT HI observations of the high gas-phase metallicity
dwarf galaxy WISEA J230615.06+143927.9 (z = 0.005) (hereafter J2306) and
investigate whether it could be a Tidal Dwarf Galaxy (TDG) candidate. TDGs are
observed to have higher metallicities than normal dwarfs. J2306 has an unusual
combination of a blue g -- r colour of 0.23 mag, irregular optical morphology
and high-metallicity (12 + log(O/H) = 8.68$\pm$0.14), making it an interesting
galaxy to study in more detail. We find J2306 to be an HI rich galaxy with a
large extended, unperturbed rotating HI disk. Using our HI data we estimated
its dynamical mass and found the galaxy to be dark matter (DM) dominated within
its HI radius. The quantity of DM, inferred from its dynamical mass, appears to
rule out J2306 as an evolved TDG. A wide area environment search reveals J2306
to be isolated from any larger galaxies which could have been the source of its
high gas metallicity. Additionally, the HI morphology and kinematics of the
galaxy show no indication of a recent merger to explain the high-metallicity.
Further detailed optical spectroscopic observations of J2306 might provide an
answer to how a seemingly ordinary irregular dwarf galaxy achieved such a high
level of metal enrichment.
Authors' comments: 11 pages, 5 figures, Accepted in RAA
Jixuan Leng, Yijiang Li, Haohan Wang
Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically the CLIP model, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module. This module seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.
Chenyang Gao, Yue Gu, Ivan Marsic
In supervised speech separation, permutation invariant training (PIT) is
widely used to handle label ambiguity by selecting the best permutation to
update the model. Despite its success, previous studies showed that PIT is
plagued by excessive label assignment switching in adjacent epochs, impeding
the model to learn better label assignments. To address this issue, we propose
a novel training strategy, dynamic sample dropout (DSD), which considers
previous best label assignments and evaluation metrics to exclude the samples
that may negatively impact the learned label assignments during training.
Additionally, we include layer-wise optimization (LO) to improve the
performance by solving layer-decoupling. Our experiments showed that combining
DSD and LO outperforms the baseline and solves excessive label assignment
switching and layer-decoupling issues. The proposed DSD and LO approach is easy
to implement, requires no extra training sets or steps, and shows generality to
various speech separation tasks.
Authors' comments: Accepted by INTERSPEECH 2023
Yueyuan Li, Wei Yuan, Songan Zhang, Weihao Yan, Qiyuan Shen, Chunxiang Wang, Ming Yang
Simulators play a crucial role in autonomous driving, offering significant
time, cost, and labor savings. Over the past few years, the number of
simulators for autonomous driving has grown substantially. However, there is a
growing concern about the validity of algorithms developed and evaluated in
simulators, indicating a need for a thorough analysis of the development status
of the simulators.
To bridge the gap in research, this paper analyzes the evolution of
simulators and explains how the functionalities and utilities have developed.
Then, the existing simulators are categorized based on their task
applicability, providing researchers with a taxonomy to swiftly assess a
simulator's suitability for specific tasks. Recommendations for select
simulators are presented, considering factors such as accessibility,
maintenance status, and quality. Recognizing potential hazards in simulators
that could impact the confidence of simulation experiments, the paper dedicates
substantial effort to identifying and justifying critical issues in actively
maintained open-source simulators. Moreover, the paper reviews potential
solutions to address these issues, serving as a guide for enhancing the
credibility of simulators.
Authors' comments: 18 pages, 5 figures, 8 tables
Lszl Antal, Hana Masara, Erika brahm
In this paper, we extend an available neural network verification technique
to support a wider class of piece-wise linear activation functions.
Furthermore, we extend the algorithms, which provide in their original form
exact respectively over-approximative results for bounded input sets
represented as start sets, to allow also unbounded input set. We implemented
our algorithms and demonstrated their effectiveness in some case studies.
Authors' comments: In Proceedings FMAS 2023, arXiv:2311.08987
Rita Kuznetsova, Alize Pace, Manuel Burger, Hugo Yche, Gunnar Rtsch
Recent advances in deep learning architectures for sequence modeling have not
fully transferred to tasks handling time-series from electronic health records.
In particular, in problems related to the Intensive Care Unit (ICU), the
state-of-the-art remains to tackle sequence classification in a tabular manner
with tree-based methods. Recent findings in deep learning for tabular data are
now surpassing these classical methods by better handling the severe
heterogeneity of data input features. Given the similar level of feature
heterogeneity exhibited by ICU time-series and motivated by these findings, we
explore these novel methods' impact on clinical sequence modeling tasks. By
jointly using such advances in deep learning for tabular data, our primary
objective is to underscore the importance of step-wise embeddings in
time-series modeling, which remain unexplored in machine learning methods for
clinical data. On a variety of clinically relevant tasks from two large-scale
ICU datasets, MIMIC-III and HiRID, our work provides an exhaustive analysis of
state-of-the-art methods for tabular time-series as time-step embedding models,
showing overall performance improvement. In particular, we evidence the
importance of feature grouping in clinical time-series, with significant
performance gains when considering features within predefined semantic groups
in the step-wise embedding module.
Authors' comments: Machine Learning for Health (ML4H) 2023 in Proceedings of Machine
Learning Research 225
Silpa Babu, Namrata Vaswani
This paper focuses studies the following low rank + sparse (LR+S) column-wise
compressive sensing problem. We aim to recover an $n \times q$ matrix, $\X^* =[
\x_1^*, \x_2^*, \cdots , \x_q^*]$ from $m$ independent linear projections of
each of its $q$ columns, given by $\y_k :=\A_k\x_k^*$, $k \in [q]$. Here,
$\y_k$ is an $m$-length vector with $m < n$. We assume that the matrix $\X^*$
can be decomposed as $\X^*=\L^*+\S^*$, where $\L^*$ is a low rank matrix of
rank $r << \min(n,q)$ and $\S^*$ is a sparse matrix. Each column of $\S$
contains $\rho$ non-zero entries. The matrices $\A_k$ are known and mutually
independent for different $k$. To address this recovery problem, we propose a
novel fast GD-based solution called AltGDmin-LR+S, which is memory and
communication efficient. We numerically evaluate its performance by conducting
a detailed simulation-based study.
Authors' comments: 6 pages, 2 figures, conference