Chenyang Gao, Yue Gu, Ivan Marsic
In supervised speech separation, permutation invariant training (PIT) is
widely used to handle label ambiguity by selecting the best permutation to
update the model. Despite its success, previous studies showed that PIT is
plagued by excessive label assignment switching in adjacent epochs, impeding
the model to learn better label assignments. To address this issue, we propose
a novel training strategy, dynamic sample dropout (DSD), which considers
previous best label assignments and evaluation metrics to exclude the samples
that may negatively impact the learned label assignments during training.
Additionally, we include layer-wise optimization (LO) to improve the
performance by solving layer-decoupling. Our experiments showed that combining
DSD and LO outperforms the baseline and solves excessive label assignment
switching and layer-decoupling issues. The proposed DSD and LO approach is easy
to implement, requires no extra training sets or steps, and shows generality to
various speech separation tasks.
Authors' comments: Accepted by INTERSPEECH 2023
Yueyuan Li, Wei Yuan, Songan Zhang, Weihao Yan, Qiyuan Shen, Chunxiang Wang, Ming Yang
Simulators play a crucial role in autonomous driving, offering significant
time, cost, and labor savings. Over the past few years, the number of
simulators for autonomous driving has grown substantially. However, there is a
growing concern about the validity of algorithms developed and evaluated in
simulators, indicating a need for a thorough analysis of the development status
of the simulators.
To bridge the gap in research, this paper analyzes the evolution of
simulators and explains how the functionalities and utilities have developed.
Then, the existing simulators are categorized based on their task
applicability, providing researchers with a taxonomy to swiftly assess a
simulator's suitability for specific tasks. Recommendations for select
simulators are presented, considering factors such as accessibility,
maintenance status, and quality. Recognizing potential hazards in simulators
that could impact the confidence of simulation experiments, the paper dedicates
substantial effort to identifying and justifying critical issues in actively
maintained open-source simulators. Moreover, the paper reviews potential
solutions to address these issues, serving as a guide for enhancing the
credibility of simulators.
Authors' comments: 18 pages, 5 figures, 8 tables
Lszl Antal, Hana Masara, Erika brahm
In this paper, we extend an available neural network verification technique
to support a wider class of piece-wise linear activation functions.
Furthermore, we extend the algorithms, which provide in their original form
exact respectively over-approximative results for bounded input sets
represented as start sets, to allow also unbounded input set. We implemented
our algorithms and demonstrated their effectiveness in some case studies.
Authors' comments: In Proceedings FMAS 2023, arXiv:2311.08987
Rita Kuznetsova, Alize Pace, Manuel Burger, Hugo Yche, Gunnar Rtsch
Recent advances in deep learning architectures for sequence modeling have not
fully transferred to tasks handling time-series from electronic health records.
In particular, in problems related to the Intensive Care Unit (ICU), the
state-of-the-art remains to tackle sequence classification in a tabular manner
with tree-based methods. Recent findings in deep learning for tabular data are
now surpassing these classical methods by better handling the severe
heterogeneity of data input features. Given the similar level of feature
heterogeneity exhibited by ICU time-series and motivated by these findings, we
explore these novel methods' impact on clinical sequence modeling tasks. By
jointly using such advances in deep learning for tabular data, our primary
objective is to underscore the importance of step-wise embeddings in
time-series modeling, which remain unexplored in machine learning methods for
clinical data. On a variety of clinically relevant tasks from two large-scale
ICU datasets, MIMIC-III and HiRID, our work provides an exhaustive analysis of
state-of-the-art methods for tabular time-series as time-step embedding models,
showing overall performance improvement. In particular, we evidence the
importance of feature grouping in clinical time-series, with significant
performance gains when considering features within predefined semantic groups
in the step-wise embedding module.
Authors' comments: Machine Learning for Health (ML4H) 2023 in Proceedings of Machine
Learning Research 225
Silpa Babu, Namrata Vaswani
This paper focuses studies the following low rank + sparse (LR+S) column-wise
compressive sensing problem. We aim to recover an $n \times q$ matrix, $\X^* =[
\x_1^*, \x_2^*, \cdots , \x_q^*]$ from $m$ independent linear projections of
each of its $q$ columns, given by $\y_k :=\A_k\x_k^*$, $k \in [q]$. Here,
$\y_k$ is an $m$-length vector with $m < n$. We assume that the matrix $\X^*$
can be decomposed as $\X^*=\L^*+\S^*$, where $\L^*$ is a low rank matrix of
rank $r << \min(n,q)$ and $\S^*$ is a sparse matrix. Each column of $\S$
contains $\rho$ non-zero entries. The matrices $\A_k$ are known and mutually
independent for different $k$. To address this recovery problem, we propose a
novel fast GD-based solution called AltGDmin-LR+S, which is memory and
communication efficient. We numerically evaluate its performance by conducting
a detailed simulation-based study.
Authors' comments: 6 pages, 2 figures, conference
Yuto Watanabe, Kazunori Sakurama
This study explores distributed optimization problems with clique-wise
coupling via operator splitting and how we can utilize this framework for
performance analysis and enhancement. This framework extends beyond
conventional pairwise coupled problems (e.g., consensus optimization) and is
applicable to broader examples. To this end, we first introduce a new
distributed optimization algorithm by leveraging a clique-based matrix and the
Davis-Yin splitting (DYS), a versatile three-operator splitting method. We then
demonstrate that this approach sheds new light on conventional algorithms in
the following way: (i) Existing algorithms (NIDS, Exact diffusion, diffusion,
and our previous work) can be derived from our proposed method; (ii) We present
a new mixing matrix based on clique-wise coupling, which surfaces when deriving
the NIDS. We prove its preferable distribution of eigenvalues, enabling fast
consensus; (iii) These observations yield a new linear convergence rate for the
NIDS with non-smooth objective functions. Remarkably our linear rate is first
established for the general DYS with a projection for a subspace. This case is
not covered by any prior results, to our knowledge. Finally, numerical examples
showcase the efficacy of our proposed approach.
Authors' comments: 32 pages
Ali Javidani, Mohammad Amin Sadeghi, Babak Nadjar Araabi
Self-supervised visual representation learning traditionally focuses on
image-level instance discrimination. Our study introduces an innovative,
fine-grained dimension by integrating patch-level discrimination into these
methodologies. This integration allows for the simultaneous analysis of local
and global visual features, thereby enriching the quality of the learned
representations. Initially, the original images undergo spatial augmentation.
Subsequently, we employ a distinctive photometric patch-level augmentation,
where each patch is individually augmented, independent from other patches
within the same view. This approach generates a diverse training dataset with
distinct color variations in each segment. The augmented images are then
processed through a self-distillation learning framework, utilizing the Vision
Transformer (ViT) as its backbone. The proposed method minimizes the
representation distances across both image and patch levels to capture details
from macro to micro perspectives. To this end, we present a simple yet
effective patch-matching algorithm to find the corresponding patches across the
augmented views. Thanks to the efficient structure of the patch-matching
algorithm, our method reduces computational complexity compared to similar
approaches. Consequently, we achieve an advanced understanding of the model
without adding significant computational requirements. We have extensively
pretrained our method on datasets of varied scales, such as Cifar10,
ImageNet-100, and ImageNet-1K. It demonstrates superior performance over
state-of-the-art self-supervised representation learning methods in image
classification and downstream tasks, such as copy detection and image
retrieval. The implementation of our method is accessible on GitHub.
Authors' comments: 15 pages
Rongrong Lin, Shimin Li, Yulan Liu
Computing the proximal operator of the sparsity-promoting piece-wise exponential (PiE) penalty $1-e^{-|x|/\sigma}$ with a given shape parameter $\sigma>0$, which is treated as a popular nonconvex surrogate of $\ell_0$-norm, is fundamental in feature selection via support vector machines, image reconstruction, zero-one programming problems, compressed sensing, etc. Due to the nonconvexity of PiE, for a long time, its proximal operator is frequently evaluated via an iteratively reweighted $\ell_1$ algorithm, which substitutes PiE with its first-order approximation, however, the obtained solutions only are the critical point. Based on the exact characterization of the proximal operator of PiE, we explore how the iteratively reweighted $\ell_1$ solution deviates from the true proximal operator in certain regions, which can be explicitly identified in terms of $\sigma$, the initial value and the regularization parameter in the definition of the proximal operator. Moreover, the initial value can be adaptively and simply chosen to ensure that the iteratively reweighted $\ell_1$ solution belongs to the proximal operator of PiE.
Caixin Wang, Jie Zhang, Matthew A. Wilson, Ralph Etienne-Cummings
Accurately capturing dynamic scenes with wide-ranging motion and light
intensity is crucial for many vision applications. However, acquiring
high-speed high dynamic range (HDR) video is challenging because the camera's
frame rate restricts its dynamic range. Existing methods sacrifice speed to
acquire multi-exposure frames. Yet, misaligned motion in these frames can still
pose complications for HDR fusion algorithms, resulting in artifacts. Instead
of frame-based exposures, we sample the videos using individual pixels at
varying exposures and phase offsets. Implemented on a pixel-wise programmable
image sensor, our sampling pattern simultaneously captures fast motion at a
high dynamic range. We then transform pixel-wise outputs into an HDR video
using end-to-end learned weights from deep neural networks, achieving high
spatiotemporal resolution with minimized motion blurring. We demonstrate
aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under
low-light conditions and against bright backgrounds - both challenging
conditions for conventional cameras. By combining the versatility of pixel-wise
sampling patterns with the strength of deep neural networks at decoding complex
scenes, our method greatly enhances the vision system's adaptability and
performance in dynamic conditions.
Authors' comments: 14 pages, 14 figures
Kushal Chawla, Ian Wu, Yu Rong, Gale M. Lucas, Jonathan Gratch
A natural way to design a negotiation dialogue system is via self-play RL:
train an agent that learns to maximize its performance by interacting with a
simulated user that has been designed to imitate human-human dialogue data.
Although this procedure has been adopted in prior work, we find that it results
in a fundamentally flawed system that fails to learn the value of compromise in
a negotiation, which can often lead to no agreements (i.e., the partner walking
away without a deal), ultimately hurting the model's overall performance. We
investigate this observation in the context of the DealOrNoDeal task, a
multi-issue negotiation over books, hats, and balls. Grounded in negotiation
theory from Economics, we modify the training procedure in two novel ways to
design agents with diverse personalities and analyze their performance with
human partners. We find that although both techniques show promise, a selfish
agent, which maximizes its own performance while also avoiding walkaways,
performs superior to other variants by implicitly learning to generate value
for both itself and the negotiation partner. We discuss the implications of our
findings for what it means to be a successful negotiation dialogue system and
how these systems should be designed in the future.
Authors' comments: Accepted at EMNLP 2023 (Main)
Shuaiyi Li, Yang Deng, Wai Lam
Spatial reasoning in text plays a crucial role in various real-world
applications. Existing approaches for spatial reasoning typically infer spatial
relations from pure text, which overlooks the gap between natural language and
symbolic structures. Graph neural networks (GNNs) have showcased exceptional
proficiency in inducing and aggregating symbolic structures. However, classical
GNNs face challenges in handling multi-hop spatial reasoning due to the
over-smoothing issue, i.e., the performance decreases substantially as the
number of graph layers increases. To cope with these challenges, we propose a
novel Depth-Wise Graph Neural Network (DepWiGNN). Specifically, we design a
novel node memory scheme and aggregate the information over the depth dimension
instead of the breadth dimension of the graph, which empowers the ability to
collect long dependencies without stacking multiple layers. Experimental
results on two challenging multi-hop spatial reasoning datasets show that
DepWiGNN outperforms existing spatial reasoning methods. The comparisons with
the other three GNNs further demonstrate its superiority in capturing long
dependency in the graph.
Authors' comments: EMNLP 2023 Findings
Pascal Pernot
Binwise Variance Scaling (BVS) has recently been proposed as a post hoc
recalibration method for prediction uncertainties of machine learning
regression problems that is able of more efficient corrections than uniform
variance (or temperature) scaling. The original version of BVS uses
uncertainty-based binning, which is aimed to improve calibration conditionally
on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in
particular with alternative loss functions and a binning scheme based on an
input-feature (X) in order to improve adaptivity, i.e. calibration conditional
on X. The performances of BVS and its proposed variants are tested on a
benchmark dataset for the prediction of atomization energies and compared to
the results of isotonic regression.
Authors' comments: This version corrects an error in the estimation of the Sx scores for
the test set, affecting Fig. 2 and Tables I-III of the initial version. The
main points of the discussion and the conclusions are unchanged
Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng et al.
Resistive random access memory (ReRAM)-based processing-in-memory (PIM)
architectures have demonstrated great potential to accelerate Deep Neural
Network (DNN) training/inference. However, the computational accuracy of analog
PIM is compromised due to the non-idealities, such as the conductance variation
of ReRAM cells. The impact of these non-idealities worsens as the number of
concurrently activated wordlines and bitlines increases. To guarantee
computational accuracy, only a limited number of wordlines and bitlines of the
crossbar array can be turned on concurrently, significantly reducing the
achievable parallelism of the architecture.
While the constraints on parallelism limit the efficiency of the
accelerators, they also provide a new opportunity for fine-grained
mixed-precision quantization. To enable efficient DNN inference on practical
ReRAM-based accelerators, we propose an algorithm-architecture co-design
framework called \underline{B}lock-\underline{W}ise mixed-precision
\underline{Q}uantization (BWQ). At the algorithm level, BWQ-A introduces a
mixed-precision quantization scheme at the block level, which achieves a high
weight and activation compression ratio with negligible accuracy degradation.
We also present the hardware architecture design BWQ-H, which leverages the
low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference
on ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping
method to increase the ReRAM crossbar's throughput. Our evaluation demonstrates
the effectiveness of BWQ, which achieves a 6.08x speedup and a 17.47x energy
saving on average compared to existing ReRAM-based architectures.
Authors' comments: 12 pages, 13 figures
Uri Stern, Daniel Shwartz, Daphna Weinshall
Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stopping is used. To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction -- that the variance among classifiers increases when overfit occurs -- is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit. Code is available at https://github.com/uristern123/United-We-Stand-Using-Epoch-wise-Agreement-of-Ensembles-to-Combat-Overfit.
Nayoung Choi
Contextual word embeddings obtained from pre-trained language model (PLM) have proven effective for various natural language processing tasks at the word level. However, interpreting the hidden aspects within embeddings, such as syntax and semantics, remains challenging. Disentangled representation learning has emerged as a promising approach, which separates specific aspects into distinct embeddings. Furthermore, different linguistic knowledge is believed to be stored in different layers of PLM. This paper aims to disentangle semantic sense from BERT by applying a binary mask to middle outputs across the layers, without updating pre-trained parameters. The disentangled embeddings are evaluated through binary classification to determine if the target word in two different sentences has the same meaning. Experiments with cased BERT$_{\texttt{base}}$ show that leveraging layer-wise information is effective and disentangling semantic sense further improve performance.
Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
Large Vision-Language Models (LVLMs) can understand the world comprehensively
by integrating rich information from different modalities, achieving remarkable
advancements on various multimodal downstream tasks. However, deploying LVLMs
is often problematic due to their massive computational/energy costs and carbon
consumption. Such issues make it infeasible to adopt conventional iterative
global pruning, which is costly due to computing the Hessian matrix of the
entire large model for sparsification. Alternatively, several studies have
recently proposed layer-wise pruning approaches to avoid the expensive
computation of global pruning and efficiently compress model weights according
to their importance within a layer. However, they often suffer from suboptimal
model compression due to their lack of a global perspective. To address this
limitation in recent efficient pruning methods for large models, we propose
Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage
coarse-to-fine weight pruning approach for LVLMs. We first determine the
sparsity ratios of different layers or blocks by leveraging the global
importance score, which is efficiently computed based on the zeroth-order
approximation of the global model gradients. Then, the model performs local
layer-wise unstructured weight pruning based on globally-informed sparsity
ratios. We validate our proposed method across various multimodal and unimodal
models and datasets, demonstrating significant performance improvements over
prevalent pruning techniques in the high-sparsity regime.
Authors' comments: ICLR 2024 (project page: https://ecoflap.github.io/)
Skander Karkar, Ibrahim Ayed, Emmanuel de Bézenac, Patrick Gallinari
Greedy layer-wise or module-wise training of neural networks is compelling in
constrained and on-device settings where memory is limited, as it circumvents a
number of problems of end-to-end back-propagation. However, it suffers from a
stagnation problem, whereby early layers overfit and deeper layers stop
increasing the test accuracy after a certain depth. We propose to solve this
issue by introducing a module-wise regularization inspired by the minimizing
movement scheme for gradient flows in distribution space. We call the method
TRGL for Transport Regularized Greedy Learning and study it theoretically,
proving that it leads to greedy modules that are regular and that progressively
solve the task. Experimentally, we show improved accuracy of module-wise
training of various architectures such as ResNets, Transformers and VGG, when
our regularization is added, superior to that of other module-wise training
methods and often to end-to-end training, with as much as 60% less memory
usage.
Authors' comments: NeurIPS 2023. arXiv admin note: text overlap with arXiv:2210.00949
Yu Gao, Chong Chen
Motivated by a class of nonlinear imaging inverse problems, for instance,
multispectral computed tomography (MSCT), this paper studies the convergence
theory of the nonlinear Kaczmarz method (NKM) for solving the system of
nonlinear equations with component-wise convex mapping, namely, the function
corresponding to each equation being convex. However, such kind of nonlinear
mapping may not satisfy the commonly used component-wise tangential cone
condition (TCC). For this purpose, we propose a novel condition named relative
gradient discrepancy condition (RGDC), and make use of it to prove the
convergence and even the convergence rate of the NKM with several general index
selection strategies, where these strategies include cyclic strategy and
maximum residual strategy. Particularly, we investigate the application of the
NKM for solving nonlinear systems in MSCT image reconstruction. We prove that
the nonlinear mapping in this context fulfills the proposed RGDC rather than
the component-wise TCC, and provide a global convergence of the NKM based on
the previously obtained results. Numerical experiments further illustrate the
numerical convergence of the NKM for MSCT image reconstruction.
Authors' comments: 34 pages, 10 figures, 1 table
Di Liang, Nian Shao, Xiaofei Li
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.
Jaroslav Schmidt, Alena Zemanová, Jan Zeman
Laminated glass achieves improved post-critical response through the
composite effect of stiff glass layers and more compliant polymer films,
manifested in progressive layer failure by multiple localized cracks. As a
result, laminated glass exhibits greater ductility than non-laminated glass,
making structures made with it suitable for safety-critical applications while
maintaining their aesthetic qualities. However, such post-critical response is
challenging to reproduce using deterministic failure models, which mostly
predict failure through a single through-thickness crack localized
simultaneously in all layers. This numerical-experimental study explores the
extent to which progressive failure can be predicted by a simple randomized
model, where layer-wise tensile strength is modeled by independent, identically
distributed Weibull variables. On the numerical side, we employ a
computationally efficient, dimensionally-reduced phase field formulation --
with each layer considered to be a Timoshenko beam -- to study progressive
failure through combinatorial analysis and detailed Monte Carlo simulations.
The reference experimental data were obtained from displacement-controlled
four-point bending tests performed on multi-layer laminated glass beams. For
certain combinations of the glass layer strengths, results show that the
randomized model can reproduce progressive structural failure and the formation
of multiple localized cracks in the glass layers. However, the predicted
response was less ductile than that observed in experiments, and the model
could not reproduce the most frequent glass layer failure sequence. These
findings highlight the need to consider strength variability along the length
of a beam and to include it in phase-field formulations.
Authors' comments: 30 pages, 18 figures, and 2 tables