Yuto Watanabe, Kazunori Sakurama
This study explores distributed optimization problems with clique-wise
coupling via operator splitting and how we can utilize this framework for
performance analysis and enhancement. This framework extends beyond
conventional pairwise coupled problems (e.g., consensus optimization) and is
applicable to broader examples. To this end, we first introduce a new
distributed optimization algorithm by leveraging a clique-based matrix and the
Davis-Yin splitting (DYS), a versatile three-operator splitting method. We then
demonstrate that this approach sheds new light on conventional algorithms in
the following way: (i) Existing algorithms (NIDS, Exact diffusion, diffusion,
and our previous work) can be derived from our proposed method; (ii) We present
a new mixing matrix based on clique-wise coupling, which surfaces when deriving
the NIDS. We prove its preferable distribution of eigenvalues, enabling fast
consensus; (iii) These observations yield a new linear convergence rate for the
NIDS with non-smooth objective functions. Remarkably our linear rate is first
established for the general DYS with a projection for a subspace. This case is
not covered by any prior results, to our knowledge. Finally, numerical examples
showcase the efficacy of our proposed approach.
Authors' comments: 32 pages
Ali Javidani, Mohammad Amin Sadeghi, Babak Nadjar Araabi
Self-supervised visual representation learning traditionally focuses on
image-level instance discrimination. Our study introduces an innovative,
fine-grained dimension by integrating patch-level discrimination into these
methodologies. This integration allows for the simultaneous analysis of local
and global visual features, thereby enriching the quality of the learned
representations. Initially, the original images undergo spatial augmentation.
Subsequently, we employ a distinctive photometric patch-level augmentation,
where each patch is individually augmented, independent from other patches
within the same view. This approach generates a diverse training dataset with
distinct color variations in each segment. The augmented images are then
processed through a self-distillation learning framework, utilizing the Vision
Transformer (ViT) as its backbone. The proposed method minimizes the
representation distances across both image and patch levels to capture details
from macro to micro perspectives. To this end, we present a simple yet
effective patch-matching algorithm to find the corresponding patches across the
augmented views. Thanks to the efficient structure of the patch-matching
algorithm, our method reduces computational complexity compared to similar
approaches. Consequently, we achieve an advanced understanding of the model
without adding significant computational requirements. We have extensively
pretrained our method on datasets of varied scales, such as Cifar10,
ImageNet-100, and ImageNet-1K. It demonstrates superior performance over
state-of-the-art self-supervised representation learning methods in image
classification and downstream tasks, such as copy detection and image
retrieval. The implementation of our method is accessible on GitHub.
Authors' comments: 15 pages
Rongrong Lin, Shimin Li, Yulan Liu
Computing the proximal operator of the sparsity-promoting piece-wise exponential (PiE) penalty $1-e^{-|x|/\sigma}$ with a given shape parameter $\sigma>0$, which is treated as a popular nonconvex surrogate of $\ell_0$-norm, is fundamental in feature selection via support vector machines, image reconstruction, zero-one programming problems, compressed sensing, etc. Due to the nonconvexity of PiE, for a long time, its proximal operator is frequently evaluated via an iteratively reweighted $\ell_1$ algorithm, which substitutes PiE with its first-order approximation, however, the obtained solutions only are the critical point. Based on the exact characterization of the proximal operator of PiE, we explore how the iteratively reweighted $\ell_1$ solution deviates from the true proximal operator in certain regions, which can be explicitly identified in terms of $\sigma$, the initial value and the regularization parameter in the definition of the proximal operator. Moreover, the initial value can be adaptively and simply chosen to ensure that the iteratively reweighted $\ell_1$ solution belongs to the proximal operator of PiE.
Caixin Wang, Jie Zhang, Matthew A. Wilson, Ralph Etienne-Cummings
Accurately capturing dynamic scenes with wide-ranging motion and light
intensity is crucial for many vision applications. However, acquiring
high-speed high dynamic range (HDR) video is challenging because the camera's
frame rate restricts its dynamic range. Existing methods sacrifice speed to
acquire multi-exposure frames. Yet, misaligned motion in these frames can still
pose complications for HDR fusion algorithms, resulting in artifacts. Instead
of frame-based exposures, we sample the videos using individual pixels at
varying exposures and phase offsets. Implemented on a pixel-wise programmable
image sensor, our sampling pattern simultaneously captures fast motion at a
high dynamic range. We then transform pixel-wise outputs into an HDR video
using end-to-end learned weights from deep neural networks, achieving high
spatiotemporal resolution with minimized motion blurring. We demonstrate
aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under
low-light conditions and against bright backgrounds - both challenging
conditions for conventional cameras. By combining the versatility of pixel-wise
sampling patterns with the strength of deep neural networks at decoding complex
scenes, our method greatly enhances the vision system's adaptability and
performance in dynamic conditions.
Authors' comments: 14 pages, 14 figures
Kushal Chawla, Ian Wu, Yu Rong, Gale M. Lucas, Jonathan Gratch
A natural way to design a negotiation dialogue system is via self-play RL:
train an agent that learns to maximize its performance by interacting with a
simulated user that has been designed to imitate human-human dialogue data.
Although this procedure has been adopted in prior work, we find that it results
in a fundamentally flawed system that fails to learn the value of compromise in
a negotiation, which can often lead to no agreements (i.e., the partner walking
away without a deal), ultimately hurting the model's overall performance. We
investigate this observation in the context of the DealOrNoDeal task, a
multi-issue negotiation over books, hats, and balls. Grounded in negotiation
theory from Economics, we modify the training procedure in two novel ways to
design agents with diverse personalities and analyze their performance with
human partners. We find that although both techniques show promise, a selfish
agent, which maximizes its own performance while also avoiding walkaways,
performs superior to other variants by implicitly learning to generate value
for both itself and the negotiation partner. We discuss the implications of our
findings for what it means to be a successful negotiation dialogue system and
how these systems should be designed in the future.
Authors' comments: Accepted at EMNLP 2023 (Main)
Shuaiyi Li, Yang Deng, Wai Lam
Spatial reasoning in text plays a crucial role in various real-world
applications. Existing approaches for spatial reasoning typically infer spatial
relations from pure text, which overlooks the gap between natural language and
symbolic structures. Graph neural networks (GNNs) have showcased exceptional
proficiency in inducing and aggregating symbolic structures. However, classical
GNNs face challenges in handling multi-hop spatial reasoning due to the
over-smoothing issue, i.e., the performance decreases substantially as the
number of graph layers increases. To cope with these challenges, we propose a
novel Depth-Wise Graph Neural Network (DepWiGNN). Specifically, we design a
novel node memory scheme and aggregate the information over the depth dimension
instead of the breadth dimension of the graph, which empowers the ability to
collect long dependencies without stacking multiple layers. Experimental
results on two challenging multi-hop spatial reasoning datasets show that
DepWiGNN outperforms existing spatial reasoning methods. The comparisons with
the other three GNNs further demonstrate its superiority in capturing long
dependency in the graph.
Authors' comments: EMNLP 2023 Findings
Pascal Pernot
Binwise Variance Scaling (BVS) has recently been proposed as a post hoc
recalibration method for prediction uncertainties of machine learning
regression problems that is able of more efficient corrections than uniform
variance (or temperature) scaling. The original version of BVS uses
uncertainty-based binning, which is aimed to improve calibration conditionally
on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in
particular with alternative loss functions and a binning scheme based on an
input-feature (X) in order to improve adaptivity, i.e. calibration conditional
on X. The performances of BVS and its proposed variants are tested on a
benchmark dataset for the prediction of atomization energies and compared to
the results of isotonic regression.
Authors' comments: This version corrects an error in the estimation of the Sx scores for
the test set, affecting Fig. 2 and Tables I-III of the initial version. The
main points of the discussion and the conclusions are unchanged
Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng et al.
Resistive random access memory (ReRAM)-based processing-in-memory (PIM)
architectures have demonstrated great potential to accelerate Deep Neural
Network (DNN) training/inference. However, the computational accuracy of analog
PIM is compromised due to the non-idealities, such as the conductance variation
of ReRAM cells. The impact of these non-idealities worsens as the number of
concurrently activated wordlines and bitlines increases. To guarantee
computational accuracy, only a limited number of wordlines and bitlines of the
crossbar array can be turned on concurrently, significantly reducing the
achievable parallelism of the architecture.
While the constraints on parallelism limit the efficiency of the
accelerators, they also provide a new opportunity for fine-grained
mixed-precision quantization. To enable efficient DNN inference on practical
ReRAM-based accelerators, we propose an algorithm-architecture co-design
framework called \underline{B}lock-\underline{W}ise mixed-precision
\underline{Q}uantization (BWQ). At the algorithm level, BWQ-A introduces a
mixed-precision quantization scheme at the block level, which achieves a high
weight and activation compression ratio with negligible accuracy degradation.
We also present the hardware architecture design BWQ-H, which leverages the
low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference
on ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping
method to increase the ReRAM crossbar's throughput. Our evaluation demonstrates
the effectiveness of BWQ, which achieves a 6.08x speedup and a 17.47x energy
saving on average compared to existing ReRAM-based architectures.
Authors' comments: 12 pages, 13 figures
Uri Stern, Daniel Shwartz, Daphna Weinshall
Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stopping is used. To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction -- that the variance among classifiers increases when overfit occurs -- is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit. Code is available at https://github.com/uristern123/United-We-Stand-Using-Epoch-wise-Agreement-of-Ensembles-to-Combat-Overfit.
Nayoung Choi
Contextual word embeddings obtained from pre-trained language model (PLM) have proven effective for various natural language processing tasks at the word level. However, interpreting the hidden aspects within embeddings, such as syntax and semantics, remains challenging. Disentangled representation learning has emerged as a promising approach, which separates specific aspects into distinct embeddings. Furthermore, different linguistic knowledge is believed to be stored in different layers of PLM. This paper aims to disentangle semantic sense from BERT by applying a binary mask to middle outputs across the layers, without updating pre-trained parameters. The disentangled embeddings are evaluated through binary classification to determine if the target word in two different sentences has the same meaning. Experiments with cased BERT$_{\texttt{base}}$ show that leveraging layer-wise information is effective and disentangling semantic sense further improve performance.
Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
Large Vision-Language Models (LVLMs) can understand the world comprehensively
by integrating rich information from different modalities, achieving remarkable
advancements on various multimodal downstream tasks. However, deploying LVLMs
is often problematic due to their massive computational/energy costs and carbon
consumption. Such issues make it infeasible to adopt conventional iterative
global pruning, which is costly due to computing the Hessian matrix of the
entire large model for sparsification. Alternatively, several studies have
recently proposed layer-wise pruning approaches to avoid the expensive
computation of global pruning and efficiently compress model weights according
to their importance within a layer. However, they often suffer from suboptimal
model compression due to their lack of a global perspective. To address this
limitation in recent efficient pruning methods for large models, we propose
Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage
coarse-to-fine weight pruning approach for LVLMs. We first determine the
sparsity ratios of different layers or blocks by leveraging the global
importance score, which is efficiently computed based on the zeroth-order
approximation of the global model gradients. Then, the model performs local
layer-wise unstructured weight pruning based on globally-informed sparsity
ratios. We validate our proposed method across various multimodal and unimodal
models and datasets, demonstrating significant performance improvements over
prevalent pruning techniques in the high-sparsity regime.
Authors' comments: ICLR 2024 (project page: https://ecoflap.github.io/)
Skander Karkar, Ibrahim Ayed, Emmanuel de Bézenac, Patrick Gallinari
Greedy layer-wise or module-wise training of neural networks is compelling in
constrained and on-device settings where memory is limited, as it circumvents a
number of problems of end-to-end back-propagation. However, it suffers from a
stagnation problem, whereby early layers overfit and deeper layers stop
increasing the test accuracy after a certain depth. We propose to solve this
issue by introducing a module-wise regularization inspired by the minimizing
movement scheme for gradient flows in distribution space. We call the method
TRGL for Transport Regularized Greedy Learning and study it theoretically,
proving that it leads to greedy modules that are regular and that progressively
solve the task. Experimentally, we show improved accuracy of module-wise
training of various architectures such as ResNets, Transformers and VGG, when
our regularization is added, superior to that of other module-wise training
methods and often to end-to-end training, with as much as 60% less memory
usage.
Authors' comments: NeurIPS 2023. arXiv admin note: text overlap with arXiv:2210.00949
Yu Gao, Chong Chen
Motivated by a class of nonlinear imaging inverse problems, for instance,
multispectral computed tomography (MSCT), this paper studies the convergence
theory of the nonlinear Kaczmarz method (NKM) for solving the system of
nonlinear equations with component-wise convex mapping, namely, the function
corresponding to each equation being convex. However, such kind of nonlinear
mapping may not satisfy the commonly used component-wise tangential cone
condition (TCC). For this purpose, we propose a novel condition named relative
gradient discrepancy condition (RGDC), and make use of it to prove the
convergence and even the convergence rate of the NKM with several general index
selection strategies, where these strategies include cyclic strategy and
maximum residual strategy. Particularly, we investigate the application of the
NKM for solving nonlinear systems in MSCT image reconstruction. We prove that
the nonlinear mapping in this context fulfills the proposed RGDC rather than
the component-wise TCC, and provide a global convergence of the NKM based on
the previously obtained results. Numerical experiments further illustrate the
numerical convergence of the NKM for MSCT image reconstruction.
Authors' comments: 34 pages, 10 figures, 1 table
Di Liang, Nian Shao, Xiaofei Li
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.
Jaroslav Schmidt, Alena Zemanová, Jan Zeman
Laminated glass achieves improved post-critical response through the
composite effect of stiff glass layers and more compliant polymer films,
manifested in progressive layer failure by multiple localized cracks. As a
result, laminated glass exhibits greater ductility than non-laminated glass,
making structures made with it suitable for safety-critical applications while
maintaining their aesthetic qualities. However, such post-critical response is
challenging to reproduce using deterministic failure models, which mostly
predict failure through a single through-thickness crack localized
simultaneously in all layers. This numerical-experimental study explores the
extent to which progressive failure can be predicted by a simple randomized
model, where layer-wise tensile strength is modeled by independent, identically
distributed Weibull variables. On the numerical side, we employ a
computationally efficient, dimensionally-reduced phase field formulation --
with each layer considered to be a Timoshenko beam -- to study progressive
failure through combinatorial analysis and detailed Monte Carlo simulations.
The reference experimental data were obtained from displacement-controlled
four-point bending tests performed on multi-layer laminated glass beams. For
certain combinations of the glass layer strengths, results show that the
randomized model can reproduce progressive structural failure and the formation
of multiple localized cracks in the glass layers. However, the predicted
response was less ductile than that observed in experiments, and the model
could not reproduce the most frequent glass layer failure sequence. These
findings highlight the need to consider strength variability along the length
of a beam and to include it in phase-field formulations.
Authors' comments: 30 pages, 18 figures, and 2 tables
Shida Wang, Beichen Xue
State-space models have gained popularity in sequence modelling due to their
simple and efficient network structures. However, the absence of nonlinear
activation along the temporal direction limits the model's capacity. In this
paper, we prove that stacking state-space models with layer-wise nonlinear
activation is sufficient to approximate any continuous sequence-to-sequence
relationship. Our findings demonstrate that the addition of layer-wise
nonlinear activation enhances the model's capacity to learn complex sequence
patterns. Meanwhile, it can be seen both theoretically and empirically that the
state-space models do not fundamentally resolve the exponential decaying memory
issue. Theoretical results are justified by numerical verifications.
Authors' comments: 17 pages, 6 figures,
Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon
Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.
Sebastian Eliassen, Raghavendra Selvan
Efficient training of large-scale graph neural networks (GNNs) has been
studied with a specific focus on reducing their memory consumption. Work by Liu
et al. (2022) proposed extreme activation compression (EXACT) which
demonstrated drastic reduction in memory consumption by performing quantization
of the intermediate activation maps down to using INT2 precision. They showed
little to no reduction in performance while achieving large reductions in GPU
memory consumption. In this work, we present an improvement to the EXACT
strategy by using block-wise quantization of the intermediate activation maps.
We experimentally analyze different block sizes and show further reduction in
memory consumption (>15%), and runtime speedup per epoch (about 5%) even when
performing extreme extents of quantization with similar performance trade-offs
as with the original EXACT. Further, we present a correction to the assumptions
on the distribution of intermediate activation maps in EXACT (assumed to be
uniform) and show improved variance estimations of the quantization and
dequantization steps.
Authors' comments: Accepted to be presented at the International Conference on
Acoustics, Speech and Signal Processing (ICASSP-2024). Source code at
https://github.com/saintslab/i-Exact
Ofir Gordon, Elad Cohen, Hai Victor Habi, Arnon Netzer
Quantization is a key method for deploying deep neural networks on edge devices with limited memory and computation resources. Recent improvements in Post-Training Quantization (PTQ) methods were achieved by an additional local optimization process for learning the weight quantization rounding policy. However, a gap exists when employing network-wise optimization with small representative datasets. In this paper, we propose a new method for enhanced PTQ (EPTQ) that employs a network-wise quantization optimization process, which benefits from considering cross-layer dependencies during optimization. EPTQ enables network-wise optimization with a small representative dataset using a novel sample-layer attention score based on a label-free Hessian matrix upper bound. The label-free approach makes our method suitable for the PTQ scheme. We give a theoretical analysis for the said bound and use it to construct a knowledge distillation loss that guides the optimization to focus on the more sensitive layers and samples. In addition, we leverage the Hessian upper bound to improve the weight quantization parameters selection by focusing on the more sensitive elements in the weight tensors. Empirically, by employing EPTQ we achieve state-of-the-art results on various models, tasks, and datasets, including ImageNet classification, COCO object detection, and Pascal-VOC for semantic segmentation.
Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
Large Language Models (LLMs) are trained with a pre-defined context length,
restricting their use in scenarios requiring long inputs. Previous efforts for
adapting LLMs to a longer length usually requires fine-tuning with this target
length (Full-length fine-tuning), suffering intensive training cost. To
decouple train length from target length for efficient context window
extension, we propose Positional Skip-wisE (PoSE) training that smartly
simulates long inputs using a fixed context window. This is achieved by first
dividing the original context window into several chunks, then designing
distinct skipping bias terms to manipulate the position indices of each chunk.
These bias terms and the lengths of each chunk are altered for every training
example, allowing the model to adapt to all positions within target length.
Experimental results show that PoSE greatly reduces memory and time overhead
compared with Full-length fine-tuning, with minimal impact on performance.
Leveraging this advantage, we have successfully extended the LLaMA model to
128k tokens using a 2k training context window. Furthermore, we empirically
confirm that PoSE is compatible with all RoPE-based LLMs and position
interpolation strategies. Notably, our method can potentially support infinite
length, limited only by memory usage in inference. With ongoing progress for
efficient inference, we believe PoSE can further scale the context window
beyond 128k.
Authors' comments: ICLR 2024