Navin Ranjan, Andreas Savakis
Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained platforms. Quantization is a popular approach for reducing model size, but most studies mainly focus on equal bit-width quantization for the entire network, resulting in sub-optimal solutions. While there are few works on mixed precision quantization (MPQ) for ViTs, they typically rely on search space-based methods or employ mixed precision arbitrarily. In this paper, we introduce LRP-QViT, an explainability-based method for assigning mixed-precision bit allocations to different layers based on their importance during classification. Specifically, to measure the contribution score of each layer in predicting the target class, we employ the Layer-wise Relevance Propagation (LRP) method. LRP assigns local relevance at the output layer and propagates it through all layers, distributing the relevance until it reaches the input layers. These relevance scores serve as indicators for computing the layer contribution score. Additionally, we have introduced a clipped channel-wise quantization aimed at eliminating outliers from post-LayerNorm activations to alleviate severe inter-channel variations. To validate and assess our approach, we employ LRP-QViT across ViT, DeiT, and Swin transformer models on various datasets. Our experimental findings demonstrate that both our fixed-bit and mixed-bit post-training quantization methods surpass existing models in the context of 4-bit and 6-bit quantization.
Pawan Kumar, Prateek Dwivedi, Sobiya Ashraf, Dipin Pillai, Rahul Mangal
Self-propelled droplets serve as ideal model systems to delve deeper into
understanding of the motion of biological micro-swimmers by simulating their
motility. Biological microorganisms are renowned for showcasing a diverse array
of dynamic swimming behaviors when confronted with physical constraints. This
study aims to elucidate the impact of physical constraints on swimming
characteristics of biological microorganisms. To achieve this, we present
observations on the individual and pair-wise behavior of micellar solubilized
self-propelled 4-Cyano-4'-pentyl-biphenyl (5CB) oil droplets in a square
capillary channel filled with a surfactant trimethyl ammonium bromide (TTAB)
aqueous solution. To explore the effect of the underlying P\'eclet ($Pe$)
number of the swimming droplets, the study is also performed in the presence of
additives such as high molecular weight polymer Polyethylene oxide (PEO) and
molecular solute glycerol. The capillary confinement restricts droplet to
predominantly one-dimensional (1D) motion, albeit with noticeable differences
in their motion across the three scenarios. Through a characterization of the
chemical and hydrodynamic flow fields surrounding the droplets, we illustrate
that the modification of the droplets' chemical field due to confinement varies
significantly based on the underlying differences in the P\'eclet number ($Pe$)
in these cases. This alteration in the chemical field distribution notably
affects the individual droplets' motion. Moreover, these distinct chemical
field interactions between the droplets also lead to variations in their
pair-wise motion, ranging from behaviors like chasing to scattering.
Authors' comments: 13 pages, 9 figures
Haonan Yu, Wei Xu
Unsupervised video object learning seeks to decompose video scenes into
structural object representations without any supervision from depth, optical
flow, or segmentation. We present VONet, an innovative approach that is
inspired by MONet. While utilizing a U-Net architecture, VONet employs an
efficient and effective parallel attention inference process, generating
attention masks for all slots simultaneously. Additionally, to enhance the
temporal consistency of each mask across consecutive video frames, VONet
develops an object-wise sequential VAE framework. The integration of these
innovative encoder-side techniques, in conjunction with an expressive
transformer-based decoder, establishes VONet as the leading unsupervised method
for object learning across five MOVI datasets, encompassing videos of diverse
complexities. Code is available at https://github.com/hnyu/vonet.
Authors' comments: ICLR 2024
Daniel Campbell
We present three novel classifications of the weak sequential (and strong) limits in $W^{1,p}$ of planar diffeomorphisms. We introduce a concept called the QM condition which is a kind of separation property for pre-images of closed connected sets and show that $u$ satisfies this property exactly when it is the limit of Sobolev homeomorphisms. Further, we prove that $u\in W^{1,p}_{\operatorname{id}}((-1,1)^2,\mathbb{R}^2)$ is the limit of a sequence of homeomorphisms exactly when there are classically monotone mappings $g_{\delta}:[-1,1]^2\to \mathbb{R}^2$ and very small open sets $U_{\delta}$ such that $g_{\delta} = u$ on $[-1,1]^2 \setminus U_{\delta}$. Also, we introduce the so-called three curve condition, which is in some sense reminiscent of the NCL condition of \cite{CPR} but for $u^{-1}$ instead of for $u$, and prove that a map is the $W^{1,p}$ limit of planar Sobolev homeomorphisms exactly when it satisfies this property. This improves on results in \cite{DPP} answering the question from \cite{IO2}.
Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He
Overlapping sound events are ubiquitous in real-world environments, but
existing end-to-end sound event detection (SED) methods still struggle to
detect them effectively. A critical reason is that these methods represent
overlapping events using shared and entangled frame-wise features, which
degrades the feature discrimination. To solve the problem, we propose a
disentangled feature learning framework to learn a category-specific
representation. Specifically, we employ different projectors to learn the
frame-wise features for each category. To ensure that these feature does not
contain information of other categories, we maximize the common information
between frame-wise features within the same category and propose a frame-wise
contrastive loss. In addition, considering that the labeled data used by the
proposed method is limited, we propose a semi-supervised frame-wise contrastive
loss that can leverage large amounts of unlabeled data to achieve feature
disentanglement. The experimental results demonstrate the effectiveness of our
method.
Authors' comments: accepted by icassp2024
Tobias Cord-Landwehr, Christoph Boeddeker, Cătălin Zorilă, Rama Doddipatla, Reinhold Haeb-Umbach
We propose a modified teacher-student training for the extraction of
frame-wise speaker embeddings that allows for an effective diarization of
meeting scenarios containing partially overlapping speech. To this end, a
geodesic distance loss is used that enforces the embeddings computed from
regions with two active speakers to lie on the shortest path on a sphere
between the points given by the d-vectors of each of the active speakers. Using
those frame-wise speaker embeddings in clustering-based diarization outperforms
segment-level clustering-based diarization systems such as VBx and Spectral
Clustering. By extending our approach to a mixture-model-based diarization, the
performance can be further improved, approaching the diarization error rates of
diarization systems that use a dedicated overlap detection, and outperforming
these systems when also employing an additional overlap detection.
Authors' comments: Accepted at ICASSP 2024
Punit Vadher, Devsi Bantva
Let $G$ be a simple finite connected graph with vertex set $V(G) =
\{v_1,v_2,\ldots,v_n\}$. Denote the degree of vertex $v_i$ by $d_i$ for all $1
\leq i \leq n$. The Randi\'c matrix of $G$, denoted by $R(G) = [r_{i,j}]$, is
the $n \times n$ matrix whose $(i,j)$-entry $r_{i,j}$ is $r_{i,j} =
1/\sqrt{d_id_j}$ if $v_i$ and $v_j$ are adjacent in $G$ and 0 otherwise. A tree
is a connected acyclic graph. A level-wise regular tree is a tree rooted at one
vertex $r$ or two (adjacent) vertices $r$ and $r'$ in which all vertices with
the minimum distance $i$ from $r$ or $r'$ have the same degree $m_i$ for $0
\leq i \leq h$, where $h$ is the height of $T$. In this paper, we give a
complete characterization of the eigenvalues with their multiplicity of the
Randi\'c matrix of level-wise regular trees. We prove that the eigenvalues of
the Randi\'c matrix of a level-wise regular tree are the eigenvalues of the
particular tridiagonal matrices, which are formed using the degree sequence
$(m_0,m_1,\ldots,m_{h-1})$ of level-wise regular trees.
Authors' comments: 20 pages, 2 figures
Rongyu Zhang, Yulin Luo, Jiaming Liu, Huanrui Yang, Zhen Dong, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata et al.
The Mixture-of-Experts (MoE) approach has demonstrated outstanding
scalability in multi-task learning including low-level upstream tasks such as
concurrent removal of multiple adverse weather effects. However, the
conventional MoE architecture with parallel Feed Forward Network (FFN) experts
leads to significant parameter and computational overheads that hinder its
efficient deployment. In addition, the naive MoE linear router is suboptimal in
assigning task-specific features to multiple experts which limits its further
scalability. In this work, we propose an efficient MoE architecture with weight
sharing across the experts. Inspired by the idea of linear feature modulation
(FM), our architecture implicitly instantiates multiple experts via learnable
activation modulations on a single shared expert block. The proposed Feature
Modulated Expert (FME) serves as a building block for the novel
Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up
the number of experts with low overhead. We further propose an
Uncertainty-aware Router (UaR) to assign task-specific features to different FM
modules with well-calibrated weights. This enables MoFME to effectively learn
diverse expert functions for multiple tasks. The conducted experiments on the
multi-deweather task show that our MoFME outperforms the baselines in the image
restoration quality by 0.1-0.2 dB and achieves SOTA-compatible performance
while saving more than 72% of parameters and 39% inference time over the
conventional MoE counterpart. Experiments on the downstream segmentation and
classification tasks further demonstrate the generalizability of MoFME to real
open-world applications.
Authors' comments: aaai2024
Anzhe Cheng, Zhenkun Wang, Chenzhong Yin, Mingxi Cheng, Heng Ping, Xiongye Xiao, Shahin Nazarian, Paul Bogdan
Backpropagation (BP) has been a successful optimization technique for deep
learning models. However, its limitations, such as backward- and
update-locking, and its biological implausibility, hinder the concurrent
updating of layers and do not mimic the local learning processes observed in
the human brain. To address these issues, recent research has suggested using
local error signals to asynchronously train network blocks. However, this
approach often involves extensive trial-and-error iterations to determine the
best configuration for local training. This includes decisions on how to
decouple network blocks and which auxiliary networks to use for each block. In
our work, we introduce a novel BP-free approach: a block-wise BP-free (BWBPF)
neural network that leverages local error signals to optimize distinct
sub-neural networks separately, where the global loss is only responsible for
updating the output layer. The local error signals used in the BP-free model
can be computed in parallel, enabling a potential speed-up in the weight update
process through parallel implementation. Our experimental results consistently
show that this approach can identify transferable decoupled architectures for
VGG and ResNet variations, outperforming models trained with end-to-end
backpropagation and other state-of-the-art block-wise learning techniques on
datasets such as CIFAR-10 and Tiny-ImageNet. The code is released at
https://github.com/Belis0811/BWBPF.
Authors' comments: The paper has been accepted by ICASSP2024
Lihao Zhang, Haijian Sun, Jin Sun, Rose Qingyang Hu
The accurate modeling of indoor radio propagation is crucial for
localization, monitoring, and device coordination, yet remains a formidable
challenge, due to the complex nature of indoor environments where radio can
propagate along hundreds of paths. These paths are resulted from the room
layout, furniture, appliances and even small objects like a glass cup. They are
also influenced by the object material and surface roughness. Advanced machine
learning (ML) techniques have the potential to take such non-linear and
hard-to-model factors into consideration. However, extensive and fine-grained
datasets are urgently required. This paper presents WiSegRT, an open-source
dataset for indoor radio propagation modeling. Generated by a differentiable
ray tracer within the segmented 3-dimensional (3D) indoor environments, WiSegRT
provides site-specific channel impulse responses for each grid point relative
to the given transmitter location. We expect WiSegRT to support a wide-range of
applications, such as ML-based channel prediction, accurate indoor
localization, radio-based object detection, wireless digital twin, and more.
Authors' comments: accepted by IEEE ICNC 2024
Junsu Kim, Sumin Hong, Chanwoo Kim, Jihyeon Kim, Yihalem Yimolal Tiruneh, Jeongwan On, Jihyun Song, Sunhwa Choi et al.
Class incremental learning aims to solve a problem that arises when
continuously adding unseen class instances to an existing model This approach
has been extensively studied in the context of image classification; however
its applicability to object detection is not well established yet. Existing
frameworks using replay methods mainly collect replay data without considering
the model being trained and tend to rely on randomness or the number of labels
of each sample. Also, despite the effectiveness of the replay, it was not yet
optimized for the object detection task. In this paper, we introduce an
effective buffer training strategy (eBTS) that creates the optimized replay
buffer on object detection. Our approach incorporates guarantee minimum and
hierarchical sampling to establish the buffer customized to the trained model.
%These methods can facilitate effective retrieval of prior knowledge.
Furthermore, we use the circular experience replay training to optimally
utilize the accumulated buffer data. Experiments on the MS COCO dataset
demonstrate that our eBTS achieves state-of-the-art performance compared to the
existing replay schemes.
Authors' comments: 5 pages, 3 figures, Accepted at ICASSP 2024
Yang Zhang, Huilin Pan, Mingying Li, An Wang, Yang Zhou, Hongliang Ren
Despite the successful application of convolutional neural networks (CNNs) in
object detection tasks, their efficiency in detecting faults from freight train
images remains inadequate for implementation in real-world engineering
scenarios. Existing modeling shortcomings of spatial invariance and pooling
layers in conventional CNNs often ignore the neglect of crucial global
information, resulting in error localization for fault objection tasks of
freight trains. To solve these problems, we design a spatial-wise dynamic
distillation framework based on multi-layer perceptron (MLP) for visual fault
detection of freight trains. We initially present the axial shift strategy,
which allows the MLP-like architecture to overcome the challenge of spatial
invariance and effectively incorporate both local and global cues. We propose a
dynamic distillation method without a pre-training teacher, including a dynamic
teacher mechanism that can effectively eliminate the semantic discrepancy with
the student model. Such an approach mines more abundant details from
lower-level feature appearances and higher-level label semantics as the extra
supervision signal, which utilizes efficient instance embedding to model the
global spatial and semantic information. In addition, the proposed dynamic
teacher can jointly train with students to further enhance the distillation
efficiency. Extensive experiments executed on six typical fault datasets reveal
that our approach outperforms the current state-of-the-art detectors and
achieves the highest accuracy with real-time detection at a lower computational
cost. The source code will be available at
\url{https://github.com/MVME-HBUT/SDD-FTI-FDet}.
Authors' comments: 10 pages, 6 figures
Risab Biswas, Swalpa Kumar Roy, Umapada Pal
Document image enhancement is a fundamental and important stage for attaining
the best performance in any document analysis assignment because there are many
degradation situations that could harm document images, making it more
difficult to recognize and analyze them. In this paper, we propose
\textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder
architecture based on a Tokens-to-token vision transformer. Each image is
divided into a set of tokens with a defined length using the ViT model, which
is then applied several times to model the global relationship between the
tokens. However, the conventional tokenization of input data does not
adequately reflect the crucial local structure between adjacent pixels of the
input image, which results in low efficiency. Instead of using a simple ViT and
hard splitting of images for the document image enhancement task, we employed a
progressive tokenization technique to capture this local information from an
image to achieve more effective results. Experiments on various DIBCO and
H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing
CNN and ViT-based state-of-the-art methods. In this research, the primary area
of examination is the application of the proposed architecture to the task of
document binarization. The source code will be made available at
https://github.com/RisabBiswas/T2T-BinFormer.
Authors' comments: arXiv admin note: text overlap with arXiv:2312.03568
Florian Kofler, Hendrik Mller, Josef A. Buchner, Ezequiel de la Rosa, Ivan Ezhov, Marcel Rosier, Isra Mekki, Suprosanna Shit et al.
This paper introduces panoptica, a versatile and performance-optimized
package designed for computing instance-wise segmentation quality metrics from
2D and 3D segmentation maps. panoptica addresses the limitations of existing
metrics and provides a modular framework that complements the original
intersection over union-based panoptic quality with other metrics, such as the
distance metric Average Symmetric Surface Distance. The package is open-source,
implemented in Python, and accompanied by comprehensive documentation and
tutorials. panoptica employs a three-step metrics computation process to cover
diverse use cases. The efficacy of panoptica is demonstrated on various
real-world biomedical datasets, where an instance-wise evaluation is
instrumental for an accurate representation of the underlying clinical task.
Overall, we envision panoptica as a valuable tool facilitating in-depth
evaluation of segmentation methods.
Authors' comments: 15 pages, 6 figures, 3 tables
Tianhao Peng, Ge Gao, Heming Sun, Fan Zhang, David Bull
In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Gradient decay enhances parameter exploration during sparsification whilst preventing runaway sparsity and is superior to the standard Straight-Through Estimation. The adaptive layer-wise distillation regulates the sparse training in various stages based on the distortion of intermediate features. This stage-wise design efficiently updates parameters with minimal computational overhead. The proposed approach has been applied to three popular end-to-end learnt video codecs, FVC, DCVC, and DCVC-HEM. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR. Supporting code and supplementary material can be downloaded from: https://jasminepp.github.io/lightweightdvc/
Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie Gu, Guannan Zhang
Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.
Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu
We present a new approach, termed GPS-Gaussian, for synthesizing novel views
of a character in a real-time manner. The proposed method enables 2K-resolution
rendering under a sparse-view camera setting. Unlike the original Gaussian
Splatting or neural implicit rendering methods that necessitate per-subject
optimizations, we introduce Gaussian parameter maps defined on the source views
and regress directly Gaussian Splatting properties for instant novel view
synthesis without any fine-tuning or optimization. To this end, we train our
Gaussian parameter regression module on a large amount of human scan data,
jointly with a depth estimation module to lift 2D parameter maps to 3D space.
The proposed framework is fully differentiable and experiments on several
datasets demonstrate that our method outperforms state-of-the-art methods while
achieving an exceeding rendering speed.
Authors' comments: Accepted by CVPR 2024. Project page:
https://shunyuanzheng.github.io/GPS-Gaussian
Shuchi Wu, Chuan Ma, Kang Wei, Xiaogang Xu, Ming Ding, Yuwen Qian, Tao Xiang
This paper introduces RDA, a pioneering approach designed to address two
primary deficiencies prevalent in previous endeavors aiming at stealing
pre-trained encoders: (1) suboptimal performances attributed to biased
optimization objectives, and (2) elevated query costs stemming from the
end-to-end paradigm that necessitates querying the target encoder every epoch.
Specifically, we initially Refine the representations of the target encoder for
each training sample, thereby establishing a less biased optimization objective
before the steal-training phase. This is accomplished via a sample-wise
prototype, which consolidates the target encoder's representations for a given
sample's various perspectives. Demanding exponentially fewer queries compared
to the end-to-end approach, prototypes can be instantiated to guide subsequent
query-free training. For more potent efficacy, we develop a multi-relational
extraction loss that trains the surrogate encoder to Discriminate mismatched
embedding-prototype pairs while Aligning those matched ones in terms of both
amplitude and angle. In this way, the trained surrogate encoder achieves
state-of-the-art results across the board in various downstream datasets with
limited queries. Moreover, RDA is shown to be robust to multiple widely-used
defenses.
Authors' comments: 25 pages, 12 figures, 15 tables
Yan Guo, C. Sengupta, T. C. Scott, P. Lagos, Y. Luo
We present resolved GMRT HI observations of the high gas-phase metallicity
dwarf galaxy WISEA J230615.06+143927.9 (z = 0.005) (hereafter J2306) and
investigate whether it could be a Tidal Dwarf Galaxy (TDG) candidate. TDGs are
observed to have higher metallicities than normal dwarfs. J2306 has an unusual
combination of a blue g -- r colour of 0.23 mag, irregular optical morphology
and high-metallicity (12 + log(O/H) = 8.68$\pm$0.14), making it an interesting
galaxy to study in more detail. We find J2306 to be an HI rich galaxy with a
large extended, unperturbed rotating HI disk. Using our HI data we estimated
its dynamical mass and found the galaxy to be dark matter (DM) dominated within
its HI radius. The quantity of DM, inferred from its dynamical mass, appears to
rule out J2306 as an evolved TDG. A wide area environment search reveals J2306
to be isolated from any larger galaxies which could have been the source of its
high gas metallicity. Additionally, the HI morphology and kinematics of the
galaxy show no indication of a recent merger to explain the high-metallicity.
Further detailed optical spectroscopic observations of J2306 might provide an
answer to how a seemingly ordinary irregular dwarf galaxy achieved such a high
level of metal enrichment.
Authors' comments: 11 pages, 5 figures, Accepted in RAA
Jixuan Leng, Yijiang Li, Haohan Wang
Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically the CLIP model, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module. This module seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.