Shuai Chen, Fanman Meng, Chenhao Wu, Haoran Wei, Runtong Zhang, Qingbo Wu, Linfeng Xu, Hongliang Li
Few-Shot Segmentation (FSS) aims to segment novel classes using only a few
annotated images. Despite considerable progress under pixel-wise support
annotation, current FSS methods still face three issues: the inflexibility of
backbone upgrade without re-training, the inability to uniformly handle various
types of annotations (e.g., scribble, bounding box, mask, and text), and the
difficulty in accommodating different annotation quantity. To address these
issues simultaneously, we propose DiffUp, a novel framework that conceptualizes
the FSS task as a conditional generative problem using a diffusion process. For
the first issue, we introduce a backbone-agnostic feature transformation module
that converts different segmentation cues into unified coarse priors,
facilitating seamless backbone upgrade without re-training. For the second
issue, due to the varying granularity of transformed priors from diverse
annotation types (scribble, bounding box, mask, and text), we conceptualize
these multi-granular transformed priors as analogous to noisy intermediates at
different steps of a diffusion model. This is implemented via a
self-conditioned modulation block coupled with a dual-level quality modulation
branch. For the third issue, we incorporate an uncertainty-aware information
fusion module to harmonize the variability across zero-shot, one-shot, and
many-shot scenarios. Evaluated through rigorous benchmarks, DiffUp
significantly outperforms existing FSS models in terms of flexibility and
accuracy.
Authors' comments: 9 figures
Andrew Jeong
This letter presents KGpose, a novel end-to-end framework for 6D pose estimation of multiple objects. Our approach combines keypoint-based method with learnable pose regression through `keypoint-graph', which is a graph representation of the keypoints. KGpose first estimates 3D keypoints for each object using an attentional multi-modal feature fusion of RGB and point cloud features. These keypoints are estimated from each point of point cloud and converted into a graph representation. The network directly regresses 6D pose parameters for each point through a sequence of keypoint-graph embedding and local graph embedding which are designed with graph convolutions, followed by rotation and translation heads. The final pose for each object is selected from the candidates of point-wise predictions. The method achieves competitive results on the benchmark dataset, demonstrating the effectiveness of our model. KGpose enables multi-object pose estimation without requiring an extra localization step, offering a unified and efficient solution for understanding geometric contexts in complex scenes for robotic applications.
Haoyu Wang, Tianci Liu, Tuo Zhao, Jing Gao
Pre-trained language models, trained on large-scale corpora, demonstrate strong generalizability across various NLP tasks. Fine-tuning these models for specific tasks typically involves updating all parameters, which is resource-intensive. Parameter-efficient fine-tuning (PEFT) methods, such as the popular LoRA family, introduce low-rank matrices to learn only a few parameters efficiently. However, during inference, the product of these matrices updates all pre-trained parameters, complicating tasks like knowledge editing that require selective updates. We propose a novel PEFT method, which conducts \textbf{r}ow and c\textbf{o}lumn-wise spar\textbf{se} \textbf{lo}w-\textbf{r}ank \textbf{a}daptation (RoseLoRA), to address this challenge. RoseLoRA identifies and updates only the most important parameters for a specific task, maintaining efficiency while preserving other model knowledge. By adding a sparsity constraint on the product of low-rank matrices and converting it to row and column-wise sparsity, we ensure efficient and precise model updates. Our theoretical analysis guarantees the lower bound of the sparsity with respective to the matrix product. Extensive experiments on five benchmarks across twenty datasets demonstrate that RoseLoRA outperforms baselines in both general fine-tuning and knowledge editing tasks.
Zhigao Cai, Xing-Ming Zhao
Automatic segmentation of the fetal brain is still challenging due to the health state of fetal development, motion artifacts, and variability across gestational ages, since existing methods rely on high-quality datasets of healthy fetuses. In this work, we propose a novel cascade network called CasUNext to enhance the accuracy and generalization of fetal brain MRI segmentation. CasUNext incorporates depth-wise separable convolution, attention mechanisms, and a two-step cascade architecture for efficient high-precision segmentation. The first network localizes the fetal brain region, while the second network focuses on detailed segmentation. We evaluate CasUNext on 150 fetal MRI scans between 20 to 36 weeks from two scanners made by Philips and Siemens including axial, coronal, and sagittal views, and also validated on a dataset of 50 abnormal fetuses. Results demonstrate that CasUNext achieves improved segmentation performance compared to U-Nets and other state-of-the-art approaches. It obtains an average Dice coefficient of 96.1% and mean intersection over union of 95.9% across diverse scenarios. CasUNext shows promising capabilities for handling the challenges of multi-view fetal MRI and abnormal cases, which could facilitate various quantitative analyses and apply to multi-site data.
Yanshu Wang, Wenyang He, Tong Yang
Large Language Models (LLMs) have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Effective compression and quantization techniques are crucial for addressing these issues, reducing memory footprint and computational requirements without significantly compromising performance. Traditional methods that uniformly map parameters to compressed spaces fail to account for the uneven distribution of parameters, leading to substantial accuracy loss. In this work, we propose Athena, a novel algorithm for efficient block-wise post-training quantization of LLMs. Athena leverages Second-Order Matrix Derivative Information to guide the quantization process using the curvature information of the loss landscape. By grouping parameters by columns or rows and iteratively optimizing the quantization process, Athena updates the model parameters and Hessian matrix to achieve significant compression while maintaining high accuracy. This makes Athena a practical solution for deploying LLMs in various settings.
Zong-Wei Hong, Yen-Yang Hung, Chu-Song Chen
In this work, we introduce a novel method for calculating the 6DoF pose of an
object using a single RGB-D image. Unlike existing methods that either directly
predict objects' poses or rely on sparse keypoints for pose recovery, our
approach addresses this challenging task using dense correspondence, i.e., we
regress the object coordinates for each visible pixel. Our method leverages
existing object detection methods. We incorporate a re-projection mechanism to
adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images.
Moreover, we transform the 3D object coordinates into a residual
representation, which can effectively reduce the output space and yield
superior performance. We conducted extensive experiments to validate the
efficacy of our approach for 6D pose estimation. Our approach outperforms most
previous methods, especially in occlusion scenarios, and demonstrates notable
improvements over the state-of-the-art methods. Our code is available on
https://github.com/AI-Application-and-Integration-Lab/RDPN6D.
Authors' comments: Accepted by CVPR Workshop DLGC, 2024
Johan Öfverstedt, Elin Lundström, Göran Bergström, Joel Kullberg, Håkan Ahlström
The study of associations between an individual's age and imaging and
non-imaging data is an active research area that attempts to aid understanding
of the effects and patterns of aging. In this work we have conducted a
supervoxel-wise association study between both volumetric and tissue density
features in coronary computed tomography angiograms and the chronological age
of a subject, to understand the localized changes in morphology and tissue
density with age. To enable a supervoxel-wise study of volume and tissue
density, we developed a novel method based on image segmentation, inter-subject
image registration, and robust supervoxel-based correlation analysis, to
achieve a statistical association study between the images and age. We evaluate
the registration methodology in terms of the Dice coefficient for the heart
chambers and myocardium, and the inverse consistency of the transformations,
showing that the method works well in most cases with high overlap and inverse
consistency. In a sex-stratified study conducted on a subset of $n=1388$ images
from the SCAPIS study, the supervoxel-wise analysis was able to find localized
associations with age outside of the commonly segmented and analyzed
sub-regions, and several substantial differences between the sexes in the
association of age and volume.
Authors' comments: 35 pages
Tao Feng, Lizhen Qu, Zhuang Li, Haolan Zhan, Yuncheng Hua, Gholamreza Haffari
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
Yiqiang Zhu, Lida Zhu, Yiheng Lim, Shuichi Makita, Yu Guo, Yoshiaki Yasuno
We demonstrate a method that reduces the noise caused by multi-scattering (MS) photons in an \invivo optical coherence tomography image. This method combines a specially designed image acquisition (i.e., optical coherence tomography scan) scheme and subsequent complex signal processing. For the acquisition, multiple cross-sectional images (frames) are sequentially acquired while the depth position of the focus is altered for each frame by an electrically tunable lens. In the signal processing, the frames are numerically defocus-corrected, and complex averaged. Because of the inconsistency in the MS-photon trajectories among the different electrically tunable lens-induced defocus, this averaging reduces the MS signal. This method was validated using a scattering phantom and in vivo unanesthetized small fish samples, and was found to reduce MS noise even for unanesthetized in vivo measurement.
Likun Li, Haoqi Zeng, Changpeng Yang, Haozhe Jia, Di Xu
The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method.
Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, Ievgen Redko
Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having ~4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.
Keitaro Sakamoto, Issei Sato
End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap. In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, a non-E2E method that locally sets errors. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. It suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.
Jongmin Yu, Chen Bene Chi, Sebastiano Fichera, Paolo Paoletti, Devansh Mehta, Shan Luo
Road pavement detection and segmentation are critical for developing
autonomous road repair systems. However, developing an instance segmentation
method that simultaneously performs multi-class defect detection and
segmentation is challenging due to the textural simplicity of road pavement
image, the diversity of defect geometries, and the morphological ambiguity
between classes. We propose a novel end-to-end method for multi-class road
defect detection and segmentation. The proposed method comprises multiple
spatial and channel-wise attention blocks available to learn global
representations across spatial and channel-wise dimensions. Through these
attention blocks, more globally generalised representations of morphological
information (spatial characteristics) of road defects and colour and depth
information of images can be learned. To demonstrate the effectiveness of our
framework, we conducted various ablation studies and comparisons with prior
methods on a newly collected dataset annotated with nine road defect classes.
The experiments show that our proposed method outperforms existing
state-of-the-art methods for multi-class road defect detection and segmentation
methods.
Authors' comments: Accepted to the ICRA 2024
Dong Yang, Tomoki Koriyama, Yuki Saito
Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and involves: 1) annotation of limited breath sounds utilizing a rule-based approach, and 2) iterative augmentation of these annotations through pseudo-labeling based on the model's predictions. Our detection model employs Conformer blocks with down-/up-sampling layers, enabling accurate frame-wise breath detection. We investigate its effectiveness in multi-speaker TTS using text transcripts with detected breath marks. The results indicate that using our proposed model for breath detection and breath mark insertion synthesizes breath-contained speech more naturally than a baseline model.
Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima
Self-supervised learning (SSL) has attracted increased attention for learning
meaningful speech representations. Speech SSL models, such as WavLM, employ
masked prediction training to encode general-purpose representations. In
contrast, speaker SSL models, exemplified by DINO-based models, adopt
utterance-level training objectives primarily for speaker representation.
Understanding how these models represent information is essential for refining
model efficiency and effectiveness. Unlike the various analyses of speech SSL,
there has been limited investigation into what information speaker SSL captures
and how its representation differs from speech SSL or other fully-supervised
speaker models. This paper addresses these fundamental questions. We explore
the capacity to capture various speech properties by applying SUPERB evaluation
probing tasks to speech and speaker SSL models. We also examine which layers
are predominantly utilized for each task to identify differences in how speech
is represented. Furthermore, we conduct direct comparisons to measure the
similarities between layers within and across models. Our analysis unveils that
1) the capacity to represent content information is somewhat unrelated to
enhanced speaker representation, 2) specific layers of speech SSL models would
be partly specialized in capturing linguistic information, and 3) speaker SSL
models tend to disregard linguistic information but exhibit more sophisticated
speaker representation.
Authors' comments: Accepted at ICASSP 2024
Ali Tofik, Roy Partha Pratim
In this paper, we introduce Fast&Focused-Net, a novel deep neural network architecture tailored for efficiently encoding small objects into fixed-length feature vectors. Contrary to conventional Convolutional Neural Networks (CNNs), Fast&Focused-Net employs a series of our newly proposed layer, the Volume-wise Dot Product (VDP) layer, designed to address several inherent limitations of CNNs. Specifically, CNNs often exhibit a smaller effective receptive field than their theoretical counterparts, limiting their vision span. Additionally, the initial layers in CNNs produce low-dimensional feature vectors, presenting a bottleneck for subsequent learning. Lastly, the computational overhead of CNNs, particularly in capturing diverse image regions by parameter sharing, is significantly high. The VDP layer, at the heart of Fast&Focused-Net, aims to remedy these issues by efficiently covering the entire image patch information with reduced computational demand. Experimental results demonstrate the prowess of Fast&Focused-Net in a variety of applications. For small object classification tasks, our network outperformed state-of-the-art methods on datasets such as CIFAR-10, CIFAR-100, STL-10, SVHN-Cropped, and Fashion-MNIST. In the context of larger image classification, when combined with a transformer encoder (ViT), Fast&Focused-Net produced competitive results for OpenImages V6, ImageNet-1K, and Places365 datasets. Moreover, the same combination showcased unparalleled performance in text recognition tasks across SVT, IC15, SVTP, and HOST datasets. This paper presents the architecture, the underlying motivation, and extensive empirical evidence suggesting that Fast&Focused-Net is a promising direction for efficient and focused deep learning.
Mulomba Mukendi Christian, Yun Seon Kim, Hyebong Choi, Jaeyoung Lee, SongHee You
Accurate prediction of wind speed and power is vital for enhancing the efficiency of wind energy systems. Numerous solutions have been implemented to date, demonstrating their potential to improve forecasting. Among these, deep learning is perceived as a revolutionary approach in the field. However, despite their effectiveness, the noise present in the collected data remains a significant challenge. This noise has the potential to diminish the performance of these algorithms, leading to inaccurate predictions. In response to this, this study explores a novel feature engineering approach. This approach involves altering the data input shape in both Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) and Autoregressive models for various forecasting horizons. The results reveal substantial enhancements in model resilience against noise resulting from step increases in data. The approach could achieve an impressive 83% accuracy in predicting unseen data up to the 24th steps. Furthermore, this method consistently provides high accuracy for short, mid, and long-term forecasts, outperforming the performance of individual models. These findings pave the way for further research on noise reduction strategies at different forecasting horizons through shape-wise feature engineering.
Takuya Kurihana, Kyongmin Yeo, Daniela Szwarcman, Bruce Elmegreen, Karthik Mukkavilli, Johannes Schmude, Levente Klein
To mitigate global warming, greenhouse gas sources need to be resolved at a
high spatial resolution and monitored in time to ensure the reduction and
ultimately elimination of the pollution source. However, the complexity of
computation in resolving high-resolution wind fields left the simulations
impractical to test different time lengths and model configurations. This study
presents a preliminary development of a physics-informed super-resolution (SR)
generative adversarial network (GAN) that super-resolves the three-dimensional
(3D) low-resolution wind fields by upscaling x9 times. We develop a pixel-wise
self-attention (PWA) module that learns 3D weather dynamics via a
self-attention computation followed by a 2D convolution. We also employ a loss
term that regularizes the self-attention map during pretraining, capturing the
vertical convection process from input wind data. The new PWA SR-GAN shows the
high-fidelity super-resolved 3D wind data, learns a wind structure at the
high-frequency domain, and reduces the computational cost of a high-resolution
wind simulation by x89.7 times.
Authors' comments: 7 pages, 4 figures, NeurIPS 2023 Workshop: Tackling Climate Change
with Machine Learning
G. Camacho-Ciurana, P. Lee, N. Arsenov, A. Kovcs, I. Szapudi, I. Csabai
The cross-correlation of cosmic voids with the lensing convergence ($\kappa$)
map of the CMB fluctuations offers a powerful tool to refine our understanding
of the dark sector in the consensus cosmological model. Our principal aim is to
compare the lensing signature of our galaxy data set with simulations based on
the concordance model and characterize the results with an $A_{\kappa}$
consistency parameter. In particular, our measurements contribute to the
understanding of the "lensing-is-low" tension of the $\Lambda$CDM model. We
selected luminous red galaxies from the WISE-Pan-STARSS data set, allowing an
extended 14,200 deg$^2$ sky area, that offers a more precise measurement
compared to previous studies. We created 2D and 3D void catalogs to
cross-correlate their locations with the Planck lensing map and studied their
average imprint signal using a stacking methodology. Applying the same
procedure, we also generated a mock catalog from the WebSky simulation for
comparison. The 2D void analysis revealed good agreement with the standard
cosmological model with $A_{\kappa}\approx1.06 \pm 0.08$, i.e. $S/N=13.3$,
showing a higher $S/N$ than previous studies using voids detected in the Dark
Energy Survey data set. The 3D void analysis exhibited a lower $S/N$ and
demonstrated worse agreement with our mock catalog than the 2D voids. These
deviations might be attributed to limitations in the mock catalog, such as
imperfections in the LRG selection, as well as a potential asymmetry between
the North and South patches of the WISE-Pan-STARSS data set in terms of data
quality. Overall, we present a significant detection of a CMB lensing signal
associated with cosmic voids, largely consistent with the concordance model.
Future analyses using even larger data sets also hold great promise of further
sharpening these results, given their complementary nature to large-scale
structure analyses.
Authors' comments: 10 pages, 6 figure, submitted to A&A
Hongda Wu, Ping Wang, C V Aswartha Narayana
Federated Learning (FL) enables many resource-limited devices to train a model collaboratively without data sharing. However, many existing works focus on model-homogeneous FL, where the global and local models are the same size, ignoring the inherently heterogeneous computational capabilities of different devices and restricting resource-constrained devices from contributing to FL. In this paper, we consider model-heterogeneous FL and propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models (subsets of the global model) and contribute to the global model. Different from Dropout-based partial model generation, which removes neurons in hidden layers at random, model training in FedPMT is achieved from the back-propagation perspective. As such, all devices in FedPMT prioritize the most crucial parts of the global model. Theoretical analysis shows that the proposed partial model training design has a similar convergence rate to the widely adopted Federated Averaging (FedAvg) algorithm, $\mathcal{O}(1/T)$, with the sub-optimality gap enlarged by a constant factor related to the model splitting design in FedPMT. Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop. Meanwhile, compared to the popular model-homogeneous benchmark, FedAvg, FedPMT reaches the learning target in a shorter completion time, thus achieving a better trade-off between learning accuracy and completion time.