Yiqiang Zhu, Lida Zhu, Yiheng Lim, Shuichi Makita, Yu Guo, Yoshiaki Yasuno
We demonstrate a method that reduces the noise caused by multi-scattering (MS) photons in an \invivo optical coherence tomography image. This method combines a specially designed image acquisition (i.e., optical coherence tomography scan) scheme and subsequent complex signal processing. For the acquisition, multiple cross-sectional images (frames) are sequentially acquired while the depth position of the focus is altered for each frame by an electrically tunable lens. In the signal processing, the frames are numerically defocus-corrected, and complex averaged. Because of the inconsistency in the MS-photon trajectories among the different electrically tunable lens-induced defocus, this averaging reduces the MS signal. This method was validated using a scattering phantom and in vivo unanesthetized small fish samples, and was found to reduce MS noise even for unanesthetized in vivo measurement.
Likun Li, Haoqi Zeng, Changpeng Yang, Haozhe Jia, Di Xu
The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method.
Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, Ievgen Redko
Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having ~4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.
Keitaro Sakamoto, Issei Sato
End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap. In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, a non-E2E method that locally sets errors. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. It suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.
Jongmin Yu, Chen Bene Chi, Sebastiano Fichera, Paolo Paoletti, Devansh Mehta, Shan Luo
Road pavement detection and segmentation are critical for developing
autonomous road repair systems. However, developing an instance segmentation
method that simultaneously performs multi-class defect detection and
segmentation is challenging due to the textural simplicity of road pavement
image, the diversity of defect geometries, and the morphological ambiguity
between classes. We propose a novel end-to-end method for multi-class road
defect detection and segmentation. The proposed method comprises multiple
spatial and channel-wise attention blocks available to learn global
representations across spatial and channel-wise dimensions. Through these
attention blocks, more globally generalised representations of morphological
information (spatial characteristics) of road defects and colour and depth
information of images can be learned. To demonstrate the effectiveness of our
framework, we conducted various ablation studies and comparisons with prior
methods on a newly collected dataset annotated with nine road defect classes.
The experiments show that our proposed method outperforms existing
state-of-the-art methods for multi-class road defect detection and segmentation
methods.
Authors' comments: Accepted to the ICRA 2024
Dong Yang, Tomoki Koriyama, Yuki Saito
Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and involves: 1) annotation of limited breath sounds utilizing a rule-based approach, and 2) iterative augmentation of these annotations through pseudo-labeling based on the model's predictions. Our detection model employs Conformer blocks with down-/up-sampling layers, enabling accurate frame-wise breath detection. We investigate its effectiveness in multi-speaker TTS using text transcripts with detected breath marks. The results indicate that using our proposed model for breath detection and breath mark insertion synthesizes breath-contained speech more naturally than a baseline model.
Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima
Self-supervised learning (SSL) has attracted increased attention for learning
meaningful speech representations. Speech SSL models, such as WavLM, employ
masked prediction training to encode general-purpose representations. In
contrast, speaker SSL models, exemplified by DINO-based models, adopt
utterance-level training objectives primarily for speaker representation.
Understanding how these models represent information is essential for refining
model efficiency and effectiveness. Unlike the various analyses of speech SSL,
there has been limited investigation into what information speaker SSL captures
and how its representation differs from speech SSL or other fully-supervised
speaker models. This paper addresses these fundamental questions. We explore
the capacity to capture various speech properties by applying SUPERB evaluation
probing tasks to speech and speaker SSL models. We also examine which layers
are predominantly utilized for each task to identify differences in how speech
is represented. Furthermore, we conduct direct comparisons to measure the
similarities between layers within and across models. Our analysis unveils that
1) the capacity to represent content information is somewhat unrelated to
enhanced speaker representation, 2) specific layers of speech SSL models would
be partly specialized in capturing linguistic information, and 3) speaker SSL
models tend to disregard linguistic information but exhibit more sophisticated
speaker representation.
Authors' comments: Accepted at ICASSP 2024
Ali Tofik, Roy Partha Pratim
In this paper, we introduce Fast&Focused-Net, a novel deep neural network architecture tailored for efficiently encoding small objects into fixed-length feature vectors. Contrary to conventional Convolutional Neural Networks (CNNs), Fast&Focused-Net employs a series of our newly proposed layer, the Volume-wise Dot Product (VDP) layer, designed to address several inherent limitations of CNNs. Specifically, CNNs often exhibit a smaller effective receptive field than their theoretical counterparts, limiting their vision span. Additionally, the initial layers in CNNs produce low-dimensional feature vectors, presenting a bottleneck for subsequent learning. Lastly, the computational overhead of CNNs, particularly in capturing diverse image regions by parameter sharing, is significantly high. The VDP layer, at the heart of Fast&Focused-Net, aims to remedy these issues by efficiently covering the entire image patch information with reduced computational demand. Experimental results demonstrate the prowess of Fast&Focused-Net in a variety of applications. For small object classification tasks, our network outperformed state-of-the-art methods on datasets such as CIFAR-10, CIFAR-100, STL-10, SVHN-Cropped, and Fashion-MNIST. In the context of larger image classification, when combined with a transformer encoder (ViT), Fast&Focused-Net produced competitive results for OpenImages V6, ImageNet-1K, and Places365 datasets. Moreover, the same combination showcased unparalleled performance in text recognition tasks across SVT, IC15, SVTP, and HOST datasets. This paper presents the architecture, the underlying motivation, and extensive empirical evidence suggesting that Fast&Focused-Net is a promising direction for efficient and focused deep learning.
Mulomba Mukendi Christian, Yun Seon Kim, Hyebong Choi, Jaeyoung Lee, SongHee You
Accurate prediction of wind speed and power is vital for enhancing the efficiency of wind energy systems. Numerous solutions have been implemented to date, demonstrating their potential to improve forecasting. Among these, deep learning is perceived as a revolutionary approach in the field. However, despite their effectiveness, the noise present in the collected data remains a significant challenge. This noise has the potential to diminish the performance of these algorithms, leading to inaccurate predictions. In response to this, this study explores a novel feature engineering approach. This approach involves altering the data input shape in both Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) and Autoregressive models for various forecasting horizons. The results reveal substantial enhancements in model resilience against noise resulting from step increases in data. The approach could achieve an impressive 83% accuracy in predicting unseen data up to the 24th steps. Furthermore, this method consistently provides high accuracy for short, mid, and long-term forecasts, outperforming the performance of individual models. These findings pave the way for further research on noise reduction strategies at different forecasting horizons through shape-wise feature engineering.
Takuya Kurihana, Kyongmin Yeo, Daniela Szwarcman, Bruce Elmegreen, Karthik Mukkavilli, Johannes Schmude, Levente Klein
To mitigate global warming, greenhouse gas sources need to be resolved at a
high spatial resolution and monitored in time to ensure the reduction and
ultimately elimination of the pollution source. However, the complexity of
computation in resolving high-resolution wind fields left the simulations
impractical to test different time lengths and model configurations. This study
presents a preliminary development of a physics-informed super-resolution (SR)
generative adversarial network (GAN) that super-resolves the three-dimensional
(3D) low-resolution wind fields by upscaling x9 times. We develop a pixel-wise
self-attention (PWA) module that learns 3D weather dynamics via a
self-attention computation followed by a 2D convolution. We also employ a loss
term that regularizes the self-attention map during pretraining, capturing the
vertical convection process from input wind data. The new PWA SR-GAN shows the
high-fidelity super-resolved 3D wind data, learns a wind structure at the
high-frequency domain, and reduces the computational cost of a high-resolution
wind simulation by x89.7 times.
Authors' comments: 7 pages, 4 figures, NeurIPS 2023 Workshop: Tackling Climate Change
with Machine Learning
G. Camacho-Ciurana, P. Lee, N. Arsenov, A. Kovcs, I. Szapudi, I. Csabai
The cross-correlation of cosmic voids with the lensing convergence ($\kappa$)
map of the CMB fluctuations offers a powerful tool to refine our understanding
of the dark sector in the consensus cosmological model. Our principal aim is to
compare the lensing signature of our galaxy data set with simulations based on
the concordance model and characterize the results with an $A_{\kappa}$
consistency parameter. In particular, our measurements contribute to the
understanding of the "lensing-is-low" tension of the $\Lambda$CDM model. We
selected luminous red galaxies from the WISE-Pan-STARSS data set, allowing an
extended 14,200 deg$^2$ sky area, that offers a more precise measurement
compared to previous studies. We created 2D and 3D void catalogs to
cross-correlate their locations with the Planck lensing map and studied their
average imprint signal using a stacking methodology. Applying the same
procedure, we also generated a mock catalog from the WebSky simulation for
comparison. The 2D void analysis revealed good agreement with the standard
cosmological model with $A_{\kappa}\approx1.06 \pm 0.08$, i.e. $S/N=13.3$,
showing a higher $S/N$ than previous studies using voids detected in the Dark
Energy Survey data set. The 3D void analysis exhibited a lower $S/N$ and
demonstrated worse agreement with our mock catalog than the 2D voids. These
deviations might be attributed to limitations in the mock catalog, such as
imperfections in the LRG selection, as well as a potential asymmetry between
the North and South patches of the WISE-Pan-STARSS data set in terms of data
quality. Overall, we present a significant detection of a CMB lensing signal
associated with cosmic voids, largely consistent with the concordance model.
Future analyses using even larger data sets also hold great promise of further
sharpening these results, given their complementary nature to large-scale
structure analyses.
Authors' comments: 10 pages, 6 figure, submitted to A&A
Hongda Wu, Ping Wang, C V Aswartha Narayana
Federated Learning (FL) enables many resource-limited devices to train a model collaboratively without data sharing. However, many existing works focus on model-homogeneous FL, where the global and local models are the same size, ignoring the inherently heterogeneous computational capabilities of different devices and restricting resource-constrained devices from contributing to FL. In this paper, we consider model-heterogeneous FL and propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models (subsets of the global model) and contribute to the global model. Different from Dropout-based partial model generation, which removes neurons in hidden layers at random, model training in FedPMT is achieved from the back-propagation perspective. As such, all devices in FedPMT prioritize the most crucial parts of the global model. Theoretical analysis shows that the proposed partial model training design has a similar convergence rate to the widely adopted Federated Averaging (FedAvg) algorithm, $\mathcal{O}(1/T)$, with the sub-optimality gap enlarged by a constant factor related to the model splitting design in FedPMT. Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop. Meanwhile, compared to the popular model-homogeneous benchmark, FedAvg, FedPMT reaches the learning target in a shorter completion time, thus achieving a better trade-off between learning accuracy and completion time.
Romain Thoreau, Laurent Risser, Véronique Achard, Béatrice Berthelot, Xavier Briottet
Airborne hyperspectral images can be used to map the land cover in large
urban areas, thanks to their very high spatial and spectral resolutions on a
wide spectral domain. While the spectral dimension of hyperspectral images is
highly informative of the chemical composition of the land surface, the use of
state-of-the-art machine learning algorithms to map the land cover has been
dramatically limited by the availability of training data. To cope with the
scarcity of annotations, semi-supervised and self-supervised techniques have
lately raised a lot of interest in the community. Yet, the publicly available
hyperspectral data sets commonly used to benchmark machine learning models are
not totally suited to evaluate their generalization performances due to one or
several of the following properties: a limited geographical coverage (which
does not reflect the spectral diversity in metropolitan areas), a small number
of land cover classes and a lack of appropriate standard train / test splits
for semi-supervised and self-supervised learning. Therefore, we release in this
paper the Toulouse Hyperspectral Data Set that stands out from other data sets
in the above-mentioned respects in order to meet key issues in spectral
representation learning and classification over large-scale hyperspectral
images with very few labeled pixels. Besides, we discuss and experiment
self-supervised techniques for spectral representation learning, including the
Masked Autoencoder, and establish a baseline for pixel-wise classification
achieving 85% overall accuracy and 77% F1 score. The Toulouse Hyperspectral
Data Set and our code are publicly available at
https://www.toulouse-hyperspectral-data-set.com and
https://www.github.com/Romain3Ch216/tlse-experiments, respectively.
Authors' comments: 17 pages, 13 figures
Sung-Jin Kim, Heon-Gyu Kwak, Hyeon-Taek Han, Dae-Hyeok Lee, Ji-Hoon Jeong, Seong-Whan Lee
Brain-computer interface (BCI) has garnered the significant attention for their potential in various applications, with event-related potential (ERP) performing a considerable role in BCI systems. This paper introduces a novel Distributed Inference System tailored for detecting task-wise single-trial ERPs in a stream of satellite images. Unlike traditional methodologies that employ a single model for target detection, our system utilizes multiple models, each optimized for specific tasks, ensuring enhanced performance across varying image transition times and target onset times. Our experiments, conducted on four participants, employed two paradigms: the Normal paradigm and an AI paradigm with bounding boxes. Results indicate that our proposed system outperforms the conventional methods in both paradigms, achieving the highest $F_{\beta}$ scores. Furthermore, including bounding boxes in the AI paradigm significantly improved target recognition. This study underscores the potential of our Distributed Inference System in advancing the field of ERP detection in satellite image streams.
Daehee Kim, Yoonsik Kim, DongHyun Kim, Yumin Lim, Geewook Kim, Taeho Kil
Inspired by the great success of language model (LM)-based pre-training,
recent studies in visual document understanding have explored LM-based
pre-training methods for modeling text within document images. Among them,
pre-training that reads all text from an image has shown promise, but often
exhibits instability and even fails when applied to broader domains, such as
those involving both visual documents and scene text images. This is a
substantial limitation for real-world scenarios, where the processing of text
image inputs in diverse domains is essential. In this paper, we investigate
effective pre-training tasks in the broader domains and also propose a novel
pre-training method called SCOB that leverages character-wise supervised
contrastive learning with online text rendering to effectively pre-train
document and scene text domains by bridging the domain gap. Moreover, SCOB
enables weakly supervised learning, significantly reducing annotation costs.
Extensive benchmarks demonstrate that SCOB generally improves vanilla
pre-training methods and achieves comparable performance to state-of-the-art
methods. Our findings suggest that SCOB can be served generally and effectively
for read-type pre-training methods. The code will be available at
https://github.com/naver-ai/scob.
Authors' comments: ICCV 2023
Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
Multi-organ segmentation in abdominal Computed Tomography (CT) images is of
great importance for diagnosis of abdominal lesions and subsequent treatment
planning. Though deep learning based methods have attained high performance,
they rely heavily on large-scale pixel-level annotations that are
time-consuming and labor-intensive to obtain. Due to its low dependency on
annotation, weakly supervised segmentation has attracted great attention.
However, there is still a large performance gap between current
weakly-supervised methods and fully supervised learning, leaving room for
exploration. In this work, we propose a novel 3D framework with two consistency
constraints for scribble-supervised multiple abdominal organ segmentation from
CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with
one encoder and three decoders using different dilation rates to capture
features from different receptive fields that are complementary to each other
to generate high-quality soft pseudo labels. For more stable unsupervised
learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and
then supervise the outputs of each decoder. To further regularize the network,
class relationship information is exploited by encouraging the generated class
affinity matrices to be consistent across different decoders under multi-view
projection. Experiments on the public WORD dataset show that our method
outperforms five existing scribble-supervised methods.
Authors' comments: 10 pages, 3 figures, MICCAI2023
Xueyuan Li, Ruining Deng, Yucheng Tang, Shunxing Bao, Haichun Yang, Yuankai Huo
Precise identification of multiple cell classes in high-resolution Giga-pixel whole slide imaging (WSI) is critical for various clinical scenarios. Building an AI model for this purpose typically requires pixel-level annotations, which are often unscalable and must be done by skilled domain experts (e.g., pathologists). However, these annotations can be prone to errors, especially when distinguishing between intricate cell types (e.g., podocytes and mesangial cells) using only visual inspection. Interestingly, a recent study showed that lay annotators, when using extra immunofluorescence (IF) images for reference (referred to as molecular-empowered learning), can sometimes outperform domain experts in labeling. Despite this, the resource-intensive task of manual delineation remains a necessity during the annotation process. In this paper, we explore the potential of bypassing pixel-level delineation by employing the recent segment anything model (SAM) on weak box annotation in a zero-shot learning approach. Specifically, we harness SAM's ability to produce pixel-level annotations from box annotations and utilize these SAM-generated labels to train a segmentation model. Our findings show that the proposed SAM-assisted molecular-empowered learning (SAM-L) can diminish the labeling efforts for lay annotators by only requiring weak box annotations. This is achieved without compromising annotation accuracy or the performance of the deep learning-based segmentation. This research represents a significant advancement in democratizing the annotation process for training pathological image segmentation, relying solely on non-expert annotators.
Minghui Liwang, Bingshuo Guo, Zhanxi Ma, Yuhan Su, Jian Jin, Seyyedali Hosseinalipour, Xianbin Wang, Huaiyu Dai
To effectively process high volume of data across a fleet of dynamic and distributed vehicles, it is crucial to implement resource provisioning techniques that can provide reliable, cost-effective, and timely computing services. This article explores computation-intensive task scheduling over mobile vehicular clouds (MVCs). We use undirected weighted graphs (UWGs) to model both the execution of tasks and communication patterns among vehicles in an MVC. We then study reliable and timely scheduling of UWG tasks through a novel mechanism, operating on two complementary decision-making stages: Plan A and Plan B. Plan A entails a proactive decision-making approach, leveraging historical statistical data for the preemptive creation of an optimal mapping ($\alpha$) between tasks and the MVC prior to practical task scheduling. In contrast, Plan B explores a real-time decision-making paradigm, functioning as a reliable contingency plan. It seeks a viable mapping ($\beta$) if $\alpha$ encounters failures during task scheduling due to the unpredictable nature of the network. Furthermore, we provide an in-depth exploration of the procedural intricacies and key contributing factors that underpin the success of our mechanism. Additionally, we present a case study showcasing the superior performance on time efficiency and computation overhead. We further discuss a series of open directions for future research.
Rodrigo Crdova Rosado, Brandon S. Hensley, Susan E. Clark, Adriaan J. Duivenvoorden, Zachary Atkins, Elia Stefano Battistelli, Steve K. Choi, Jo Dunkley et al.
We present a cross-correlation analysis between $1'$ resolution total
intensity and polarization observations from the Atacama Cosmology Telescope
(ACT) at 150 and 220 GHz and 15$''$ mid-infrared photometry from the Wide-field
Infrared Survey Explorer (WISE) over 107 12.5$^\circ\times$12.5$^\circ$ patches
of sky. We detect a spatially isotropic signal in the WISE$\times$ACT $TT$
cross power spectrum at 30$\sigma$ significance that we interpret as the
correlation between the cosmic infrared background at ACT frequencies and
polycyclic aromatic hydrocarbon (PAH) emission from galaxies in WISE, i.e., the
cosmic PAH background. Within the Milky Way, the Galactic dust $TT$ spectra are
generally well-described by power laws in $\ell$ over the range 10$^3 < \ell <
$10$^4$, but there is evidence both for variability in the power law index and
for non-power law behavior in some regions. We measure a positive correlation
between WISE total intensity and ACT $E$-mode polarization at 1000$ < \ell
\lesssim $6000 at $>$3$\sigma$ in each of 35 distinct $\sim$100 deg$^2$ regions
of the sky, suggesting alignment between Galactic density structures and the
local magnetic field persists to sub-parsec physical scales in these regions.
The distribution of $TE$ amplitudes in this $\ell$ range across all 107 regions
is biased to positive values, while there is no evidence for such a bias in the
$TB$ spectra. This work constitutes the highest-$\ell$ measurements of the
Galactic dust $TE$ spectrum to date and indicates that cross-correlation with
high-resolution mid-infrared measurements of dust emission is a promising tool
for constraining the spatial statistics of dust emission at millimeter
wavelengths.
Authors' comments: 20 pages, 14 figures, submitted to ApJ
Angela Andreella, Anna Vesely, Weeda Wouter, Jelle Goeman
Two permutation-based methods for simultaneous inference on the proportion of active voxels in cluster-wise brain imaging analysis have recently been published: Notip (Blain et al. 2022) and pARI (Andreella et al. 2023). Both rely on the definition of a critical vector of ordered p-values, chosen from a family of candidate vectors, but differ in how the family is defined: computed from randomization of external data for Notip and determined a priori for pARI. These procedures were compared to other proposals in the literature, but an extensive comparison between the two methods is missing due to their parallel publication. We provide such a comparison and find that pARI outperforms Notip if both methods are applied under their recommended settings. However, each method carries different advantages and drawbacks.