Romain Thoreau, Laurent Risser, Véronique Achard, Béatrice Berthelot, Xavier Briottet
Airborne hyperspectral images can be used to map the land cover in large
urban areas, thanks to their very high spatial and spectral resolutions on a
wide spectral domain. While the spectral dimension of hyperspectral images is
highly informative of the chemical composition of the land surface, the use of
state-of-the-art machine learning algorithms to map the land cover has been
dramatically limited by the availability of training data. To cope with the
scarcity of annotations, semi-supervised and self-supervised techniques have
lately raised a lot of interest in the community. Yet, the publicly available
hyperspectral data sets commonly used to benchmark machine learning models are
not totally suited to evaluate their generalization performances due to one or
several of the following properties: a limited geographical coverage (which
does not reflect the spectral diversity in metropolitan areas), a small number
of land cover classes and a lack of appropriate standard train / test splits
for semi-supervised and self-supervised learning. Therefore, we release in this
paper the Toulouse Hyperspectral Data Set that stands out from other data sets
in the above-mentioned respects in order to meet key issues in spectral
representation learning and classification over large-scale hyperspectral
images with very few labeled pixels. Besides, we discuss and experiment
self-supervised techniques for spectral representation learning, including the
Masked Autoencoder, and establish a baseline for pixel-wise classification
achieving 85% overall accuracy and 77% F1 score. The Toulouse Hyperspectral
Data Set and our code are publicly available at
https://www.toulouse-hyperspectral-data-set.com and
https://www.github.com/Romain3Ch216/tlse-experiments, respectively.
Authors' comments: 17 pages, 13 figures
Sung-Jin Kim, Heon-Gyu Kwak, Hyeon-Taek Han, Dae-Hyeok Lee, Ji-Hoon Jeong, Seong-Whan Lee
Brain-computer interface (BCI) has garnered the significant attention for their potential in various applications, with event-related potential (ERP) performing a considerable role in BCI systems. This paper introduces a novel Distributed Inference System tailored for detecting task-wise single-trial ERPs in a stream of satellite images. Unlike traditional methodologies that employ a single model for target detection, our system utilizes multiple models, each optimized for specific tasks, ensuring enhanced performance across varying image transition times and target onset times. Our experiments, conducted on four participants, employed two paradigms: the Normal paradigm and an AI paradigm with bounding boxes. Results indicate that our proposed system outperforms the conventional methods in both paradigms, achieving the highest $F_{\beta}$ scores. Furthermore, including bounding boxes in the AI paradigm significantly improved target recognition. This study underscores the potential of our Distributed Inference System in advancing the field of ERP detection in satellite image streams.
Daehee Kim, Yoonsik Kim, DongHyun Kim, Yumin Lim, Geewook Kim, Taeho Kil
Inspired by the great success of language model (LM)-based pre-training,
recent studies in visual document understanding have explored LM-based
pre-training methods for modeling text within document images. Among them,
pre-training that reads all text from an image has shown promise, but often
exhibits instability and even fails when applied to broader domains, such as
those involving both visual documents and scene text images. This is a
substantial limitation for real-world scenarios, where the processing of text
image inputs in diverse domains is essential. In this paper, we investigate
effective pre-training tasks in the broader domains and also propose a novel
pre-training method called SCOB that leverages character-wise supervised
contrastive learning with online text rendering to effectively pre-train
document and scene text domains by bridging the domain gap. Moreover, SCOB
enables weakly supervised learning, significantly reducing annotation costs.
Extensive benchmarks demonstrate that SCOB generally improves vanilla
pre-training methods and achieves comparable performance to state-of-the-art
methods. Our findings suggest that SCOB can be served generally and effectively
for read-type pre-training methods. The code will be available at
https://github.com/naver-ai/scob.
Authors' comments: ICCV 2023
Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
Multi-organ segmentation in abdominal Computed Tomography (CT) images is of
great importance for diagnosis of abdominal lesions and subsequent treatment
planning. Though deep learning based methods have attained high performance,
they rely heavily on large-scale pixel-level annotations that are
time-consuming and labor-intensive to obtain. Due to its low dependency on
annotation, weakly supervised segmentation has attracted great attention.
However, there is still a large performance gap between current
weakly-supervised methods and fully supervised learning, leaving room for
exploration. In this work, we propose a novel 3D framework with two consistency
constraints for scribble-supervised multiple abdominal organ segmentation from
CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with
one encoder and three decoders using different dilation rates to capture
features from different receptive fields that are complementary to each other
to generate high-quality soft pseudo labels. For more stable unsupervised
learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and
then supervise the outputs of each decoder. To further regularize the network,
class relationship information is exploited by encouraging the generated class
affinity matrices to be consistent across different decoders under multi-view
projection. Experiments on the public WORD dataset show that our method
outperforms five existing scribble-supervised methods.
Authors' comments: 10 pages, 3 figures, MICCAI2023
Xueyuan Li, Ruining Deng, Yucheng Tang, Shunxing Bao, Haichun Yang, Yuankai Huo
Precise identification of multiple cell classes in high-resolution Giga-pixel whole slide imaging (WSI) is critical for various clinical scenarios. Building an AI model for this purpose typically requires pixel-level annotations, which are often unscalable and must be done by skilled domain experts (e.g., pathologists). However, these annotations can be prone to errors, especially when distinguishing between intricate cell types (e.g., podocytes and mesangial cells) using only visual inspection. Interestingly, a recent study showed that lay annotators, when using extra immunofluorescence (IF) images for reference (referred to as molecular-empowered learning), can sometimes outperform domain experts in labeling. Despite this, the resource-intensive task of manual delineation remains a necessity during the annotation process. In this paper, we explore the potential of bypassing pixel-level delineation by employing the recent segment anything model (SAM) on weak box annotation in a zero-shot learning approach. Specifically, we harness SAM's ability to produce pixel-level annotations from box annotations and utilize these SAM-generated labels to train a segmentation model. Our findings show that the proposed SAM-assisted molecular-empowered learning (SAM-L) can diminish the labeling efforts for lay annotators by only requiring weak box annotations. This is achieved without compromising annotation accuracy or the performance of the deep learning-based segmentation. This research represents a significant advancement in democratizing the annotation process for training pathological image segmentation, relying solely on non-expert annotators.
Minghui Liwang, Bingshuo Guo, Zhanxi Ma, Yuhan Su, Jian Jin, Seyyedali Hosseinalipour, Xianbin Wang, Huaiyu Dai
To effectively process high volume of data across a fleet of dynamic and distributed vehicles, it is crucial to implement resource provisioning techniques that can provide reliable, cost-effective, and timely computing services. This article explores computation-intensive task scheduling over mobile vehicular clouds (MVCs). We use undirected weighted graphs (UWGs) to model both the execution of tasks and communication patterns among vehicles in an MVC. We then study reliable and timely scheduling of UWG tasks through a novel mechanism, operating on two complementary decision-making stages: Plan A and Plan B. Plan A entails a proactive decision-making approach, leveraging historical statistical data for the preemptive creation of an optimal mapping ($\alpha$) between tasks and the MVC prior to practical task scheduling. In contrast, Plan B explores a real-time decision-making paradigm, functioning as a reliable contingency plan. It seeks a viable mapping ($\beta$) if $\alpha$ encounters failures during task scheduling due to the unpredictable nature of the network. Furthermore, we provide an in-depth exploration of the procedural intricacies and key contributing factors that underpin the success of our mechanism. Additionally, we present a case study showcasing the superior performance on time efficiency and computation overhead. We further discuss a series of open directions for future research.
Rodrigo Crdova Rosado, Brandon S. Hensley, Susan E. Clark, Adriaan J. Duivenvoorden, Zachary Atkins, Elia Stefano Battistelli, Steve K. Choi, Jo Dunkley et al.
We present a cross-correlation analysis between $1'$ resolution total
intensity and polarization observations from the Atacama Cosmology Telescope
(ACT) at 150 and 220 GHz and 15$''$ mid-infrared photometry from the Wide-field
Infrared Survey Explorer (WISE) over 107 12.5$^\circ\times$12.5$^\circ$ patches
of sky. We detect a spatially isotropic signal in the WISE$\times$ACT $TT$
cross power spectrum at 30$\sigma$ significance that we interpret as the
correlation between the cosmic infrared background at ACT frequencies and
polycyclic aromatic hydrocarbon (PAH) emission from galaxies in WISE, i.e., the
cosmic PAH background. Within the Milky Way, the Galactic dust $TT$ spectra are
generally well-described by power laws in $\ell$ over the range 10$^3 < \ell <
$10$^4$, but there is evidence both for variability in the power law index and
for non-power law behavior in some regions. We measure a positive correlation
between WISE total intensity and ACT $E$-mode polarization at 1000$ < \ell
\lesssim $6000 at $>$3$\sigma$ in each of 35 distinct $\sim$100 deg$^2$ regions
of the sky, suggesting alignment between Galactic density structures and the
local magnetic field persists to sub-parsec physical scales in these regions.
The distribution of $TE$ amplitudes in this $\ell$ range across all 107 regions
is biased to positive values, while there is no evidence for such a bias in the
$TB$ spectra. This work constitutes the highest-$\ell$ measurements of the
Galactic dust $TE$ spectrum to date and indicates that cross-correlation with
high-resolution mid-infrared measurements of dust emission is a promising tool
for constraining the spatial statistics of dust emission at millimeter
wavelengths.
Authors' comments: 20 pages, 14 figures, submitted to ApJ
Angela Andreella, Anna Vesely, Weeda Wouter, Jelle Goeman
Two permutation-based methods for simultaneous inference on the proportion of active voxels in cluster-wise brain imaging analysis have recently been published: Notip (Blain et al. 2022) and pARI (Andreella et al. 2023). Both rely on the definition of a critical vector of ordered p-values, chosen from a family of candidate vectors, but differ in how the family is defined: computed from randomization of external data for Notip and determined a priori for pARI. These procedures were compared to other proposals in the literature, but an extensive comparison between the two methods is missing due to their parallel publication. We provide such a comparison and find that pARI outperforms Notip if both methods are applied under their recommended settings. However, each method carries different advantages and drawbacks.
Zihong Yan, Xiaoyi Wu, Zhuozhu Jian, Bin Lan Xueqian Wang, Bin Liang
Mobile robots navigating in outdoor environments frequently encounter the issue of undesired traces left by dynamic objects and manifested as obstacles on map, impeding robots from achieving accurate localization and effective navigation. To tackle the problem, a novel map construction framework based on 3D region-wise hash map structure (RH-Map) is proposed, consisting of front-end scan fresher and back-end removal modules, which realizes real-time map construction and online dynamic object removal (DOR). First, a two-layer 3D region-wise hash map structure of map management is proposed for effective online DOR. Then, in scan fresher, region-wise ground plane estimation (R-GPE) is adopted for estimating and preserving ground information and Scan-to-Map Removal (S2M-R) is proposed to discriminate and remove dynamic regions. Moreover, the lightweight back-end removal module maintaining keyframes is proposed for further DOR. As experimentally verified on SemanticKITTI, our proposed framework yields promising performance on online DOR of map construction compared with the state-of-the-art methods. And we also validate the proposed framework in real-world environments.
Manorama Jha, Bhaskar Banerjee
We present an efficient and robust point cloud registration (PCR) workflow
for part-wise rigid point cloud alignment using the Microsoft HoloLens 2. Point
Cloud Registration (PCR) is an important problem in Augmented and Mixed Reality
use cases, and we present a study for a special class of non-rigid
transformations. Many commonly encountered objects are composed of rigid parts
that move relative to one another about joints resulting in non-rigid
deformation of the whole object such as robots with manipulators, and machines
with hinges. The workflow presented allows us to register the point cloud with
various configurations of the point cloud.
Authors' comments: Accepted for presentation at WiCV @ CVPR 2023
Marco Braun, Moritz Luszek, Mirko Meuter, Dominic Spata, Kevin Kollek, Anton Kummert
Current Deep Learning methods for environment segmentation and velocity
estimation rely on Convolutional Recurrent Neural Networks to exploit
spatio-temporal relationships within obtained sensor data. These approaches
derive scene dynamics implicitly by correlating novel input and memorized data
utilizing ConvNets. We show how ConvNets suffer from architectural restrictions
for this task. Based on these findings, we then provide solutions to various
issues on exploiting spatio-temporal correlations in a sequence of sensor
recordings by presenting a novel Recurrent Neural Network unit utilizing
Transformer mechanisms. Within this unit, object encodings are tracked across
consecutive frames by correlating key-query pairs derived from sensor inputs
and memory states, respectively. We then use resulting tracking patterns to
obtain scene dynamics and regress velocities. In a last step, the memory state
of the Recurrent Neural Network is projected based on extracted velocity
estimates to resolve aforementioned spatio-temporal misalignment.
Authors' comments: Preprint submitted to 2022 IEEE 25th International Conference on
Intelligent Transportation Systems (ITSC), Macau, China, 7 pages
Shivang Rawat, Stefano Martiniani
Stochasticity plays a central role in nearly every biological process, and the noise power spectral density (PSD) is a critical tool for understanding variability and information processing in living systems. In steady-state, many such processes can be described by stochastic linear time-invariant (LTI) systems driven by Gaussian white noise, whose PSD is a complex rational function of the frequency that can be concisely expressed in terms of their Jacobian, dispersion, and diffusion matrices, fully defining the statistical properties of the system's dynamics at steady-state. Here, we arrive at compact element-wise solutions of the rational function coefficients for the auto- and cross-spectrum that enable the explicit analytical computation of the PSD in dimensions n=2,3,4. We further present a recursive Leverrier-Faddeev-type algorithm for the exact computation of the rational function coefficients. Crucially, both solutions are free of matrix inverses. We illustrate our element-wise and recursive solutions by considering the stochastic dynamics of neural systems models, namely Fitzhugh-Nagumo (n=2), Hindmarsh-Rose (n=3), Wilson-Cowan (n=4), and the Stabilized Supralinear Network (n=22), as well as an evolutionary game-theoretic model with mutations (n=5, 31). We extend our approach to derive a recursive method for calculating the coefficients in the power series expansion of the integrated covariance matrix for interacting spiking neurons modeled as Hawkes processes on arbitrary directed graphs.
Domenico Iuso, Soumick Chatterjee, Sven Cornelissen, Dries Verhees, Jan De Beenhouwer, Jan Sijbers
Additive Manufacturing (AM) has emerged as a manufacturing process that allows the direct production of samples from digital models. To ensure that quality standards are met in all manufactured samples of a batch, X-ray computed tomography (X-CT) is often used combined with automated anomaly detection. For the latter, deep learning (DL) anomaly detection techniques are increasingly, as they can be trained to be robust to the material being analysed and resilient towards poor image quality. Unfortunately, most recent and popular DL models have been developed for 2D image processing, thereby disregarding valuable volumetric information. This study revisits recent supervised (UNet, UNet++, UNet 3+, MSS-UNet) and unsupervised (VAE, ceVAE, gmVAE, vqVAE) DL models for porosity analysis of AM samples from X-CT images and extends them to accept 3D input data with a 3D-patch pipeline for lower computational requirements, improved efficiency and generalisability. The supervised models were trained using the Focal Tversky loss to address class imbalance that arises from the low porosity in the training datasets. The output of the unsupervised models is post-processed to reduce misclassifications caused by their inability to adequately represent the object surface. The findings were cross-validated in a 5-fold fashion and include: a performance benchmark of the DL models, an evaluation of the post-processing algorithm, an evaluation of the effect of training supervised models with the output of unsupervised models. In a final performance benchmark on a test set with poor image quality, the best performing supervised model was UNet++ with an average precision of 0.751 $\pm$ 0.030, while the best unsupervised model was the post-processed ceVAE with 0.830 $\pm$ 0.003. The VAE/ceVAE models demonstrated superior capabilities, particularly when leveraging post-processing techniques.
Di Hong, Jiangrong Shen, Yu Qi, Yueming Wang
Spiking Neural Networks (SNNs) are biologically realistic and practically promising in low-power computation because of their event-driven mechanism. Usually, the training of SNNs suffers accuracy loss on various tasks, yielding an inferior performance compared with ANNs. A conversion scheme is proposed to obtain competitive accuracy by mapping trained ANNs' parameters to SNNs with the same structures. However, an enormous number of time steps are required for these converted SNNs, thus losing the energy-efficient benefit. Utilizing both the accuracy advantages of ANNs and the computing efficiency of SNNs, a novel SNN training framework is proposed, namely layer-wise ANN-to-SNN knowledge distillation (LaSNN). In order to achieve competitive accuracy and reduced inference latency, LaSNN transfers the learning from a well-trained ANN to a small SNN by distilling the knowledge other than converting the parameters of ANN. The information gap between heterogeneous ANN and SNN is bridged by introducing the attention scheme, the knowledge in an ANN is effectively compressed and then efficiently transferred by utilizing our layer-wise distillation paradigm. We conduct detailed experiments to demonstrate the effectiveness, efficacy, and scalability of LaSNN on three benchmark data sets (CIFAR-10, CIFAR-100, and Tiny ImageNet). We achieve competitive top-1 accuracy compared to ANNs and 20x faster inference than converted SNNs with similar performance. More importantly, LaSNN is dexterous and extensible that can be effortlessly developed for SNNs with different architectures/depths and input encoding methods, contributing to their potential development.
Muhammad Febrian Rachmadi, Charissa Poon, Henrik Skibbe
In this paper, we propose a novel two-component loss for biomedical image
segmentation tasks called the Instance-wise and Center-of-Instance (ICI) loss,
a loss function that addresses the instance imbalance problem commonly
encountered when using pixel-wise loss functions such as the Dice loss. The
Instance-wise component improves the detection of small instances or ``blobs"
in image datasets with both large and small instances. The Center-of-Instance
component improves the overall detection accuracy. We compared the ICI loss
with two existing losses, the Dice loss and the blob loss, in the task of
stroke lesion segmentation using the ATLAS R2.0 challenge dataset from MICCAI
2022. Compared to the other losses, the ICI loss provided a better balanced
segmentation, and significantly outperformed the Dice loss with an improvement
of $1.7-3.7\%$ and the blob loss by $0.6-5.0\%$ in terms of the Dice similarity
coefficient on both validation and test set, suggesting that the ICI loss is a
potential solution to the instance imbalance problem.
Authors' comments: conference
Amirhossein Rasoulian, Soorena Salari, Yiming Xiao
Intracranial hemorrhage (ICH) is a life-threatening medical emergency caused by various factors. Timely and precise diagnosis of ICH is crucial for administering effective treatment and improving patient survival rates. While deep learning techniques have emerged as the leading approach for medical image analysis and processing, the most commonly employed supervised learning often requires large, high-quality annotated datasets that can be costly to obtain, particularly for pixel/voxel-wise image segmentation. To address this challenge and facilitate ICH treatment decisions, we proposed a novel weakly supervised ICH segmentation method that leverages a hierarchical combination of head-wise gradient-infused self-attention maps obtained from a Swin transformer. The transformer is trained using an ICH classification task with categorical labels. To build and validate the proposed technique, we used two publicly available clinical CT datasets, namely RSNA 2019 Brain CT hemorrhage and PhysioNet. Additionally, we conducted an exploratory study comparing two learning strategies - binary classification and full ICH subtyping - to assess their impact on self-attention and our weakly supervised ICH segmentation framework. The proposed algorithm was compared against the popular U-Net with full supervision, as well as a similar weakly supervised approach using Grad-CAM for ICH segmentation. With a mean Dice score of 0.47, our technique achieved similar ICH segmentation performance as the U-Net and outperformed the Grad-CAM based approach, demonstrating the excellent potential of the proposed framework in challenging medical image segmentation tasks.
Anja Rappl, Thomas Kneib, Stefan Lang, Elisabeth Bergherr
Joint models for longitudinal and time-to-event data have seen many developments in recent years. Though spatial joint models are still rare and the traditional proportional hazards formulation of the time-to-event part of the model is accompanied by computational challenges. We propose a joint model with a piece-wise exponential formulation of the hazard using the counting process representation of a hazard and structured additive predictors able to estimate (non-)linear, spatial and random effects. Its capabilities are assessed in a simulation study comparing our approach to an established one and highlighted by an example on physical functioning after cardiovascular events from the German Ageing Survey. The Structured Piecewise Additive Joint Model yielded good estimation performance, also and especially in spatial effects, while being double as fast as the chosen benchmark approach and performing stable in imbalanced data setting with few events.
Gokularam Muthukrishnan, Sheetal Kalyani
Conventionally, in a differentially private additive noise mechanism, independent and identically distributed (i.i.d.) noise samples are added to each coordinate of the response. In this work, we formally present the addition of noise that is independent but not identically distributed (i.n.i.d.) across the coordinates to achieve tighter privacy-accuracy trade-off by exploiting coordinate-wise disparity in privacy leakage. In particular, we study the i.n.i.d. Gaussian and Laplace mechanisms and obtain the conditions under which these mechanisms guarantee privacy. The optimal choice of parameters that ensure these conditions are derived considering (weighted) mean squared and $\ell_{p}^{p}$-errors as measures of accuracy. Theoretical analyses and numerical simulations demonstrate that the i.n.i.d. mechanisms achieve higher utility for the given privacy requirements compared to their i.i.d. counterparts. One of the interesting observations is that the Laplace mechanism outperforms Gaussian even in high dimensions, as opposed to the popular belief, if the irregularity in coordinate-wise sensitivities is exploited. We also demonstrate how the i.n.i.d. noise can improve the performance in private (a) coordinate descent, (b) principal component analysis, and (c) deep learning with group clipping.
Baichuan Mo, Qing Yi Wang, Xiaotong Guo, Matthias Winkenbach, Jinhua Zhao
In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respective service areas. Hence, the actual stop sequences chosen by an experienced human driver may be potentially preferable to the theoretical shortest-distance routing under real-life operational conditions. Thus, being able to predict the actual stop sequence that a human driver would follow can help to improve route planning in last-mile delivery. This paper proposes a pair-wise attention-based pointer neural network for this prediction task using drivers' historical delivery trajectory data. In addition to the commonly used encoder-decoder architecture for sequence-to-sequence prediction, we propose a new attention mechanism based on an alternative specific neural network to capture the local pair-wise information for each pair of stops. To further capture the global efficiency of the route, we propose a new iterative sequence generation algorithm that is used after model training to identify the first stop of a route that yields the lowest operational cost. Results from an extensive case study on real operational data from Amazon's last-mile delivery operations in the US show that our proposed method can significantly outperform traditional optimization-based approaches and other machine learning methods (such as the Long Short-Term Memory encoder-decoder and the original pointer network) in finding stop sequences that are closer to high-quality routes executed by experienced drivers in the field. Compared to benchmark models, the proposed model can increase the average prediction accuracy of the first four stops from around 0.2 to 0.312, and reduce the disparity between the predicted route and the actual route by around 15%.
Chunyu Qiang, Peng Yang, Hao Che, Xiaorui Wang, Zhongyuan Wang
Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesised speech of a target speaker's timbre. Most
previous approaches rely on data with style labels, but manually-annotated
labels are expensive and not always reliable. In response to this problem, we
propose Style-Label-Free, a cross-speaker style transfer method, which can
realize the style transfer from source speaker to target speaker without style
labels. Firstly, a reference encoder structure based on quantized variational
autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style
representations. Secondly, a speaker-wise batch normalization layer is proposed
to reduce the source speaker leakage. In order to improve the style extraction
ability of the reference encoder, a style invariant and contrastive data
augmentation method is proposed. Experimental results show that the method
outperforms the baseline. We provide a website with audio samples.
Authors' comments: Published to ISCSLP 2022