Se-In Jang, Tinsu Pan, Ye Li, Pedram Heidari, Junyu Chen, Quanzheng Li, Kuang Gong
Position emission tomography (PET) is widely used in clinics and research due
to its quantitative merits and high sensitivity, but suffers from low
signal-to-noise ratio (SNR). Recently convolutional neural networks (CNNs) have
been widely used to improve PET image quality. Though successful and efficient
in local feature extraction, CNN cannot capture long-range dependencies well
due to its limited receptive field. Global multi-head self-attention (MSA) is a
popular approach to capture long-range information. However, the calculation of
global MSA for 3D images has high computational costs. In this work, we
proposed an efficient spatial and channel-wise encoder-decoder transformer,
Spach Transformer, that can leverage spatial and channel information based on
local and global MSAs. Experiments based on datasets of different PET tracers,
i.e., $^{18}$F-FDG, $^{18}$F-ACBC, $^{18}$F-DCFPyL, and $^{68}$Ga-DOTATATE,
were conducted to evaluate the proposed framework. Quantitative results show
that the proposed Spach Transformer framework outperforms state-of-the-art deep
learning architectures. Our codes are available at
https://github.com/sijang/SpachTransformer
Authors' comments: 15 pages
S. Rahmani, C. W. Xiao
The branching ratios of the semileptonic decay widths of the charm mesons are analyzed, using three different models for the Isgur-Wise functions, such as ${D^0} \to {K^ - }{l^ + }\upsilon $, ${D^0} \to {\pi ^ - }{l^ + }\nu $, ${D_s} \to {K^0}{l^ + }\nu $ and ${D_s} \to \eta {l^ + }\nu$, where the form factors of these decays are discussed. The mass spectra of the charm mesons are obtained. We use a potential quark model and consider the non-relativistic Hamiltonian of the charm meson as a bound state of the quark-antiquark system. We take into account the harmonic-type confinement and also Hellmann potential, which is a superposition of the Coulomb and the Yukawa potential. Using the variational approach along with the harmonic oscillator wave functions, we evaluate the mass spectra of the charm mesons, the form factors and the semileptonic decay widths of $D_{(s)}$. We present our results for masses of $D, D_s$ and $\eta$, the Isgur-Wise functions, the form factors of the semileptonic decays, and the branching fractions of the semileptonic decays of $D$ and $D_s$. Our results are motivating.
Toshihiro Kasuga, Joseph R. Masiero
We present space-based thermal infrared observations of the presumably
Geminid-associated asteroids: (3200)Phaethon, 2005 UD and 1999 YC using
WISE/NEOWISE. The images were taken at the four wavelength bands
3.4$\mu$m(W1),4.6$\mu$m(W2),12$\mu$m(W3),and 22$\mu$m(W4). We find no evidence
of lasting mass-loss in the asteroids over the decadal multi-epoch datasets. We
set an upper limit to the mass-loss rate in dust of Q<2kg s$^{-1}$ for Phaethon
and <0.1kg s$^{-1}$ for both 2005 UD and 1999 YC, respectively, with little
dependency over the observed heliocentric distances of R=1.0$-$2.3au. For
Phaethon, even if the maximum mass-loss was sustained over the 1000(s)yr
dynamical age of the Geminid stream, it is more than two orders of magnitude
too small to supply the reported stream mass (1e13$-$14kg). The
Phaethon-associated dust trail (Geminid stream) is not detected at R=2.3au,
corresponding an upper limit on the optical depth of $\tau$<7e-9. Additionally,
no co-moving asteroids with radii r<650m were found. The DESTINY+ dust analyzer
would be capable of detecting several of the 10$\mu$m-sized interplanetary dust
particles when at far distances(>50,000km) from Phaethon. From 2005 UD, if the
mass-loss rate lasted over the 10,000yr dynamical age of the Daytime Sextantid
meteoroid stream, the mass of the stream would be ~1e10kg. The 1999 YC images
showed neither the related dust trail ($\tau$<2e-8) nor co-moving objects with
radii r<170m at R=1.6au. Estimated physical parameters from these limits do not
explain the production mechanism of the Geminid meteoroid stream. Lastly, to
explore the origin of the Geminids, we discuss the implications for our data in
relation to the possibly sodium (Na)-driven perihelion activity of Phaethon.
Authors' comments: Accepted for publication in The Astronomical Journal, 8 tables, 7
figures
Eeshan Bhaduri, Shagufta Pal, Arkopal Kishore Goswami
The study examines heterogeneity in travel behaviour among ride-hailing services (RHS) users by including attitudes, in order to reinforce conventional user-segmentation approaches. Simultaneously, prioritization of ride-hailing specific attributes was carried out to assess how RHS will operate in a sustainable way.
Katsuhisa Ouchi, Hiroyuki Masuyama
This paper considers the level-increment (LI) truncation approximation of
M/G/1-type Markov chains. The LI truncation approximation is useful for
implementing the M/G/1 paradigm, which is the framework for computing the
stationary distribution of M/G/1-type Markov chains. The main result of this
paper is a subgeometric convergence formula for the total variation distance
between the original stationary distribution and its LI truncation
approximation. Suppose that the equilibrium level-increment distribution is
subexponential, and that the downward transition matrix is rank one. We then
show that the convergence rate of the total variation error of the LI
truncation approximation is equal to that of the tail of the equilibrium
level-increment distribution and that of the tail of the original stationary
distribution.
Authors' comments: 20 pages This is a revised version of the paper to appear in JORSJ,
Vol. 65, No. 4, 2022
Fabian Duffhauss, Tobias Demmler, Gerhard Neumann
Estimating 6D poses of objects is an essential computer vision task. However,
most conventional approaches rely on camera data from a single perspective and
therefore suffer from occlusions. We overcome this issue with our novel
multi-view 6D pose estimation method called MV6D which accurately predicts the
6D poses of all objects in a cluttered scene based on RGB-D images from
multiple perspectives. We base our approach on the PVN3D network that uses a
single RGB-D image to predict keypoints of the target objects. We extend this
approach by using a combined point cloud from multiple views and fusing the
images from each view with a DenseFusion layer. In contrast to current
multi-view pose detection networks such as CosyPose, our MV6D can learn the
fusion of multiple perspectives in an end-to-end manner and does not require
multiple prediction stages or subsequent fine tuning of the prediction.
Furthermore, we present three novel photorealistic datasets of cluttered scenes
with heavy occlusions. All of them contain RGB-D images from multiple
perspectives and the ground truth for instance semantic segmentation and 6D
pose estimation. MV6D significantly outperforms the state-of-the-art in
multi-view 6D pose estimation even in cases where the camera poses are known
inaccurately. Furthermore, we show that our approach is robust towards dynamic
camera setups and that its accuracy increases incrementally with an increasing
number of perspectives.
Authors' comments: Accepted at IROS 2022
Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura
In this paper, we investigate the semi-supervised joint training of text to
speech (TTS) and automatic speech recognition (ASR), where a small amount of
paired data and a large amount of unpaired text data are available.
Conventional studies form a cycle called the TTS-ASR pipeline, where the
multispeaker TTS model synthesizes speech from text with a reference speech and
the ASR model reconstructs the text from the synthesized speech, after which
both models are trained with a cycle-consistency loss. However, the synthesized
speech does not reflect the speaker characteristics of the reference speech and
the synthesized speech becomes overly easy for the ASR model to recognize after
training. This not only decreases the TTS model quality but also limits the ASR
model improvement. To solve this problem, we propose improving the
cycleconsistency-based training with a speaker consistency loss and step-wise
optimization. The speaker consistency loss brings the speaker characteristics
of the synthesized speech closer to that of the reference speech. In the
step-wise optimization, we first freeze the parameter of the TTS model before
both models are trained to avoid over-adaptation of the TTS model to the ASR
model. Experimental results demonstrate the efficacy of the proposed method.
Authors' comments: Accepted to INTERSPEECH 2022
Geonho Cha, Chaehun Shin, Sungroh Yoon, Dongyoon Wee
To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance.
Xiaofeng Gao, Xingwei Wu, Samson Ho, Teruhisa Misu, Kumar Akash
Although partially autonomous driving (AD) systems are already available in
production vehicles, drivers are still required to maintain a sufficient level
of situational awareness (SA) during driving. Previous studies have shown that
providing information about the AD's capability using user interfaces can
improve the driver's SA. However, displaying too much information increases the
driver's workload and can distract or overwhelm the driver. Therefore, to
design an efficient user interface (UI), it is necessary to understand its
effect under different circumstances. In this paper, we focus on a UI based on
augmented reality (AR), which can highlight potential hazards on the road. To
understand the effect of highlighting on drivers' SA for objects with different
types and locations under various traffic densities, we conducted an in-person
experiment with 20 participants on a driving simulator. Our study results show
that the effects of highlighting on drivers' SA varied by traffic densities,
object locations and object types. We believe our study can provide guidance in
selecting which object to highlight for the AR-based driver-assistance
interface to optimize SA for drivers driving and monitoring partially
autonomous vehicles.
Authors' comments: 10 pages, 11 figures, IV 2022
Iasson Karafyllis, Pierdomenico Pepe, Yuan Wang, Antoine Chaillet
For time-delay systems, it is known that global asymptotic stability is guaranteed by the existence of a Lyapunov-Krasovskii functional that dissipates in a point-wise manner along solutions, namely whose dissipation rate involves only the current value of the solution's norm. So far, the extension of this result to global exponential stability (GES) holds only for systems ruled by a globally Lipschitz vector field and remains largely open for the input-to-state stability (ISS) property. In this paper, we rely on the notion of exponential ISS to extend the class of systems for which GES or ISS can be concluded from a point-wise dissipation. Our results in turn show that these properties still hold in the presence of a sufficiently small additional term involving the whole state history norm. We provide explicit estimates of the tolerable magnitude of this extra term and show through an example how it can be used to assess robustness with respect to modeling uncertainties.
Qiuyi Wu, Julie Bessac, Whitney Huang, Jiali Wang
This study develops a statistical conditional approach to evaluate climate model performance in wind speed and direction and to project their future changes under the representative concentration pathway 8.5 scenario over inland and offshore locations across the Continental United States. The proposed conditional approach extends the scope of existing studies by characterizing the changes of the full range of the joint wind speed and direction distribution. Directional wind speed distributions are estimated using two statistical methods: a Weibull distributional regression model and a quantile regression model, both of which enforce the circular constraint to their resulting estimates of directional distributions. Projected uncertainties associated with different climate models and model internal variability are investigated and compared with the climate change signal to quantify the statistical significance of the future projections. In particular this work extends the concept of internal variability to the standard deviation and high quantiles to assess the relative magnitudes to their projected changes. The evaluation results show that the studied climate model capture both historical wind speed, wind direction, and their dependencies reasonably well over both inland and offshore locations. In the future, most of the locations show no significant changes in mean wind speeds in both winter and summer, although the changes in standard deviation and 95th-quantile show some robust changes over certain locations in winter. The proposed conditional approach enables the characterization of the directional wind speed distributions, which offers additional insights for the joint assessment of speed and direction.
Christoph Wehner, Francis Powlesland, Bashar Altakrouri, Ute Schmid
Artificial Intelligence and Digital Twins play an integral role in driving innovation in the domain of intelligent driving. Long short-term memory (LSTM) is a leading driver in the field of lane change prediction for manoeuvre anticipation. However, the decision-making process of such models is complex and non-transparent, hence reducing the trustworthiness of the smart solution. This work presents an innovative approach and a technical implementation for explaining lane change predictions of layer normalized LSTMs using Layer-wise Relevance Propagation (LRP). The core implementation includes consuming live data from a digital twin on a German highway, live predictions and explanations of lane changes by extending LRP to layer normalized LSTMs, and an interface for communicating and explaining the predictions to a human user. We aim to demonstrate faithful, understandable, and adaptable explanations of lane change prediction to increase the adoption and trustworthiness of AI systems that involve humans. Our research also emphases that explainability and state-of-the-art performance of ML models for manoeuvre anticipation go hand in hand without negatively affecting predictive effectiveness.
Chengxin Chen, Pengyuan Zhang
Previous research has looked into ways to improve speech emotion recognition
(SER) by utilizing both acoustic and linguistic cues of speech. However, the
potential association between state-of-the-art ASR models and the SER task has
yet to be investigated. In this paper, we propose a novel channel and
temporal-wise attention RNN (CTA-RNN) architecture based on the intermediate
representations of pre-trained ASR models. Specifically, the embeddings of a
large-scale pre-trained end-to-end ASR encoder contain both acoustic and
linguistic information, as well as the ability to generalize to different
speakers, making them well suited for downstream SER task. To further exploit
the embeddings from different layers of the ASR encoder, we propose a novel
CTA-RNN architecture to capture the emotional salient parts of embeddings in
both the channel and temporal directions. We evaluate our approach on two
popular benchmark datasets, IEMOCAP and MSP-IMPROV, using both within-corpus
and cross-corpus settings. Experimental results show that our proposed method
can achieve excellent performance in terms of accuracy and robustness.
Authors' comments: 5 pages, 2 figures, submitted to INTERSPEECH 2022
Ehsan Kamalloo, Mehdi Rezagholizadeh, Ali Ghodsi
Data Augmentation (DA) is known to improve the generalizability of deep
neural networks. Most existing DA techniques naively add a certain number of
augmented samples without considering the quality and the added computational
cost of these samples. To tackle this problem, a common strategy, adopted by
several state-of-the-art DA methods, is to adaptively generate or re-weight
augmented samples with respect to the task objective during training. However,
these adaptive DA methods: (1) are computationally expensive and not
sample-efficient, and (2) are designed merely for a specific setting. In this
work, we present a universal DA technique, called Glitter, to overcome both
issues. Glitter can be plugged into any DA method, making training
sample-efficient without sacrificing performance. From a pre-generated pool of
augmented samples, Glitter adaptively selects a subset of worst-case samples
with maximal loss, analogous to adversarial DA. Without altering the training
strategy, the task objective can be optimized on the selected subset. Our
thorough experiments on the GLUE benchmark, SQuAD, and HellaSwag in three
widely used training setups including consistency training, self-distillation
and knowledge distillation reveal that Glitter is substantially faster to train
and achieves a competitive performance, compared to strong baselines.
Authors' comments: ACL 2022 Findings
Ilias Chalkidis, Anders Søgaard
In document classification for, e.g., legal and biomedical text, we often
deal with hundreds of classes, including very infrequent ones, as well as
temporal concept drift caused by the influence of real world events, e.g.,
policy changes, conflicts, or pandemics. Class imbalance and drift can
sometimes be mitigated by resampling the training data to simulate (or
compensate for) a known target distribution, but what if the target
distribution is determined by unknown future events? Instead of simply
resampling uniformly to hedge our bets, we focus on the underlying optimization
algorithms used to train such document classifiers and evaluate several
group-robust optimization algorithms, initially proposed to mitigate
group-level disparities. Reframing group-robust algorithms as adaptation
algorithms under concept drift, we find that Invariant Risk Minimization and
Spectral Decoupling outperform sampling-based approaches to class imbalance and
concept drift, and lead to much better performance on minority classes. The
effect is more pronounced the larger the label set.
Authors' comments: 9 pages, long paper at ACL 2022 Findings
Shivani Kumar, Atharva Kulkarni, Md Shad Akhtar, Tanmoy Chakraborty
Indirect speech such as sarcasm achieves a constellation of discourse goals
in human communication. While the indirectness of figurative language warrants
speakers to achieve certain pragmatic goals, it is challenging for AI agents to
comprehend such idiosyncrasies of human communication. Though sarcasm
identification has been a well-explored topic in dialogue analysis, for
conversational systems to truly grasp a conversation's innate meaning and
generate appropriate responses, simply detecting sarcasm is not enough; it is
vital to explain its underlying sarcastic connotation to capture its true
essence. In this work, we study the discourse structure of sarcastic
conversations and propose a novel task - Sarcasm Explanation in Dialogue (SED).
Set in a multimodal and code-mixed setting, the task aims to generate natural
language explanations of satirical conversations. To this end, we curate WITS,
a new dataset to support our task. We propose MAF (Modality Aware Fusion), a
multimodal context-aware attention and global information fusion module to
capture multimodality and use it to benchmark WITS. The proposed attention
module surpasses the traditional multimodal fusion baselines and reports the
best performance on almost all metrics. Lastly, we carry out detailed analyses
both quantitatively and qualitatively.
Authors' comments: Accepted in ACL 2022. 13 pages, 4 figures, 12 tables
Azadeh Khaleghi, Lukas Zierahn
We introduce PyChEst, a Python package which provides tools for the simultaneous estimation of multiple changepoints in the distribution of piece-wise stationary time series. The nonparametric algorithms implemented are provably consistent in a general framework: when the samples are generated by unknown piece-wise stationary processes. In this setting, samples may have long-range dependencies of arbitrary form and the finite-dimensional marginals of any (unknown) fixed size before and after the changepoints may be the same. The strength of the algorithms included in the package is in their ability to consistently detect the changes without imposing any assumptions beyond stationarity on the underlying process distributions. We illustrate this distinguishing feature by comparing the performance of the package against state-of-the-art models designed for a setting where the samples are independently and identically distributed.
Yaxu Xie, Fangwen Shu, Jason Rambach, Alain Pagani, Didier Stricker
Piece-wise 3D planar reconstruction provides holistic scene understanding of
man-made environments, especially for indoor scenarios. Most recent approaches
focused on improving the segmentation and reconstruction results by introducing
advanced network architectures but overlooked the dual characteristics of
piece-wise planes as objects and geometric models. Different from other
existing approaches, we start from enforcing cross-task consistency for our
multi-task convolutional neural network, PlaneRecNet, which integrates a
single-stage instance segmentation network for piece-wise planar segmentation
and a depth decoder to reconstruct the scene from a single RGB image. To
achieve this, we introduce several novel loss functions (geometric constraint)
that jointly improve the accuracy of piece-wise planar segmentation and depth
estimation. Meanwhile, a novel Plane Prior Attention module is used to guide
depth estimation with the awareness of plane instances. Exhaustive experiments
are conducted in this work to validate the effectiveness and efficiency of our
method.
Authors' comments: accepted to BMVC 2021, code opensource:
https://github.com/EryiXie/PlaneRecNet
Mahsa N. Shirazi
Two perfect matchings $P$ and $Q$ of the complete graph on $2k$ vertices are
said to be set-wise $t$-intersecting if there exist edges $P_{1}, \cdots,
P_{t}$ in $P$ and $Q_{1}, \cdots, Q_{t}$ in $Q$ such that the union of edges
$P_{1}, \cdots, P_{t}$ has the same set of vertices as the union of $Q_{1},
\cdots, Q_{t}$ has. In this paper we prove an extension of the famous
Erd\H{o}s-Ko-Rado (EKR) theorem to set-wise $2$-intersecting families of
perfect matching on all values of $k$, and we conjecture similar statement for
all $t\geq 2$.
Authors' comments: arXiv admin note: text overlap with arXiv:2008.08503
Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar
Most of the recent works on probing representations have focused on BERT,
with the presumption that the findings might be similar to the other models. In
this work, we extend the probing studies to two other models in the family,
namely ELECTRA and XLNet, showing that variations in the pre-training
objectives or architectural choices can result in different behaviors in
encoding linguistic information in the representations. Most notably, we
observe that ELECTRA tends to encode linguistic knowledge in the deeper layers,
whereas XLNet instead concentrates that in the earlier layers. Also, the former
model undergoes a slight change during fine-tuning, whereas the latter
experiences significant adjustments. Moreover, we show that drawing conclusions
based on the weight mixing evaluation strategy -- which is widely used in the
context of layer-wise probing -- can be misleading given the norm disparity of
the representations across different layers. Instead, we adopt an alternative
information-theoretic probing with minimum description length, which has
recently been proven to provide more reliable and informative results.
Authors' comments: Accepted to BlackboxNLP Workshop at EMNLP 2021