Arya Bangun, Oleh Melnyk, Benjamin März, Benedikt Diederichs, Alexander Clausen, Dieter Weber, Frank Filbir, Knut MÜller-Caspary
We propose algorithms based on an optimisation method for inverse multislice ptychography in, e.g. electron microscopy. The multislice method is widely used to model the interaction between relativistic electrons and thick specimens. Since only the intensity of diffraction patterns can be recorded, the challenge in applying inverse multislice ptychography is to uniquely reconstruct the electrostatic potential in each slice up to some ambiguities. In this conceptual study, we show that a unique separation of atomic layers for simulated data is possible when considering a low acceleration voltage. We also introduce an adaptation for estimating the illuminating probe. For the sake of practical application, we finally present slice reconstructions using experimental 4D scanning transmission electron microscopy (STEM) data.
Zhaofeng Si, Honggang Qi, Xiaoyu Song
Convolutional neural networks are prevailing in deep learning tasks. However,
they suffer from massive cost issues when working on mobile devices. Network
pruning is an effective method of model compression to handle such problems.
This paper presents a novel structured network pruning method with auxiliary
gating structures which assigns importance marks to blocks in backbone network
as a criterion when pruning. Block-wise pruning is then realized by proposed
voting strategy, which is different from prevailing methods who prune a model
in small granularity like channel-wise. We further develop a three-stage
training scheduling for the proposed architecture incorporating knowledge
distillation for better performance. Our experiments demonstrate that our
method can achieve state-of-the-arts compression performance for the
classification tasks. In addition, our approach can integrate synergistically
with other pruning methods by providing pretrained models, thus achieving a
better performance than the unpruned model with over 93\% FLOPs reduced.
Authors' comments: 7 pages, 7 figures, 2 tables
Alberto Mellone, Giordano Scarciotti
We address the path-wise control of systems described by a set of nonlinear stochastic differential equations. For this class of systems, we introduce a notion of stochastic relative degree and a change of coordinates which transforms the dynamics to a stochastic normal form. The normal form is instrumental for the design of a state-feedback control which linearises and makes the dynamics deterministic. We observe that this control is idealistic, i.e. it is not practically implementable because it employs a feedback of the Brownian motion (which is never available) to cancel the noise. Using the idealistic control as a starting point, we introduce a hybrid control architecture which achieves \emph{practical} path-wise control. This hybrid controller uses measurements of the state to perform periodic compensations for the noise contribution to the dynamics. We prove that the hybrid controller retrieves the idealistic performances in the limit as the compensating period approaches zero. We address the problem of asymptotic output tracking, solving it in the idealistic and in the practical framework. We finally validate the theory by means of a numerical example.
Yajing Feng, Qian Hu, Zhenzhou Tang
Vacant parking space (VPS) prediction is one of the key issues of intelligent parking guidance systems. Accurately predicting VPS information plays a crucial role in intelligent parking guidance systems, which can help drivers find parking space quickly, reducing unnecessary waste of time and excessive environmental pollution. Through the simple analysis of historical data, we found that there not only exists a obvious temporal correlation in each parking lot, but also a clear spatial correlation between different parking lots. In view of this, this paper proposed a graph data-based model ST-GBGRU (Spatial-Temporal Graph Based Gated Recurrent Unit), the number of VPSs can be predicted both in short-term (i.e., within 30 min) and in long-term (i.e., over 30min). On the one hand, the temporal correlation of historical VPS data is extracted by GRU, on the other hand, the spatial correlation of historical VPS data is extracted by GCN inside GRU. Two prediction methods, namely direct prediction and iterative prediction, are combined with the proposed model. Finally, the prediction model is applied to predict the number VPSs of 8 public parking lots in Santa Monica. The results show that in the short-term and long-term prediction tasks, ST-GBGRU model can achieve high accuracy and have good application prospects.
E. Glikman, M. Lacy, S. LaMassa, C. Bradley, S. G. Djorgovski, T. Urrutia, E. L. Gates, M. J. Graham et al.
We present a highly complete sample of broad-line (Type 1) QSOs out to z ~ 3
selected by their mid-infrared colors, a method that is minimally affected by
dust reddening. We remove host galaxy emission from the spectra and fit for
excess reddening in the residual QSOs, resulting in a Gaussian distribution of
colors for unreddened (blue) QSOs, with a tail extending toward heavily
reddened (red) QSOs, defined as having E(B - V) > 0.25. This radio-independent
selection method enables us to compare red and blue QSO radio properties in
both the FIRST (1.4 GHz) and VLASS (2 - 4 GHz) surveys. Consistent with recent
results from optically-selected QSOs from SDSS, we find that red QSOs have a
significantly higher detection fraction and a higher fraction of compact radio
morphologies at both frequencies. We employ radio stacking to investigate the
median radio properties of the QSOs including those that are undetected in
FIRST and VLASS, finding that red QSOs have significantly brighter radio
emission and steeper radio spectral slopes compared with blue QSOs. Finally, we
find that the incidence of red QSOs is strongly luminosity dependent, where red
QSOs make up > 40% of all QSOs at the highest luminosities. Overall, red QSOs
comprise ~ 40% of higher luminosity QSOs, dropping to only a few percent at
lower luminosities. Furthermore, red QSOs make up a larger percentage of the
radio-detected QSO population. We argue that dusty AGN-driven winds are
responsible for both the obscuration as well as excess radio emission seen in
red QSOs.
Authors' comments: Accepted for publication in ApJ; 35 pages, 24 Figures,6 Tables
Sarem Seitz
Gaussian Processes (GPs) are a versatile and popular method in Bayesian Machine Learning. A common modification are Sparse Variational Gaussian Processes (SVGPs) which are well suited to deal with large datasets. While GPs allow to elegantly deal with Gaussian-distributed target variables in closed form, their applicability can be extended to non-Gaussian data as well. These extensions are usually impossible to treat in closed form and hence require approximate solutions. This paper proposes to approximate the inverse-link function, which is necessary when working with non-Gaussian likelihoods, by a piece-wise constant function. It will be shown that this yields a closed form solution for the corresponding SVGP lower bound. In addition, it is demonstrated how the piece-wise constant function itself can be optimized, resulting in an inverse-link function that can be learnt from the data at hand.
Chen Tang, Haoyu Zhai, Kai Ouyang, Zhi Wang, Yifei Zhu, Wenwu Zhu
Conventional model quantization methods use a fixed quantization scheme to different data samples, which ignores the inherent "recognition difficulty" differences between various samples. We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a fine-grained layer level. However, enabling this adaptive inference with changeable layer-wise quantization schemes is challenging because the combination of bit-widths and layers is growing exponentially, making it extremely difficult to train a single model in such a vast searching space and use it in practice. To solve this problem, we present the Arbitrary Bit-width Network (ABN), where the bit-widths of a single deep network can change at runtime for different data samples, with a layer-wise granularity. Specifically, first we build a weight-shared layer-wise quantizable "super-network" in which each layer can be allocated with multiple bit-widths and thus quantized differently on demand. The super-network provides a considerably large number of combinations of bit-widths and layers, each of which can be used during inference without retraining or storing myriad models. Second, based on the well-trained super-network, each layer's runtime bit-width selection decision is modeled as a Markov Decision Process (MDP) and solved by an adaptive inference strategy accordingly. Experiments show that the super-network can be built without accuracy degradation, and the bit-widths allocation of each layer can be adjusted to deal with various inputs on the fly. On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.
Sidike Paheding, Abel A. Reyes, Anush Kasaragod, Thomas Oommen
Hyperspectral image (HSI) classification is the most vibrant area of research
in the hyperspectral community due to the rich spectral information contained
in HSI can greatly aid in identifying objects of interest. However, inherent
non-linearity between materials and the corresponding spectral profiles brings
two major challenges in HSI classification: interclass similarity and
intraclass variability. Many advanced deep learning methods have attempted to
address these issues from the perspective of a region/patch-based approach,
instead of a pixel-based alternate. However, the patch-based approaches
hypothesize that neighborhood pixels of a target pixel in a fixed spatial
window belong to the same class. And this assumption is not always true. To
address this problem, we herein propose a new deep learning architecture,
namely Gramian Angular Field encoded Neighborhood Attention U-Net (GAF-NAU),
for pixel-based HSI classification. The proposed method does not require
regions or patches centered around a raw target pixel to perform 2D-CNN based
classification, instead, our approach transforms 1D pixel vector in HSI into 2D
angular feature space using Gramian Angular Field (GAF) and then embed it to a
new neighborhood attention network to suppress irrelevant angular feature while
emphasizing on pertinent features useful for HSI classification task.
Evaluation results on three publicly available HSI datasets demonstrate the
superior performance of the proposed model.
Authors' comments: 8 Pages, 9 Figures
Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian
Accent variability has posed a huge challenge to automatic speech
recognition~(ASR) modeling. Although one-hot accent vector based adaptation
systems are commonly used, they require prior knowledge about the target accent
and cannot handle unseen accents. Furthermore, simply concatenating accent
embeddings does not make good use of accent knowledge, which has limited
improvements. In this work, we aim to tackle these problems with a novel
layer-wise adaptation structure injected into the E2E ASR model encoder. The
adapter layer encodes an arbitrary accent in the accent space and assists the
ASR model in recognizing accented speech. Given an utterance, the adaptation
structure extracts the corresponding accent information and transforms the
input acoustic feature into an accent-related feature through the linear
combination of all accent bases. We further explore the injection position of
the adaptation layer, the number of accent bases, and different types of accent
bases to achieve better accent adaptation. Experimental results show that the
proposed adaptation structure brings 12\% and 10\% relative word error
rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech
dataset, respectively, compared to the baseline.
Authors' comments: Accepted by Interspeech2021
Shana Moothedath, Namrata Vaswani
This work develops a provably accurate fully-decentralized alternating projected gradient descent (GD) algorithm for recovering a low rank (LR) matrix from mutually independent projections of each of its columns, in a fast and communication-efficient fashion. To our best knowledge, this work is the first attempt to develop a provably correct decentralized algorithm (i) for any problem involving the use of an alternating projected GD algorithm; (ii) and for any problem in which the constraint set to be projected to is a non-convex set.
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, Rongrong Ji
Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e., Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e., the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we also apply our optimization strategy to a recently proposed image Transformer called Swin-Transformer for image classification, where the effectiveness can be also confirmed
Saira Soomro, Arjumand Bano Soomro, Tarique Bhatti, Yonis Gulzar
Blended learning (BL) is a recent tread among many options that can best fit
learners' needs, regardless of time and place. This study aimed to discover
students' perceptions of BL and the challenges faced by them while using
technology. This quantitative study used data gathered from 300 students
enrolled in four public universities in the Sindh province of Pakistan. the
finding shows that students were compatible with the use of technology, and it
has a positive effect on their academic experience. The study also showed that
the use of technology encourages peer collaboration. The challenges found
include: neither teacher support nor a training program was provided to the
students for the course which needed to shift from a traditional face to face
paradigm to a blended format, a lake of space lies with skills in a laboratory
assistants for the courses with a blended format and as shortage of high tech
computer laboratories / computer units to run these courses. Therefore, it is
recommended that the authorities must develop and incorporate a comprehensive
mechanism for the effective implementation of BL in the learning
teaching-learning process heads of the departments should also provide
additional computing infrastructure to their departments.
Authors' comments: 5 pages
Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, Stefano Soatto
In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ~20 frames per second without using any LVIS annotation during training.
Wenjing Chen, Ruida Zhou, Chao Tian, Cong Shen
We analyze the performance of the Borda counting algorithm in a non-parametric model. The algorithm needs to utilize probabilistic rankings of the items within $m$-sized subsets to accurately determine which items are the overall top-$k$ items in a total of $n$ items. The Borda counting algorithm simply counts the cumulative scores for each item from these partial ranking observations. This generalizes a previous work of a similar nature by Shah et al. using probabilistic pairwise comparison data. The performance of the Borda counting algorithm critically depends on the associated score separation $\Delta_k$ between the $k$-th item and the $(k+1)$-th item. Specifically, we show that if $\Delta_k$ is greater than certain value, then the top-$k$ items selected by the algorithm is asymptotically accurate almost surely; if $\Delta_k$ is below certain value, then the result will be inaccurate with a constant probability. In the special case of $m=2$, i.e., pairwise comparison, the resultant bound is tighter than that given by Shah et al., leading to a reduced gap between the error probability upper and lower bounds. These results are further extended to the approximate top-$k$ selection setting. Numerical experiments demonstrate the effectiveness and accuracy of the Borda counting algorithm, compared with the spectral MLE-based algorithm, particularly when the data does not necessarily follow an assumed parametric model.
Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Chunjing Xu, Yanwei Fu
Existing text-guided image manipulation methods aim to modify the appearance
of the image or to edit a few objects in a virtual or simple scenario, which is
far from practical application. In this work, we study a novel task on
text-guided image manipulation on the entity level in the real world. The task
imposes three basic requirements, (1) to edit the entity consistent with the
text descriptions, (2) to preserve the text-irrelevant regions, and (3) to
merge the manipulated entity into the image naturally. To this end, we propose
a new transformer-based framework based on the two-stage image synthesis
method, namely \textbf{ManiTrans}, which can not only edit the appearance of
entities but also generate new entities corresponding to the text guidance. Our
framework incorporates a semantic alignment module to locate the image regions
to be manipulated, and a semantic loss to help align the relationship between
the vision and language. We conduct extensive experiments on the real datasets,
CUB, Oxford, and COCO datasets to verify that our method can distinguish the
relevant and irrelevant regions and achieve more precise and flexible
manipulation compared with baseline methods. The project homepage is
\url{https://jawang19.github.io/manitrans}.
Authors' comments: Accepted by CVPR2022 (Oral)
Ze Wang, Guogang Liao, Xiaowen Shi, Xiaoxu Wu, Chuheng Zhang, Yongkang Wang, Xingxing Wang, Dong Wang
With the recent prevalence of reinforcement learning (RL), there have been
tremendous interests in utilizing RL for ads allocation in recommendation
platforms (e.g., e-commerce and news feed sites). To achieve better allocation,
the input of recent RL-based ads allocation methods is upgraded from point-wise
single item to list-wise item arrangement. However, this also results in a
high-dimensional space of state-action pairs, making it difficult to learn
list-wise representations with good generalization ability. This further
hinders the exploration of RL agents and causes poor sample efficiency. To
address this problem, we propose a novel RL-based approach for ads allocation
which learns better list-wise representations by leveraging task-specific
signals on Meituan food delivery platform. Specifically, we propose three
different auxiliary tasks based on reconstruction, prediction, and contrastive
learning respectively according to prior domain knowledge on ads allocation. We
conduct extensive experiments on Meituan food delivery platform to evaluate the
effectiveness of the proposed auxiliary tasks. Both offline and online
experimental results show that the proposed method can learn better list-wise
representations and achieve higher revenue for the platform compared to the
state-of-the-art baselines.
Authors' comments: Accepted by CIKM-22
Arshdeep Singh
This paper presents an alternate representation framework to commonly used
time-frequency representation for acoustic scene classification (ASC). A raw
audio signal is represented using a pre-trained convolutional neural network
(CNN) using its various intermediate layers. The study assumes that the
representations obtained from the intermediate layers lie in low-dimensions
intrinsically. To obtain low-dimensional embeddings, principal component
analysis is performed, and the study analyzes that only a few principal
components are significant. However, the appropriate number of significant
components are not known. To address this, an automatic dictionary learning
framework is utilized that approximates the underlying subspace. Further, the
low-dimensional embeddings are aggregated in a late-fusion manner in the
ensemble framework to incorporate hierarchical information learned at various
intermediate layers. The experimental evaluation is performed on publicly
available DCASE 2017 and 2018 ASC datasets on a pre-trained 1-D CNN, SoundNet.
Empirically, it is observed that deeper layers show more compression ratio than
others. At 70% compression ratio across different datasets, the performance is
similar to that obtained without performing any dimensionality reduction. The
proposed framework outperforms the time-frequency representation based methods.
Authors' comments: No comments
Zhifang Fan, Dan Ou, Yulong Gu, Bairan Fu, Xiang Li, Wentian Bao, Xin-Yu Dai, Xiaoyi Zeng et al.
Modeling user's historical feedback is essential for Click-Through Rate Prediction in personalized search and recommendation. Existing methods usually only model users' positive feedback information such as click sequences which neglects the context information of the feedback. In this paper, we propose a new perspective for context-aware users' behavior modeling by including the whole page-wisely exposed products and the corresponding feedback as contextualized page-wise feedback sequence. The intra-page context information and inter-page interest evolution can be captured to learn more specific user preference. We design a novel neural ranking model RACP(i.e., Recurrent Attention over Contextualized Page sequence), which utilizes page-context aware attention to model the intra-page context. A recurrent attention process is used to model the cross-page interest convergence evolution as denoising the interest in the previous pages. Experiments on public and real-world industrial datasets verify our model's effectiveness.
Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti
Several solutions for lightweight TTS have shown promising results. Still,
they either rely on a hand-crafted design that reaches non-optimum size or use
a neural architecture search but often suffer training costs. We present
Nix-TTS, a lightweight TTS achieved via knowledge distillation to a
high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free)
TTS teacher model. Specifically, we offer module-wise distillation, enabling
flexible and independent distillation to the encoder and decoder module. The
resulting Nix-TTS inherited the advantageous properties of being
non-autoregressive and end-to-end from the teacher, yet significantly smaller
in size, with only 5.23M parameters or up to 89.34% reduction of the teacher
model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU
and Raspberry Pi 3B respectively and still retains a fair voice naturalness and
intelligibility compared to the teacher model. We provide pretrained models and
audio samples of Nix-TTS.
Authors' comments: Accepted at SLT 2022 (https://slt2022.org/). Associated materials can
be seen in https://github.com/rendchevi/nix-tts
Minghao Chen, Fangyun Wei, Chong Li, Deng Cai
Prior works on action representation learning mainly focus on designing
various architectures to extract the global representations for short video
clips. In contrast, many practical applications such as video alignment have
strong demand for learning dense representations for long videos. In this
paper, we introduce a novel contrastive action representation learning (CARL)
framework to learn frame-wise action representations, especially for long
videos, in a self-supervised manner. Concretely, we introduce a simple yet
efficient video encoder that considers spatio-temporal context to extract
frame-wise representations. Inspired by the recent progress of self-supervised
learning, we present a novel sequence contrastive loss (SCL) applied on two
correlated views obtained through a series of spatio-temporal data
augmentations. SCL optimizes the embedding space by minimizing the
KL-divergence between the sequence similarity of two augmented views and a
prior Gaussian distribution of timestamp distance. Experiments on FineGym,
PennAction and Pouring datasets show that our method outperforms previous
state-of-the-art by a large margin for downstream fine-grained action
classification. Surprisingly, although without training on paired videos, our
approach also shows outstanding performance on video alignment and fine-grained
frame retrieval tasks. Code and models are available at
https://github.com/minghchen/CARL_code.
Authors' comments: Accepted by CVPR 2022