Zhanchao Huang, Wei Li, Xiang-Gen Xia, Hao Wang, Ran Tao
Arbitrary-oriented object detection (AOOD) has been widely applied to locate
and classify objects with diverse orientations in remote sensing images.
However, the inconsistent features for the localization and classification
tasks in AOOD models may lead to ambiguity and low-quality object predictions,
which constrains the detection performance. In this article, an AOOD method
called task-wise sampling convolutions (TS-Conv) is proposed. TS-Conv
adaptively samples task-wise features from respective sensitive regions and
maps these features together in alignment to guide a dynamic label assignment
for better predictions. Specifically, sampling positions of the localization
convolution in TS-Conv are supervised by the oriented bounding box (OBB)
prediction associated with spatial coordinates, while sampling positions and
convolutional kernel of the classification convolution are designed to be
adaptively adjusted according to different orientations for improving the
orientation robustness of features. Furthermore, a dynamic
task-consistent-aware label assignment (DTLA) strategy is developed to select
optimal candidate positions and assign labels dynamically according to ranked
task-aware scores obtained from TS-Conv. Extensive experiments on several
public datasets covering multiple scenes, multimodal images, and multiple
categories of objects demonstrate the effectiveness, scalability, and superior
performance of the proposed TS-Conv.
Authors' comments: 15 pages, 13 figures, 11 tables
Frank Kulwa, Chen Li, Marcin Grzegorzek, Md Mamunur Rahaman, Kimiaki Shirahama, Sergey Kosov
The use of Environmental Microorganisms (EMs) offers a highly efficient, low
cost and harmless remedy to environmental pollution, by monitoring and
decomposing of pollutants. This relies on how the EMs are correctly segmented
and identified. With the aim of enhancing the segmentation of weakly visible EM
images which are transparent, noisy and have low contrast, a Pairwise Deep
Learning Feature Network (PDLF-Net) is proposed in this study. The use of PDLFs
enables the network to focus more on the foreground (EMs) by concatenating the
pairwise deep learning features of each image to different blocks of the base
model SegNet. Leveraging the Shi and Tomas descriptors, we extract each image's
deep features on the patches, which are centered at each descriptor using the
VGG-16 model. Then, to learn the intermediate characteristics between the
descriptors, pairing of the features is performed based on the Delaunay
triangulation theorem to form pairwise deep learning features. In this
experiment, the PDLF-Net achieves outstanding segmentation results of 89.24%,
63.20%, 77.27%, 35.15%, 89.72%, 91.44% and 89.30% on the accuracy, IoU, Dice,
VOE, sensitivity, precision and specificity, respectively.
Authors' comments: arXiv admin note: text overlap with arXiv:2102.12147
Weimin Zhu, Yi Zhang, DuanCheng Zhao, Jianrong Xu, Ling Wang
Elucidating and accurately predicting the druggability and bioactivities of molecules plays a pivotal role in drug design and discovery and remains an open challenge. Recently, graph neural networks (GNN) have made remarkable advancements in graph-based molecular property prediction. However, current graph-based deep learning methods neglect the hierarchical information of molecules and the relationships between feature channels. In this study, we propose a well-designed hierarchical informative graph neural networks framework (termed HiGNN) for predicting molecular property by utilizing a co-representation learning of molecular graphs and chemically synthesizable BRICS fragments. Furthermore, a plug-and-play feature-wise attention block is first designed in HiGNN architecture to adaptively recalibrate atomic features after the message passing phase. Extensive experiments demonstrate that HiGNN achieves state-of-the-art predictive performance on many challenging drug discovery-associated benchmark datasets. In addition, we devise a molecule-fragment similarity mechanism to comprehensively investigate the interpretability of HiGNN model at the subgraph level, indicating that HiGNN as a powerful deep learning tool can help chemists and pharmacists identify the key components of molecules for designing better molecules with desired properties or functions. The source code is publicly available at https://github.com/idruglab/hignn.
Woon Hyung Cho, Jiseon Shin, Young Duck Kim, George J. Jung
Mechanical exfoliation of graphene and its identification by optical
inspection is one of the milestones in condensed matter physics that sparked
the field of 2D materials. Finding regions of interest from the entire sample
space and identification of layer number is a routine task potentially amenable
to automatization. We propose supervised pixel-wise classification methods
showing a high performance even with a small number of training image datasets
that require short computational time without GPU. We introduce four different
tree-based machine learning algorithms -- decision tree, random forest, extreme
gradient boost, and light gradient boosting machine. We train them with five
optical microscopy images of graphene, and evaluate their performances with
multiple metrics and indices. We also discuss combinatorial machine learning
models between the three single classifiers and assess their performances in
identification and reliability. The code developed in this paper is open to the
public and will be released at github.com/gjung-group/Graphene_segmentation.
Authors' comments: 12 pages, 6 figures
Yile Wang, Yue Zhang
Contextualized word embeddings in language models have given much advance to NLP. Intuitively, sentential information is integrated into the representation of words, which can help model polysemy. However, context sensitivity also leads to the variance of representations, which may break the semantic consistency for synonyms. We quantify how much the contextualized embeddings of each word sense vary across contexts in typical pre-trained models. Results show that contextualized embeddings can be highly consistent across contexts. In addition, part-of-speech, number of word senses, and sentence length have an influence on the variance of sense representations. Interestingly, we find that word representations are position-biased, where the first words in different contexts tend to be more similar. We analyze such a phenomenon and also propose a simple way to alleviate such bias in distance-based word sense disambiguation settings.
Tesi Xiao, Xia Xiao, Ming Chen, Youlong Chen
Feature embeddings are one of the most essential steps when training deep learning based Click-Through Rate prediction models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size selection methods are shown to be "sub-optimal" in terms of the trade-off between memory usage and model capacity. The trending methods in Neural Architecture Search (NAS) have demonstrated their efficiency to search for embedding sizes. However, most existing NAS-based works suffer from expensive computational costs, the curse of dimensionality of the search space, and the discrepancy between continuous search space and discrete candidate space. Other works that prune embeddings in an unstructured manner fail to reduce the computational costs explicitly. In this paper, to address those limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space using a simple and efficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it can effectively remove redundant embedding dimensions without great performance loss.
H. F. M. Yao, M. E. Cluver, T. H. Jarrett, Gyula I. G. Jozsa, M. G. Santos, L. Marchetti, M. J. I. Brown, Y. A. Gordon et al.
The identification of AGN in large surveys has been hampered by seemingly discordant classifications arising from differing diagnostic methods, usually tracing distinct processes specific to a particular wavelength regime. However, as shown in Yao et al. (2020), the combination of optical emission line measurements and mid-infrared photometry can be used to optimise the discrimination capability between AGN and star formation activity. In this paper we test our new classification scheme by combining the existing GAMA-WISE data with high-quality MeerKAT radio continuum data covering 8 deg$^2$ of the GAMA G23 region. Using this sample of 1 841 galaxies (z < 0.25), we investigate the total infrared (derived from 12$\mu$m) to radio luminosity ratio, q(TIR), and its relationship to optical-infrared AGN and star-forming (SF) classifications. We find that while q(TIR) is efficient at detecting AGN activity in massive galaxies generally appearing quiescent in the infrared, it becomes less reliable for cases where the emission from star formation in the host galaxy is dominant. However, we find that the q(TIR) can identify up to 70 % more AGNs not discernible at optical and/or infrared wavelengths. The median q(TIR) of our SF sample is 2.57 $\pm$ 0.23 consistent with previous local universe estimates.
Huimin Huang, Shiao Xie1, Lanfen Lin, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Ruofeng Tong
Recently, a variety of vision transformers have been developed as their
capability of modeling long-range dependency. In current transformer-based
backbones for medical image segmentation, convolutional layers were replaced
with pure transformers, or transformers were added to the deepest encoder to
learn global context. However, there are mainly two challenges in a scale-wise
perspective: (1) intra-scale problem: the existing methods lacked in extracting
local-global cues in each scale, which may impact the signal propagation of
small objects; (2) inter-scale problem: the existing methods failed to explore
distinctive information from multiple scales, which may hinder the
representation learning from objects with widely variable size, shape and
location. To address these limitations, we propose a novel backbone, namely
ScaleFormer, with two appealing designs: (1) A scale-wise intra-scale
transformer is designed to couple the CNN-based local features with the
transformer-based global cues in each scale, where the row-wise and column-wise
global dependencies can be extracted by a lightweight Dual-Axis MSA. (2) A
simple and effective spatial-aware inter-scale transformer is designed to
interact among consensual regions in multiple scales, which can highlight the
cross-scale dependency and resolve the complex scale variations. Experimental
results on different benchmarks demonstrate that our Scale-Former outperforms
the current state-of-the-art methods. The code is publicly available at:
https://github.com/ZJUGiveLab/ScaleFormer.
Authors' comments: Accepted to IJCAI 2022
Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding et al.
Detection transformer (DETR) relies on one-to-one assignment, assigning one
ground-truth object to one prediction, for end-to-end detection without NMS
post-processing. It is known that one-to-many assignment, assigning one
ground-truth object to multiple predictions, succeeds in detection methods such
as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work
for DETR, and it remains challenging to apply one-to-many assignment for DETR
training. In this paper, we introduce Group DETR, a simple yet efficient DETR
training approach that introduces a group-wise way for one-to-many assignment.
This approach involves using multiple groups of object queries, conducting
one-to-one assignment within each group, and performing decoder self-attention
separately. It resembles data augmentation with automatically-learned object
query augmentation. It is also equivalent to simultaneously training
parameter-sharing networks of the same architecture, introducing more
supervision and thus improving DETR training. The inference process is the same
as DETR trained normally and only needs one group of queries without any
architecture modification. Group DETR is versatile and is applicable to various
DETR variants. The experiments show that Group DETR significantly speeds up the
training convergence and improves the performance of various DETR-based models.
Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.
Authors' comments: ICCV23 camera ready version
Cheng-Yen Hsieh, Yu-Chuan Chuang, An-Yeu, Wu
Most existing studies improve the efficiency of Split learning (SL) by
compressing the transmitted features. However, most works focus on
dimension-wise compression that transforms high-dimensional features into a
low-dimensional space. In this paper, we propose circular convolution-based
batch-wise compression for SL (C3-SL) to compress multiple features into one
single feature. To avoid information loss while merging multiple features, we
exploit the quasi-orthogonality of features in high-dimensional space with
circular convolution and superposition. To the best of our knowledge, we are
the first to explore the potential of batch-wise compression under the SL
scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method
achieves a 16x compression ratio with negligible accuracy drops compared with
the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x
computation overhead compared to the state-of-the-art dimension-wise
compression method.
Authors' comments: 6 pages, IEEE MLSP 2022, Github:
https://github.com/WesleyHsieh0806/Split-Learning-Compression
Paul Boniol, Mohammed Meftah, Emmanuel Remy, Themis Palpanas
Data series classification is an important and challenging problem in data science. Explaining the classification decisions by finding the discriminant parts of the input that led the algorithm to some decisions is a real need in many applications. Convolutional neural networks perform well for the data series classification task; though, the explanations provided by this type of algorithm are poor for the specific case of multivariate data series. Addressing this important limitation is a significant challenge. In this paper, we propose a novel method that solves this problem by highlighting both the temporal and dimensional discriminant information. Our contribution is two-fold: we first describe a convolutional architecture that enables the comparison of dimensions; then, we propose a method that returns dCAM, a Dimension-wise Class Activation Map specifically designed for multivariate time series (and CNN-based models). Experiments with several synthetic and real datasets demonstrate that dCAM is not only more accurate than previous approaches, but the only viable solution for discriminant feature discovery and classification explanation in multivariate time series. This paper has appeared in SIGMOD'22.
Zhengqi Gao, Fan-Keng Sun, Mingran Yang, Sucheng Ren, Zikai Xiong, Marc Engeler, Antonio Burazer, Linda Wildling et al.
Data lies at the core of modern deep learning. The impressive performance of
supervised learning is built upon a base of massive accurately labeled data.
However, in some real-world applications, accurate labeling might not be
viable; instead, multiple noisy labels (instead of one accurate label) are
provided by several annotators for each data sample. Learning a classifier on
such a noisy training dataset is a challenging task. Previous approaches
usually assume that all data samples share the same set of parameters related
to annotator errors, while we demonstrate that label error learning should be
both annotator and data sample dependent. Motivated by this observation, we
propose a novel learning algorithm. The proposed method displays superiority
compared with several state-of-the-art baseline methods on MNIST, CIFAR-100,
and ImageNet-100. Our code is available at:
https://github.com/zhengqigao/Learning-from-Multiple-Annotator-Noisy-Labels.
Authors' comments: Accepted by ECCV 2022
Samson B. Akintoye, Liangxiu Han, Huw Lloyd, Xin Zhang, Darren Dancey, Haoming Chen, Daoqiang Zhang
Deep Neural Network (DNN) models are usually trained sequentially from one layer to another, which causes forward, backward and update locking's problems, leading to poor performance in terms of training time. The existing parallel strategies to mitigate these problems provide suboptimal runtime performance. In this work, we have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance. The novelty of the proposed work consists of 1) a layer-wise partition and merging model which can minimise communication overhead between devices without the memory cost of existing strategies during the training process; 2) a forward pass and backward pass parallelisation and optimisation to address the update locking problem and minimise the total training cost. The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed; and achieves almost linear speedup without compromising the accuracy performance of the non-parallel approach.
Yao Chen, Junhao Pan, Xinheng Liu, Jinjun Xiong, Deming Chen
Quantization for CNN has shown significant progress with the intention of
reducing the cost of computation and storage with low-bitwidth data
representations. There are, however, no systematic studies on how an existing
full-bitwidth processing unit, such as ALU in CPUs and DSP in FPGAs, can be
better utilized to deliver significantly higher computation throughput for
convolution under various quantized bitwidths. In this study, we propose
HiKonv, a unified solution that maximizes the throughput of convolution on a
given underlying processing unit with low-bitwidth quantized data inputs
through novel bit-wise management and parallel computation. We establish
theoretical framework and performance models using a full-bitwidth multiplier
for highly parallelized low-bitwidth convolution, and demonstrate new
breakthroughs for high-performance computing in this critical domain. For
example, a single 32-bit processing unit in CPU can deliver 128 binarized
convolution operations (multiplications and additions) and 13 4-bit convolution
operations with a single multiplication instruction, and a single 27x18
multiplier in the FPGA DSP can deliver 60, 8 or 2 convolution operations with
1, 4 or 8-bit inputs in one clock cycle. We demonstrate the effectiveness of
HiKonv on both CPU and FPGA. On CPU, HiKonv outperforms the baseline
implementation with 1 to 8-bit inputs and provides up to 7.6x and 1.4x
performance improvements for 1-D convolution, and performs 2.74x and 3.19x over
the baseline implementation for 4-bit signed and unsigned data inputs for 2-D
convolution. On FPGA, HiKonv solution enables a single DSP to process multiple
convolutions with a shorter processing latency. For binarized input, each DSP
with HiKonv is equivalent up to 76.6 LUTs. Compared to the DAC-SDC 2020
champion model, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP
efficiency improvement, respectively.
Authors' comments: The conference version is pubilished in Proceedings of ASP-DAC 2022.
arXiv admin note: substantial text overlap with arXiv:2112.13972
Weiguang Zhao, Yuyao Yan, Chaolong Yang, Jianan Ye, Xi Yang, Kaizhu Huang
Instance segmentation on point clouds is crucially important for 3D scene understanding. Most SOTAs adopt distance clustering, which is typically effective but does not perform well in segmenting adjacent objects with the same semantic label (especially when they share neighboring points). Due to the uneven distribution of offset points, these existing methods can hardly cluster all instance points. To this end, we design a novel divide-and-conquer strategy named PBNet that binarizes each point and clusters them separately to segment instances. Our binary clustering divides offset instance points into two categories: high and low density points (HPs vs. LPs). Adjacent objects can be clearly separated by removing LPs, and then be completed and refined by assigning LPs via a neighbor voting method. To suppress potential over-segmentation, we propose to construct local scenes with the weight mask for each instance. As a plug-in, the proposed binary clustering can replace traditional distance clustering and lead to consistent performance gains on many mainstream baselines. A series of experiments on ScanNetV2 and S3DIS datasets indicate the superiority of our model. In particular, PBNet ranks first on the ScanNetV2 official benchmark challenge, achieving the highest mAP. Code will be available publicly at https://github.com/weiguangzhao/PBNet.
Takato Yasuno, Junichiro Fujii, Masazumi Amakata
Urban rivers provide a water environment that influences residential living.
River surface monitoring has become crucial for making decisions about where to
prioritize cleaning and when to automatically start the cleaning treatment. We
focus on the organic mud, or "scum", that accumulates on the river's surface
and contributes to the river's odor and has external economic effects on the
landscape. Because of its feature of a sparsely distributed and unstable
pattern of organic shape, automating the monitoring process has proved
difficult. We propose a patch-wise classification pipeline to detect scum
features on the river surface using mixture image augmentation to increase the
diversity between the scum floating on the river and the entangled background
on the river surface reflected by nearby structures like buildings, bridges,
poles, and barriers. Furthermore, we propose a scum-index cover on rivers to
help monitor worse grade online, collect floating scum, and decide on chemical
treatment policies. Finally, we demonstrate the application of our method on a
time series dataset with frames every ten minutes recording river scum events
over several days. We discuss the significance of our pipeline and its
experimental findings.
Authors' comments: 15 figures, 3 table
Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, Xing Xie
The past several years have witnessed Variational Auto-Encoder's superiority
in various text generation tasks. However, due to the sequential nature of the
text, auto-regressive decoders tend to ignore latent variables and then reduce
to simple language models, known as the KL vanishing problem, which would
further deteriorate when VAE is combined with Transformer-based structures. To
ameliorate this problem, we propose DELLA, a novel variational Transformer
framework. DELLA learns a series of layer-wise latent variables with each
inferred from those of lower layers and tightly coupled with the hidden states
by low-rank tensor product. In this way, DELLA forces these posterior latent
variables to be fused deeply with the whole computation path and hence
incorporate more information. We theoretically demonstrate that our method can
be regarded as entangling latent variables to avoid posterior information
decrease through layers, enabling DELLA to get higher non-zero KL values even
without any annealing or thresholding tricks. Experiments on four unconditional
and three conditional generation tasks show that DELLA could better alleviate
KL vanishing and improve both quality and diversity compared to several strong
baselines.
Authors' comments: NAACL 2022
Lalita Kumari, Sukhdeep Singh, VVS Rathore, Anuj Sharma
Cursive handwritten text recognition is a challenging research problem in the domain of pattern recognition. The current state-of-the-art approaches include models based on convolutional recurrent neural networks and multi-dimensional long short-term memory recurrent neural networks techniques. These methods are highly computationally extensive as well model is complex at design level. In recent studies, combination of convolutional neural network and gated convolutional neural networks based models demonstrated less number of parameters in comparison to convolutional recurrent neural networks based models. In the direction to reduced the total number of parameters to be trained, in this work, we have used depthwise convolution in place of standard convolutions with a combination of gated-convolutional neural network and bidirectional gated recurrent unit to reduce the total number of parameters to be trained. Additionally, we have also included a lexicon based word beam search decoder at testing step. It also helps in improving the the overall accuracy of the model. We have obtained 3.84% character error rate and 9.40% word error rate on IAM dataset; 4.88% character error rate and 14.56% word error rate in George Washington dataset, respectively.
Chanyong Jung, Joonhyung Lee, Sunkyoung You, Jong Chul Ye
The acquisition conditions for low-dose and high-dose CT images are usually
different, so that the shifts in the CT numbers often occur. Accordingly,
unsupervised deep learning-based approaches, which learn the target image
distribution, often introduce CT number distortions and result in detrimental
effects in diagnostic performance. To address this, here we propose a novel
unsupervised learning approach for lowdose CT reconstruction using patch-wise
deep metric learning. The key idea is to learn embedding space by pulling the
positive pairs of image patches which shares the same anatomical structure, and
pushing the negative pairs which have same noise level each other. Thereby, the
network is trained to suppress the noise level, while retaining the original
global CT number distributions even after the image translation. Experimental
results confirm that our deep metric learning plays a critical role in
producing high quality denoised images without CT number shift.
Authors' comments: MICCAI 2022
Dabao Wang, Hang Feng, Siwei Wu, Yajin Zhou, Lei Wu, Xingliang Yuan
The prosperity of decentralized finance motivates many investors to profit
via trading their crypto assets on decentralized applications (DApps for short)
of the Ethereum ecosystem. Apart from Ether (the native cryptocurrency of
Ethereum), many ERC20 (a widely used token standard on Ethereum) tokens obtain
vast market value in the ecosystem. Specifically, the approval mechanism is
used to delegate the privilege of spending users' tokens to DApps. By doing so,
the DApps can transfer these tokens to arbitrary receivers on behalf of the
users. To increase the usability, unlimited approval is commonly adopted by
DApps to reduce the required interaction between them and their users. However,
as shown in existing security incidents, this mechanism can be abused to steal
users' tokens.
In this paper, we present the first systematic study to quantify the risk of
unlimited approval of ERC20 tokens on Ethereum. Specifically, by evaluating
existing transactions up to 31st July 2021, we find that unlimited approval is
prevalent (60%, 15.2M/25.4M) in the ecosystem, and 22% of users have a high
risk of their approved tokens for stealing. After that, we investigate the
security issues that are involved in interacting with the UIs of 22
representative DApps and 9 famous wallets to prepare the approval transactions.
The result reveals the worrisome fact that all DApps request unlimited approval
from the front-end users and only 10% (3/31) of UIs provide explanatory
information for the approval mechanism. Meanwhile, only 16% (5/31) of UIs allow
users to modify their approval amounts. Finally, we take a further step to
characterize the user behavior into five modes and formalize the good practice,
i.e., on-demand approval and timely spending, towards securely spending
approved tokens. However, the evaluation result suggests that only 0.2% of
users follow the good practice to mitigate the risk.
Authors' comments: 16 pages 12 figures Conferences: The 25th International Symposium on
Research in Attacks, Intrusions and Defenses (RAID 2022), October 26--28,
2022, Limassol, Cyprus