Yanpeng Sun, Zechao Li
The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAM) to generate pseudo masks as ground-truth. However, the existing methods typically depend on the painstaking training modules, which may bring in grinding computational overhead and complex training procedures. In this work, the semantic structure aware inference (SSA) is proposed to explore the semantic structure information hidden in different stages of the CNN-based network to generate high-quality CAM in the model inference. Specifically, the semantic structure modeling module (SSM) is first proposed to generate the class-agnostic semantic correlation representation, where each item denotes the affinity degree between one category of objects and all the others. Then the structured feature representation is explored to polish an immature CAM via the dot product operation. Finally, the polished CAMs from different backbone stages are fused as the output. The proposed method has the advantage of no parameters and does not need to be trained. Therefore, it can be applied to a wide range of weakly-supervised pixel-wise dense prediction tasks. Experimental results on both weakly-supervised object localization and weakly-supervised semantic segmentation tasks demonstrate the effectiveness of the proposed method, which achieves the new state-of-the-art results on these two tasks.
Awadelrahman M. A. Ahmed, Leen A. M. Ali
This paper contributes to automating medical image segmentation by proposing
generative adversarial network-based models to segment both polyps and
instruments in endoscopy images. A major contribution of this work is to
provide explanations for the predictions using a layer-wise relevance
propagation approach designating which input image pixels are relevant to the
predictions and to what extent. On the polyp segmentation task, the models
achieved 0.84 of accuracy and 0.46 on Jaccard index. On the instrument
segmentation task, the models achieved 0.96 of accuracy and 0.70 on Jaccard
index. The code is available at https://github.com/Awadelrahman/MedAI.
Authors' comments: Nordic Machine Intelligence
Miao Zhang, Miaojing Shi, Li Li
In visual recognition tasks, few-shot learning requires the ability to learn
object categories with few support examples. Its re-popularity in light of the
deep learning development is mainly in image classification. This work focuses
on few-shot semantic segmentation, which is still a largely unexplored field. A
few recent advances are often restricted to single-class few-shot segmentation.
In this paper, we first present a novel multi-way (class) encoding and decoding
architecture which effectively fuses multi-scale query information and
multi-class support information into one query-support embedding. Multi-class
segmentation is directly decoded upon this embedding. For better feature
fusion, a multi-level attention mechanism is proposed within the architecture,
which includes the attention for support feature modulation and attention for
multi-scale combination. Last, to enhance the embedding space learning, an
additional pixel-wise metric learning module is introduced with triplet loss
formulated on the pixel-level embedding of the input image. Extensive
experiments on standard benchmarks PASCAL-5i and COCO-20i show clear benefits
of our method over the state of the art in few-shot segmentation
Authors' comments: Accepted on IEEE Transactions on Circuits and Systems for Video
Technology
Sung-En Chang, Yanyu Li, Mengshu Sun, Weiwen Jiang, Sijia Liu, Yanzhi Wang, Xue Lin
This work proposes a novel Deep Neural Network (DNN) quantization framework,
namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach.
Specifically, this is the first effort to assign mixed quantization schemes and
multiple precisions within layers -- among rows of the DNN weight matrix, for
simplified operations in hardware inference, while preserving accuracy.
Furthermore, this paper makes a different observation from the prior work that
the quantization error does not necessarily exhibit the layer-wise sensitivity,
and actually can be mitigated as long as a certain portion of the weights in
every layer are in higher precisions. This observation enables layer-wise
uniformality in the hardware implementation towards guaranteed inference
acceleration, while still enjoying row-wise flexibility of mixed schemes and
multiple precisions to boost accuracy. The candidates of schemes and precisions
are derived practically and effectively with a highly hardware-informative
strategy to reduce the problem search space. With the offline determined ratio
of different quantization schemes and precisions for all the layers, the RMSMP
quantization algorithm uses the Hessian and variance-based method to
effectively assign schemes and precisions for each row. The proposed RMSMP is
tested for the image classification and natural language processing (BERT)
applications and achieves the best accuracy performance among state-of-the-arts
under the same equivalent precisions. The RMSMP is implemented on FPGA devices,
achieving 3.65x speedup in the end-to-end inference time for ResNet-18 on
ImageNet, compared with the 4-bit Fixed-point baseline.
Authors' comments: Accepted by International Conference on Computer Vision 2021 (ICCV
2021)
Maoyu Mao, Jun Yang
Deepfake poses a serious threat to the reliability of judicial evidence and intellectual property protection. In spite of an urgent need for Deepfake identification, existing pixel-level detection methods are increasingly unable to resist the growing realism of fake videos and lack generalization. In this paper, we propose a scheme to expose Deepfake through faint signals hidden in face videos. This scheme extracts two types of minute information hidden between face pixels-photoplethysmography (PPG) features and auto-regressive (AR) features, which are used as the basis for forensics in the temporal and spatial domains, respectively. According to the principle of PPG, tracking the absorption of light by blood cells allows remote estimation of the temporal domains heart rate (HR) of face video, and irregular HR fluctuations can be seen as traces of tampering. On the other hand, AR coefficients are able to reflect the inter-pixel correlation, and can also reflect the traces of smoothing caused by up-sampling in the process of generating fake faces. Furthermore, the scheme combines asymmetric convolution block (ACBlock)-based improved densely connected networks (DenseNets) to achieve face video authenticity forensics. Its asymmetric convolutional structure enhances the robustness of network to the input feature image upside-down and left-right flipping, so that the sequence of feature stitching does not affect detection results. Simulation results show that our proposed scheme provides more accurate authenticity detection results on multiple deep forgery datasets and has better generalization compared to the benchmark strategy.
Charlotte Ward, Suvi Gezari, Peter Nugent, Eric C. Bellm, Richard Dekany, Andrew Drake, Dmitry A. Duev, Matthew J. Graham et al.
While it is difficult to observe the first black hole seeds in the early
Universe, we can study intermediate mass black holes (IMBHs) in local dwarf
galaxies for clues about their origins. In this paper we present a sample of
variability--selected AGN in dwarf galaxies using optical photometry from the
Zwicky Transient Facility (ZTF) and forward--modeled mid--IR photometry of
time--resolved Wide--field Infrared Survey Explorer ({\it WISE}) coadded
images. We found that 44 out of 25,714 dwarf galaxies had optically variable
AGN candidates, and 148 out of 79,879 dwarf galaxies had mid--IR variable AGN
candidates, corresponding to active fractions of $0.17\pm0.03$\% and
$0.19\pm0.02$\% respectively. We found that spectroscopic approaches to AGN
identification would have missed 81\% of our ZTF IMBH candidates and 69\% of
our {\it WISE} IMBH candidates. Only $9$ candidates have been detected
previously in radio, X-ray, and variability searches for dwarf galaxy AGN. The
ZTF and {\it WISE} dwarf galaxy AGN with broad Balmer lines have virial masses
down to $10^{5.5}M_\odot$ and for the rest of the sample, BH masses predicted
from host galaxy mass range between
$10^{5.2}M_\odot<M_{\text{BH}}<10^{7.3}M_\odot$. We found that only 5 of 152
previously reported variability--selected AGN candidates from the Palomar
Transient Factory in common with our parent sample were variable in ZTF. We
also determined a nuclear supernova fraction of $0.05\pm0.01$\% year$^{-1}$ for
dwarf galaxies in ZTF. Our ZTF and {\it WISE} IMBH candidates show the promise
of variability searches for the discovery of otherwise hidden low mass AGN.
Authors' comments: Submitted to ApJ. 25 pages, 10 figures, 4 tables
Antoine Bodin, Nicolas Macris
Recent evidence has shown the existence of a so-called double-descent and even triple-descent behavior for the generalization error of deep-learning models. This important phenomenon commonly appears in implemented neural network architectures, and also seems to emerge in epoch-wise curves during the training process. A recent line of research has highlighted that random matrix tools can be used to obtain precise analytical asymptotics of the generalization (and training) errors of the random feature model. In this contribution, we analyze the whole temporal behavior of the generalization and training errors under gradient flow for the random feature model. We show that in the asymptotic limit of large system size the full time-evolution path of both errors can be calculated analytically. This allows us to observe how the double and triple descents develop over time, if and when early stopping is an option, and also observe time-wise descent structures. Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.
David Bonet, Antonio Ortega, Javier Ruiz-Hidalgo, Sarath Shekkizhar
Feature spaces in the deep layers of convolutional neural networks (CNNs) are
often very high-dimensional and difficult to interpret. However, convolutional
layers consist of multiple channels that are activated by different types of
inputs, which suggests that more insights may be gained by studying the
channels and how they relate to each other. In this paper, we first analyze
theoretically channel-wise non-negative kernel (CW-NNK) regression graphs,
which allow us to quantify the overlap between channels and, indirectly, the
intrinsic dimension of the data representation manifold. We find that
redundancy between channels is significant and varies with the layer depth and
the level of regularization during training. Additionally, we observe that
there is a correlation between channel overlap in the last convolutional layer
and generalization performance. Our experimental results demonstrate that these
techniques can lead to a better understanding of deep representations.
Authors' comments: Under review at ICASSP
Subharthi Chowdhuri, Khaled Ghannam, Tirtha Banerjee
We investigate the intermittent dynamics of momentum transport and its
underlying time scales in the near-wall region of the neutrally stratified
atmospheric boundary layer in the presence of a vegetation canopy. This is
achieved through an empirical analysis of the persistence time scales (periods
between successive zero-crossings) of momentum flux events, and their
connection to the ejection-sweep cycle. Using high-frequency measurements from
the GoAmazon campaign, spanning multiple heights within and above a dense
canopy, the analysis suggests that when the persistence time scales ($t_p$) of
momentum flux events from four different quadrants are separately normalized by
$\Gamma_{w}$ (integral time scale of the vertical velocity), their
distributions ($P(t_p/\Gamma_{w})$) remain height-invariant. This result points
to a persistent memory imposed by canopy-induced coherent structures, and to
their role as an efficient momentum transport mechanism between the canopy
airspace and the region immediately above. Moreover, $P(t_p/\Gamma_{w})$
exhibits a power-law scaling at times $t_{p}<\Gamma_{w}$ with an exponential
tail appearing for $t_{p} \geq \Gamma_{w}$. By separating the flux events based
on $t_p$, we discover that around 80\% of the momentum is transported through
the long-lived events ($t_{p} \geq \Gamma_{w}$) at heights immediately above
the canopy while the short-lived ones ($t_{p} < \Gamma_{w}$) only contribute
marginally ($\approx$ 20\%). To explain the role of instantaneous flux
amplitudes towards momentum transport, we compare the measurements with a
newly-developed surrogate data and establish that the range of time scales
involved with amplitude variations in the fluxes tend to increase as one
transitions from within to above the canopy.
Authors' comments: 33 Pages, 12 figures
Jin Gyu Lee, Thomas Berger, Stephan Trenn, Hyungbo Shim
When a group of heterogeneous node dynamics are diffusively coupled with a
high coupling gain, the group exhibits a collective emergent behavior which is
governed by a simple algebraic average of the node dynamics called the blended
dynamics. This finding has been utilized for designing heterogeneous
multi-agent systems by building the desired blended dynamics first and then
splitting it into the node dynamics. However, to compute the magnitude of the
coupling gain, each agent needs to know global information such as the number
of participating nodes, the graph structure, and so on, which prevents a fully
decentralized design of the node dynamics in conjunction with the coupling
laws. To resolve this issue, the idea of funnel control, which is a method for
adaptive gain selection, can be exploited for a node-wise coupling, but the
price to pay is that the collective emergent behavior is no longer governed by
a simple average of the node dynamics. Our analysis reveals that this drawback
can be avoided by an edge-wise design premise, which is the idea that we
present in this paper. After all, we gain benefits such as a fully
decentralized design without global information, collective emergent behavior
being governed by the blended dynamics, and the plug-and-play operation based
on edge-wise handshaking between two nodes.
Authors' comments: 14 pages, 3 figures
Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
In this paper, we propose TitaNet, a novel neural network architecture for
extracting speaker representations. We employ 1D depth-wise separable
convolutions with Squeeze-and-Excitation (SE) layers with global context
followed by channel attention based statistics pooling layer to map
variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a
scalable architecture and achieves state-of-the-art performance on speaker
verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1
trial file and also on speaker diarization tasks with diarization error rate
(DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109.
Furthermore, we investigate various sizes of TitaNet and present a light
TitaNet-S model with only 6M parameters that achieve near state-of-the-art
results in diarization tasks.
Authors' comments: preprint. Submitted to ICASSP 2022
Kyuhong Shim, Iksoo Choi, Wonyong Sung, Jungwook Choi
While Transformer-based models have shown impressive language modeling performance, the large computation cost is often prohibitive for practical use. Attention head pruning, which removes unnecessary attention heads in the multihead attention, is a promising technique to solve this problem. However, it does not evenly reduce the overall load because the heavy feedforward module is not affected by head pruning. In this paper, we apply layer-wise attention head pruning on All-attention Transformer so that the entire computation and the number of parameters can be reduced proportionally to the number of pruned heads. While the architecture has the potential to fully utilize head pruning, we propose three training methods that are especially helpful to minimize performance degradation and stabilize the pruning process. Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.
Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee
Self-supervised speech representation learning methods like wav2vec 2.0 and
Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and
offer good representations for numerous speech processing tasks. Despite the
success of these methods, they require large memory and high pre-training
costs, making them inaccessible for researchers in academia and small
companies. Therefore, this paper introduces DistilHuBERT, a novel multi-task
learning framework to distill hidden representations from a HuBERT model
directly. This method reduces HuBERT's size by 75% and 73% faster while
retaining most performance in ten different tasks. Moreover, DistilHuBERT
required little training time and data, opening the possibilities of
pre-training personal and on-device SSL models for speech.
Authors' comments: Accepted to ICASSP 2022
Li Chen, Yulin Ding, Saeid Pirasteh, Han Hu, Qing Zhu, Haowei Zeng, Haojia Yu, Qisen Shang et al.
Predicting a landslide susceptibility map (LSM) is essential for risk recognition and disaster prevention. Despite the successful application of data-driven approaches for LSM prediction, most methods generally apply a single global model to predict the LSM for an entire target region. However, in large-scale areas with significant environmental change, various parts of the region hold different landslide-inducing environments, and therefore, should be predicted with respective models. This study first segmented target scenarios into blocks for individual analysis. Then, the critical problem is that in each block with limited samples, conducting training and testing a model is impossible for a satisfactory LSM prediction, especially in dangerous mountainous areas where landslide surveying is expensive. To solve the problem, we trained an intermediate representation by the meta-learning paradigm, which is superior for capturing information valuable for few-shot adaption from LSM tasks. We hypothesized that there are more general and vital concepts concerning landslide causes and are sensitive to variations in input features. Thus, we can quickly few-shot adapt the models from the intermediate representation for different blocks or even unseen tasks using very few exemplar samples. Experimental results on the two study areas demonstrated the validity of our block-wise analysis in large scenarios and revealed the top few-shot adaption performances of the proposed methods.
Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays
Streaming end-to-end speech recognition models have been widely applied to
mobile devices and show significant improvement in efficiency. These models are
typically trained on the server using transcribed speech data. However, the
server data distribution can be very different from the data distribution on
user devices, which could affect the model performance. There are two main
challenges for on device training, limited reliable labels and limited training
memory. While self-supervised learning algorithms can mitigate the mismatch
between domains using unlabeled data, they are not applicable on mobile devices
directly because of the memory constraint. In this paper, we propose an
incremental layer-wise self-supervised learning algorithm for efficient speech
domain adaptation on mobile devices, in which only one layer is updated at a
time. Extensive experimental results demonstrate that the proposed algorithm
obtains a Word Error Rate (WER) on the target domain $24.2\%$ better than
supervised baseline and costs $89.7\%$ less training memory than the end-to-end
self-supervised learning algorithm.
Authors' comments: 5 pages
Yule Wang, Xin Xin, Yue Ding, Yunzhe Li, Dong Wang
Recommender system based on historical user-item interactions is of vital importance for web-based services. However, the observed data used to train the recommender model suffers from severe bias issues. Practically, the item frequency distribution of the dataset is a highly skewed power-law distribution. Interactions of a small fraction of head items account for almost the whole training data. The normal training paradigm from such biased data tends to repetitively generate recommendations from the head items, which further exacerbates the biases and affects the exploration of potentially interesting items from the niche set. In this work, we innovatively explore the central theme of recommendation debiasing from an item cluster-wise multi-objective optimization perspective. Aiming to balance the learning on various item clusters that differ in popularity during the training process, we propose a model-agnostic framework namely Item Cluster-Wise Pareto-Efficient Recommendation (ICPE). In detail, we define our item cluster-wise optimization target as the recommender model should balance all item clusters that differ in popularity, thus we set the model learning on each item cluster as a unique optimization objective. To achieve this goal, we first explore items' popularity levels from a novel causal reasoning perspective. Then, we devise popularity discrepancy-based bisecting clustering to separate the item clusters. Next, we adaptively find the overall harmonious gradient direction for cluster-wise optimization objectives from a Pareto-efficient solver. Finally, in the prediction stage, we perform counterfactual inference to further eliminate the impact of global propensity. Extensive experimental results verify the superiorities of ICPE on overall recommendation performance and biases elimination.
Negin Ghamsarian, Mario Taschwer, Doris Putzgruber-Adamitsch, Stephanie Sarny, Yosuf El-Shabrawi, Klaus Schoeffmann
Semantic segmentation in surgical videos is a prerequisite for a broad range
of applications towards improving surgical outcomes and surgical video
analysis. However, semantic segmentation in surgical videos involves many
challenges. In particular, in cataract surgery, various features of the
relevant objects such as blunt edges, color and context variation, reflection,
transparency, and motion blur pose a challenge for semantic segmentation. In
this paper, we propose a novel convolutional module termed as \textit{ReCal}
module, which can calibrate the feature maps by employing region
intra-and-inter-dependencies and channel-region cross-dependencies. This
calibration strategy can effectively enhance semantic representation by
correlating different representations of the same semantic label, considering a
multi-angle local view centering around each pixel. Thus the proposed module
can deal with distant visual characteristics of unique objects as well as
cross-similarities in the visual characteristics of different objects.
Moreover, we propose a novel network architecture based on the proposed module
termed as ReCal-Net. Experimental results confirm the superiority of ReCal-Net
compared to rival state-of-the-art approaches for all relevant objects in
cataract surgery. Moreover, ablation studies reveal the effectiveness of the
ReCal module in boosting semantic segmentation accuracy.
Authors' comments: 12 pages, 5 figures, accepted at the 28th International Conference on
Neural Information Processing (ICONIP), 2021
Lianbo Ma, Nan Li, Guo Yu, Xiaoyu Geng, Min Huang, Xingwei Wang
In the deployment of deep neural models, how to effectively and automatically find feasible deep models under diverse design objectives is fundamental. Most existing neural architecture search (NAS) methods utilize surrogates to predict the detailed performance (e.g., accuracy and model size) of a candidate architecture during the search, which however is complicated and inefficient. In contrast, we aim to learn an efficient Pareto classifier to simplify the search process of NAS by transforming the complex multi-objective NAS task into a simple Pareto-dominance classification task. To this end, we propose a classification-wise Pareto evolution approach for one-shot NAS, where an online classifier is trained to predict the dominance relationship between the candidate and constructed reference architectures, instead of using surrogates to fit the objective functions. The main contribution of this study is to change supernet adaption into a Pareto classifier. Besides, we design two adaptive schemes to select the reference set of architectures for constructing classification boundary and regulate the rate of positive samples over negative ones, respectively. We compare the proposed evolution approach with state-of-the-art approaches on widely-used benchmark datasets, and experimental results indicate that the proposed approach outperforms other approaches and have found a number of neural architectures with different model sizes ranging from 2M to 6M under diverse objectives and constraints.
Yang Zhang, Yao Wang, Zhi Han, Xi'ai Chen, Yandong Tang
In recent years, there have been an increasing number of applications of tensor completion based on the tensor train (TT) format because of its efficiency and effectiveness in dealing with higher-order tensor data. However, existing tensor completion methods using TT decomposition have two obvious drawbacks. One is that they only consider mode weights according to the degree of mode balance, even though some elements are recovered better in an unbalanced mode. The other is that serious blocking artifacts appear when the missing element rate is relatively large. To remedy such two issues, in this work, we propose a novel tensor completion approach via the element-wise weighted technique. Accordingly, a novel formulation for tensor completion and an effective optimization algorithm, called as tensor completion by parallel weighted matrix factorization via tensor train (TWMac-TT), is proposed. In addition, we specifically consider the recovery quality of edge elements from adjacent blocks. Different from traditional reshaping and ket augmentation, we utilize a new tensor augmentation technique called overlapping ket augmentation, which can further avoid blocking artifacts. We then conduct extensive performance evaluations on synthetic data and several real image data sets. Our experimental results demonstrate that the proposed algorithm TWMac-TT outperforms several other competing tensor completion methods.
Ryoya Katafuchi, Terumasa Tokunaga
The utilization of prior knowledge about anomalies is an essential issue for anomaly detections. Recently, the visual attention mechanism has become a promising way to improve the performance of CNNs for some computer vision tasks. In this paper, we propose a novel model called Layer-wise External Attention Network (LEA-Net) for efficient image anomaly detection. The core idea relies on the integration of unsupervised and supervised anomaly detectors via the visual attention mechanism. Our strategy is as follows: (i) Prior knowledge about anomalies is represented as the anomaly map generated by unsupervised learning of normal instances, (ii) The anomaly map is translated to an attention map by the external network, (iii) The attention map is then incorporated into intermediate layers of the anomaly detection network. Notably, this layer-wise external attention can be applied to any CNN model in an end-to-end training manner. For a pilot study, we validate LEA-Net on color anomaly detection tasks. Through extensive experiments on PlantVillage, MVTec AD, and Cloud datasets, we demonstrate that the proposed layer-wise visual attention mechanism consistently boosts anomaly detection performances of an existing CNN model, even on imbalanced datasets. Moreover, we show that our attention mechanism successfully boosts the performance of several CNN models.