Yao Zhao, Mohammad Saleh, Peter J. Liu
Most prior work in the sequence-to-sequence paradigm focused on datasets with input sequence lengths in the hundreds of tokens due to the computational constraints of common RNN and Transformer architectures. In this paper, we study long-form abstractive text summarization, a sequence-to-sequence setting with input sequence lengths up to 100,000 tokens and output sequence lengths up to 768 tokens. We propose SEAL, a Transformer-based model, featuring a new encoder-decoder attention that dynamically extracts/selects input snippets to sparsely attend to for each output segment. Using only the original documents and summaries, we derive proxy labels that provide weak supervision for extractive layers simultaneously with regular supervision from abstractive summaries. The SEAL model achieves state-of-the-art results on existing long-form summarization tasks, and outperforms strong baseline models on a new dataset/task we introduce, Search2Wiki, with much longer input text. Since content selection is explicit in the SEAL model, a desirable side effect is that the selection can be inspected for enhanced interpretability.
Bartosz Wójcik, Paweł Morawiecki, Marek Śmieja, Tomasz Krzyżek, Przemysław Spurek, Jacek Tabor
We present a mechanism for detecting adversarial examples based on data representations taken from the hidden layers of the target network. For this purpose, we train individual autoencoders at intermediate layers of the target network. This allows us to describe the manifold of true data and, in consequence, decide whether a given example has the same characteristics as true data. It also gives us insight into the behavior of adversarial examples and their flow through the layers of a deep neural network. Experimental results show that our method outperforms the state of the art in supervised and unsupervised settings.
Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy
There is currently a debate within the neuroscience community over the
likelihood of the brain performing backpropagation (BP). To better mimic the
brain, training a network $\textit{one layer at a time}$ with only a "single
forward pass" has been proposed as an alternative to bypass BP; we refer to
these networks as "layer-wise" networks. We continue the work on layer-wise
networks by answering two outstanding questions. First, $\textit{do they have a
closed-form solution?}$ Second, $\textit{how do we know when to stop adding
more layers?}$ This work proves that the kernel Mean Embedding is the
closed-form weight that achieves the network global optimum while driving these
networks to converge towards a highly desirable kernel for classification; we
call it the $\textit{Neural Indicator Kernel}$.
Authors' comments: This version will be published in AIStats 2022
Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha
Multi-agent settings in the real world often involve tasks with varying types
and quantities of agents and non-agent entities; however, common patterns of
behavior often emerge among these agents/entities. Our method aims to leverage
these commonalities by asking the question: ``What is the expected utility of
each agent when only considering a randomly selected sub-group of its observed
entities?'' By posing this counterfactual question, we can recognize
state-action trajectories within sub-groups of entities that we may have
encountered in another task and use what we learned in that task to inform our
prediction in the current one. We then reconstruct a prediction of the full
returns as a combination of factors considering these disjoint groups of
entities and train this ``randomly factorized" value function as an auxiliary
objective for value-based multi-agent reinforcement learning. By doing so, our
model can recognize and leverage similarities across tasks to improve learning
efficiency in a multi-task setting. Our approach, Randomized Entity-wise
Factorization for Imagined Learning (REFIL), outperforms all strong baselines
by a significant margin in challenging multi-task StarCraft micromanagement
settings.
Authors' comments: ICML 2021 Camera Ready
Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu
Speaker diarization is an essential step for processing multi-speaker audio.
Although an end-to-end neural diarization (EEND) method achieved
state-of-the-art performance, it is limited to a fixed number of speakers. In
this paper, we solve this fixed number of speaker issue by a novel speaker-wise
conditional inference method based on the probabilistic chain rule. In the
proposed method, each speaker's speech activity is regarded as a single random
variable, and is estimated sequentially conditioned on previously estimated
other speakers' speech activities. Similar to other sequence-to-sequence
models, the proposed method produces a variable number of speakers with a stop
sequence condition. We evaluated the proposed method on multi-speaker audio
recordings of a variable number of speakers. Experimental results show that the
proposed method can correctly produce diarization results with a variable
number of speakers and outperforms the state-of-the-art end-to-end speaker
diarization methods in terms of diarization error rate.
Authors' comments: Submitted to Interspeech 2020
Zaida Zhou, Chaoran Zhuge, Xinwei Guan, Wen Liu
Knowledge distillation is to transfer the knowledge from the data learned by the teacher network to the student network, so that the student has the advantage of less parameters and less calculations, and the accuracy is close to the teacher. In this paper, we propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy. The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD). CD transfers the channel information from the teacher to the student. The second is Guided Knowledge Distillation (GKD). Unlike Knowledge Distillation (KD), which allows the student to mimic each sample's prediction distribution of the teacher, GKD only enables the student to mimic the correct output of the teacher. The last part is Early Decay Teacher (EDT). During the training process, we gradually decay the weight of the distillation loss. The purpose is to enable the student to gradually control the optimization rather than the teacher. Our proposed method is evaluated on ImageNet and CIFAR100. On ImageNet, we achieve 27.68% of top-1 error with ResNet18, which outperforms state-of-the-art methods. On CIFAR100, we achieve surprising result that the student outperforms the teacher. Code is available at https://github.com/zhouzaida/channel-distillation.
Wennan Chang, Xinyu Zhou, Yong Zang, Chi Zhang, Sha Cao
Parameter estimation of mixture regression model using the expectation maximization (EM) algorithm is highly sensitive to outliers. Here we propose a fast and efficient robust mixture regression algorithm, called Component-wise Adaptive Trimming (CAT) method. We consider simultaneous outlier detection and robust parameter estimation to minimize the effect of outlier contamination. Robust mixture regression has many important applications including in human cancer genomics data, where the population often displays strong heterogeneity added by unwanted technological perturbations. Existing robust mixture regression methods suffer from outliers as they either conduct parameter estimation in the presence of outliers, or rely on prior knowledge of the level of outlier contamination. CAT was implemented in the framework of classification expectation maximization, under which a natural definition of outliers could be derived. It implements a least trimmed squares (LTS) approach within each exclusive mixing component, where the robustness issue could be transformed from the mixture case to simple linear regression case. The high breakdown point of the LTS approach allows us to avoid the pre-specification of trimming parameter. Compared with multiple existing algorithms, CAT is the most competitive one that can handle and adaptively trim off outliers as well as heavy tailed noise, in different scenarios of simulated data and real genomic data. CAT has been implemented in an R package `RobMixReg' available in CRAN.
Seungwoo Yoo, Heeseok Lee, Heesoo Myeong, Sungrack Yun, Hyoungwoo Park, Janghoon Cho, Duck Hoon Kim
In autonomous driving, detecting reliable and accurate lane marker positions is a crucial yet challenging task. The conventional approaches for the lane marker detection problem perform a pixel-level dense prediction task followed by sophisticated post-processing that is inevitable since lane markers are typically represented by a collection of line segments without thickness. In this paper, we propose a method performing direct lane marker vertex prediction in an end-to-end manner, i.e., without any post-processing step that is required in the pixel-level dense prediction task. Specifically, we translate the lane marker detection problem into a row-wise classification task, which takes advantage of the innate shape of lane markers but, surprisingly, has not been explored well. In order to compactly extract sufficient information about lane markers which spread from the left to the right in an image, we devise a novel layer, which is utilized to successively compress horizontal components so enables an end-to-end lane marker detection system where the final lane marker positions are simply obtained via argmax operations in testing time. Experimental results demonstrate the effectiveness of the proposed method, which is on par or outperforms the state-of-the-art methods on two popular lane marker detection benchmarks, i.e., TuSimple and CULane.
Abhijith Punnappurath, Michael S. Brown
Imaging sensors digitize incoming scene light at a dynamic range of 10--12 bits (i.e., 1024--4096 tonal values). The sensor image is then processed onboard the camera and finally quantized to only 8 bits (i.e., 256 tonal values) to conform to prevailing encoding standards. There are a number of important applications, such as high-bit-depth displays and photo editing, where it is beneficial to recover the lost bit depth. Deep neural networks are effective at this bit-depth reconstruction task. Given the quantized low-bit-depth image as input, existing deep learning methods employ a single-shot approach that attempts to either (1) directly estimate the high-bit-depth image, or (2) directly estimate the residual between the high- and low-bit-depth images. In contrast, we propose a training and inference strategy that recovers the residual image bitplane-by-bitplane. Our bitplane-wise learning framework has the advantage of allowing for multiple levels of supervision during training and is able to obtain state-of-the-art results using a simple network architecture. We test our proposed method extensively on several image datasets and demonstrate an improvement from 0.5dB to 2.3dB PSNR over prior methods depending on the quantization level.
Daniella C. Bardalez Gagliuffi, Jacqueline K. Faherty, Adam C. Schneider, Aaron Meisner, Dan Caselden, Guilluame Colin, Sam Goodman, J. Davy Kirkpatrick et al.
We present the discovery of WISEA J083011.95+283716.0, the first Y dwarf
candidate identified through the Backyard Worlds: Planet 9 citizen science
project. We identified this object as a red, fast-moving source with a faint
$W2$ detection in multi-epoch \textit{AllWISE} and unWISE images. We have
characterized this object with Spitzer Space Telescope and \textit{Hubble Space
Telescope} follow-up imaging. With mid-infrared detections in
\textit{Spitzer}'s \emph{ch1} and \emph{ch2} bands and flux upper limits in
Hubble Space Telescope $F105W$ and $F125W$ filters, we find that this object is
both very faint and has extremely red colors ($ch1-ch2 = 3.25\pm0.23$ mag,
$F125W-ch2 \geq 9.36$ mag), consistent with a T$_{eff}\sim300$ K source, as
estimated from the known Y dwarf population. A preliminary parallax provides a
distance of $11.1^{+2.0}_{-1.5}$ pc, leading to a slightly warmer temperature
of $\sim350$ K. The extreme faintness and red Hubble Space Telescope and
Spitzer Space Telescope colors of this object suggest it may be a link between
the broader Y dwarf population and the coldest known brown dwarf WISE
J0855$-$0714, and highlight our limited knowledge of the true spread of Y dwarf
colors. We also present four additional Backyard Worlds: Planet 9 late-T brown
dwarf discoveries within 30 pc.
Authors' comments: 13 pages, 6 figures, 5 tables
Chun Jiang Zhu, Song Han, Kam-Yiu Lam
In this paper, we study the problem of fast constructions of source-wise
round-trip spanners in weighted directed graphs. For a source vertex set
$S\subseteq V$ in a graph $G(V,E)$, an $S$-sourcewise round-trip spanner of $G$
of stretch $k$ is a subgraph $H$ of $G$ such that for every pair of vertices
$u,v\in S\times V$, their round-trip distance in $H$ is at most $k$ times of
their round-trip distance in $G$. We show that for a graph $G(V,E)$ with $n$
vertices and $m$ edges, an $s$-sized source vertex set $S\subseteq V$ and an
integer $k>1$, there exists an algorithm that in time $O(ms^{1/k}\log^5n)$
constructs an $S$-sourcewise round-trip spanner of stretch $O(k\log n)$ and
$O(ns^{1/k}\log^2n)$ edges with high probability. Compared to the fast
algorithms for constructing all-pairs round-trip spanners \cite{PRS+18,CLR+20},
our algorithm improve the running time and the number of edges in the spanner
when $k$ is super-constant. Compared with the existing algorithm for
constructing source-wise round-trip spanners \cite{ZL17}, our algorithm
significantly improves their construction time $\Omega(\min\{ms,n^\omega\})$
(where $\omega \in [2,2.373)$ and 2.373 is the matrix multiplication exponent)
to nearly linear $O(ms^{1/k}\log^5n)$, at the expense of paying an extra
$O(\log n)$ in the stretch. As an important building block of the algorithm, we
develop a graph partitioning algorithm to partition $G$ into clusters of
bounded radius and prove that for every $u,v\in S\times V$ at small round-trip
distance, the probability of separating them in different clusters is small.
The algorithm takes the size of $S$ as input and does not need the knowledge of
$S$. With the algorithm and a reachability vertex size estimation algorithm, we
show that the recursive algorithm for constructing standard round-trip spanners
\cite{PRS+18} can be adapted to the source-wise setting.
Authors' comments: Chun Jiang Zhu, Song Han and Kam-Yiu Lam. A Fast Algorithm for
Source-Wise Round-Trip Spanners. Theoretical Computer Science (TCS), 876,
34-44, 2021
Ratnabali Pal, Arif Ahmed Sekh, Samarjit Kar, Dilip K. Prasad
The recent worldwide outbreak of the novel coronavirus (COVID-19) has opened up new challenges to the research community. Artificial intelligence (AI) driven methods can be useful to predict the parameters, risks, and effects of such an epidemic. Such predictions can be helpful to control and prevent the spread of such diseases. The main challenges of applying AI is the small volume of data and the uncertain nature. Here, we propose a shallow long short-term memory (LSTM) based neural network to predict the risk category of a country. We have used a Bayesian optimization framework to optimize and automatically design country-specific networks. The results show that the proposed pipeline outperforms state-of-the-art methods for data of 180 countries and can be a useful tool for such risk categorization. We have also experimented with the trend data and weather data combined for the prediction. The outcome shows that the weather does not have a significant role. The tool can be used to predict long-duration outbreak of such an epidemic such that we can take preventive steps earlier
Sukmin Yun, Jongjin Park, Kimin Lee, Jinwoo Shin
Deep neural networks with millions of parameters may suffer from poor
generalization due to overfitting. To mitigate the issue, we propose a new
regularization method that penalizes the predictive distribution between
similar samples. In particular, we distill the predictive distribution between
different samples of the same label during training. This results in
regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a
single network (i.e., a self-knowledge distillation) by forcing it to produce
more meaningful and consistent predictions in a class-wise manner.
Consequently, it mitigates overconfident predictions and reduces intra-class
variations. Our experimental results on various image classification tasks
demonstrate that the simple yet powerful method can significantly improve not
only the generalization ability but also the calibration performance of modern
convolutional neural networks.
Authors' comments: Accepted to CVPR 2020. Code is available at
https://github.com/alinlab/cs-kd
Qihang Yu, Yingwei Li, Jieru Mei, Yuyin Zhou, Alan L. Yuille
3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene
understanding, such as video analysis and volumetric image recognition.
However, 3D networks can easily lead to over-parameterization which incurs
expensive computation cost. In this paper, we propose Channel-wise Automatic
KErnel Shrinking (CAKES), to enable efficient 3D learning by shrinking standard
3D convolutions into a set of economic operations e.g., 1D, 2D convolutions.
Unlike previous methods, CAKES performs channel-wise kernel shrinkage, which
enjoys the following benefits: 1) enabling operations deployed in every layer
to be heterogeneous, so that they can extract diverse and complementary
information to benefit the learning process; and 2) allowing for an efficient
and flexible replacement design, which can be generalized to both
spatial-temporal and volumetric data. Further, we propose a new search space
based on CAKES, so that the replacement configuration can be determined
automatically for simplifying 3D networks. CAKES shows superior performance to
other methods with similar model size, and it also achieves comparable
performance to state-of-the-art with much fewer parameters and computational
costs on tasks including 3D medical imaging segmentation and video action
recognition. Codes and models are available at
https://github.com/yucornetto/CAKES
Authors' comments: AAAI 2021
Aliaksei L. Petsiuk, Joshua M. Pearce
The paper describes an open source computer vision-based hardware structure
and software algorithm, which analyzes layer-wise the 3-D printing processes,
tracks printing errors, and generates appropriate printer actions to improve
reliability. This approach is built upon multiple-stage monocular image
examination, which allows monitoring both the external shape of the printed
object and internal structure of its layers. Starting with the side-view height
validation, the developed program analyzes the virtual top view for outer shell
contour correspondence using the multi-template matching and iterative closest
point algorithms, as well as inner layer texture quality clustering the
spatial-frequency filter responses with Gaussian mixture models and segmenting
structural anomalies with the agglomerative hierarchical clustering algorithm.
This allows evaluation of both global and local parameters of the printing
modes. The experimentally-verified analysis time per layer is less than one
minute, which can be considered a quasi-real-time process for large prints. The
systems can work as an intelligent printing suspension tool designed to save
time and material. However, the results show the algorithm provides a means to
systematize in situ printing data as a first step in a fully open source
failure correction algorithm for additive manufacturing.
Authors' comments: 29 pages, 19 figures
Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, Chandan K. Reddy
Image hashing is one of the fundamental problems that demand both efficient and effective solutions for various practical scenarios. Adversarial autoencoders are shown to be able to implicitly learn a robust, locality-preserving hash function that generates balanced and high-quality hash codes. However, the existing adversarial hashing methods are inefficient to be employed for large-scale image retrieval applications. Specifically, they require an exponential number of samples to be able to generate optimal hash codes and a significantly high computational cost to train. In this paper, we show that the high sample-complexity requirement often results in sub-optimal retrieval performance of the adversarial hashing methods. To address this challenge, we propose a new adversarial-autoencoder hashing approach that has a much lower sample requirement and computational cost. Specifically, by exploiting the desired properties of the hash function in the low-dimensional, discrete space, our method efficiently estimates a better variant of Wasserstein distance by averaging a set of easy-to-compute one-dimensional Wasserstein distances. The resulting hashing approach has an order-of-magnitude better sample complexity, thus better generalization property, compared to the other adversarial hashing methods. In addition, the computational cost is significantly reduced using our approach. We conduct experiments on several real-world datasets and show that the proposed method outperforms the competing hashing methods, achieving up to 10% improvement over the current state-of-the-art image hashing methods. The code accompanying this paper is available on Github (https://github.com/khoadoan/adversarial-hashing).
Tomáš Dlask, Tomáš Werner
Coordinate-wise minimization is a simple popular method for large-scale
optimization. Unfortunately, for general (non-differentiable) convex problems
it may not find global minima. We present a class of linear programs that
coordinate-wise minimization solves exactly. We show that dual LP relaxations
of several well-known combinatorial optimization problems are in this class and
the method finds a global minimum with sufficient accuracy in reasonable
runtimes. Moreover, for extensions of these problems that no longer are in this
class the method yields reasonably good suboptima. Though the presented LP
relaxations can be solved by more efficient methods (such as max-flow), our
results are theoretically non-trivial and can lead to new large-scale
optimization algorithms in the future.
Authors' comments: The final authenticated version is available online at
https://doi.org/10.1007/978-3-030-53552-0_8
Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, Ming-Hsuan Yang
Few-shot classification aims to recognize novel categories with only few
labeled images in each class. Existing metric-based few-shot classification
algorithms predict categories by comparing the feature embeddings of query
images with those from a few labeled images (support examples) using a learned
metric function. While promising performance has been demonstrated, these
methods often fail to generalize to unseen domains due to large discrepancy of
the feature distribution across domains. In this work, we address the problem
of few-shot classification under domain shifts for metric-based methods. Our
core idea is to use feature-wise transformation layers for augmenting the image
features using affine transforms to simulate various feature distributions
under different domains in the training stage. To capture variations of the
feature distributions under different domains, we further apply a
learning-to-learn approach to search for the hyper-parameters of the
feature-wise transformation layers. We conduct extensive experiments and
ablation studies under the domain generalization setting using five few-shot
classification datasets: mini-ImageNet, CUB, Cars, Places, and Plantae.
Experimental results demonstrate that the proposed feature-wise transformation
layer is applicable to various metric-based models, and provides consistent
improvements on the few-shot classification performance under domain shift.
Authors' comments: ICLR 2020 (Spotlight). Project page:
http://vllab.ucmerced.edu/ym41608/projects/CrossDomainFewShot Code:
https://github.com/hytseng0509/CrossDomainFewShot
Koki Madono, Masayuki Tanaka, Masaki Onishi, Tetsuji Ogawa
In this study, a perceptually hidden object-recognition method is
investigated to generate secure images recognizable by humans but not machines.
Hence, both the perceptual information hiding and the corresponding object
recognition methods should be developed. Block-wise image scrambling is
introduced to hide perceptual information from a third party. In addition, an
adaptation network is proposed to recognize those scrambled images.
Experimental comparisons conducted using CIFAR datasets demonstrated that the
proposed adaptation network performed well in incorporating simple perceptual
information hiding into DNN-based image classification.
Authors' comments: 6 pages Artificial Intelligence of Things(AAAI-2020 WS)
S. J. Curran
Machine learning techniques, specifically the k-nearest neighbour algorithm
applied to optical band colours, have had some success in predicting
photometric redshifts of quasi-stellar objects (QSOs): Although the mean of
differences between the spectroscopic and photometric redshifts is close to
zero, the distribution of these differences remains wide and distinctly
non-Gaussian. As per our previous empirical estimate of photometric redshifts,
we find that the predictions can be significantly improved by adding colours
from other wavebands, namely the near-infrared and ultraviolet. Self-testing
this, by using half of the 33 643 strong QSO sample to train the algorithm,
results in a significantly narrower spread for the remaining half of the
sample. Using the whole QSO sample to train the algorithm, the same set of
magnitudes return a similar spread for a sample of radio sources (quasars).
Although the matching coincidence is relatively low (739 of the 3663 sources
having photometry in the relevant bands), this is still significantly larger
than from the empirical method (2%) and thus may provide a method with which to
obtain redshifts for the vast number of continuum radio sources expected to be
detected with the next generation of large radio telescopes.
Authors' comments: Accepted by MNRAS