Shancong Mou, Xiaoyi Gu, Meng Cao, Haoping Bai, Ping Huang, Jiulong Shan, Jianjun Shi
Generative adversarial networks (GANs), trained on a large-scale image dataset, can be a good approximator of the natural image manifold. GAN-inversion, using a pre-trained generator as a deep generative prior, is a promising tool for image restoration under corruptions. However, the performance of GAN-inversion can be limited by a lack of robustness to unknown gross corruptions, i.e., the restored image might easily deviate from the ground truth. In this paper, we propose a Robust GAN-inversion (RGI) method with a provable robustness guarantee to achieve image restoration under unknown \textit{gross} corruptions, where a small fraction of pixels are completely corrupted. Under mild assumptions, we show that the restored image and the identified corrupted region mask converge asymptotically to the ground truth. Moreover, we extend RGI to Relaxed-RGI (R-RGI) for generator fine-tuning to mitigate the gap between the GAN learned manifold and the true image manifold while avoiding trivial overfitting to the corrupted input image, which further improves the image restoration and corrupted region mask identification performance. The proposed RGI/R-RGI method unifies two important applications with state-of-the-art (SOTA) performance: (i) mask-free semantic inpainting, where the corruptions are unknown missing regions, the restored background can be used to restore the missing content; (ii) unsupervised pixel-wise anomaly detection, where the corruptions are unknown anomalous regions, the retrieved mask can be used as the anomalous region's segmentation mask.
Matthew De Furio, Ben W. Lew, Charles A. Beichman, Thomas Roellig, Geoffrey Bryden, David R. Ciardi, Michael R. Meyer, Marcia J. Rieke et al.
The Y-dwarf WISE 1828+2650 is one of the coldest known Brown Dwarfs with an
effective temperature of $\sim$300 K. Located at a distance of just 10 pc,
previous model-based estimates suggest WISE1828+2650 has a mass of $\sim$5-10
Mj, making it a valuable laboratory for understanding the formation, evolution
and physical characteristics of gas giant planets. However, previous photometry
and spectroscopy have presented a puzzle with the near-impossibility of
simultaneously fitting both the short (0.9-2.0 microns) and long wavelength
(3-5 microns) data. A potential solution to this problem has been the
suggestion that WISE 1828+2650 is a binary system whose composite spectrum
might provide a better match to the data. Alternatively, new models being
developed to fit JWST/NIRSpec and MIRI spectroscopy might provide new insights.
This article describes JWST/NIRCam observations of WISE 1828+2650 in 6 filters
to address the binarity question and to provide new photometry to be used in
model fitting. We also report Adaptive Optics imaging with the Keck 10 m
telescope. We find no evidence for multiplicity for a companion beyond 0.5 AU
with either JWST or Keck. Companion articles will present low and high
resolution spectra of WISE 1828+2650 obtained with both NIRSpec and MIRI.
Authors' comments: 15 pages, 9 figures, Accepted by ApJ on Feb. 21 2023
Shuai Tao, Himavanth Reddy, Jesper Rindom Jensen, Mads Græsbøll Christensen
In this work, we propose a frequency bin-wise method to estimate the
single-channel speech presence probability (SPP) with multiple deep neural
networks (DNNs) in the short-time Fourier transform domain. Since all frequency
bins are typically considered simultaneously as input features for conventional
DNN-based SPP estimators, high model complexity is inevitable. To reduce the
model complexity and the requirements on the training data, we take a single
frequency bin and some of its neighboring frequency bins into account to train
separate gate recurrent units. In addition, the noisy speech and the a
posteriori probability SPP representation are used to train our model. The
experiments were performed on the Deep Noise Suppression challenge dataset. The
experimental results show that the speech detection accuracy can be improved
when we employ the frequency bin-wise model. Finally, we also demonstrate that
our proposed method outperforms most of the state-of-the-art SPP estimation
methods in terms of speech detection accuracy and model complexity.
Authors' comments: Accepted for ICASSP 2023
Marco Landt-Hayen, Willi Rath, Martin Claus, Peer Kröger
Layer-wise relevance propagation (LRP) is a widely used and powerful
technique to reveal insights into various artificial neural network (ANN)
architectures. LRP is often used in the context of image classification. The
aim is to understand, which parts of the input sample have highest relevance
and hence most influence on the model prediction. Relevance can be traced back
through the network to attribute a certain score to each input pixel. Relevance
scores are then combined and displayed as heat maps and give humans an
intuitive visual understanding of classification models. Opening the black box
to understand the classification engine in great detail is essential for domain
experts to gain trust in ANN models. However, there are pitfalls in terms of
model-inherent artifacts included in the obtained relevance maps, that can
easily be missed. But for a valid interpretation, these artifacts must not be
ignored. Here, we apply and revise LRP on various ANN architectures trained as
classifiers on geospatial and synthetic data. Depending on the network
architecture, we show techniques to control model focus and give guidance to
improve the quality of obtained relevance maps to separate facts from
artifacts.
Authors' comments: Fixed typo
Wei Tang, Kangning Cui, Raymond H. Chan
Diabetic retinopathy (DR) is a leading global cause of blindness. Early
detection of hard exudates plays a crucial role in identifying DR, which aids
in treating diabetes and preventing vision loss. However, the unique
characteristics of hard exudates, ranging from their inconsistent shapes to
indistinct boundaries, pose significant challenges to existing segmentation
techniques. To address these issues, we present a novel supervised contrastive
learning framework to optimize hard exudate segmentation. Specifically, we
introduce a patch-wise density contrasting scheme to distinguish between areas
with varying lesion concentrations, and therefore improve the model's
proficiency in segmenting small lesions. To handle the ambiguous boundaries, we
develop a discriminative edge inspection module to dynamically analyze the
pixels that lie around the boundaries and accurately delineate the exudates.
Upon evaluation using the IDRiD dataset and comparison with state-of-the-art
frameworks, our method exhibits its effectiveness and shows potential for
computer-assisted hard exudate detection. The code to replicate experiments is
available at github.com/wetang7/HECL/.
Authors' comments: 8 pages, 3 figures, 2 tables. To appear in ISBI 2024
Dingxin Zhang, Jianhui Yu, Chaoyi Zhang, Weidong Cai
Recent interest in point cloud analysis has led rapid progress in designing
deep learning methods for 3D models. However, state-of-the-art models are not
robust to rotations, which remains an unknown prior to real applications and
harms the model performance. In this work, we introduce a novel Patch-wise
Rotation-invariant network (PaRot), which achieves rotation invariance via
feature disentanglement and produces consistent predictions for samples with
arbitrary rotations. Specifically, we design a siamese training module which
disentangles rotation invariance and equivariance from patches defined over
different scales, e.g., the local geometry and global shape, via a pair of
rotations. However, our disentangled invariant feature loses the intrinsic pose
information of each patch. To solve this problem, we propose a
rotation-invariant geometric relation to restore the relative pose with
equivariant information for patches defined over different scales. Utilising
the pose information, we propose a hierarchical module which implements
intra-scale and inter-scale feature aggregation for 3D shape learning.
Moreover, we introduce a pose-aware feature propagation process with the
rotation-invariant relative pose information embedded. Experiments show that
our disentanglement module extracts high-quality rotation-robust features and
the proposed lightweight model achieves competitive results in rotated 3D
object classification and part segmentation tasks. Our project page is released
at: https://patchrot.github.io/.
Authors' comments: Accepted by AAAI2023
Shashank Agnihotri, Steffen Jung, Margret Keuper
While neural networks allow highly accurate predictions in many tasks, their
lack of robustness towards even slight input perturbations often hampers their
deployment. Adversarial attacks such as the seminal projected gradient descent
(PGD) offer an effective means to evaluate a model's robustness and dedicated
solutions have been proposed for attacks on semantic segmentation or optical
flow estimation. While they attempt to increase the attack's efficiency, a
further objective is to balance its effect, so that it acts on the entire image
domain instead of isolated point-wise predictions. This often comes at the cost
of optimization stability and thus efficiency. Here, we propose CosPGD, an
attack that encourages more balanced errors over the entire image domain while
increasing the attack's overall efficiency. To this end, CosPGD leverages a
simple alignment score computed from any pixel-wise prediction and its target
to scale the loss in a smooth and fully differentiable way. It leads to
efficient evaluations of a model's robustness for semantic segmentation as well
as regression models (such as optical flow, disparity estimation, or image
restoration), and it allows it to outperform the previous SotA attack on
semantic segmentation. We provide code for the CosPGD algorithm and example
usage at https://github.com/shashankskagnihotri/cospgd.
Authors' comments: Accepted at 41st International Conference on Machine Learning (ICML),
2024
Grayson C. Petter, Ryan C. Hickox, David M. Alexander, Adam D. Myers, James E. Geach, Kelly E. Whalen, Carolina P. Andonie
Obscuration in quasars may arise from steep viewing angles along the dusty
torus, or instead may represent a distinct phase of supermassive black hole
growth. We test these scenarios by probing the host dark matter halo
environments of $\sim 1.4$ million WISE-selected obscured and unobscured
quasars at $\langle z \rangle = 1.4$ using angular clustering measurements as
well as cross-correlation measurements of quasar positions with the
gravitational lensing of the cosmic microwave background (CMB). We interpret
these signals within a halo occupation distribution (HOD) framework to conclude
that obscured systems reside in more massive effective halos ($ \sim 10^{12.9}
h^{-1} M_{\odot}$) than their unobscured counterparts ($ \sim 10^{12.6} h^{-1}
M_{\odot}$), though we do not detect a difference in the satellite fraction. We
find excellent agreement between the clustering and lensing analyses and show
that this implies the observed difference is robust to uncertainties in the
obscured quasar redshift distribution, highlighting the power of combining
angular clustering and weak lensing measurements. This finding appears in
tension with models that ascribe obscuration exclusively to orientation of the
dusty torus along the line-of-sight, and instead may be consistent with the
notion that some obscured quasars are attenuated by galaxy-scale or
circumnuclear material during an evolutionary phase.
Authors' comments: Accepted to The Astrophysical Journal
Hadi Mansourifar, Steven J. Simske
Conditional Generative Adversarial Nets (CGANs) need a significantly huge dataset with a detailed pixel-wise annotation to generate high-quality images. Unfortunately, any amount of missing pixel annotations may significantly impact the result not only locally, but also in annotated areas. To the best of our knowledge, such a challenge has never been investigated in the broader field of GANs. In this paper, we take the first step in this direction to study the problem of CGAN-based satellite image synthesis given partially annotated images. We first define the problem of image synthesis using partially annotated data, and we discuss a scenario in which we face such a challenge. We then propose an effective solution called detail augmentation to address this problem. To do so, we tested two different approaches to augment details to compensate for missing pixel-wise annotations. In the first approach, we augmented the original images with their Canny edges to using the CGAN to compensate for the missing annotations. The second approach, however, attempted to assign a color to all pixels with missing annotation. Eventually, a different CGAN was trained to translate the new feature images into a final output.
Banghua Zhu, Jiantao Jiao, Michael I. Jordan
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.
Kazuki Naganuma, Shunsuke Ono
This paper proposes a method for designing diagonal preconditioners for a
preconditioned primal-dual splitting method (P-PDS), an efficient algorithm
that solves nonsmooth convex optimization problems. To speed up the convergence
of P-PDS, a design method has been proposed to automatically determine
appropriate preconditioners from the problem structure. However, the existing
method has two limitations. One is that it directly accesses all elements of
matrices representing linear operators involved in a given problem, which is
inconvenient for handling linear operators implemented as procedures rather
than matrices. The other is that it takes an element-wise preconditioning
approach, which turns certain types of proximity operators into analytically
intractable forms. To overcome these limitations, we establish an Operator
norm-based design method of Variable-wise Diagonal Preconditioning (OVDP).
First, OVDP constructs diagonal preconditioners using only (upper bounds) of
the operator norms of linear operators, thus eliminating the need for their
explicit matrix representations. Furthermore, since OVDP takes a variable-wise
preconditioning approach, it keeps any proximity operator analytically
computable. We also prove that our preconditioners satisfy the convergence
condition of P-PDS. Finally, we demonstrate the effectiveness and usefulness of
OVDP through applications to mixed noise removal of hyperspectral images,
hyperspectral unmixing, and graph signal recovery.
Authors' comments: Submitted to IEEE Transactions on Signal Processing
Jisu Shin, Seunghyun Shin, Hae-Gon Jeon
Understanding the informative structures of scenes is essential for low-level
vision tasks. Unfortunately, it is difficult to obtain a concrete visual
definition of the informative structures because influences of visual features
are task-specific. In this paper, we propose a single general neural network
architecture for extracting task-specific structure guidance for scenes. To do
this, we first analyze traditional spectral clustering methods, which computes
a set of eigenvectors to model a segmented graph forming small compact
structures on image domains. We then unfold the traditional graph-partitioning
problem into a learnable network, named \textit{Scene Structure Guidance
Network (SSGNet)}, to represent the task-specific informative structures. The
SSGNet yields a set of coefficients of eigenvectors that produces explicit
feature representations of image structures. In addition, our SSGNet is
light-weight ($\sim$ 56K parameters), and can be used as a plug-and-play module
for off-the-shelf architectures. We optimize the SSGNet without any supervision
by proposing two novel training losses that enforce task-specific scene
structure generation during training. Our main contribution is to show that
such a simple network can achieve state-of-the-art results for several
low-level vision applications. We also demonstrate that our network generalizes
well on unseen datasets, compared to existing methods which use structural
embedding frameworks. We further propose a lighter version of SSGNet ($\sim$
29K parameters) for depth computation, SSGNet-D, and successfully execute it on
edge computing devices like Jetson AGX Orin, improving the performance of
baseline network, even in the wild, with little computational delay.
Authors' comments: 35 pages, 14 figures, journal extension version of SSGNet
(https://ojs.aaai.org/index.php/AAAI/article/view/25322)
Shishira R Maiya, Sharath Girish, Max Ehrlich, Hanyu Wang, Kwot Sin Lee, Patrick Poirson, Pengxiang Wu, Chen Wang et al.
Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12X, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6X decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.
Yuhang Zhang, Shishun Tian, Muxin Liao, Zhengyu Zhang, Wenbin Zou, Chen Xu
Video semantic segmentation (VSS) is beneficial for dealing with dynamic scenes due to the continuous property of the real-world environment. On the one hand, some methods alleviate the predicted inconsistent problem between continuous frames. On the other hand, other methods employ the previous frame as the prior information to assist in segmenting the current frame. Although the previous methods achieve superior performances on the independent and identically distributed (i.i.d) data, they can not generalize well on other unseen domains. Thus, we explore a new task, the video generalizable semantic segmentation (VGSS) task that considers both continuous frames and domain generalization. In this paper, we propose a class-wise non-salient region generalized (CNSG) framework for the VGSS task. Concretely, we first define the class-wise non-salient feature, which describes features of the class-wise non-salient region that carry more generalizable information. Then, we propose a class-wise non-salient feature reasoning strategy to select and enhance the most generalized channels adaptively. Finally, we propose an inter-frame non-salient centroid alignment loss to alleviate the predicted inconsistent problem in the VGSS task. We also extend our video-based framework to the image-based generalizable semantic segmentation (IGSS) task. Experiments demonstrate that our CNSG framework yields significant improvement in the VGSS and IGSS tasks.
Jonathan Balasingham, Viktor Zamaraev, Vitaliy Kurlin
Use of graphs to represent crystal structures has become popular in recent
years as they provide a natural translation from atoms and bonds to nodes and
edges. Graphs capture structure, while remaining invariant to the symmetries
that crystals display. Several works in property prediction, including those
with state-of-the-art results, make use of the Crystal Graph. The present work
offers a graph based on Point-wise Distance Distributions which retains
symmetrical invariance, decreases computational load, and yields similar or
better prediction accuracy on both experimental and simulated crystals.
Authors' comments: 8 pages, 5 tables, 5 figures (4 single column, 1 double column)
Silpa Babu, Sajan Goud Lingala, Namrata Vaswani
This work develops a novel set of algorithms, alternating Gradient Descent
(GD) and minimization for MRI (altGDmin-MRI1 and altGDmin-MRI2), for
accelerated dynamic MRI by assuming an approximate low-rank (LR) model on the
matrix formed by the vectorized images of the sequence. The LR model itself is
well-known in the MRI literature; our contribution is the novel GD-based
algorithms which are much faster, memory efficient, and general compared with
existing work; and careful use of a 3-level hierarchical LR model. By general,
we mean that, with a single choice of parameters, our method provides accurate
reconstructions for multiple accelerated dynamic MRI applications, multiple
sampling rates and sampling schemes.
We show that our methods outperform many of the popular existing approaches
while also being faster than all of them, on average. This claim is based on
comparisons on 8 different retrospectively under sampled multi-coil dynamic MRI
applications, sampled using either 1D Cartesian or 2D pseudo radial under
sampling, at multiple sampling rates. Evaluations on some prospectively under
sampled datasets are also provided. Our second contribution is a mini-batch
subspace tracking extension that can process new measurements and return
reconstructions within a short delay after they arrive. The recovery algorithm
itself is also faster than its batch counterpart.
Authors' comments: 16 Pages (Including Appendix), 9 Figures, 3 Tables. arXiv admin note:
substantial text overlap with arXiv:2206.13618
Md Ershadul Haque, Manoranjan Paul, Anwaar Ulhaq, Tanmoy Debnath
Quantum computing draws huge attention due to its faster computational
capability compared to classical computing to represent and compress the
classical image data into the quantum domain. The main idea of quantum domain
representation is to convert pixel intensities and their coordinates i.e. state
label preparation using quantum bits i.e. Qubits. For a bigger size image, the
state label preparation takes more Qubits. To address more Qubits issues, a
novel SCMNEQR (State Connection Modification Novel Enhanced Quantum
Representation) approach has been proposed that uses fewer qubits to map the
arbitrary size of the grayscale image using block-wise state label preparation.
The proposed SCMNEQR approach introduces the state connection using a reset
gate rather than repeating the use of the Toffoli gate used in the existing
approach. The experimental results show that the proposed approach outperforms
the existing methods in terms of compression.
Authors' comments: 11 pages, 12 figures
Qian Li, Yuxiao Hu, Ye Liu, Dongxiao Zhang, Xin Jin, Yuntian Chen
Classical adversarial attacks for Face Recognition (FR) models typically
generate discrete examples for target identity with a single state image.
However, such paradigm of point-wise attack exhibits poor generalization
against numerous unknown states of identity and can be easily defended. In this
paper, by rethinking the inherent relationship between the face of target
identity and its variants, we introduce a new pipeline of Generalized Manifold
Adversarial Attack (GMAA) to achieve a better attack performance by expanding
the attack range. Specifically, this expansion lies on two aspects - GMAA not
only expands the target to be attacked from one to many to encourage a good
generalization ability for the generated adversarial examples, but it also
expands the latter from discrete points to manifold by leveraging the domain
knowledge that face expression change can be continuous, which enhances the
attack effect as a data augmentation mechanism did. Moreover, we further design
a dual supervision with local and global constraints as a minor contribution to
improve the visual quality of the generated adversarial examples. We
demonstrate the effectiveness of our method based on extensive experiments, and
reveal that GMAA promises a semantic continuous adversarial space with a higher
generalization ability and visual quality
Authors' comments: Accepted by CVPR2023
Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, Boxi Wu, Deng Cai
Previous work on action representation learning focused on global
representations for short video clips. In contrast, many practical
applications, such as video alignment, strongly demand learning the intensive
representation of long videos. In this paper, we introduce a new framework of
contrastive action representation learning (CARL) to learn frame-wise action
representation in a self-supervised or weakly-supervised manner, especially for
long videos. Specifically, we introduce a simple but effective video encoder
that considers both spatial and temporal context by combining convolution and
transformer. Inspired by the recent massive progress in self-supervised
learning, we propose a new sequence contrast loss (SCL) applied to two related
views obtained by expanding a series of spatio-temporal data in two versions.
One is the self-supervised version that optimizes embedding space by minimizing
KL-divergence between sequence similarity of two augmented views and prior
Gaussian distribution of timestamp distance. The other is the weakly-supervised
version that builds more sample pairs among videos using video-level labels by
dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring
datasets show that our method outperforms previous state-of-the-art by a large
margin for downstream fine-grained action classification and even faster
inference. Surprisingly, although without training on paired videos like in
previous works, our self-supervised version also shows outstanding performance
in video alignment and fine-grained frame retrieval tasks.
Authors' comments: author conflicts
Honggyu Choi, Zhixiang Chen, Xuepeng Shi, Tae-Kyun Kim
Semi-supervised object detection (SSOD) aims to boost detection performance
by leveraging extra unlabeled data. The teacher-student framework has been
shown to be promising for SSOD, in which a teacher network generates
pseudo-labels for unlabeled data to assist the training of a student network.
Since the pseudo-labels are noisy, filtering the pseudo-labels is crucial to
exploit the potential of such framework. Unlike existing suboptimal methods, we
propose a two-step pseudo-label filtering for the classification and regression
heads in a teacher-student framework. For the classification head, OCL
(Object-wise Contrastive Learning) regularizes the object representation
learning that utilizes unlabeled data to improve pseudo-label filtering by
enhancing the discriminativeness of the classification score. This is designed
to pull together objects in the same class and push away objects from different
classes. For the regression head, we further propose RUPL
(Regression-Uncertainty-guided Pseudo-Labeling) to learn the aleatoric
uncertainty of object localization for label filtering. By jointly filtering
the pseudo-labels for the classification and regression heads, the student
network receives better guidance from the teacher network for object detection
task. Experimental results on Pascal VOC and MS-COCO datasets demonstrate the
superiority of our proposed method with competitive performance compared to
existing methods.
Authors' comments: Accepted to BMVC 2022