Dingxin Zhang, Jianhui Yu, Chaoyi Zhang, Weidong Cai
Recent interest in point cloud analysis has led rapid progress in designing
deep learning methods for 3D models. However, state-of-the-art models are not
robust to rotations, which remains an unknown prior to real applications and
harms the model performance. In this work, we introduce a novel Patch-wise
Rotation-invariant network (PaRot), which achieves rotation invariance via
feature disentanglement and produces consistent predictions for samples with
arbitrary rotations. Specifically, we design a siamese training module which
disentangles rotation invariance and equivariance from patches defined over
different scales, e.g., the local geometry and global shape, via a pair of
rotations. However, our disentangled invariant feature loses the intrinsic pose
information of each patch. To solve this problem, we propose a
rotation-invariant geometric relation to restore the relative pose with
equivariant information for patches defined over different scales. Utilising
the pose information, we propose a hierarchical module which implements
intra-scale and inter-scale feature aggregation for 3D shape learning.
Moreover, we introduce a pose-aware feature propagation process with the
rotation-invariant relative pose information embedded. Experiments show that
our disentanglement module extracts high-quality rotation-robust features and
the proposed lightweight model achieves competitive results in rotated 3D
object classification and part segmentation tasks. Our project page is released
at: https://patchrot.github.io/.
Authors' comments: Accepted by AAAI2023
Shashank Agnihotri, Steffen Jung, Margret Keuper
While neural networks allow highly accurate predictions in many tasks, their
lack of robustness towards even slight input perturbations often hampers their
deployment. Adversarial attacks such as the seminal projected gradient descent
(PGD) offer an effective means to evaluate a model's robustness and dedicated
solutions have been proposed for attacks on semantic segmentation or optical
flow estimation. While they attempt to increase the attack's efficiency, a
further objective is to balance its effect, so that it acts on the entire image
domain instead of isolated point-wise predictions. This often comes at the cost
of optimization stability and thus efficiency. Here, we propose CosPGD, an
attack that encourages more balanced errors over the entire image domain while
increasing the attack's overall efficiency. To this end, CosPGD leverages a
simple alignment score computed from any pixel-wise prediction and its target
to scale the loss in a smooth and fully differentiable way. It leads to
efficient evaluations of a model's robustness for semantic segmentation as well
as regression models (such as optical flow, disparity estimation, or image
restoration), and it allows it to outperform the previous SotA attack on
semantic segmentation. We provide code for the CosPGD algorithm and example
usage at https://github.com/shashankskagnihotri/cospgd.
Authors' comments: Accepted at 41st International Conference on Machine Learning (ICML),
2024
Grayson C. Petter, Ryan C. Hickox, David M. Alexander, Adam D. Myers, James E. Geach, Kelly E. Whalen, Carolina P. Andonie
Obscuration in quasars may arise from steep viewing angles along the dusty
torus, or instead may represent a distinct phase of supermassive black hole
growth. We test these scenarios by probing the host dark matter halo
environments of $\sim 1.4$ million WISE-selected obscured and unobscured
quasars at $\langle z \rangle = 1.4$ using angular clustering measurements as
well as cross-correlation measurements of quasar positions with the
gravitational lensing of the cosmic microwave background (CMB). We interpret
these signals within a halo occupation distribution (HOD) framework to conclude
that obscured systems reside in more massive effective halos ($ \sim 10^{12.9}
h^{-1} M_{\odot}$) than their unobscured counterparts ($ \sim 10^{12.6} h^{-1}
M_{\odot}$), though we do not detect a difference in the satellite fraction. We
find excellent agreement between the clustering and lensing analyses and show
that this implies the observed difference is robust to uncertainties in the
obscured quasar redshift distribution, highlighting the power of combining
angular clustering and weak lensing measurements. This finding appears in
tension with models that ascribe obscuration exclusively to orientation of the
dusty torus along the line-of-sight, and instead may be consistent with the
notion that some obscured quasars are attenuated by galaxy-scale or
circumnuclear material during an evolutionary phase.
Authors' comments: Accepted to The Astrophysical Journal
Hadi Mansourifar, Steven J. Simske
Conditional Generative Adversarial Nets (CGANs) need a significantly huge dataset with a detailed pixel-wise annotation to generate high-quality images. Unfortunately, any amount of missing pixel annotations may significantly impact the result not only locally, but also in annotated areas. To the best of our knowledge, such a challenge has never been investigated in the broader field of GANs. In this paper, we take the first step in this direction to study the problem of CGAN-based satellite image synthesis given partially annotated images. We first define the problem of image synthesis using partially annotated data, and we discuss a scenario in which we face such a challenge. We then propose an effective solution called detail augmentation to address this problem. To do so, we tested two different approaches to augment details to compensate for missing pixel-wise annotations. In the first approach, we augmented the original images with their Canny edges to using the CGAN to compensate for the missing annotations. The second approach, however, attempted to assign a color to all pixels with missing annotation. Eventually, a different CGAN was trained to translate the new feature images into a final output.
Banghua Zhu, Jiantao Jiao, Michael I. Jordan
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.
Kazuki Naganuma, Shunsuke Ono
This paper proposes a method for designing diagonal preconditioners for a
preconditioned primal-dual splitting method (P-PDS), an efficient algorithm
that solves nonsmooth convex optimization problems. To speed up the convergence
of P-PDS, a design method has been proposed to automatically determine
appropriate preconditioners from the problem structure. However, the existing
method has two limitations. One is that it directly accesses all elements of
matrices representing linear operators involved in a given problem, which is
inconvenient for handling linear operators implemented as procedures rather
than matrices. The other is that it takes an element-wise preconditioning
approach, which turns certain types of proximity operators into analytically
intractable forms. To overcome these limitations, we establish an Operator
norm-based design method of Variable-wise Diagonal Preconditioning (OVDP).
First, OVDP constructs diagonal preconditioners using only (upper bounds) of
the operator norms of linear operators, thus eliminating the need for their
explicit matrix representations. Furthermore, since OVDP takes a variable-wise
preconditioning approach, it keeps any proximity operator analytically
computable. We also prove that our preconditioners satisfy the convergence
condition of P-PDS. Finally, we demonstrate the effectiveness and usefulness of
OVDP through applications to mixed noise removal of hyperspectral images,
hyperspectral unmixing, and graph signal recovery.
Authors' comments: Submitted to IEEE Transactions on Signal Processing
Jisu Shin, Seunghyun Shin, Hae-Gon Jeon
Understanding the informative structures of scenes is essential for low-level
vision tasks. Unfortunately, it is difficult to obtain a concrete visual
definition of the informative structures because influences of visual features
are task-specific. In this paper, we propose a single general neural network
architecture for extracting task-specific structure guidance for scenes. To do
this, we first analyze traditional spectral clustering methods, which computes
a set of eigenvectors to model a segmented graph forming small compact
structures on image domains. We then unfold the traditional graph-partitioning
problem into a learnable network, named \textit{Scene Structure Guidance
Network (SSGNet)}, to represent the task-specific informative structures. The
SSGNet yields a set of coefficients of eigenvectors that produces explicit
feature representations of image structures. In addition, our SSGNet is
light-weight ($\sim$ 56K parameters), and can be used as a plug-and-play module
for off-the-shelf architectures. We optimize the SSGNet without any supervision
by proposing two novel training losses that enforce task-specific scene
structure generation during training. Our main contribution is to show that
such a simple network can achieve state-of-the-art results for several
low-level vision applications. We also demonstrate that our network generalizes
well on unseen datasets, compared to existing methods which use structural
embedding frameworks. We further propose a lighter version of SSGNet ($\sim$
29K parameters) for depth computation, SSGNet-D, and successfully execute it on
edge computing devices like Jetson AGX Orin, improving the performance of
baseline network, even in the wild, with little computational delay.
Authors' comments: 35 pages, 14 figures, journal extension version of SSGNet
(https://ojs.aaai.org/index.php/AAAI/article/view/25322)
Shishira R Maiya, Sharath Girish, Max Ehrlich, Hanyu Wang, Kwot Sin Lee, Patrick Poirson, Pengxiang Wu, Chen Wang et al.
Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12X, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6X decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.
Yuhang Zhang, Shishun Tian, Muxin Liao, Zhengyu Zhang, Wenbin Zou, Chen Xu
Video semantic segmentation (VSS) is beneficial for dealing with dynamic scenes due to the continuous property of the real-world environment. On the one hand, some methods alleviate the predicted inconsistent problem between continuous frames. On the other hand, other methods employ the previous frame as the prior information to assist in segmenting the current frame. Although the previous methods achieve superior performances on the independent and identically distributed (i.i.d) data, they can not generalize well on other unseen domains. Thus, we explore a new task, the video generalizable semantic segmentation (VGSS) task that considers both continuous frames and domain generalization. In this paper, we propose a class-wise non-salient region generalized (CNSG) framework for the VGSS task. Concretely, we first define the class-wise non-salient feature, which describes features of the class-wise non-salient region that carry more generalizable information. Then, we propose a class-wise non-salient feature reasoning strategy to select and enhance the most generalized channels adaptively. Finally, we propose an inter-frame non-salient centroid alignment loss to alleviate the predicted inconsistent problem in the VGSS task. We also extend our video-based framework to the image-based generalizable semantic segmentation (IGSS) task. Experiments demonstrate that our CNSG framework yields significant improvement in the VGSS and IGSS tasks.
Jonathan Balasingham, Viktor Zamaraev, Vitaliy Kurlin
Use of graphs to represent crystal structures has become popular in recent
years as they provide a natural translation from atoms and bonds to nodes and
edges. Graphs capture structure, while remaining invariant to the symmetries
that crystals display. Several works in property prediction, including those
with state-of-the-art results, make use of the Crystal Graph. The present work
offers a graph based on Point-wise Distance Distributions which retains
symmetrical invariance, decreases computational load, and yields similar or
better prediction accuracy on both experimental and simulated crystals.
Authors' comments: 8 pages, 5 tables, 5 figures (4 single column, 1 double column)
Silpa Babu, Sajan Goud Lingala, Namrata Vaswani
This work develops a novel set of algorithms, alternating Gradient Descent
(GD) and minimization for MRI (altGDmin-MRI1 and altGDmin-MRI2), for
accelerated dynamic MRI by assuming an approximate low-rank (LR) model on the
matrix formed by the vectorized images of the sequence. The LR model itself is
well-known in the MRI literature; our contribution is the novel GD-based
algorithms which are much faster, memory efficient, and general compared with
existing work; and careful use of a 3-level hierarchical LR model. By general,
we mean that, with a single choice of parameters, our method provides accurate
reconstructions for multiple accelerated dynamic MRI applications, multiple
sampling rates and sampling schemes.
We show that our methods outperform many of the popular existing approaches
while also being faster than all of them, on average. This claim is based on
comparisons on 8 different retrospectively under sampled multi-coil dynamic MRI
applications, sampled using either 1D Cartesian or 2D pseudo radial under
sampling, at multiple sampling rates. Evaluations on some prospectively under
sampled datasets are also provided. Our second contribution is a mini-batch
subspace tracking extension that can process new measurements and return
reconstructions within a short delay after they arrive. The recovery algorithm
itself is also faster than its batch counterpart.
Authors' comments: 16 Pages (Including Appendix), 9 Figures, 3 Tables. arXiv admin note:
substantial text overlap with arXiv:2206.13618
Md Ershadul Haque, Manoranjan Paul, Anwaar Ulhaq, Tanmoy Debnath
Quantum computing draws huge attention due to its faster computational
capability compared to classical computing to represent and compress the
classical image data into the quantum domain. The main idea of quantum domain
representation is to convert pixel intensities and their coordinates i.e. state
label preparation using quantum bits i.e. Qubits. For a bigger size image, the
state label preparation takes more Qubits. To address more Qubits issues, a
novel SCMNEQR (State Connection Modification Novel Enhanced Quantum
Representation) approach has been proposed that uses fewer qubits to map the
arbitrary size of the grayscale image using block-wise state label preparation.
The proposed SCMNEQR approach introduces the state connection using a reset
gate rather than repeating the use of the Toffoli gate used in the existing
approach. The experimental results show that the proposed approach outperforms
the existing methods in terms of compression.
Authors' comments: 11 pages, 12 figures
Qian Li, Yuxiao Hu, Ye Liu, Dongxiao Zhang, Xin Jin, Yuntian Chen
Classical adversarial attacks for Face Recognition (FR) models typically
generate discrete examples for target identity with a single state image.
However, such paradigm of point-wise attack exhibits poor generalization
against numerous unknown states of identity and can be easily defended. In this
paper, by rethinking the inherent relationship between the face of target
identity and its variants, we introduce a new pipeline of Generalized Manifold
Adversarial Attack (GMAA) to achieve a better attack performance by expanding
the attack range. Specifically, this expansion lies on two aspects - GMAA not
only expands the target to be attacked from one to many to encourage a good
generalization ability for the generated adversarial examples, but it also
expands the latter from discrete points to manifold by leveraging the domain
knowledge that face expression change can be continuous, which enhances the
attack effect as a data augmentation mechanism did. Moreover, we further design
a dual supervision with local and global constraints as a minor contribution to
improve the visual quality of the generated adversarial examples. We
demonstrate the effectiveness of our method based on extensive experiments, and
reveal that GMAA promises a semantic continuous adversarial space with a higher
generalization ability and visual quality
Authors' comments: Accepted by CVPR2023
Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, Boxi Wu, Deng Cai
Previous work on action representation learning focused on global
representations for short video clips. In contrast, many practical
applications, such as video alignment, strongly demand learning the intensive
representation of long videos. In this paper, we introduce a new framework of
contrastive action representation learning (CARL) to learn frame-wise action
representation in a self-supervised or weakly-supervised manner, especially for
long videos. Specifically, we introduce a simple but effective video encoder
that considers both spatial and temporal context by combining convolution and
transformer. Inspired by the recent massive progress in self-supervised
learning, we propose a new sequence contrast loss (SCL) applied to two related
views obtained by expanding a series of spatio-temporal data in two versions.
One is the self-supervised version that optimizes embedding space by minimizing
KL-divergence between sequence similarity of two augmented views and prior
Gaussian distribution of timestamp distance. The other is the weakly-supervised
version that builds more sample pairs among videos using video-level labels by
dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring
datasets show that our method outperforms previous state-of-the-art by a large
margin for downstream fine-grained action classification and even faster
inference. Surprisingly, although without training on paired videos like in
previous works, our self-supervised version also shows outstanding performance
in video alignment and fine-grained frame retrieval tasks.
Authors' comments: author conflicts
Honggyu Choi, Zhixiang Chen, Xuepeng Shi, Tae-Kyun Kim
Semi-supervised object detection (SSOD) aims to boost detection performance
by leveraging extra unlabeled data. The teacher-student framework has been
shown to be promising for SSOD, in which a teacher network generates
pseudo-labels for unlabeled data to assist the training of a student network.
Since the pseudo-labels are noisy, filtering the pseudo-labels is crucial to
exploit the potential of such framework. Unlike existing suboptimal methods, we
propose a two-step pseudo-label filtering for the classification and regression
heads in a teacher-student framework. For the classification head, OCL
(Object-wise Contrastive Learning) regularizes the object representation
learning that utilizes unlabeled data to improve pseudo-label filtering by
enhancing the discriminativeness of the classification score. This is designed
to pull together objects in the same class and push away objects from different
classes. For the regression head, we further propose RUPL
(Regression-Uncertainty-guided Pseudo-Labeling) to learn the aleatoric
uncertainty of object localization for label filtering. By jointly filtering
the pseudo-labels for the classification and regression heads, the student
network receives better guidance from the teacher network for object detection
task. Experimental results on Pascal VOC and MS-COCO datasets demonstrate the
superiority of our proposed method with competitive performance compared to
existing methods.
Authors' comments: Accepted to BMVC 2022
Naoki Matsunaga, Masato Ishii, Akio Hayakawa, Kenji Suzuki, Takuya Narihira
Our goal is to develop fine-grained real-image editing methods suitable for
real-world applications. In this paper, we first summarize four requirements
for these methods and propose a novel diffusion-based image editing framework
with pixel-wise guidance that satisfies these requirements. Specifically, we
train pixel-classifiers with a few annotated data and then infer the
segmentation map of a target image. Users then manipulate the map to instruct
how the image will be edited. We utilize a pre-trained diffusion model to
generate edited images aligned with the user's intention with pixel-wise
guidance. The effective combination of proposed guidance and other techniques
enables highly controllable editing with preserving the outside of the edited
area, which results in meeting our requirements. The experimental results
demonstrate that our proposal outperforms the GAN-based method for editing
quality and speed.
Authors' comments: Accepted by AI for Content Creation (AI4CC) workshop at CVPR 2023
Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu et al.
Differentially private deep learning has recently witnessed advances in
computational efficiency and privacy-utility trade-off. We explore whether
further improvements along the two axes are possible and provide affirmative
answers leveraging two instantiations of \emph{group-wise clipping}. To reduce
the compute time overhead of private learning, we show that \emph{per-layer
clipping}, where the gradient of each neural network layer is clipped
separately, allows clipping to be performed in conjunction with backpropagation
in differentially private optimization. This results in private learning that
is as memory-efficient and almost as fast per training update as non-private
learning for many workflows of interest. While per-layer clipping with constant
thresholds tends to underperform standard flat clipping, per-layer clipping
with adaptive thresholds matches or outperforms flat clipping under given
training epoch constraints, hence attaining similar or better task performance
within less wall time. To explore the limits of scaling (pretrained) models in
differentially private deep learning, we privately fine-tune the 175
billion-parameter GPT-3. We bypass scaling challenges associated with clipping
gradients that are distributed across multiple devices with \emph{per-device
clipping} that clips the gradient of each model piece separately on its host
device. Privately fine-tuning GPT-3 with per-device clipping achieves a task
performance at $\epsilon=1$ better than what is attainable by non-privately
fine-tuning the largest GPT-2 on a summarization task.
Authors' comments: 25 pages
Qianwen Meng, Hangwei Qian, Yong Liu, Lizhen Cui, Yonghui Xu, Zhiqi Shen
Learning semantic-rich representations from raw unlabeled time series data is
critical for downstream tasks such as classification and forecasting.
Contrastive learning has recently shown its promising representation learning
capability in the absence of expert annotations. However, existing contrastive
approaches generally treat each instance independently, which leads to false
negative pairs that share the same semantics. To tackle this problem, we
propose MHCCL, a Masked Hierarchical Cluster-wise Contrastive Learning model,
which exploits semantic information obtained from the hierarchical structure
consisting of multiple latent partitions for multivariate time series.
Motivated by the observation that fine-grained clustering preserves higher
purity while coarse-grained one reflects higher-level semantics, we propose a
novel downward masking strategy to filter out fake negatives and supplement
positives by incorporating the multi-granularity information from the
clustering hierarchy. In addition, a novel upward masking strategy is designed
in MHCCL to remove outliers of clusters at each partition to refine prototypes,
which helps speed up the hierarchical clustering process and improves the
clustering quality. We conduct experimental evaluations on seven widely-used
multivariate time series datasets. The results demonstrate the superiority of
MHCCL over the state-of-the-art approaches for unsupervised time series
representation learning.
Authors' comments: accepted by AAAI 2023
Yassine Kamri, Julien M. Hendrickx, François Glineur
We propose a unifying framework for the automated computer-assisted worst-case analysis of cyclic block coordinate algorithms in the unconstrained smooth convex optimization setup. We compute exact worst-case bounds for the cyclic coordinate descent and the alternating minimization algorithms over the class of smooth convex functions, and provide sublinear upper and lower bounds on the worst-case rate for the standard class of functions with coordinate-wise Lipschitz gradients. We obtain in particular a new upper bound for cyclic coordinate descent that outperforms the best available ones by an order of magnitude. We also demonstrate the flexibility of our approach by providing new numerical bounds using simpler and more natural assumptions than those normally made for the analysis of block coordinate algorithms. Finally, we provide numerical evidence for the fact that a standard scheme that provably accelerates random coordinate descent to a $O(1/k^2)$ complexity is actually inefficient when used in a (deterministic) cyclic algorithm.
Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, Gustavo Carneiro
Semantic segmentation models classify pixels into a set of known
(``in-distribution'') visual classes. When deployed in an open world, the
reliability of these models depends on their ability not only to classify
in-distribution pixels but also to detect out-of-distribution (OoD) pixels.
Historically, the poor OoD detection performance of these models has motivated
the design of methods based on model re-training using synthetic training
images that include OoD visual objects. Although successful, these re-trained
methods have two issues: 1) their in-distribution segmentation accuracy may
drop during re-training, and 2) their OoD detection accuracy does not
generalise well to new contexts (e.g., country surroundings) outside the
training set (e.g., city surroundings). In this paper, we mitigate these issues
with: (i) a new residual pattern learning (RPL) module that assists the
segmentation model to detect OoD pixels without affecting the inlier
segmentation performance; and (ii) a novel context-robust contrastive learning
(CoroCL) that enforces RPL to robustly detect OoD pixels among various
contexts. Our approach improves by around 10\% FPR and 7\% AuPRC the previous
state-of-the-art in Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly
datasets. Our code is available at: https://github.com/yyliu01/RPL.
Authors' comments: The paper contains 16 pages and it is accepted by ICCV'23