Utkarsh Singhal, Carlos Esteves, Ameesh Makadia, Stella X. Yu
Computer vision research has long aimed to build systems that are robust to
spatial transformations found in natural data. Traditionally, this is done
using data augmentation or hard-coding invariances into the architecture.
However, too much or too little invariance can hurt, and the correct amount is
unknown a priori and dependent on the instance. Ideally, the appropriate
invariance would be learned from data and inferred at test-time.
We treat invariance as a prediction problem. Given any image, we use a
normalizing flow to predict a distribution over transformations and average the
predictions over them. Since this distribution only depends on the instance, we
can align instances before classifying them and generalize invariance across
classes. The same distribution can also be used to adapt to out-of-distribution
poses. This normalizing flow is trained end-to-end and can learn a much larger
range of transformations than Augerino and InstaAug. When used as data
augmentation, our method shows accuracy and robustness gains on CIFAR 10,
CIFAR10-LT, and TinyImageNet.
Authors' comments: Accepted to ICCV 2023
Ankit Pratap Singh, Namrata Vaswani
This work considers two related learning problems in a federated attack prone
setting: federated principal components analysis (PCA) and federated low rank
column-wise sensing (LRCS). The node attacks are assumed to be Byzantine which
means that the attackers are omniscient and can collude. We introduce a novel
provably Byzantine-resilient communication-efficient and sampleefficient
algorithm, called Subspace-Median, that solves the PCA problem and is a key
part of the solution for the LRCS problem. We also study the most natural
Byzantine-resilient solution for federated PCA, a geometric median based
modification of the federated power method, and explain why it is not useful.
Our second main contribution is a complete alternating gradient descent (GD)
and minimization (altGDmin) algorithm for Byzantine-resilient horizontally
federated LRCS and sample and communication complexity guarantees for it.
Extensive simulation experiments are used to corroborate our theoretical
guarantees. The ideas that we develop for LRCS are easily extendable to other
LR recovery problems as well.
Authors' comments: 36 pages
Kiana D. McFadden, Amy K. Mainzer, Joseph R. Masiero, James M. Bauer, Roc M. Cutri, Dar Dahlen, Frank J. Masci, Jana Pittichov et al.
Probing small main-belt asteroids provides insight into their formation and
evolution through multiple dynamical and collisional processes. These asteroids
also overlap in size with the potentially hazardous near-earth object
population and supply the majority of these objects. The Lucy mission will
provide an opportunity for study of a small main-belt asteroid, (152830)
Dinkinesh. The spacecraft will perform a flyby of this object on November 1,
2023, in preparation for its mission to the Jupiter Trojan asteroids. We
employed aperture photometry on stacked frames of Dinkinesh obtained by the
Wide-field-Infrared Survey Explorer and performed thermal modeling on a
detection at 12 $\mu$m to compute diameter and albedo values. Through this
method, we determined Dinkinesh has an effective spherical diameter of
$0.76^{+0.11}_{-0.21}$ km and a visual geometric albedo of
$0.27^{+0.25}_{-0.06}$ at the 16th and 84th percentiles. This albedo is
consistent with typical stony (S-type) asteroids.
Authors' comments: Submitted to Astrophysical Journal Letters
Hanjiang Hu, Zuxin Liu, Linyi Li, Jiacheng Zhu, Ding Zhao
In recent years, computer vision has made remarkable advancements in
autonomous driving and robotics. However, it has been observed that deep
learning-based visual perception models lack robustness when faced with camera
motion perturbations. The current certification process for assessing
robustness is costly and time-consuming due to the extensive number of image
projections required for Monte Carlo sampling in the 3D camera motion space. To
address these challenges, we present a novel, efficient, and practical
framework for certifying the robustness of 3D-2D projective transformations
against camera motion perturbations. Our approach leverages a smoothing
distribution over the 2D pixel space instead of in the 3D physical space,
eliminating the need for costly camera motion sampling and significantly
enhancing the efficiency of robustness certifications. With the pixel-wise
smoothed classifier, we are able to fully upper bound the projection errors
using a technique of uniform partitioning in camera motion space. Additionally,
we extend our certification framework to a more general scenario where only a
single-frame point cloud is required in the projection oracle. This is achieved
by deriving Lipschitz-based approximated partition intervals. Through extensive
experimentation, we validate the trade-off between effectiveness and efficiency
enabled by our proposed method. Remarkably, our approach achieves approximately
80% certified accuracy while utilizing only 30% of the projected image frames.
Authors' comments: 32 pages, 5 figures, 13 tables
Shiheng Zhang, Jiahao Zhang, Jie Shen, Guang Lin
We present a novel optimization algorithm, element-wise relaxed scalar
auxiliary variable (E-RSAV), that satisfies an unconditional energy dissipation
law and exhibits improved alignment between the modified and the original
energy. Our algorithm features rigorous proofs of linear convergence in the
convex setting. Furthermore, we present a simple accelerated algorithm that
improves the linear convergence rate to super-linear in the univariate case. We
also propose an adaptive version of E-RSAV with Steffensen step size. We
validate the robustness and fast convergence of our algorithm through ample
numerical experiments.
Authors' comments: 25 pages, 7 figures
Oscar Pina, Vernica Vilaplana
End-to-end training of graph neural networks (GNN) on large graphs presents several memory and computational challenges, and limits the application to shallow architectures as depth exponentially increases the memory and space complexities. In this manuscript, we propose Layer-wise Regularized Graph Infomax, an algorithm to train GNNs layer by layer in a self-supervised manner. We decouple the feature propagation and feature transformation carried out by GNNs to learn node representations in order to derive a loss function based on the prediction of future inputs. We evaluate the algorithm in inductive large graphs and show similar performance to other end to end methods and a substantially increased efficiency, which enables the training of more sophisticated models in one single device. We also show that our algorithm avoids the oversmoothing of the representations, another common challenge of deep GNNs.
Caroline Brosse, Oscar Defrain, Kazuhiro Kurita, Vincent Limouzy, Takeaki Uno, Kunihiro Wasa
Enumeration problems are often encountered as key subroutines in the exact
computation of graph parameters such as chromatic number, treewidth, or
treedepth. In the case of treedepth computation, the enumeration of
inclusion-wise minimal separators plays a crucial role. However and quite
surprisingly, the complexity status of this problem has not been settled since
it has been posed as an open direction by Kloks and Kratsch in 1998. Recently
at the PACE 2020 competition dedicated to treedepth computation, solvers have
been circumventing that by listing all minimal $a$-$b$ separators and filtering
out those that are not inclusion-wise minimal, at the cost of efficiency.
Naturally, having an efficient algorithm for listing inclusion-wise minimal
separators would drastically improve such practical algorithms. In this note,
however, we show that no efficient algorithm is to be expected from an
output-sensitive perspective, namely, we prove that there is no
output-polynomial time algorithm for inclusion-wise minimal separators
enumeration unless P = NP.
Authors' comments: 11 pages, 3 figures
Hasan Saribas, Cagri Yesil, Serdarcan Dilbaz, Halit Orenbas
With the increasing complexity and scale of click-through rate (CTR) prediction tasks in online advertising and recommendation systems, accurately estimating the importance of features has become a critical aspect of developing effective models. In this paper, we propose an attention-based approach that leverages max and mean pooling operations, along with a bit-wise attention mechanism, to enhance feature importance estimation in CTR prediction. Traditionally, pooling operations such as max and mean pooling have been widely used to extract relevant information from features. However, these operations can lead to information loss and hinder the accurate determination of feature importance. To address this challenge, we propose a novel attention architecture that utilizes a bit-based attention structure that emphasizes the relationships between all bits in features, together with maximum and mean pooling. By considering the fine-grained interactions at the bit level, our method aims to capture intricate patterns and dependencies that might be overlooked by traditional pooling operations. To examine the effectiveness of the proposed method, experiments have been conducted on three public datasets. The experiments demonstrated that the proposed method significantly improves the performance of the base models to achieve state-of-the-art results.
Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu
3D human generation from 2D images has achieved remarkable progress through
the synergistic utilization of neural rendering and generative models. Existing
3D human generative models mainly generate a clothed 3D human as an
undetectable 3D model in a single pass, while rarely considering the layer-wise
nature of a clothed human body, which often consists of the human body and
various clothes such as underwear, outerwear, trousers, shoes, etc. In this
work, we propose HumanLiff, the first layer-wise 3D human generative model with
a unified diffusion process. Specifically, HumanLiff firstly generates
minimal-clothed humans, represented by tri-plane features, in a canonical
space, and then progressively generates clothes in a layer-wise manner. In this
way, the 3D human generation is thus formulated as a sequence of
diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D
humans with tri-plane representation, we propose a tri-plane shift operation
that splits each tri-plane into three sub-planes and shifts these sub-planes to
enable feature grid subdivision. To further enhance the controllability of 3D
generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane
features and 3D layered conditions to facilitate the 3D diffusion model
learning. Extensive experiments on two layer-wise 3D human datasets, SynBody
(synthetic) and TightCap (real-world), validate that HumanLiff significantly
outperforms state-of-the-art methods in layer-wise 3D human generation. Our
code will be available at https://skhu101.github.io/HumanLiff.
Authors' comments: Project page: https://skhu101.github.io/HumanLiff/
Jian-Zhou Zhu
Component-wise dimensionally reduced flows (CWDRFs) are characterized by the uniformly (over space and time) vanishing of some component(s) in the velocity gradient tensor, and they may present in various situations with different conditions. A more universal method for specifying and computing barotropic CWDRFs associated to the Navier-Stokes equation is designed for situations besides that in a (cyclic) box. The method is \textit{local} in the sense that global relations involving volume integration are not used, and the enthalpy gradient is used as the primitive variable and computed directly. Such a local method is more useful for, say, testing the physical relevance of CWDRFs, including the real Schur flows proposed recently, or finding their practically meaningful realizations. The local and global methods are shown to be equivalent for CWDRFs in (cyclic) boxes.
Hitoshi Kiya, Ryota Iijima, Teru Nagamori
This article presents block-wise image encryption for the vision transformer
and its applications. Perceptual image encryption for deep learning enables us
not only to protect the visual information of plain images but to also embed
unique features controlled with a key into images and models. However, when
using conventional perceptual encryption methods, the performance of models is
degraded due to the influence of encryption. In this paper, we focus on
block-wise encryption for the vision transformer, and we introduce three
applications: privacy-preserving image classification, access control, and the
combined use of federated learning and encrypted images. Our scheme can have
the same performance as models without any encryption, and it does not require
any network modification. It also allows us to easily update the secret key. In
experiments, the effectiveness of the scheme is demonstrated in terms of
performance degradation and access control on the CIFAR10 and CIFAR-100
datasets.
Authors' comments: 7 figures, 3 tables. arXiv admin note: substantial text overlap with
arXiv:2207.05366
Hui Kang, Sheng Liu, Huaxi Huang, Tongliang Liu
In real-world datasets, noisy labels are pervasive. The challenge of learning with noisy labels (LNL) is to train a classifier that discerns the actual classes from given instances. For this, the model must identify features indicative of the authentic labels. While research indicates that genuine label information is embedded in the learned features of even inaccurately labeled data, it's often intertwined with noise, complicating its direct application. Addressing this, we introduce channel-wise contrastive learning (CWCL). This method distinguishes authentic label information from noise by undertaking contrastive learning across diverse channels. Unlike conventional instance-wise contrastive learning (IWCL), CWCL tends to yield more nuanced and resilient features aligned with the authentic labels. Our strategy is twofold: firstly, using CWCL to extract pertinent features to identify cleanly labeled samples, and secondly, progressively fine-tuning using these samples. Evaluations on several benchmark datasets validate our method's superiority over existing approaches.
Rui Li, Shenglong Zhou, Dong Liu
Video analysis tasks rely heavily on identifying the pixels from different
frames that correspond to the same visual target. To tackle this problem,
recent studies have advocated feature learning methods that aim to learn
distinctive representations to match the pixels, especially in a
self-supervised fashion. Unfortunately, these methods have difficulties for
tiny or even single-pixel visual targets. Pixel-wise video correspondences were
traditionally related to optical flows, which however lead to deterministic
correspondences and lack robustness on real-world videos. We address the
problem of learning features for establishing pixel-wise correspondences.
Motivated by optical flows as well as the self-supervised feature learning, we
propose to use not only labeled synthetic videos but also unlabeled real-world
videos for learning fine-grained representations in a holistic framework. We
adopt an adversarial learning scheme to enhance the generalization ability of
the learned features. Moreover, we design a coarse-to-fine framework to pursue
high computational efficiency. Our experimental results on a series of
correspondence-based tasks demonstrate that the proposed method outperforms
state-of-the-art rivals in both accuracy and efficiency.
Authors' comments: Accepted to ICCV 2023. Code and models are available at
https://github.com/qianduoduolr/FGVC
Chang-Bin Jeon, Kyogu Lee
The loudness war, an ongoing phenomenon in the music industry characterized
by the increasing final loudness of music while reducing its dynamic range, has
been a controversial topic for decades. Music mastering engineers have used
limiters to heavily compress and make music louder, which can induce ear
fatigue and hearing loss in listeners. In this paper, we introduce music
de-limiter networks that estimate uncompressed music from heavily compressed
signals. Inspired by the principle of a limiter, which performs sample-wise
gain reduction of a given signal, we propose the framework of sample-wise gain
inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k
segments created by applying a commercial limiter plug-in for training
real-world friendly de-limiter networks. Our proposed de-limiter network
achieves excellent performance with a scale-invariant source-to-distortion
ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a
limiter-applied version of musdb-HQ. The training data, codes, and model
weights are available in our repository
(https://github.com/jeonchangbin49/De-limiter).
Authors' comments: Accepted to IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA) 2023
Hanyu Peng, Guanhua Fang, Ping Li
Instance-wise feature selection and ranking methods can achieve a good
selection of task-friendly features for each sample in the context of neural
networks. However, existing approaches that assume feature subsets to be
independent are imperfect when considering the dependency between features. To
address this limitation, we propose to incorporate the Gaussian copula, a
powerful mathematical technique for capturing correlations between variables,
into the current feature selection framework with no additional changes needed.
Experimental results on both synthetic and real datasets, in terms of
performance comparison and interpretability, demonstrate that our method is
capable of capturing meaningful correlations.
Authors' comments: 15 pages, UAI poster
Yajie Cui, Zhaoxiang Liu, Shiguo Lian
Anomaly detection without priors of the anomalies is challenging. In the field of unsupervised anomaly detection, traditional auto-encoder (AE) tends to fail based on the assumption that by training only on normal images, the model will not be able to reconstruct abnormal images correctly. On the contrary, we propose a novel patch-wise auto-encoder (Patch AE) framework, which aims at enhancing the reconstruction ability of AE to anomalies instead of weakening it. Each patch of image is reconstructed by corresponding spatially distributed feature vector of the learned feature representation, i.e., patch-wise reconstruction, which ensures anomaly-sensitivity of AE. Our method is simple and efficient. It advances the state-of-the-art performances on Mvtec AD benchmark, which proves the effectiveness of our model. It shows great potential in practical industrial application scenarios.
Chunjin Yang, Fanman Meng, Shuai Chen, Mingyu Liu, Runtong Zhang
Large-scale vision-language models (LVLMs) pretrained on massive image-text pairs have achieved remarkable success in visual representations. However, existing paradigms to transfer LVLMs to downstream tasks encounter two primary challenges. Firstly, the text features remain fixed after being calculated and cannot be adjusted according to image features, which decreases the model's adaptability. Secondly, the model's output solely depends on the similarity between the text and image features, leading to excessive reliance on LVLMs. To address these two challenges, we introduce a novel two-branch model named the Instance-Wise Adaptive Tuning and Caching (ATC). Specifically, one branch implements our proposed ConditionNet, which guides image features to form an adaptive textual cache that adjusts based on image features, achieving instance-wise inference and improving the model's adaptability. The other branch introduces the similarities between images and incorporates a learnable visual cache, designed to decouple new and previous knowledge, allowing the model to acquire new knowledge while preserving prior knowledge. The model's output is jointly determined by the two branches, thus overcoming the limitations of existing methods that rely solely on LVLMs. Additionally, our method requires limited computing resources to tune parameters, yet outperforms existing methods on 11 benchmark datasets.
Cheng Wen, Baosheng Yu, Rao Fu, Dacheng Tao
A generative model for high-fidelity point clouds is of great importance in synthesizing 3d environments for applications such as autonomous driving and robotics. Despite the recent success of deep generative models for 2d images, it is non-trivial to generate 3d point clouds without a comprehensive understanding of both local and global geometric structures. In this paper, we devise a new 3d point cloud generation framework using a divide-and-conquer approach, where the whole generation process can be divided into a set of patch-wise generation tasks. Specifically, all patch generators are based on learnable priors, which aim to capture the information of geometry primitives. We introduce point- and patch-wise transformers to enable the interactions between points and patches. Therefore, the proposed divide-and-conquer approach contributes to a new understanding of point cloud generation from the geometry constitution of 3d shapes. Experimental results on a variety of object categories from the most popular point cloud dataset, ShapeNet, show the effectiveness of the proposed patch-wise point cloud generation, where it clearly outperforms recent state-of-the-art methods for high-fidelity point cloud generation.
Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti
Continual learning aims to emulate the human ability to continually
accumulate knowledge over sequential tasks. The main challenge is to maintain
performance on previously learned tasks after learning new tasks, i.e., to
avoid catastrophic forgetting. We propose a Channel-wise Lightweight
Reprogramming (CLR) approach that helps convolutional neural networks (CNNs)
overcome catastrophic forgetting during continual learning. We show that a CNN
model trained on an old task (or self-supervised proxy task) could be
``reprogrammed" to solve a new task by using our proposed lightweight (very
cheap) reprogramming parameter. With the help of CLR, we have a better
stability-plasticity trade-off to solve continual learning problems: To
maintain stability and retain previous task ability, we use a common
task-agnostic immutable part as the shared ``anchor" parameter set. We then add
task-specific lightweight reprogramming parameters to reinterpret the outputs
of the immutable parts, to enable plasticity and integrate new knowledge. To
learn sequential tasks, we only train the lightweight reprogramming parameters
to learn each new task. Reprogramming parameters are task-specific and
exclusive to each task, which makes our method immune to catastrophic
forgetting. To minimize the parameter requirement of reprogramming to learn new
tasks, we make reprogramming lightweight by only adjusting essential kernels
and learning channel-wise linear mappings from anchor parameters to
task-specific domain knowledge. We show that, for general CNNs, the CLR
parameter increase is less than 0.6\% for any new task. Our method outperforms
13 state-of-the-art continual learning baselines on a new challenging sequence
of 53 image classification datasets. Code and data are available at
https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming
Authors' comments: ICCV 2023
Yongjeong Oh, Jaeho Lee, Christopher G. Brinton, Yo-Seb Jeon
This paper proposes a novel communication-efficient split learning (SL) framework, named SplitFC, which reduces the communication overhead required for transmitting intermediate feature and gradient vectors during the SL training process. The key idea of SplitFC is to leverage different dispersion degrees exhibited in the columns of the matrices. SplitFC incorporates two compression strategies: (i) adaptive feature-wise dropout and (ii) adaptive feature-wise quantization. In the first strategy, the intermediate feature vectors are dropped with adaptive dropout probabilities determined based on the standard deviation of these vectors. Then, by the chain rule, the intermediate gradient vectors associated with the dropped feature vectors are also dropped. In the second strategy, the non-dropped intermediate feature and gradient vectors are quantized using adaptive quantization levels determined based on the ranges of the vectors. To minimize the quantization error, the optimal quantization levels of this strategy are derived in a closed-form expression. Simulation results on the MNIST, CIFAR-100, and CelebA datasets demonstrate that SplitFC outperforms state-of-the-art SL frameworks by significantly reducing communication overheads while maintaining high accuracy.