Giorgio Saracco, Giorgio Stefani
We study Cheeger and $p$-eigenvalue partition problems depending on a given
evaluation function $\Phi$ for $p\in[1,\infty)$. We prove existence and
regularity of minima, relations among the problems, convergence, and stability
with respect to $p$ and to $\Phi$.
Authors' comments: 33 pages
Suneung Kim, Woo-Jeoung Nam, Seong-Whan Lee
Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.
Ye Lin Tun, Chu Myaet Thwal, Le Quang Huy, Minh N. H. Nguyen, Choong Seon Hong
Many recent studies integrate federated learning (FL) with self-supervised learning (SSL) to take advantage of raw training data distributed across edge devices. However, edge devices often struggle with high computation and communication costs imposed by SSL and FL algorithms. To tackle this hindrance, we propose LW-FedSSL, a layer-wise federated self-supervised learning approach that allows edge devices to incrementally train one layer of the model at a time. LW-FedSSL comprises server-side calibration and representation alignment mechanisms to maintain comparable performance with end-to-end FedSSL while significantly lowering clients' resource requirements. The server-side calibration mechanism takes advantage of the resource-rich server in an FL environment to assist in global model training. Meanwhile, the representation alignment mechanism encourages closeness between representations of FL local models and those of the global model. Our experiments show that LW-FedSSL has a $3.3 \times$ lower memory requirement and a $3.2 \times$ cheaper communication cost than its end-to-end counterpart. We also explore a progressive training strategy called Prog-FedSSL that outperforms end-to-end training with a similar memory requirement and a $1.8 \times$ cheaper communication cost.
Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez, Isabela Hernández, Nicolás Aparicio, Eugénie Dessevres, Sebastián Peña, Jessica Santander et al.
This paper presents the Holistic and Multi-Granular Surgical Scene
Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that
models surgical scene understanding as a hierarchy of complementary tasks with
varying levels of granularity. Our approach encompasses long-term tasks, such
as surgical phase and step recognition, and short-term tasks, including
surgical instrument segmentation and atomic visual actions detection. To
exploit our proposed benchmark, we introduce the Transformers for Actions,
Phases, Steps, and Instrument Segmentation (TAPIS) model, a general
architecture that combines a global video feature extractor with localized
region proposals from an instrument segmentation model to tackle the
multi-granularity of our benchmark. Through extensive experimentation in ours
and alternative benchmarks, we demonstrate TAPIS's versatility and
state-of-the-art performance across different tasks. This work represents a
foundational step forward in Endoscopic Vision, offering a novel framework for
future research towards holistic surgical scene understanding.
Authors' comments: Preprint submitted to Medical Image Analysis. Official extension of
previous MICCAI 2022
(https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42) and ISBI
2023 (https://ieeexplore.ieee.org/document/10230819) orals. Data and codes
are available at https://github.com/BCV-Uniandes/GraSP
Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin
Equivariant neural networks have shown improved performance, expressiveness
and sample complexity on symmetrical domains. But for some specific symmetries,
representations, and choice of coordinates, the most common point-wise
activations, such as ReLU, are not equivariant, hence they cannot be employed
in the design of equivariant neural networks. The theorem we present in this
paper describes all possible combinations of finite-dimensional
representations, choice of coordinates and point-wise activations to obtain an
exactly equivariant layer, generalizing and strengthening existing
characterizations. Notable cases of practical relevance are discussed as
corollaries. Indeed, we prove that rotation-equivariant networks can only be
invariant, as it happens for any network which is equivariant with respect to
connected compact groups. Then, we discuss implications of our findings when
applied to important instances of exactly equivariant networks. First, we
completely characterize permutation equivariant networks such as Invariant
Graph Networks with point-wise nonlinearities and their geometric counterparts,
highlighting a plethora of models whose expressive power and performance are
still unknown. Second, we show that feature spaces of disentangled steerable
convolutional neural networks are trivial representations.
Authors' comments: Accepted at the 12th International Conference on Learning
Representations (ICLR 2024)
Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu
Pre-trained Vision-Language (V-L) models set the benchmark for generalization
to downstream tasks among the noteworthy contenders. Many characteristics of
the V-L model have been explored in existing research including the challenge
of the sensitivity to text input and the tuning process across multi-modal
prompts. With the advanced utilization of the V-L model like CLIP, recent
approaches deploy learnable prompts instead of hand-craft prompts to boost the
generalization performance and address the aforementioned challenges. Inspired
by layer-wise training, which is wildly used in image fusion, we note that
using a sequential training process to adapt different modalities branches of
CLIP efficiently facilitates the improvement of generalization. In the context
of addressing the multi-modal prompting challenge, we propose Token-wise
Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities
prompts, vision and language, as tokens in a sequential manner. APLe addresses
the challenges in V-L models to promote prompt learning across both modalities,
which indicates a competitive generalization performance in line with the
state-of-the-art. Preeminently, APLe shows robustness and favourable
performance in prompt-length experiments with an absolute advantage in adopting
the V-L models.
Authors' comments: 7 pages,3 figures
Jonathan Fischer, Martin Schulze, Paul Rosenthal, Lars Linsen
When employing Direct Volume Rendering (DVR) for visualizing volumetric scalar fields, classification is generally performed on a piecewise constant or piecewise linear approximation of the field on a viewing ray. Smoothed Particle Hydrodynamics (SPH) data sets define volumetric scalar fields as the sum of individual particle contributions, at highly varying spatial resolution. We present an approach for approximating SPH scalar fields along viewing rays with piece-wise polynomial functions of higher order. This is done by approximating each particle contribution individually and then efficiently summing the results, thus generating a higher-order representation of the field with a resolution adapting to the data resolution in the volume.
Firas Laakom, Yuheng Bu, Moncef Gabbouj
Existing generalization theories of supervised learning typically take a
holistic approach and provide bounds for the expected generalization over the
whole data distribution, which implicitly assumes that the model generalizes
similarly for all the classes. In practice, however, there are significant
variations in generalization performance among different classes, which cannot
be captured by the existing generalization bounds. In this work, we tackle this
problem by theoretically studying the class-generalization error, which
quantifies the generalization performance of each individual class. We derive a
novel information-theoretic bound for class-generalization error using the KL
divergence, and we further obtain several tighter bounds using the conditional
mutual information (CMI), which are significantly easier to estimate in
practice. We empirically validate our proposed bounds in different neural
networks and show that they accurately capture the complex class-generalization
error behavior. Moreover, we show that the theoretical tools developed in this
paper can be applied in several applications beyond this context.
Authors' comments: 26 pages
Claire Greenwell, Poshak Gandhi, Daniel Stern, George Lansbury, Vincenzo Mainieri, Peter Boorman, Yoshiki Toba
The growth of active galactic nuclei (AGN) occurs under some form of
obscuration in a large fraction of the population. The difficulty in
constraining this population leads to high uncertainties in cosmic X-ray
background and galaxy evolution models. Using an SDSS-WISE cross-match, we
target infrared luminous AGN ($W1-W2$ > 0.8, and monochromatic rest-frame
luminosity above $\lambda L_{\lambda}$(12$\mu$m) $\approx$ 3 $\times$ 10$^{44}$
erg s$^{-1}$), but with passive galaxy-like optical spectra (Optically
Quiescent Quasars; OQQs). We find 47 objects that show no significant [O
III]$\lambda$5007 emission, a typically strong AGN optical emission line. As a
comparison sample, we examine SDSS-selected Type 2 quasars (QSO2s), which show
a significant [O III]$\lambda$5007 line by definition. We find a 1:16 ratio of
OQQs compared to QSO2s, suggesting that the OQQ duty cycle is likely much
shorter than that of QSO2s (though selection biases are not fully quantified).
We consider observed properties in comparison with other galaxy types, and
examine them for consistency with theories on their intrinsic nature: chiefly
(a) a high covering factor for surrounding obscuring matter, preventing the
detection of high-ionisation emission lines - `cocooned AGN'; or (b) ionised
gas being absent on the kpc scales of the Narrow Line Region (NLR), perhaps due
to a `switching on' or `young' AGN. OQQs do not obviously fit the standard
paradigm for merger-driven AGN and host galaxy evolution, implying we may be
missing part of the flow of AGN evolution.
Authors' comments: Accepted for publication in MNRAS (20 pages, 18 figures)
Andreas Papachristodoulou, Christos Kyrkou, Stelios Timotheou, Theocharis Theocharides
The Forward-Forward (FF) Algorithm has been recently proposed to alleviate
the issues of backpropagation (BP) commonly used to train deep neural networks.
However, its current formulation exhibits limitations such as the generation of
negative data, slower convergence, and inadequate performance on complex tasks.
In this paper, we take the main ideas of FF and improve them by leveraging
channel-wise competitive learning in the context of convolutional neural
networks for image classification tasks. A layer-wise loss function is
introduced that promotes competitive learning and eliminates the need for
negative data construction. To enhance both the learning of compositional
features and feature space partitioning, a channel-wise feature separator and
extractor block is proposed that complements the competitive learning process.
Our method outperforms recent FF-based models on image classification tasks,
achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST,
Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the
performance gap between FF learning and BP methods, indicating the potential of
our proposed approach to learn useful representations in a layer-wise modular
fashion, enabling more efficient and flexible learning.
Authors' comments: To be published in AAAI 2024, 11 pages, 7 figures
Gwladys Kelodjou, Laurence Roz, Vronique Masson, Luis Galrraga, Romaric Gaudel, Maurice Tchuente, Alexandre Termier
Machine learning techniques, such as deep learning and ensemble methods, are
widely used in various domains due to their ability to handle complex
real-world tasks. However, their black-box nature has raised multiple concerns
about the fairness, trustworthiness, and transparency of computer-assisted
decision-making. This has led to the emergence of local post-hoc explainability
methods, which offer explanations for individual decisions made by black-box
algorithms. Among these methods, Kernel SHAP is widely used due to its
model-agnostic nature and its well-founded theoretical framework. Despite these
strengths, Kernel SHAP suffers from high instability: different executions of
the method with the same inputs can lead to significantly different
explanations, which diminishes the utility of post-hoc explainability. The
contribution of this paper is two-fold. On the one hand, we show that Kernel
SHAP's instability is caused by its stochastic neighbor selection procedure,
which we adapt to achieve full stability without compromising explanation
fidelity. On the other hand, we show that by restricting the neighbors
generation to perturbations of size 1 -- which we call the coalitions of Layer
1 -- we obtain a novel feature-attribution method that is fully stable,
efficient to compute, and still meaningful.
Authors' comments: To appear in AAAI-24
Chanyong Jung, Gihyun Kwon, Jong Chul Ye
Recently, patch-wise contrastive learning is drawing attention for the image
translation by exploring the semantic correspondence between the input and
output images. To further explore the patch-wise topology for high-level
semantic understanding, here we exploit the graph neural network to capture the
topology-aware features. Specifically, we construct the graph based on the
patch-wise similarity from a pretrained encoder, whose adjacency matrix is
shared to enhance the consistency of patch-wise relation between the input and
the output. Then, we obtain the node feature from the graph neural network, and
enhance the correspondence between the nodes by increasing mutual information
using the contrastive loss. In order to capture the hierarchical semantic
structure, we further propose the graph pooling. Experimental results
demonstrate the state-of-art results for the image translation thanks to the
semantic encoding by the constructed graphs.
Authors' comments: AAAI 2024
Jonggyu Jang, Hyeonsu Lyu, Hyun Jong Yang
Model inversion (MI) attacks aim to infer or reconstruct the training dataset
through reverse-engineering from the target model's weights. Recently,
significant advancements in generative models have enabled MI attacks to
overcome challenges in producing photo-realistic replicas of the training
dataset, a technique known as generative MI. The generative MI primarily
focuses on identifying latent vectors that correspond to specific target
labels, leveraging a generative model trained with an auxiliary dataset.
However, an important aspect is often overlooked: the MI attacks fail if the
pre-trained generative model lacks the coverage to create an image
corresponding to the target label, especially when there is a significant
difference between the target and auxiliary datasets. To address this gap, we
propose the Patch-MI method, inspired by a jigsaw puzzle, which offers a novel
probabilistic interpretation of MI attacks. Even with a dissimilar auxiliary
dataset, our method effectively creates images that closely mimic the
distribution of image patches in the target dataset by patch-based
reconstruction. Moreover, we numerically demonstrate that the Patch-MI improves
Top 1 attack accuracy by 5\%p compared to existing methods.
Authors' comments: 12 pages
Ziyi Yin, Rafael Orozco, Mathias Louboutin, Felix J. Herrmann
We introduce a probabilistic technique for full-waveform inversion, employing variational inference and conditional normalizing flows to quantify uncertainty in migration-velocity models and its impact on imaging. Our approach integrates generative artificial intelligence with physics-informed common-image gathers, reducing reliance on accurate initial velocity models. Considered case studies demonstrate its efficacy producing realizations of migration-velocity models conditioned by the data. These models are used to quantify amplitude and positioning effects during subsequent imaging.
Robert Mnatsakanov, Rafik Aramyan, Farhad Jafari
The problem of recovering a moment-determinate multivariate function $f$ via its moment sequence is studied. Under mild conditions on $f$, the point-wise and $L_1$-rates of convergence for the proposed constructions are established. The cases where $f$ is the indicator function of a set, and represents a discrete probability mass function are also investigated. Calculations of the approximants and simulation studies are conducted to graphically illustrate the behavior of the approximations in several simple examples. Analytical and simulated errors of proposed approximations are recorded in Tables 1-3.
Bringfried Stecklum
The Wide-field Infrared Survey Explorer (WISE, Wright et al. 2010) and its
follow-up Near-Earth Object (NEO) mission (NEOWISE, Mainzer et al. 2011) scan
the mid-infrared sky twice a year. The spatial and temporal coverage of the
resulting database is of utmost importance for variability studies, in
particular of young stellar objects (YSOs) which have red $W1{-}W2$ colors.
During such an effort, I noticed subarcsecond position offsets between
subsequent visits. The offsets do not appear for targets with small $W1{-}W2$
colors, which points to a chromatic origin in the optics, caused by the
spacecraft pointing alternating ``forward'' and ``backward'' from one visit to
another. It amounts to 0\farcs1 for targets with $W1{-}W2\approx2$.
Consideration of this chromatic offset will improve astrometry. This is of
particular importance for NEOs that are generally red.
Authors' comments: 3 pages, 1 figure, submitted to RNAAS
Huan Chen, Wangcai Zhao, Tingfa Xu, Shiyun Zhou, Peifu Liu, Jianan Li
Coded Aperture Snapshot Spectral Imaging (CASSI) reconstruction aims to
recover the 3D spatial-spectral signal from 2D measurement. Existing methods
for reconstructing Hyperspectral Image (HSI) typically involve learning
mappings from a 2D compressed image to a predetermined set of discrete spectral
bands. However, this approach overlooks the inherent continuity of the spectral
information. In this study, we propose an innovative method called
Spectral-wise Implicit Neural Representation (SINR) as a pioneering step toward
addressing this limitation. SINR introduces a continuous spectral amplification
process for HSI reconstruction, enabling spectral super-resolution with
customizable magnification factors. To achieve this, we leverage the concept of
implicit neural representation. Specifically, our approach introduces a
spectral-wise attention mechanism that treats individual channels as distinct
tokens, thereby capturing global spectral dependencies. Additionally, our
approach incorporates two components, namely a Fourier coordinate encoder and a
spectral scale factor module. The Fourier coordinate encoder enhances the
SINR's ability to emphasize high-frequency components, while the spectral scale
factor module guides the SINR to adapt to the variable number of spectral
channels. Notably, the SINR framework enhances the flexibility of CASSI
reconstruction by accommodating an unlimited number of spectral bands in the
desired output. Extensive experiments demonstrate that our SINR outperforms
baseline methods. By enabling continuous reconstruction within the CASSI
framework, we take the initial stride toward integrating implicit neural
representation into the field.
Authors' comments: Accepted by IEEE Transactions on Circuits and Systems for Video
Technology, has been published
Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W. Mahoney, Yaoqing Yang
Regularization in modern machine learning is crucial, and it can take various
forms in algorithmic design: training set, model family, error function,
regularization terms, and optimizations. In particular, the learning rate,
which can be interpreted as a temperature-like parameter within the statistical
mechanics of learning, plays a crucial role in neural network training. Indeed,
many widely adopted training strategies basically just define the decay of the
learning rate over time. This process can be interpreted as decreasing a
temperature, using either a global learning rate (for the entire model) or a
learning rate that varies for each parameter. This paper proposes TempBalance,
a straightforward yet effective layer-wise learning rate method. TempBalance is
based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which
characterizes the implicit self-regularization of different layers in trained
models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide
the scheduling and balancing of temperature across all network layers during
model training, resulting in improved performance during testing. We implement
TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using
ResNets, VGGs, and WideResNets with various depths and widths. Our results show
that TempBalance significantly outperforms ordinary SGD and carefully-tuned
spectral norm regularization. We also show that TempBalance outperforms a
number of state-of-the-art optimizers and learning rate schedulers.
Authors' comments: NeurIPS 2023 Spotlight, first two authors contributed equally
Shpresim Sadiku, Moritz Wagner, Sebastian Pokutta
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100\%$ attack success rate.
Yixuan Luo, Mengye Ren, Sai Qian Zhang
Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.