Xiaoyun Li, Ping Li
Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying $K$ random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around $\epsilon = 5\sim 10$, where $\epsilon$ is the standard parameter in the language of $(\epsilon, \delta)$-DP.
Donghao Li, Ruiquan Huang, Cong Shen, Jing Yang
This paper investigates conservative exploration in reinforcement learning
where the performance of the learning agent is guaranteed to be above a certain
threshold throughout the learning process. It focuses on the tabular episodic
Markov Decision Process (MDP) setting that has finite states and actions. With
the knowledge of an existing safe baseline policy, an algorithm termed as
StepMix is proposed to balance the exploitation and exploration while ensuring
that the conservative constraint is never violated in each episode with high
probability. StepMix features a unique design of a mixture policy that
adaptively and smoothly interpolates between the baseline policy and the
optimistic policy. Theoretical analysis shows that StepMix achieves
near-optimal regret order as in the constraint-free setting, indicating that
obeying the stringent episode-wise conservative constraint does not compromise
the learning performance. Besides, a randomization-based EpsMix algorithm is
also proposed and shown to achieve the same performance as StepMix. The
algorithm design and theoretical analysis are further extended to the setting
where the baseline policy is not given a priori but must be learned from an
offline dataset, and it is proved that similar conservative guarantee and
regret can be achieved if the offline dataset is sufficiently large. Experiment
results corroborate the theoretical analysis and demonstrate the effectiveness
of the proposed conservative exploration strategies.
Authors' comments: Accepted by ICML2023
Yan Sun, Li Shen, Dacheng Tao
Federated learning (FL) is a distributed paradigm that coordinates massive
local clients to collaboratively train a global model via stage-wise local
training processes on the heterogeneous dataset. Previous works have implicitly
studied that FL suffers from the ``client-drift'' problem, which is caused by
the inconsistent optimum across local clients. However, till now it still lacks
solid theoretical analysis to explain the impact of this local inconsistency.
To alleviate the negative impact of the ``client drift'' and explore its
substance in FL, in this paper, we first design an efficient FL algorithm
\textit{FedInit}, which allows employing the personalized relaxed
initialization state at the beginning of each local training stage.
Specifically, \textit{FedInit} initializes the local state by moving away from
the current global state towards the reverse direction of the latest local
state. This relaxed initialization helps to revise the local divergence and
enhance the local consistency level. Moreover, to further understand how
inconsistency disrupts performance in FL, we introduce the excess risk analysis
and study the divergence term to investigate the test error of the proposed
\textit{FedInit} method. Our studies show that optimization error is not
sensitive to this local inconsistency, while it mainly affects the
generalization error bound in \textit{FedInit}. Extensive experiments are
conducted to validate this conclusion. Our proposed \textit{FedInit} could
achieve state-of-the-art~(SOTA) results compared to several advanced benchmarks
without any additional costs. Meanwhile, stage-wise relaxed initialization
could also be incorporated into the current advanced algorithms to achieve
higher performance in the FL paradigm.
Authors' comments: 32 pages
Jiachen Yao, Chang Su, Zhongkai Hao, Songming Liu, Hang Su, Jun Zhu
Physics-informed Neural Networks (PINNs) have recently achieved remarkable progress in solving Partial Differential Equations (PDEs) in various fields by minimizing a weighted sum of PDE loss and boundary loss. However, there are several critical challenges in the training of PINNs, including the lack of theoretical frameworks and the imbalance between PDE loss and boundary loss. In this paper, we present an analysis of second-order non-homogeneous PDEs, which are classified into three categories and applicable to various common problems. We also characterize the connections between the training loss and actual error, guaranteeing convergence under mild conditions. The theoretical analysis inspires us to further propose MultiAdam, a scale-invariant optimizer that leverages gradient momentum to parameter-wisely balance the loss terms. Extensive experiment results on multiple problems from different physical domains demonstrate that our MultiAdam solver can improve the predictive accuracy by 1-2 orders of magnitude compared with strong baselines.
Christine W. Bang, Vanessa Didelez
Equivalence classes of DAGs (represented by CPDAGs) may be too large to
provide useful causal information. Here, we address incorporating tiered
background knowledge yielding restricted equivalence classes represented by
'tiered MPDAGs'. Tiered knowledge leads to considerable gains in
informativeness and computational efficiency: We show that construction of
tiered MPDAGs only requires application of Meek's 1st rule, and that tiered
MPDAGs (unlike general MPDAGs) are chain graphs with chordal components. This
entails simplifications e.g. of determining valid adjustment sets for causal
effect estimation. Further, we characterise when one tiered ordering is more
informative than another, providing insights into useful aspects of background
knowledge.
Authors' comments: Accepted for the 39th Conference on Uncertainty in Artificial
Intelligence (UAI 2023)
Bernardo Carvalho, Elias Rego
We discuss different regularities on stable/unstable holonomies of
cw-hyperbolic homeomorphisms and prove that if a cw-hyperbolic homeomorphism
has continuous joint stable/unstable holonomies, then it has a dense set of
periodic points in its non-wandering set. For that, we prove that the
hyperbolic cw-metric (introduced in [9]) can be adapted to be self-similar (as
in [6]) and, in this case, continuous joint stable/unstable holonomies are
pseudo-isometric. We also prove transitivity of cw-hyperbolic homeomorphisms
assuming that the stable/unstable holonomies are isometric. In the case the
ambient space is a surface, we prove that a cw$_F$-hyperbolic homeomorphism has
continuous joint stable/unstable holonomies when every bi-asymptotic sector is
regular.
Authors' comments: 27 pages, 9 figures
Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee
Post-training quantization (PTQ) has been gaining popularity for the
deployment of deep neural networks on resource-limited devices since unlike
quantization-aware training, neither a full training dataset nor end-to-end
training is required at all. As PTQ schemes based on reconstructing each layer
or block output turn out to be effective to enhance quantized model
performance, recent works have developed algorithms to devise and learn a new
weight-rounding scheme so as to better reconstruct each layer or block output.
In this work, we propose a simple yet effective new weight-rounding mechanism
for PTQ, coined \emph{FlexRound}, based on element-wise division instead of
typical element-wise addition such that FlexRound enables jointly learning a
common quantization grid size as well as a different scale for each pre-trained
weight. Thanks to the reciprocal rule of derivatives induced by element-wise
division, FlexRound is inherently able to exploit pre-trained weights when
updating their corresponding scales, and thus, flexibly quantize pre-trained
weights depending on their magnitudes. We empirically validate the efficacy of
FlexRound on a wide range of models and tasks. To the best of our knowledge,
our work is the first to carry out comprehensive experiments on not only image
classification and natural language understanding but also natural language
generation. Moreover, we demonstrate, for the first time, that large language
models can be efficiently quantized, with only a negligible impact on
performance compared to half-precision baselines, achieved by reconstructing
the output in a block-by-block manner. Our code is available at
\url{https://github.com/onliwad101/FlexRound_LRQ}.
Authors' comments: Accepted to ICML 2023
Kangjun Liu, Ke Chen, Lihua Guo, Yaowei Wang, Kui Jia
Mixup style data augmentation algorithms have been widely adopted in various tasks as implicit network regularization on representation learning to improve model generalization, which can be achieved by a linear interpolation of labeled samples in input or feature space as well as target space. Inspired by good robustness of alternative dropout strategies against over-fitting on limited patterns of training samples, this paper introduces a novel concept of ShuffleMix -- Shuffle of Mixed hidden features, which can be interpreted as a kind of dropout operation in feature space. Specifically, our ShuffleMix method favors a simple linear shuffle of randomly selected feature channels for feature mixup in-between training samples to leverage semantic interpolated supervision signals, which can be extended to a generalized shuffle operation via additionally combining linear interpolations of intra-channel features. Compared to its direct competitor of feature augmentation -- the Manifold Mixup, the proposed ShuffleMix can gain superior generalization, owing to imposing more flexible and smooth constraints on generating samples and achieving regularization effects of channel-wise feature dropout. Experimental results on several public benchmarking datasets of single-label and multi-label visual classification tasks can confirm the effectiveness of our method on consistently improving representations over the state-of-the-art mixup augmentation.
Ana Carolina da Cruz, Camila P. E. de Souza
Change-point models deal with ordered data sequences. Their primary goal is
to infer the locations where an aspect of the data sequence changes. In this
paper, we propose and implement a nonparametric Bayesian model for clustering
observations based on their constant-wise change-point profiles via Gibbs
sampler. Our model incorporates a Dirichlet Process on the constant-wise
change-point structures to cluster observations while simultaneously performing
multiple change-point estimation. Additionally, our approach controls the
number of clusters in the model, not requiring the specification of the number
of clusters a priori. Satisfactory clustering and estimation results were
obtained when evaluating our method under various simulated scenarios and on a
real dataset from single-cell genomic sequencing. Our proposed methodology is
implemented as an R package called BayesCPclust and is available from the
Comprehensive R Archive Network at
https://CRAN.R-project.org/package=BayesCPclust.
Authors' comments: 30 pages, 12 figures
Minwoo Jung, Sangwoo Jung, Ayoung Kim
In recent years, multiple Light Detection and Ranging (LiDAR) systems have
grown in popularity due to their enhanced accuracy and stability from the
increased field of view (FOV). However, integrating multiple LiDARs can be
challenging, attributable to temporal and spatial discrepancies. Common
practice is to transform points among sensors while requiring strict time
synchronization or approximating transformation among sensor frames. Unlike
existing methods, we elaborate the inter-sensor transformation using
continuous-time (CT) inertial measurement unit (IMU) modeling and derive
associated ambiguity as a point-wise uncertainty. This uncertainty, modeled by
combining the state covariance with the acquisition time and point range,
allows us to alleviate the strict time synchronization and to overcome FOV
difference. The proposed method has been validated on both public and our
datasets and is compatible with various LiDAR manufacturers and scanning
patterns. We open-source the code for public access at
https://github.com/minwoo0611/MA-LIO.
Authors' comments: 9 pages, 10 figures
Achraf Bahamou, Donald Goldfarb
We propose a new per-layer adaptive step-size procedure for stochastic
first-order optimization methods for minimizing empirical loss functions in
deep learning, eliminating the need for the user to tune the learning rate
(LR). The proposed approach exploits the layer-wise stochastic curvature
information contained in the diagonal blocks of the Hessian in deep neural
networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. The
method has memory requirements that are comparable to those of first-order
methods, while its per-iteration time complexity is only increased by an amount
that is roughly equivalent to an additional gradient computation. Numerical
experiments show that SGD with momentum and AdamW combined with the proposed
per-layer step-sizes are able to choose effective LR schedules and outperform
fine-tuned LR versions of these methods as well as popular first-order and
second-order algorithms for training DNNs on Autoencoder, Convolutional Neural
Network (CNN) and Graph Convolutional Network (GCN) models. Finally, it is
proved that an idealized version of SGD with the layer-wise step sizes
converges linearly when using full-batch gradients.
Authors' comments: requires revision
Na Dong, Yongqiang Zhang, Mingli Ding, Gim Hee Lee
Real-world data tends to follow a long-tailed distribution, where the class
imbalance results in dominance of the head classes during training. In this
paper, we propose a frustratingly simple but effective step-wise learning
framework to gradually enhance the capability of the model in detecting all
categories of long-tailed datasets. Specifically, we build smooth-tail data
where the long-tailed distribution of categories decays smoothly to correct the
bias towards head classes. We pre-train a model on the whole long-tailed data
to preserve discriminability between all categories. We then fine-tune the
class-agnostic modules of the pre-trained model on the head class dominant
replay data to get a head class expert model with improved decision boundaries
from all categories. Finally, we train a unified model on the tail class
dominant replay data while transferring knowledge from the head class expert
model to ensure accurate detection of all categories. Extensive experiments on
long-tailed datasets LVIS v0.5 and LVIS v1.0 demonstrate the superior
performance of our method, where we can improve the AP with ResNet-50 backbone
from 27.0% to 30.3% AP, and especially for the rare categories from 15.5% to
24.9% AP. Our best model using ResNet-101 backbone can achieve 30.7% AP, which
suppresses all existing detectors using the same backbone.
Authors' comments: 10 pages, 5 figures
Peize Li, Ruining Deng, Yuankai Huo
Tissue examination and quantification in a 3D context on serial section whole
slide images (WSIs) were laborintensive and time-consuming tasks. Our previous
study proposed a novel registration-based method (Map3D) to automatically align
WSIs to the same physical space, reducing the human efforts of screening serial
sections from WSIs. However, the registration performance of our Map3D method
was only evaluated on single-stain WSIs with large-scale kidney tissue samples.
In this paper, we provide a Docker for an end-to-end 3D slide-wise registration
pipeline on needle biopsy serial sections in a multi-stain paradigm. The
contribution of this study is three-fold: (1) We release a containerized Docker
for an end-to-end multi-stain WSI registration. (2) We prove that the Map3D
pipeline is capable of sectional registration from multi-stain WSI. (3) We
verify that the Map3D pipeline can also be applied to needle biopsy tissue
samples. The source code and the Docker have been made publicly available at
https://github.com/hrlblab/Map3D.
Authors' comments: 6 pages, 4 figures
Ming Hu, Zhihao Yue, Xiaofei Xie, Cheng Chen, Yihao Huang, Xian Wei, Xiang Lian, Yang Liu et al.
Although Federated Learning (FL) enables global model training across clients
without compromising their raw data, due to the unevenly distributed data among
clients, existing Federated Averaging (FedAvg)-based methods suffer from the
problem of low inference performance. Specifically, different data
distributions among clients lead to various optimization directions of local
models. Aggregating local models usually results in a low-generalized global
model, which performs worse on most of the clients. To address the above issue,
inspired by the observation from a geometric perspective that a
well-generalized solution is located in a flat area rather than a sharp area,
we propose a novel and heuristic FL paradigm named FedMR (Federated Model
Recombination). The goal of FedMR is to guide the recombined models to be
trained towards a flat area. Unlike conventional FedAvg-based methods, in
FedMR, the cloud server recombines collected local models by shuffling each
layer of them to generate multiple recombined models for local training on
clients rather than an aggregated global model. Since the area of the flat area
is larger than the sharp area, when local models are located in different
areas, recombined models have a higher probability of locating in a flat area.
When all recombined models are located in the same flat area, they are
optimized towards the same direction. We theoretically analyze the convergence
of model recombination. Experimental results show that, compared with
state-of-the-art FL methods, FedMR can significantly improve the inference
accuracy without exposing the privacy of each client.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2208.07677
Guiyu Zhao, Bo Qiu, A-Li Luo, Xiaoyu Guo, Lin Yao, Kun Wang, Yuanbo Liu
The Wide-field Infrared Survey Explorer (WISE) has detected hundreds of millions of sources over the entire sky. However, classifying them reliably is a great challenge due to degeneracies in WISE multicolor space and low detection levels in its two longest-wavelength bandpasses. In this paper, the deep learning classification network, IICnet (Infrared Image Classification network), is designed to classify sources from WISE images to achieve a more accurate classification goal. IICnet shows good ability on the feature extraction of the WISE sources. Experiments demonstrates that the classification results of IICnet are superior to some other methods; it has obtained 96.2% accuracy for galaxies, 97.9% accuracy for quasars, and 96.4% accuracy for stars, and the Area Under Curve (AUC) of the IICnet classifier can reach more than 99%. In addition, the superiority of IICnet in processing infrared images has been demonstrated in the comparisons with VGG16, GoogleNet, ResNet34, MobileNet, EfficientNetV2, and RepVGG-fewer parameters and faster inference. The above proves that IICnet is an effective method to classify infrared sources.
Byung-Doh Oh, William Schuler
While there is much recent interest in studying why Transformer-based large
language models make predictions the way they do, the complex computations
performed within each layer have made their behavior somewhat opaque. To
mitigate this opacity, this work presents a linear decomposition of final
hidden states from autoregressive language models based on each initial input
token, which is exact for virtually all contemporary Transformer architectures.
This decomposition allows the definition of probability distributions that
ablate the contribution of specific input tokens, which can be used to analyze
their influence on model probabilities over a sequence of upcoming words with
only one forward pass from the model. Using the change in next-word probability
as a measure of importance, this work first examines which context words make
the biggest contribution to language model predictions. Regression experiments
suggest that Transformer-based language models rely primarily on collocational
associations, followed by linguistic factors such as syntactic dependencies and
coreference relationships in making next-word predictions. Additionally,
analyses using these measures to predict syntactic dependencies and coreferent
mention spans show that collocational association and repetitions of the same
token largely explain the language models' predictions on these tasks.
Authors' comments: ACL 2023
Qingpeng Zhao, Yuanyang Zhu, Zichuan Liu, Zhi Wang, Chunlin Chen
In cooperative multi-agent reinforcement learning (MARL), the environmental stochasticity and uncertainties will increase exponentially when the number of agents increases, which puts hard pressure on how to come up with a compact latent representation from partial observation for boosting value decomposition. To tackle these issues, we propose a simple yet powerful method that alleviates partial observability and efficiently promotes coordination by introducing the UNit-wise attentive State Representation (UNSR). In UNSR, each agent learns a compact and disentangled unit-wise state representation outputted from transformer blocks, and produces its local action-value function. The proposed UNSR is used to boost the value decomposition with a multi-head attention mechanism for producing efficient credit assignment in the mixing network, providing an efficient reasoning path between the individual value function and joint value function. Experimental results demonstrate that our method achieves superior performance and data efficiency compared to solid baselines on the StarCraft II micromanagement challenge. Additional ablation experiments also help identify the key factors contributing to the performance of UNSR.
Chengqing Li, Sheng Liu
Joint encryption and compression is an ideal solution for protecting security
and privacy of image data in a real scenario, e.g. storing them on an existing
cloud-based service like Facebook. Recently, some block-wise
encryption-then-compression (ETC) schemes compatible with JPEG were proposed to
provide a reasonably high level of security without compromising compression
ratio much. This paper investigates recovering the block-wise relationship in
an ETC scheme exerting on single-color blocks of size $8\times 8$ in the
scenarios of ciphertext-only attack, known-plaintext attack and
chosen-plaintext attack. Then, the attacking targets are extended to the other
conventional ETC schemes exerting on multiple color channels and blocks of
various sizes. Especially, an elaborate jigsaw puzzle solver is designed to
recover enough visual information from multiple cipher-images encrypted by the
same secret key. Moreover, the nice attacking performance was verified over two
social media platforms, Facebook and Weibo.
Authors' comments: 12 pages
Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, Young Min Kim
We present a method to animate a character incorporating multiple part-wise
motion priors (PMP). While previous works allow creating realistic articulated
motions from reference data, the range of motion is largely limited by the
available samples. Especially for the interaction-rich scenarios, it is
impractical to attempt acquiring every possible interacting motion, as the
combination of physical parameters increases exponentially. The proposed PMP
allows us to assemble multiple part skills to animate a character, creating a
diverse set of motions with different combinations of existing data. In our
pipeline, we can train an agent with a wide range of part-wise priors.
Therefore, each body part can obtain a kinematic insight of the style from the
motion captures, or at the same time extract dynamics-related information from
the additional part-specific simulation. For example, we can first train a
general interaction skill, e.g. grasping, only for the dexterous part, and then
combine the expert trajectories from the pre-trained agent with the kinematic
priors of other limbs. Eventually, our whole-body agent learns a novel physical
interaction skill even with the absence of the object trajectories in the
reference motion sequence.
Authors' comments: 13 pages, 11 figures
Hanyu Sun, Xiao Huang, Wei Ma
To provide real-time parking information, existing studies focus on predicting parking availability, which seems an indirect approach to saving drivers' cruising time. In this paper, we first time propose an on-street parking recommendation (OPR) task to directly recommend a parking space for a driver. To this end, a learn-to-rank (LTR) based OPR model called OPR-LTR is built. Specifically, parking recommendation is closely related to the "turnover events" (state switching between occupied and vacant) of each parking space, and hence we design a highly efficient heterogeneous graph called ESGraph to represent historical and real-time meters' turnover events as well as geographical relations; afterward, a convolution-based event-then-graph network is used to aggregate and update representations of the heterogeneous graph. A ranking model is further utilized to learn a score function that helps recommend a list of ranked parking spots for a specific on-street parking query. The method is verified using the on-street parking meter data in Hong Kong and San Francisco. By comparing with the other two types of methods: prediction-only and prediction-then-recommendation, the proposed direct-recommendation method achieves satisfactory performance in different metrics. Extensive experiments also demonstrate that the proposed ESGraph and the recommendation model are more efficient in terms of computational efficiency as well as saving drivers' on-street parking time.