Byung-Doh Oh, William Schuler
While there is much recent interest in studying why Transformer-based large
language models make predictions the way they do, the complex computations
performed within each layer have made their behavior somewhat opaque. To
mitigate this opacity, this work presents a linear decomposition of final
hidden states from autoregressive language models based on each initial input
token, which is exact for virtually all contemporary Transformer architectures.
This decomposition allows the definition of probability distributions that
ablate the contribution of specific input tokens, which can be used to analyze
their influence on model probabilities over a sequence of upcoming words with
only one forward pass from the model. Using the change in next-word probability
as a measure of importance, this work first examines which context words make
the biggest contribution to language model predictions. Regression experiments
suggest that Transformer-based language models rely primarily on collocational
associations, followed by linguistic factors such as syntactic dependencies and
coreference relationships in making next-word predictions. Additionally,
analyses using these measures to predict syntactic dependencies and coreferent
mention spans show that collocational association and repetitions of the same
token largely explain the language models' predictions on these tasks.
Authors' comments: ACL 2023
Qingpeng Zhao, Yuanyang Zhu, Zichuan Liu, Zhi Wang, Chunlin Chen
In cooperative multi-agent reinforcement learning (MARL), the environmental stochasticity and uncertainties will increase exponentially when the number of agents increases, which puts hard pressure on how to come up with a compact latent representation from partial observation for boosting value decomposition. To tackle these issues, we propose a simple yet powerful method that alleviates partial observability and efficiently promotes coordination by introducing the UNit-wise attentive State Representation (UNSR). In UNSR, each agent learns a compact and disentangled unit-wise state representation outputted from transformer blocks, and produces its local action-value function. The proposed UNSR is used to boost the value decomposition with a multi-head attention mechanism for producing efficient credit assignment in the mixing network, providing an efficient reasoning path between the individual value function and joint value function. Experimental results demonstrate that our method achieves superior performance and data efficiency compared to solid baselines on the StarCraft II micromanagement challenge. Additional ablation experiments also help identify the key factors contributing to the performance of UNSR.
Chengqing Li, Sheng Liu
Joint encryption and compression is an ideal solution for protecting security
and privacy of image data in a real scenario, e.g. storing them on an existing
cloud-based service like Facebook. Recently, some block-wise
encryption-then-compression (ETC) schemes compatible with JPEG were proposed to
provide a reasonably high level of security without compromising compression
ratio much. This paper investigates recovering the block-wise relationship in
an ETC scheme exerting on single-color blocks of size $8\times 8$ in the
scenarios of ciphertext-only attack, known-plaintext attack and
chosen-plaintext attack. Then, the attacking targets are extended to the other
conventional ETC schemes exerting on multiple color channels and blocks of
various sizes. Especially, an elaborate jigsaw puzzle solver is designed to
recover enough visual information from multiple cipher-images encrypted by the
same secret key. Moreover, the nice attacking performance was verified over two
social media platforms, Facebook and Weibo.
Authors' comments: 12 pages
Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, Young Min Kim
We present a method to animate a character incorporating multiple part-wise
motion priors (PMP). While previous works allow creating realistic articulated
motions from reference data, the range of motion is largely limited by the
available samples. Especially for the interaction-rich scenarios, it is
impractical to attempt acquiring every possible interacting motion, as the
combination of physical parameters increases exponentially. The proposed PMP
allows us to assemble multiple part skills to animate a character, creating a
diverse set of motions with different combinations of existing data. In our
pipeline, we can train an agent with a wide range of part-wise priors.
Therefore, each body part can obtain a kinematic insight of the style from the
motion captures, or at the same time extract dynamics-related information from
the additional part-specific simulation. For example, we can first train a
general interaction skill, e.g. grasping, only for the dexterous part, and then
combine the expert trajectories from the pre-trained agent with the kinematic
priors of other limbs. Eventually, our whole-body agent learns a novel physical
interaction skill even with the absence of the object trajectories in the
reference motion sequence.
Authors' comments: 13 pages, 11 figures
Hanyu Sun, Xiao Huang, Wei Ma
To provide real-time parking information, existing studies focus on predicting parking availability, which seems an indirect approach to saving drivers' cruising time. In this paper, we first time propose an on-street parking recommendation (OPR) task to directly recommend a parking space for a driver. To this end, a learn-to-rank (LTR) based OPR model called OPR-LTR is built. Specifically, parking recommendation is closely related to the "turnover events" (state switching between occupied and vacant) of each parking space, and hence we design a highly efficient heterogeneous graph called ESGraph to represent historical and real-time meters' turnover events as well as geographical relations; afterward, a convolution-based event-then-graph network is used to aggregate and update representations of the heterogeneous graph. A ranking model is further utilized to learn a score function that helps recommend a list of ranked parking spots for a specific on-street parking query. The method is verified using the on-street parking meter data in Hong Kong and San Francisco. By comparing with the other two types of methods: prediction-only and prediction-then-recommendation, the proposed direct-recommendation method achieves satisfactory performance in different metrics. Extensive experiments also demonstrate that the proposed ESGraph and the recommendation model are more efficient in terms of computational efficiency as well as saving drivers' on-street parking time.
Jingquan Luo, Qisheng Wang, Lvzhou Li
We explore potential quantum speedups for the fundamental problem of testing
the properties of closeness and $k$-wise uniformity of probability
distributions.
\textit{Closeness testing} is the problem of distinguishing whether two
$n$-dimensional distributions are identical or at least $\varepsilon$-far in
$\ell^1$- or $\ell^2$-distance. We show that the quantum query complexities for
$\ell^1$- and $\ell^2$-closeness testing are $O\rbra{\sqrt{n}/\varepsilon}$ and
$O\rbra{1/\varepsilon}$, respectively, both of which achieve optimal dependence
on $\varepsilon$, improving the prior best results of
\hyperlink{cite.gilyen2019distributional}{Gily{\'e}n and Li~(2019)}.
\textit{$k$-wise uniformity testing} is the problem of distinguishing whether
a distribution over $\cbra{0, 1}^n$ is uniform when restricted to any $k$
coordinates or $\varepsilon$-far from any such distributions. We propose the
first quantum algorithm for this problem with query complexity
$O\rbra{\sqrt{n^k}/\varepsilon}$, achieving a quadratic speedup over the
state-of-the-art classical algorithm with sample complexity
$O\rbra{n^k/\varepsilon^2}$ by \hyperlink{cite.o2018closeness}{O'Donnell and
Zhao (2018)}. Moreover, when $k = 2$ our quantum algorithm outperforms any
classical one because of the classical lower bound
$\Omega\rbra{n/\varepsilon^2}$.
All our quantum algorithms are fairly simple and time-efficient, using only
basic quantum subroutines such as amplitude estimation.
Authors' comments: We have added the proof of lower bounds and have polished the
language
Remi Luschei, Werner Brannath
The population-wise error rate (PWER) is a type I error rate for clinical
trials with multiple target populations. In such trials, one treatment is
tested for its efficacy in each population. The PWER is defined as the
probability that a randomly selected, future patient will be exposed to an
inefficient treatment based on the study results. The PWER can be understood
and computed as an average of strata specific family-wise error rates and
involves the prevalences of these strata. A major issue of this concept is that
the population prevalences needed to determine this average are usually not
known in practice, so that the PWER cannot be directly controlled. Instead, one
could use an estimator of the prevalences based on the given sample, like their
maximum-likelihood estimator. In this paper we show in simulations that this
does not substantially inflate the true PWER. We differentiate between the
expected PWER, which is almost perfectly controlled, and study-specific values
of the PWER which are conditioned to given sample sizes and vary within a
narrow range. Thereby, we consider up to eight different overlapping patient
populations and moderate to large sample sizes.
Authors' comments: 10 pages, 5 figures
Kyu Beom Han, Olivia G. Odenthal, Woo Jae Kim, Sung-Eui Yoon
Auxiliary features such as geometric buffers (G-buffers) and path descriptors
(P-buffers) have been shown to significantly improve Monte Carlo (MC)
denoising. However, recent approaches implicitly learn to exploit auxiliary
features for denoising, which could lead to insufficient utilization of each
type of auxiliary features. To overcome such an issue, we propose a denoising
framework that relies on an explicit pixel-wise guidance for utilizing
auxiliary features. First, we train two denoisers, each trained by a different
auxiliary feature (i.e., G-buffers or P-buffers). Then we design our ensembling
network to obtain per-pixel ensembling weight maps, which represent pixel-wise
guidance for which auxiliary feature should be dominant at reconstructing each
individual pixel and use them to ensemble the two denoised results of our
denosiers. We also propagate our pixel-wise guidance to the denoisers by
jointly training the denoisers and the ensembling network, further guiding the
denoisers to focus on regions where G-buffers or P-buffers are relatively
important for denoising. Our result and show considerable improvement in
denoising performance compared to the baseline denoising model using both
G-buffers and P-buffers.
Authors' comments: 19 pages
Pu Li, Marie Roch, Holger Klinck, Erica Fleishman, Douglas Gillespie, Eva-Marie Nosal, Yu Shiu, Xiaobai Liu
Whistle contour extraction aims to derive animal whistles from time-frequency
spectrograms as polylines. For toothed whales, whistle extraction results can
serve as the basis for analyzing animal abundance, species identity, and social
activities. During the last few decades, as long-term recording systems have
become affordable, automated whistle extraction algorithms were proposed to
process large volumes of recording data. Recently, a deep learning-based method
demonstrated superior performance in extracting whistles under varying noise
conditions. However, training such networks requires a large amount of
labor-intensive annotation, which is not available for many species. To
overcome this limitation, we present a framework of stage-wise generative
adversarial networks (GANs), which compile new whistle data suitable for deep
model training via three stages: generation of background noise in the
spectrogram, generation of whistle contours, and generation of whistle signals.
By separating the generation of different components in the samples, our
framework composes visually promising whistle data and labels even when few
expert annotated data are available. Regardless of the amount of
human-annotated data, the proposed data augmentation framework leads to a
consistent improvement in performance of the whistle extraction model, with a
maximum increase of 1.69 in the whistle extraction mean F1-score. Our
stage-wise GAN also surpasses one single GAN in improving whistle extraction
models with augmented data. The data and code will be available at
https://github.com/Paul-LiPu/CompositeGAN\_WhistleAugment.
Authors' comments: Accepted by IEEE Transactions of Multimedia (2023)
Per Calissendorff, Matthew De Furio, Michael Meyer, Loïc Albert, Christian Aganze, Mohamad Ali-Dib, Daniella C. Bardalez Gagliuffi, Frederique Baron et al.
We report the discovery of the first brown dwarf binary system with a Y dwarf
primary, WISE J033605.05$-$014350.4, observed with NIRCam on JWST with the
F150W and F480M filters. We employed an empirical point spread function binary
model to identify the companion, located at a projected separation of 84
milliarcseconds, position angle of 295 degrees, and with contrast of 2.8 and
1.8 magnitudes in F150W and F480M, respectively. At a distance of 10$\,$pc
based on its Spitzer parallax, and assuming a random inclination distribution,
the physical separation is approximately 1$\,$au. Evolutionary models predict
for that an age of 1-5 Gyr, the companion mass is about 4-12.5 Jupiter masses
around the 7.5-20 Jupiter mass primary, corresponding to a companion-to-host
mass fraction of $q=0.61\pm0.05$. Under the assumption of a Keplerian orbit the
period for this extreme binary is in the range of 5-9 years. The system joins a
small but growing sample of ultracool dwarf binaries with effective
temperatures of a few hundreds of Kelvin. Brown dwarf binaries lie at the nexus
of importance for understanding the formation mechanisms of these elusive
objects, as they allow us to investigate whether the companions formed as stars
or as planets in a disk around the primary.
Authors' comments: 8 pages, 3 figures, 1 table. Accepted for publication in
Astrophysical Journal Letters
Midia Reshadi, David Gregg
Sparse tensor computing is a core computational part of numerous applications in areas such as data science, graph processing, and scientific computing. Sparse tensors offer the potential of skipping unnecessary computations caused by zero values. In this paper, we propose a new strategy for extending row-wise product sparse tensor accelerators. We propose a new processing element called Maple that uses multiple multiply-accumulate (MAC) units to exploit local clusters of non-zero values to increase parallelism and reduce data movement. Maple works on the compressed sparse row (CSR) format and calculates only non-zero elements of the input matrices based on the sparsity pattern. Furthermore, we may employ Maple as a basic building block in a variety of spatial tensor accelerators that operate based on a row-wise product approach. As a proof of concept, we utilize Maple in two reference accelerators: Extensor and Matraptor. Our experiments show that using Maple in Matraptor and Extensor achieves 50% and 60% energy benefit and 15% and 22% speedup over the baseline designs, respectively. Employing Maple also results in 5.9x and 15.5x smaller area consumption in Matraptor and Extensor compared with the baseline structures, respectively.
Ziwei Liu, Yongtao Wang, Xiaojie Chu
Knowledge distillation is a popular technique for transferring the knowledge
from a large teacher model to a smaller student model by mimicking. However,
distillation by directly aligning the feature maps between teacher and student
may enforce overly strict constraints on the student thus degrade the
performance of the student model. To alleviate the above feature misalignment
issue, existing works mainly focus on spatially aligning the feature maps of
the teacher and the student, with pixel-wise transformation. In this paper, we
newly find that aligning the feature maps between teacher and student along the
channel-wise dimension is also effective for addressing the feature
misalignment issue. Specifically, we propose a learnable nonlinear channel-wise
transformation to align the features of the student and the teacher model.
Based on it, we further propose a simple and generic framework for feature
distillation, with only one hyper-parameter to balance the distillation loss
and the task specific loss. Extensive experimental results show that our method
achieves significant performance improvements in various computer vision tasks
including image classification (+3.28% top-1 accuracy for MobileNetV1 on
ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN
on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based
Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in
semantic segmentation on Cityscapes), which demonstrates the effectiveness and
the versatility of the proposed method. The code will be made publicly
available.
Authors' comments: 13 pages
Shenghai Liao, Xuya Liu, Ruyi Han, Shujun Fu, Yuanfeng Zhou, Yuliang Li
Digital image inpainting is an interpolation problem, inferring the content
in the missing (unknown) region to agree with the known region data such that
the interpolated result fulfills some prior knowledge. Low-rank and nonlocal
self-similarity are two important priors for image inpainting. Based on the
nonlocal self-similarity assumption, an image is divided into overlapped square
target patches (submatrices) and the similar patches of any target patch are
reshaped as vectors and stacked into a patch matrix. Such a patch matrix
usually enjoys a property of low rank or approximately low rank, and its
missing entries are recoveried by low-rank matrix approximation (LRMA)
algorithms. Traditionally, $n$ nearest neighbor similar patches are searched
within a local window centered at a target patch. However, for an image with
missing lines, the generated patch matrix is prone to having entirely-missing
rows such that the downstream low-rank model fails to reconstruct it well. To
address this problem, we propose a region-wise matching (RwM) algorithm by
dividing the neighborhood of a target patch into multiple subregions and then
search the most similar one within each subregion. A non-convex weighted
low-rank decomposition (NC-WLRD) model for LRMA is also proposed to reconstruct
all degraded patch matrices grouped by the proposed RwM algorithm. We solve the
proposed NC-WLRD model by the alternating direction method of multipliers
(ADMM) and analyze the convergence in detail. Numerous experiments on line
inpainting (entire-row/column missing) demonstrate the superiority of our
method over other competitive inpainting algorithms. Unlike other
low-rank-based matrix completion methods and inpainting algorithms, the
proposed model NC-WLRD is also effective for removing random-valued impulse
noise and structural noise (stripes).
Authors' comments: region-wise matching algorithm, image inpainting, 20 pages, 18
figures
Elliot Vincent, Jean Ponce, Mathieu Aubry
Improvements in Earth observation by satellites allow for imagery of ever
higher temporal and spatial resolution. Leveraging this data for agricultural
monitoring is key for addressing environmental and economic challenges. Current
methods for crop segmentation using temporal data either rely on annotated data
or are heavily engineered to compensate the lack of supervision. In this paper,
we present and compare datasets and methods for both supervised and
unsupervised pixel-wise segmentation of satellite image time series (SITS). We
also introduce an approach to add invariance to spectral deformations and
temporal shifts to classical prototype-based methods such as K-means and
Nearest Centroid Classifier (NCC). We study different levels of supervision and
show this simple and highly interpretable method achieves the best performance
in the low data regime and significantly improves the state of the art for
unsupervised classification of agricultural time series on four recent SITS
datasets.
Authors' comments: Revised version. Added references and baselines. Corrected typos.
Added discussion section and Appendix A, B and C
Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang
Online lane graph construction is a promising but challenging task in
autonomous driving. Previous methods usually model the lane graph at the pixel
or piece level, and recover the lane graph by pixel-wise or piece-wise
connection, which breaks down the continuity of the lane and results in
suboptimal performance. Human drivers focus on and drive along the continuous
and complete paths instead of considering lane pieces. Autonomous vehicles also
require path-specific guidance from lane graph for trajectory planning. We
argue that the path, which indicates the traffic flow, is the primitive of the
lane graph. Motivated by this, we propose to model the lane graph in a novel
path-wise manner, which well preserves the continuity of the lane and encodes
traffic information for planning. We present a path-based online lane graph
construction method, termed LaneGAP, which end-to-end learns the path and
recovers the lane graph via a Path2Graph algorithm. We qualitatively and
quantitatively demonstrate the superior accuracy and efficiency of LaneGAP over
conventional pixel-based and piece-based methods on the challenging nuScenes
and Argoverse2 datasets under controllable and fair conditions. Compared to the
recent state-of-the-art piece-wise method TopoNet on the OpenLane-V2 dataset,
LaneGAP still outperforms by 1.6 mIoU, further validating the effectiveness of
path-wise modeling. Abundant visualizations in the supplementary material show
LaneGAP can cope with diverse traffic conditions. Code is released at
\url{https://github.com/hustvl/LaneGAP}.
Authors' comments: Accepted to ECCV 2024
Kira Maag, Tobias Riedlinger
In recent years, deep neural networks have defined the state-of-the-art in semantic segmentation where their predictions are constrained to a predefined set of semantic classes. They are to be deployed in applications such as automated driving, although their categorically confined expressive power runs contrary to such open world scenarios. Thus, the detection and segmentation of objects from outside their predefined semantic space, i.e., out-of-distribution (OoD) objects, is of highest interest. Since uncertainty estimation methods like softmax entropy or Bayesian models are sensitive to erroneous predictions, these methods are a natural baseline for OoD detection. Here, we present a method for obtaining uncertainty scores from pixel-wise loss gradients which can be computed efficiently during inference. Our approach is simple to implement for a large class of models, does not require any additional training or auxiliary data and can be readily used on pre-trained segmentation models. Our experiments show the ability of our method to identify wrong pixel classifications and to estimate prediction quality at negligible computational overhead. In particular, we observe superior performance in terms of OoD segmentation to comparable baselines on the SegmentMeIfYouCan benchmark, clearly outperforming other methods.
Zheqi Zhu, Yuchen Shi, Jiajun Luo, Fei Wang, Chenghui Peng, Pingyi Fan, Khaled B. Letaief
Federated learning (FL) has prevailed as an efficient and privacy-preserved scheme for distributed learning. In this work, we mainly focus on the optimization of computation and communication in FL from a view of pruning. By adopting layer-wise pruning in local training and federated updating, we formulate an explicit FL pruning framework, FedLP (Federated Layer-wise Pruning), which is model-agnostic and universal for different types of deep learning models. Two specific schemes of FedLP are designed for scenarios with homogeneous local models and heterogeneous ones. Both theoretical and experimental evaluations are developed to verify that FedLP relieves the system bottlenecks of communication and computation with marginal performance decay. To the best of our knowledge, FedLP is the first framework that formally introduces the layer-wise pruning into FL. Within the scope of federated learning, more variants and combinations can be further designed based on FedLP.
ZongTan Li
Multi-Object Tracking (MOT) has gained extensive attention in recent years
due to its potential applications in traffic and pedestrian detection. We note
that tracking by detection may suffer from errors generated by noise detectors,
such as an imprecise bounding box before the occlusions, and observed that in
most tracking scenarios, objects tend to move and lost within specific
locations. To counter this, we present a novel tracker to deal with the bad
detector and occlusions. Firstly, we proposed a location-wise sub-region
recognition method which equally divided the frame, which we called mesh. Then
we proposed corresponding location-wise loss management strategies and
different matching strategies. The resulting Mesh-SORT, ablation studies
demonstrate its effectiveness and made 3% fragmentation 7.2% ID switches drop
and 0.4% MOTA improvement compared to the baseline on MOT17 datasets. Finally,
we analyze its limitation on the specific scene and discussed what future works
can be extended.
Authors' comments: 14 pages 18 figs
Kai Zhai, Qiang Nie, Bo Ouyang, Xiang Li, Shanlin Yang
2D-to-3D human pose lifting is fundamental for 3D human pose estimation
(HPE), for which graph convolutional networks (GCNs) have proven inherently
suitable for modeling the human skeletal topology. However, the current
GCN-based 3D HPE methods update the node features by aggregating their
neighbors' information without considering the interaction of joints in
different joint synergies. Although some studies have proposed importing limb
information to learn the movement patterns, the latent synergies among joints,
such as maintaining balance are seldom investigated. We propose the Hop-wise
GraphFormer with Intragroup Joint Refinement (HopFIR) architecture to tackle
the 3D HPE problem. HopFIR mainly consists of a novel hop-wise GraphFormer
(HGF) module and an intragroup joint refinement (IJR) module. The HGF module
groups the joints by k-hop neighbors and applies a hopwise transformer-like
attention mechanism to these groups to discover latent joint synergies. The IJR
module leverages the prior limb information for peripheral joint refinement.
Extensive experimental results show that HopFIR outperforms the SOTA methods by
a large margin, with a mean per-joint position error (MPJPE) on the Human3.6M
dataset of 32.67 mm. We also demonstrate that the state-of-the-art GCN-based
methods can benefit from the proposed hop-wise attention mechanism with a
significant improvement in performance: SemGCN and MGCN are improved by 8.9%
and 4.5%, respectively.
Authors' comments: Accepted by ICCV 2023
Shenwei Xie, Wanfeng Zheng, Zhenglin Xian, Junli Yang, Chuang Zhang, Ming Wu
Automatically extracting roads from satellite imagery is a fundamental yet
challenging computer vision task in the field of remote sensing. Pixel-wise
semantic segmentation-based approaches and graph-based approaches are two
prevailing schemes. However, prior works show the imperfections that semantic
segmentation-based approaches yield road graphs with low connectivity, while
graph-based methods with iterative exploring paradigms and smaller receptive
fields focus more on local information and are also time-consuming. In this
paper, we propose a new scheme for multi-task satellite imagery road
extraction, Patch-wise Road Keypoints Detection (PaRK-Detect). Building on top
of D-LinkNet architecture and adopting the structure of keypoint detection, our
framework predicts the position of patch-wise road keypoints and the adjacent
relationships between them to construct road graphs in a single pass.
Meanwhile, the multi-task framework also performs pixel-wise semantic
segmentation and generates road segmentation masks. We evaluate our approach
against the existing state-of-the-art methods on DeepGlobe, Massachusetts
Roads, and RoadTracer datasets and achieve competitive or better results. We
also demonstrate a considerable outperformance in terms of inference speed.
Authors' comments: Accepted at BMVC 2022 (Oral). 13 pages, 5 figures.
https://bmvc2022.mpi-inf.mpg.de/381/