Taeryung Lee, Yeonguk Oh, Kyoung Mu Lee
In this paper, we propose P3D, the human part-wise motion context learning
framework for sign language recognition. Our main contributions lie in two
dimensions: learning the part-wise motion context and employing the pose
ensemble to utilize 2D and 3D pose jointly. First, our empirical observation
implies that part-wise context encoding benefits the performance of sign
language recognition. While previous methods of sign language recognition
learned motion context from the sequence of the entire pose, we argue that such
methods cannot exploit part-specific motion context. In order to utilize
part-wise motion context, we propose the alternating combination of a part-wise
encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET
encodes the motion contexts from a part sequence, while WET merges them into a
unified context. By learning part-wise motion context, our P3D achieves
superior performance on WLASL compared to previous state-of-the-art methods.
Second, our framework is the first to ensemble 2D and 3D poses for sign
language recognition. Since the 3D pose holds rich motion context and depth
information to distinguish the words, our P3D outperformed the previous
state-of-the-art methods employing a pose ensemble.
Authors' comments: ICCV 2023
Yuhao Yang, Jun Wu, Guangjian Zhang, Rong Xiong
Traditional geometric registration based estimation methods only exploit the CAD model implicitly, which leads to their dependence on observation quality and deficiency to occlusion. To address the problem,the paper proposes a bidirectional correspondence prediction network with a point-wise attention-aware mechanism. This network not only requires the model points to predict the correspondence but also explicitly models the geometric similarities between observations and the model prior. Our key insight is that the correlations between each model point and scene point provide essential information for learning point-pair matches. To further tackle the correlation noises brought by feature distribution divergence, we design a simple but effective pseudo-siamese network to improve feature homogeneity. Experimental results on the public datasets of LineMOD, YCB-Video, and Occ-LineMOD show that the proposed method achieves better performance than other state-of-the-art methods under the same evaluation criteria. Its robustness in estimating poses is greatly improved, especially in an environment with severe occlusions.
Anton Baumann, Thomas Roßberg, Michael Schmitt
Uncertainty estimation in machine learning is paramount for enhancing the
reliability and interpretability of predictive models, especially in
high-stakes real-world scenarios. Despite the availability of numerous methods,
they often pose a trade-off between the quality of uncertainty estimation and
computational efficiency. Addressing this challenge, we present an adaptation
of the Multiple-Input Multiple-Output (MIMO) framework -- an approach
exploiting the overparameterization of deep neural networks -- for pixel-wise
regression tasks. Our MIMO variant expands the applicability of the approach
from simple image classification to broader computer vision domains. For that
purpose, we adapted the U-Net architecture to train multiple subnetworks within
a single model, harnessing the overparameterization in deep neural networks.
Additionally, we introduce a novel procedure for synchronizing subnetwork
performance within the MIMO framework. Our comprehensive evaluations of the
resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy
to existing models, superior calibration on in-distribution data, robust
out-of-distribution detection capabilities, and considerable improvements in
parameter size and inference time. Code available at
github.com/antonbaumann/MIMO-Unet
Authors' comments: 8 pages (references do not count), Accepted at UnCV (Workshop on
Uncertainty Quantification for Computer Vision at ICCV)
Jun Zhou, Kai Chen, Linlin Xu, Qi Dou, Jing Qin
One critical challenge in 6D object pose estimation from a single RGBD image
is efficient integration of two different modalities, i.e., color and depth. In
this work, we tackle this problem by a novel Deep Fusion Transformer~(DFTr)
block that can aggregate cross-modality features for improving pose estimation.
Unlike existing fusion methods, the proposed DFTr can better model
cross-modality semantic correlation by leveraging their semantic similarity,
such that globally enhanced features from different modalities can be better
integrated for improved information extraction. Moreover, to further improve
robustness and efficiency, we introduce a novel weighted vector-wise voting
algorithm that employs a non-iterative global optimization strategy for precise
3D keypoint localization while achieving near real-time inference. Extensive
experiments show the effectiveness and strong generalization capability of our
proposed 3D keypoint voting algorithm. Results on four widely used benchmarks
also demonstrate that our method outperforms the state-of-the-art methods by
large margins.
Authors' comments: Accepted by ICCV2023
Guillermo Carbajal, Patricia Vitoria, Jos Lezama, Pablo Mus
In recent years, the removal of motion blur in photographs has seen impressive progress in the hands of deep learning-based methods, trained to map directly from blurry to sharp images. For this reason, approaches that explicitly use a forward degradation model received significantly less attention. However, a well-defined specification of the blur genesis, as an intermediate step, promotes the generalization and explainability of the method. Towards this goal, we propose a learning-based motion deblurring method based on dense non-uniform motion blur estimation followed by a non-blind deconvolution approach. Specifically, given a blurry image, a first network estimates the dense per-pixel motion blur kernels using a lightweight representation composed of a set of image-adaptive basis motion kernels and the corresponding mixing coefficients. Then, a second network trained jointly with the first one, unrolls a non-blind deconvolution method using the motion kernel field estimated by the first network. The model-driven aspect is further promoted by training the networks on sharp/blurry pairs synthesized according to a convolution-based, non-uniform motion blur degradation model. Qualitative and quantitative evaluation shows that the kernel prediction network produces accurate motion blur estimates, and that the deblurring pipeline leads to restorations of real blurred images that are competitive or superior to those obtained with existing end-to-end deep learning-based methods. Code and trained models are available at https://github.com/GuillermoCarbajal/J-MKPD/.
Helal El-Zaatari, Fei Yu, Michael R Kosorok
Statistical analysis of social networks provides valuable insights into
complex network interactions across various scientific disciplines. However,
accurate modeling of networks remains challenging due to the heavy
computational burden and the need to account for observed network dependencies.
Exponential Random Graph Models (ERGMs) have emerged as a promising technique
used in social network modeling to capture network dependencies by
incorporating endogenous variables. Nevertheless, using ERGMs poses multiple
challenges, including the occurrence of ERGM degeneracy, which generates
unrealistic and meaningless network structures. To address these challenges and
enhance the modeling of collaboration networks, we propose and test a novel
approach that focuses on endogenous variable selection within ERGMs. Our method
aims to overcome the computational burden and improve the accommodation of
observed network dependencies, thereby facilitating more accurate and
meaningful interpretations of network phenomena in various scientific fields.
We conduct empirical testing and rigorous analysis to contribute to the
advancement of statistical techniques and offer practical insights for network
analysis.
Authors' comments: 23 pages, 6 tables and 18 figures
Yosuke Shinya
Scale-wise evaluation of object detectors is important for real-world
applications. However, existing metrics are either coarse or not sufficiently
reliable. In this paper, we propose novel scale-wise metrics that strike a
balance between fineness and reliability, using a filter bank consisting of
triangular and trapezoidal band-pass filters. We conduct experiments with two
methods on two datasets and show that the proposed metrics can highlight the
differences between the methods and between the datasets. Code is available at
https://github.com/shinya7y/UniverseNet .
Authors' comments: Honorable Mention Solution Award in Small Object Detection Challenge
for Spotting Birds, International Conference on Machine Vision Applications
(MVA) 2023
Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
Efficiency is quite important for 3D lane detection due to practical deployment demand. In this work, we propose a simple, fast, and end-to-end detector that still maintains high detection precision. Specifically, we devise a set of fully convolutional heads based on row-wise classification. In contrast to previous counterparts, ours supports recognizing both vertical and horizontal lanes. Besides, our method is the first one to perform row-wise classification in bird-eye-view. In the heads, we split feature into multiple groups and every group of feature corresponds to a lane instance. During training, the predictions are associated with lane labels using the proposed single-win one-to-one matching to compute loss, and no post-processing operation is demanded for inference. In this way, our proposed fully convolutional detector, GroupLane, realizes end-to-end detection like DETR. Evaluated on 3 real world 3D lane benchmarks, OpenLane, Once-3DLanes, and OpenLane-Huawei, GroupLane adopting ConvNext-Base as the backbone outperforms the published state-of-the-art PersFormer by 13.6% F1 score in the OpenLane validation set. Besides, GroupLane with ResNet18 still surpasses PersFormer by 4.9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13.3% of it.
Liu Liu, Shuaifeng Zhi, Zhenhua Du, Li Liu, Xinyu Zhang, Kai Huo, Weidong Jiang
Radars, due to their robustness to adverse weather conditions and ability to measure object motions, have served in autonomous driving and intelligent agents for years. However, Radar-based perception suffers from its unintuitive sensing data, which lack of semantic and structural information of scenes. To tackle this problem, camera and Radar sensor fusion has been investigated as a trending strategy with low cost, high reliability and strong maintenance. While most recent works explore how to explore Radar point clouds and images, rich contextual information within Radar observation are discarded. In this paper, we propose a hybrid point-wise Radar-Optical fusion approach for object detection in autonomous driving scenarios. The framework benefits from dense contextual information from both the range-doppler spectrum and images which are integrated to learn a multi-modal feature representation. Furthermore, we propose a novel local coordinate formulation, tackling the object detection task in an object-centric coordinate. Extensive results show that with the information gained from optical images, we could achieve leading performance in object detection (97.69\% recall) compared to recent state-of-the-art methods FFT-RadNet (82.86\% recall). Ablation studies verify the key design choices and practicability of our approach given machine generated imperfect detections. The code will be available at https://github.com/LiuLiu-55/ROFusion.
Yasar Abbas Ur Rehman, Yan Gao, Pedro Porto Buarque de Gusmão, Mina Alibeigi, Jiajun Shen, Nicholas D. Lane
The ubiquity of camera-enabled devices has led to large amounts of unlabeled image data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of the learned visual representations without needing to move data around. However, client bias and divergence during FL aggregation caused by data heterogeneity limits the performance of learned visual representations on downstream tasks. In this paper, we propose a new aggregation strategy termed Layer-wise Divergence Aware Weight Aggregation (L-DAWA) to mitigate the influence of client bias and divergence during FL aggregation. The proposed method aggregates weights at the layer-level according to the measure of angular divergence between the clients' model and the global model. Extensive experiments with cross-silo and cross-device settings on CIFAR-10/100 and Tiny ImageNet datasets demonstrate that our methods are effective and obtain new SOTA performance on both contrastive and non-contrastive SSL approaches.
Pierre-Louis Lions, Benjamin Seeger
We consider linear and nonlinear transport equations with irregular velocity fields, motivated by models coming from mean field games. The velocity fields are assumed to increase in each coordinate, and the divergence therefore fails to be absolutely continuous with respect to the Lebesgue measure in general. For such velocity fields, the well-posedness of first- and second-order linear transport equations in Lebesgue spaces is established, as well as the existence and uniqueness of regular ODE and SDE Lagrangian flows. These results are then applied to the study of certain nonconservative, nonlinear systems of transport type, which are used to model mean field games in a finite state space. A notion of weak solution is identified for which a unique minimal and maximal solution exist, which do not coincide in general. A selection-by-noise result is established for a relevant example to demonstrate that different types of noise can select any of the admissible solutions in the vanishing noise limit.
Weiliang Chan, Qianqian Ren
Urban region embedding is an important and yet highly challenging issue due to the complexity and constantly changing nature of urban data. To address the challenges, we propose a Region-Wise Multi-View Representation Learning (ROMER) to capture multi-view dependencies and learn expressive representations of urban regions without the constraints of rigid neighbourhood region conditions. Our model focus on learn urban region representation from multi-source urban data. First, we capture the multi-view correlations from mobility flow patterns, POI semantics and check-in dynamics. Then, we adopt global graph attention networks to learn similarity of any two vertices in graphs. To comprehensively consider and share features of multiple views, a two-stage fusion module is further proposed to learn weights with external attention to fuse multi-view embeddings. Extensive experiments for two downstream tasks on real-world datasets demonstrate that our model outperforms state-of-the-art methods by up to 17\% improvement.
Yuyuan Li, Jiaming Zhang, Yixiu Liu, Chaochao Chen
Privacy concerns associated with machine learning models have driven research into machine unlearning, which aims to erase the memory of specific target training data from already trained models. This issue also arises in federated learning, creating the need to address the federated unlearning problem. However, federated unlearning remains a challenging task. On the one hand, current research primarily focuses on unlearning all data from a client, overlooking more fine-grained unlearning targets, e.g., class-wise and sample-wise removal. On the other hand, existing methods suffer from imprecise estimation of data influence and impose significant computational or storage burden. To address these issues, we propose a neuro-inspired federated unlearning framework based on active forgetting, which is independent of model architectures and suitable for fine-grained unlearning targets. Our framework distinguishes itself from existing methods by utilizing new memories to overwrite old ones. These new memories are generated through teacher-student learning. We further utilize refined elastic weight consolidation to mitigate catastrophic forgetting of non-target data. Extensive experiments on benchmark datasets demonstrate the efficiency and effectiveness of our method, achieving satisfactory unlearning completeness against backdoor attacks.
Peter A. Monkewitz
The scaling of Reynolds stresses in turbulent wall-bounded flows is the
subject of a long running debate. In the near-wall ``inner'' region, a sizeable
group, inspired by the ``attached eddy model'', has advocated the unlimited
growth of $\langle uu\rangle^+$ and in particular of its inner peak at
$y^+\approxeq 15$, with $\ln\Reytau$ \citep[see e.g.][and references
therein]{smitsetal2021}. Only recently, \citet{chen_sreeni2021,chen_sreeni2022}
have argued on the basis of bounded dissipation, that $\langle uu\rangle^+$
remains finite in the inner near-wall region for $\Reytau\rightarrow\infty$,
with finite Reynolds number corrections of order $\Reytau^{-1/4}$. In this
paper, the overlap between the two-term inner expansion $f_0(y^+) +
f_1(y^+)/\Reytau^{1/4}$ of \citet{monkewitz22} and the leading order outer
expansion for $\langle uu\rangle^+$ is shown to be of the form $C_0 +
C_1\,(y^+/\Reytau)^{1/4}$. With a new indicator function, overlaps of this form
are reliably identified in $\langle uu\rangle^+$ profiles for channels and
pipes, while the situation in boundary layers requires further clarification.
On the other hand, the standard logarithmic indicator function, evaluated for
the same data, shows no sign of a logarithmic law to connect an inner expansion
of $\langle uu\rangle^+$ growing as $\ln{\Reytau}$ to an outer expansion of
order unity.
Authors' comments: 10 pages, 5 figures
Can Cui, Ruining Deng, Quan Liu, Tianyuan Yao, Shunxing Bao, Lucas W. Remedios, Yucheng Tang, Yuankai Huo
The Segment Anything Model (SAM) is a recently proposed prompt-based segmentation model in a generic zero-shot segmentation approach. With the zero-shot segmentation capacity, SAM achieved impressive flexibility and precision on various segmentation tasks. However, the current pipeline requires manual prompts during the inference stage, which is still resource intensive for biomedical image segmentation. In this paper, instead of using prompts during the inference stage, we introduce a pipeline that utilizes the SAM, called all-in-SAM, through the entire AI development workflow (from annotation generation to model finetuning) without requiring manual prompts during the inference stage. Specifically, SAM is first employed to generate pixel-level annotations from weak prompts (e.g., points, bounding box). Then, the pixel-level annotations are used to finetune the SAM segmentation model rather than training from scratch. Our experimental results reveal two key findings: 1) the proposed pipeline surpasses the state-of-the-art (SOTA) methods in a nuclei segmentation task on the public Monuseg dataset, and 2) the utilization of weak and few annotations for SAM finetuning achieves competitive performance compared to using strong pixel-wise annotated data.
Dejene Zewdie, Roberto J. Assef, Chiara Mazzucchelli, Manuel Aravena, Andrew W. Blain, Tanio Daz-Santos, Peter R. M. Eisenhardt, Hyunsung D. Jun et al.
We report the identification of Lyman Break Galaxy (LBG) candidates around
the most luminous Hot Dust-Obscured Galaxy (Hot DOG) known, WISE
J224607.56$-$052634.9 (W2246$-$0526) at $z=4.601$, using deep \textit{r}-,
\textit{i}-, and \textit{z}-band imaging from the Gemini Multi-Object
Spectrograph South (GMOS-S). We use the surface density of LBGs to probe the
Mpc-scale environment of W2246$-$0526 to characterize its richness and
evolutionary state. We identify LBG candidates in the vicinity of W2246$-$0526
using the selection criteria developed by \cite{2004VOuchi} and
\cite{2006Yoshida} in the Subaru Deep Field and in the Subaru XMM-Newton Deep
Field, slightly modified to account for the difference between the filters
used, and we find 37 and 55 LBG candidates, respectively. Matching to the
$z$-band depths of those studies, this corresponds to $\delta =
5.8^{+2.4}_{-1.9}$ times the surface density of LBGs expected in the field.
Interestingly, the Hot DOG itself, as well as a confirmed neighbor, do not
satisfy either LBG selection criteria, suggesting we may be missing a large
number of companion galaxies. Our analysis shows that we are most likely only
finding those with higher-than-average IGM optical depth or moderately high
dust obscuration. The number density of LBG candidates is not concentrated
around W2246$-$0526, suggesting either an early evolutionary stage for the
proto-cluster or that the Hot DOG may not be the most massive galaxy, or that
the Hot DOG may be affecting the IGM transparency in its vicinity. The
overdensity around W2246$-$0526 is comparable to overdensities found around
other Hot DOGs and is somewhat higher than typically found for radio galaxies
and luminous quasars at a similar redshift.
Authors' comments: 20 pages, 15 figures. The main results are in Figures 9 and 12.
Accepted for publication in A&A
Meng Xiao, Dongjie Wang, Min Wu, Kunpeng Liu, Hui Xiong, Yuanchun Zhou, Yanjie Fu
Feature transformation aims to reconstruct an effective representation space
by mathematically refining the existing features. It serves as a pivotal
approach to combat the curse of dimensionality, enhance model generalization,
mitigate data sparsity, and extend the applicability of classical models.
Existing research predominantly focuses on domain knowledge-based feature
engineering or learning latent representations. However, these methods, while
insightful, lack full automation and fail to yield a traceable and optimal
representation space. An indispensable question arises: Can we concurrently
address these limitations when reconstructing a feature space for a
machine-learning task? Our initial work took a pioneering step towards this
challenge by introducing a novel self-optimizing framework. This framework
leverages the power of three cascading reinforced agents to automatically
select candidate features and operations for generating improved feature
transformation combinations. Despite the impressive strides made, there was
room for enhancing its effectiveness and generalization capability. In this
extended journal version, we advance our initial work from two distinct yet
interconnected perspectives: 1) We propose a refinement of the original
framework, which integrates a graph-based state representation method to
capture the feature interactions more effectively and develop different
Q-learning strategies to alleviate Q-value overestimation further. 2) We
utilize a new optimization technique (actor-critic) to train the entire
self-optimizing framework in order to accelerate the model convergence and
improve the feature transformation performance. Finally, to validate the
improved effectiveness and generalization capability of our framework, we
perform extensive experiments and conduct comprehensive analyses.
Authors' comments: 21 pages, submitted to TKDD. arXiv admin note: text overlap with
arXiv:2209.08044, arXiv:2205.14526
Alireza Daneshyar, Leon Herrmann, Stefan Kollmannsberger
Ductile damage models and cohesive laws incorporate the material plasticity entailing the growth of irrecoverable deformations even after complete failure. This unrealistic growth remains concealed until the unilateral effects arising from the crack closure emerge. We address this issue by proposing a new strategy to cope with the entire process of failure, from the very inception in the form of diffuse damage to the final stage, i.e. the emergence of sharp cracks. To this end, we introduce a new strain field, termed discontinuity strain, to the conventional additive strain decomposition to account for discontinuities in a continuous sense so that the standard principle of virtual work applies. We treat this strain field similar to a strong discontinuity, yet without introducing new kinematic variables and nonlinear boundary conditions. In this paper, we demonstrate the effectiveness of this new strategy at a simple ductile damage constitutive model. The model uses a scalar damage index to control the degradation process. The discontinuity strain field is injected into the strain decomposition if this damage index exceeds a certain threshold. The threshold corresponds to the limit at which the induced imperfections merge and form a discrete crack. With three-point bending tests under pure mode I and mixed-mode conditions, we demonstrate that this augmentation does not show the early crack closure artifact which is wrongly predicted by plastic damage formulations at load reversal. We also use the concrete damaged plasticity model provided in Abaqus commercial finite element program for our comparison. Lastly, a high-intensity low-cycle fatigue test demonstrates the unilateral effects resulting from the complete closure of the induced crack.
Yulan Liu, Yuyang Zhou, Rongrong Lin
This paper characterizes the proximal operator of the piece-wise exponential function $1\!-\!e^{-|x|/\sigma}$ with a given shape parameter $\sigma\!>\!0$, which is a popular nonconvex surrogate of $\ell_0$-norm in support vector machines, zero-one programming problems, and compressed sensing, etc. Although Malek-Mohammadi et al. [IEEE Transactions on Signal Processing, 64(21):5657--5671, 2016] once worked on this problem, the expressions they derived were regrettably inaccurate. In a sense, it was lacking a case. Using the Lambert W function and an extensive study of the piece-wise exponential function, we have rectified the formulation of the proximal operator of the piece-wise exponential function in light of their work. We have also undertaken a thorough analysis of this operator. Finally, as an application in compressed sensing, an iterative shrinkage and thresholding algorithm (ISTA) for the piece-wise exponential function regularization problem is developed and fully investigated. A comparative study of ISTA with nine popular non-convex penalties in compressed sensing demonstrates the advantage of the piece-wise exponential penalty.
Mi Qian, Fei Ji, Yao Ge, Miaowen Wen, Xiang Cheng, H. Vincent Poor
As a promising technique for high-mobility wireless communications,
orthogonal time frequency space (OTFS) has been proved to enjoy excellent
advantages with respect to traditional orthogonal frequency division
multiplexing (OFDM). Although multiple studies have considered index modulation
(IM) based OTFS (IM-OTFS) schemes to further improve system performance, a
challenging and open problem is the development of effective IM schemes and
efficient receivers for practical OTFS systems that must operate in the
presence of channel delays and Doppler shifts. In this paper, we propose two
novel block-wise IM schemes for OTFS systems, named delay-IM with OTFS
(DeIM-OTFS) and Doppler-IM with OTFS (DoIM-OTFS), where a block of
delay/Doppler resource bins are activated simultaneously. Based on a maximum
likelihood (ML) detector, we analyze upper bounds on the average bit error
rates for the proposed DeIM-OTFS and DoIM-OTFS schemes, and verify their
performance advantages over the existing IM-OTFS systems. We also develop a
multi-layer joint symbol and activation pattern detection (MLJSAPD) algorithm
and a customized message passing detection (CMPD) algorithm for our proposed
DeIMOTFS and DoIM-OTFS systems with low complexity. Simulation results
demonstrate that our proposed MLJSAPD and CMPD algorithms can achieve desired
performance with robustness to the imperfect channel state information (CSI).
Authors' comments: arXiv admin note: text overlap with arXiv:2210.13454