Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari
Large visual-language models (VLMs), like CLIP, enable open-set image
segmentation to segment arbitrary concepts from an image in a zero-shot manner.
This goes beyond the traditional closed-set assumption, i.e., where models can
only segment classes from a pre-defined training set. More recently, first
works on open-set segmentation in 3D scenes have appeared in the literature.
These methods are heavily influenced by closed-set 3D convolutional approaches
that process point clouds or polygon meshes. However, these 3D scene
representations do not align well with the image-based nature of the
visual-language models. Indeed, point cloud and 3D meshes typically have a
lower resolution than images and the reconstructed 3D scene geometry might not
project well to the underlying 2D image sequences used to compute pixel-aligned
CLIP features. To address these challenges, we propose OpenNeRF which naturally
operates on posed images and directly encodes the VLM features within the NeRF.
This is similar in spirit to LERF, however our work shows that using pixel-wise
VLM features (instead of global CLIP features) results in an overall less
complex architecture without the need for additional DINO regularization. Our
OpenNeRF further leverages NeRF's ability to render novel views and extract
open-set VLM features from areas that are not well observed in the initial
posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF
outperforms recent open-vocabulary methods such as LERF and OpenScene by at
least +4.9 mIoU.
Authors' comments: ICLR 2024, Project page: https://opennerf.github.io
Luca Comanducci, Fabio Antonacci, Augusto Sarti
Deep learning models are widely applied in the signal processing community, yet their inner working procedure is often treated as a black box. In this paper, we investigate the use of eXplainable Artificial Intelligence (XAI) techniques to learning-based end-to-end speech source localization models. We consider the Layer-wise Relevance Propagation (LRP) technique, which aims to determine which parts of the input are more important for the output prediction. Using LRP we analyze two state-of-the-art models, of differing architectural complexity that map audio signals acquired by the microphones to the cartesian coordinates of the source. Specifically, we inspect the relevance associated with the input features of the two models and discover that both networks denoise and de-reverberate the microphone signals to compute more accurate statistical correlations between them and consequently localize the sources. To further demonstrate this fact, we estimate the Time-Difference of Arrivals (TDoAs) via the Generalized Cross Correlation with Phase Transform (GCC-PHAT) using both microphone signals and relevance signals extracted from the two networks and show that through the latter we obtain more accurate time-delay estimation results.
Yukun Yue
In this paper, we establish discrete versions of the Poincar\'e and trace inequalities for hybridizable finite element spaces. These spaces are made of piecewise polynomial functions defined both within the interiors of elements and across all faces in a mesh's skeleton, serving as the basis for both the hybridizable discontinuous Galerkin (HDG) and hybrid high-order (HHO) methods. Additionally, we present a specific adaptation of these inequalities for the HDG method and apply them to demonstrate the stability of the related numerical schemes for second-order elliptic equations under the minimal regularity assumptions for the source term and boundary data.
Natalie Lang, Alejandro Cohen, Nir Shlezinger
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. It typically involves a set of heterogeneous devices locally training neural network (NN) models in parallel with periodic centralized aggregations. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. Conventional approaches discard incomplete intra-model updates done by stragglers, alter the amount of local workload and architecture, or resort to asynchronous settings; which all affect the trained model performance under tight training latency constraints. In this work, we propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion. SALF allows stragglers to synchronously convey partial gradients, having each layer of the global model be updated independently with a different contributing set of users. We provide a theoretical analysis, establishing convergence guarantees for the global model under mild assumptions on the distribution of the participating devices, revealing that SALF converges at the same asymptotic rate as FL with no timing limitations. This insight is matched with empirical observations, demonstrating the performance gains of SALF compared to alternative mechanisms mitigating the device heterogeneity gap in FL.
Khunanon Thongkham, Anthony H. Gonzalez, Mark Brodwin, Ariane Trudeau, Ripon Saha, Peter Eisenhardt, S. A. Stanford, Emily Moravec et al.
The Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2) is a new survey
designed as the successor of the original MaDCoWS survey. MaDCoWS2 improves
upon its predecessor by using deeper optical and infrared data and a more
powerful detection algorithm (PZWav). As input to the search, we use grz
photometry from DECaLS in combination with W1 and W2 photometry from the
CatWISE2020 catalog to derive the photometric redshifts with full redshift
probability distribution functions for WISE-selected galaxies. Cluster
candidates are then detected using the PZWav algorithm to find
three-dimensional galaxy overdensities from the sky positions and photometric
redshifts. This paper provides the first MaDCoWS2 data release, covering 1461
(1838 without masking) deg^2 centered on the Hyper-SuprimeCam Subaru Strategic
Program equatorial fields. Within this region, we derive a catalog of 22,970
galaxy cluster candidates detected at S/N>5. These clusters span the redshift
range 0.1<z<2, including 1312 candidates at z>1.5. We compare MaDCoWS2 to six
existing catalogs in the area. We rediscover 60%-92% of the clusters in these
surveys at S/N>5. The medians of the absolute redshift offset are <0.02
relative to these surveys, while the standard deviations are less than 0.06.
The median offsets between the detection position from MaDCoWS2 and other
surveys are less than 0.25 Mpc. We quantify the relation between S/N and gas
mass, total mass, luminosity, and richness from other surveys using a
redshift-dependent power law relation. We find that the S/N-richness relation
exhibits the lowest scatter.
Authors' comments: 27 pages, 7 figures. Typo corrected. Accepted for publication in ApJ
Yongqiang Wang, Haisheng Fu, Qi Cao, Shang Wang, Zhenjiao Chen, Feng Liang
Recently, deep learning technology has been successfully applied in the field of image compression, leading to superior rate-distortion performance. It is crucial to design an effective and efficient entropy model to estimate the probability distribution of the latent representation. However, the majority of entropy models primarily focus on one-dimensional correlation processing between channel and spatial information. In this paper, we propose an Adaptive Channel-wise and Global-inter attention Context (ACGC) entropy model, which can efficiently achieve dual feature aggregation in both inter-slice and intraslice contexts. Specifically, we divide the latent representation into different slices and then apply the ACGC model in a parallel checkerboard context to achieve faster decoding speed and higher rate-distortion performance. In order to capture redundant global features across different slices, we utilize deformable attention in adaptive global-inter attention to dynamically refine the attention weights based on the actual spatial relationships and context. Furthermore, in the main transformation structure, we propose a high-performance S2LIC model. We introduce the residual SwinV2 Transformer model to capture global feature information and utilize a dense block network as the feature enhancement module to improve the nonlinear representation of the image within the transformation structure. Experimental results demonstrate that our method achieves faster encoding and decoding speeds and outperforms VTM-17.1 and some recent learned image compression methods in both PSNR and MS-SSIM metrics.
Nazmul Hasan, Apurba Kumar Saha, Andrew Wessman, Mohammed Shafae
Overheating anomaly detection is essential for the quality and reliability of
parts produced by laser powder bed fusion (LPBF) additive manufacturing (AM).
In this research, we focus on the detection of overheating anomalies using
photodiode sensor data. Photodiode sensors can collect high-frequency data from
the melt pool, reflecting the process dynamics and thermal history. Hence, the
proposed method offers a machine learning (ML) framework to utilize photodiode
sensor data for layer-wise detection of overheating anomalies. In doing so,
three sets of features are extracted from the raw photodiode data: MSMM (mean,
standard deviation, median, maximum), MSQ (mean, standard deviation,
quartiles), and MSD (mean, standard deviation, deciles). These three datasets
are used to train several ML classifiers. Cost-sensitive learning is used to
handle the class imbalance between the "anomalous" layers (affected by
overheating) and "nominal" layers in the benchmark dataset. To boost detection
accuracy, our proposed ML framework involves utilizing the majority voting
ensemble (MVE) approach. The proposed method is demonstrated using a case study
including an open benchmark dataset of photodiode measurements from an LPBF
specimen with deliberate overheating anomalies at some layers. The results from
the case study demonstrate that the MSD features yield the best performance for
all classifiers, and the MVE classifier (with a mean F1-score of 0.8654)
surpasses the individual ML classifiers. Moreover, our machine learning
methodology achieves superior results (9.66% improvement in mean F1-score) in
detecting layer-wise overheating anomalies, surpassing the existing methods in
the literature that use the same benchmark dataset.
Authors' comments: 12 pages (including references); 5 figures; 4 tables
Peng Zhang, Ao Duan, Xianglu Zou, Yuhong Liu
Privacy-Preserving Neural Networks (PPNN) are advanced to perform inference without breaching user privacy, which can serve as an essential tool for medical diagnosis to simultaneously achieve big data utility and privacy protection. As one of the key techniques to enable PPNN, Fully Homomorphic Encryption (FHE) is facing a great challenge that homomorphic operations cannot be easily adapted for non-linear activation calculations. In this paper, batch-oriented element-wise data packing and approximate activation are proposed, which train linear low-degree polynomials to approximate the non-linear activation function - ReLU. Compared with other approximate activation methods, the proposed fine-grained, trainable approximation scheme can effectively reduce the accuracy loss caused by approximation errors. Meanwhile, due to element-wise data packing, a large batch of images can be packed and inferred concurrently, leading to a much higher utility ratio of ciphertext slots. Therefore, although the total inference time increases sharply, the amortized time for each image actually decreases, especially when the batch size increases. Furthermore, knowledge distillation is adopted in the training process to further enhance the inference accuracy. Experiment results show that when ciphertext inference is performed on 4096 input images, compared with the current most efficient channel-wise method, the inference accuracy is improved by 1.65%, and the amortized inference time is reduced by 99.5%.
Jiawei Li, Sitong Li, Shanshan Wang, Yicheng Zeng, Falong Tan, Chuanlong Xie
Deploying machine learning in open environments presents the challenge of encountering diverse test inputs that differ significantly from the training data. These out-of-distribution samples may exhibit shifts in local or global features compared to the training distribution. The machine learning (ML) community has responded with a number of methods aimed at distinguishing anomalous inputs from original training data. However, the majority of previous studies have primarily focused on the output layer or penultimate layer of pre-trained deep neural networks. In this paper, we propose a novel framework, Multitesting-based Layer-wise Out-of-Distribution (OOD) Detection (MLOD), to identify distributional shifts in test samples at different levels of features through rigorous multiple testing procedure. Our approach distinguishes itself from existing methods as it does not require modifying the structure or fine-tuning of the pre-trained classifier. Through extensive experiments, we demonstrate that our proposed framework can seamlessly integrate with any existing distance-based inspection method while efficiently utilizing feature extractors of varying depths. Our scheme effectively enhances the performance of out-of-distribution detection when compared to baseline methods. In particular, MLOD-Fisher achieves superior performance in general. When trained using KNN on CIFAR10, MLOD-Fisher significantly lowers the false positive rate (FPR) from 24.09% to 7.47% on average compared to merely utilizing the features of the last layer.
Nhan-Khanh Le, Erfaun Noorani, Sandra Hirche, John Baras
Real-world scenarios are characterized by timing uncertainties, e.g., delays, and disturbances. Algorithms with temporal robustness are crucial in guaranteeing the successful execution of tasks and missions in such scenarios. We study time-robust path planning for synthesizing robots' trajectories that adhere to spatial-temporal specifications expressed in Signal Temporal Logic (STL). In contrast to prior approaches that rely on {discretize}d trajectories with fixed time steps, we leverage Piece-Wise Linear (PWL) signals for the synthesis. PWL signals represent a trajectory through a sequence of time-stamped waypoints. This allows us to encode the STL formula into a Mixed-Integer Linear Program (MILP) with fewer variables. This reduction is more pronounced for specifications with a long planning horizon. To that end, we define time-robustness for PWL signals. Subsequently, we propose quantitative semantics for PWL signals according to the recursive syntax of STL and prove their soundness. We then propose an encoding strategy to transform our semantics into a MILP. Our simulations showcase the soundness and the performance of our algorithm.
Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei, Zhenan Sun
Vision-language pre-trained models have achieved impressive performance on
various downstream tasks. However, their large model sizes hinder their
utilization on platforms with limited computational resources. We find that
directly using smaller pre-trained models and applying magnitude-based pruning
on CLIP models leads to inflexibility and inferior performance. Recent efforts
for VLP compression either adopt uni-modal compression metrics resulting in
limited performance or involve costly mask-search processes with learnable
masks. In this paper, we first propose the Module-wise Pruning Error (MoPE)
metric, accurately assessing CLIP module importance by performance decline on
cross-modal tasks. Using the MoPE metric, we introduce a unified pruning
framework applicable to both pre-training and task-specific fine-tuning
compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge
from the teacher model, significantly reducing pre-training costs while
maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning
from width to depth yields highly competitive task-specific models. Extensive
experiments in two stages demonstrate the effectiveness of the MoPE metric, and
MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
Authors' comments: 18 pages, 8 figures, Published in CVPR2024
Gabriel Toshio Hirokawa Higa, Rodrigo Stuqui Monzani, Jorge Fernando da Silva Cecatto, Maria Fernanda Balestieri Mariano de Souza, Vanessa Aparecida de Moraes Weber, Hemerson Pistori, Edson Takashi Matsubara
Smart indoor tourist attractions, such as smart museums and aquariums, usually require a significant investment in indoor localization devices. The smartphone Global Positional Systems use is unsuitable for scenarios where dense materials such as concrete and metal block weaken the GPS signals, which is the most common scenario in an indoor tourist attraction. Deep learning makes it possible to perform region-wise indoor localization using smartphone images. This approach does not require any investment in infrastructure, reducing the cost and time to turn museums and aquariums into smart museums or smart aquariums. This paper proposes using deep learning algorithms to classify locations using smartphone camera images for indoor tourism attractions. We evaluate our proposal in a real-world scenario in Brazil. We extensively collect images from ten different smartphones to classify biome-themed fish tanks inside the Pantanal Biopark, creating a new dataset of 3654 images. We tested seven state-of-the-art neural networks, three being transformer-based, achieving precision around 90% on average and recall and f-score around 89% on average. The results indicate good feasibility of the proposal in a most indoor tourist attractions.
Vinay Chakravarthi Gogineni, Esmaeil S. Nadimi
Machine unlearning has garnered significant attention due to its ability to
selectively erase knowledge obtained from specific training data samples in an
already trained machine learning model. This capability enables data holders to
adhere strictly to data protection regulations. However, existing unlearning
techniques face practical constraints, often causing performance degradation,
demanding brief fine-tuning post unlearning, and requiring significant storage.
In response, this paper introduces a novel class of machine unlearning
algorithms. First method is partial amnesiac unlearning, integration of
layer-wise pruning with amnesiac unlearning. In this method, updates made to
the model during training are pruned and stored, subsequently used to forget
specific data from trained model. The second method assimilates layer-wise
partial-updates into label-flipping and optimization-based unlearning to
mitigate the adverse effects of data deletion on model efficacy. Through a
detailed experimental evaluation, we showcase the effectiveness of proposed
unlearning methods. Experimental results highlight that the partial amnesiac
unlearning not only preserves model efficacy but also eliminates the necessity
for brief post fine-tuning, unlike conventional amnesiac unlearning. Moreover,
employing layer-wise partial updates in label-flipping and optimization-based
unlearning techniques demonstrates superiority in preserving model efficacy
compared to their naive counterparts.
Authors' comments: 16pages, 4 figures
José A. Vélez-Marulanda
Let $\mathbf{k}$ be a field and let $V: \mathscr{C} \to \mathbf{k}\textup{-Mod}$ be a point-wise finite dimensional persistence modules, where $\mathscr{C}$ is a small category. Assume that for all local Artinian $\mathbf{k}$-algebras $R$ with residue field isomorphic to $\mathbf{k}$, there is a generalized persistence module $M: \mathscr{C} \to R\textup{-Mod}$, such that for all $x\in \mathrm{Ob}(\mathscr{C})$, $M(x)$ is free over $R$ with finite rank and $\mathbf{k}\otimes_R M(x)\cong V(x)$. If $V$ is a direct sum of indecomposable persistence modules $V_I: \mathscr{C}\to \mathbf{k}\textup{-Mod}$ with endomorphism ring isomorphic to $\mathbf{k}$, then $M$ is a direct sum of indecomposables $M_I:\mathscr{C}\to R\textup{-Mod}$ with endomorphism ring isomorphic to $R$
Haochen Shi, Zhiyuan Sun, Xingdi Yuan, Marc-Alexandre Côté, Bang Liu
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
Chenhui Deng, Zichao Yue, Cunxi Yu, Gokce Sarar, Ryan Carey, Rajeev Jain, Zhiru Zhang
While graph neural networks (GNNs) have gained popularity for learning
circuit representations in various electronic design automation (EDA) tasks,
they face challenges in scalability when applied to large graphs and exhibit
limited generalizability to new designs. These limitations make them less
practical for addressing large-scale, complex circuit problems. In this work we
propose HOGA, a novel attention-based model for learning circuit
representations in a scalable and generalizable manner. HOGA first computes
hop-wise features per node prior to model training. Subsequently, the hop-wise
features are solely used to produce node representations through a gated
self-attention module, which adaptively learns important features among
different hops without involving the graph topology. As a result, HOGA is
adaptive to various structures across different circuits and can be efficiently
trained in a distributed manner. To demonstrate the efficacy of HOGA, we
consider two representative EDA tasks: quality of results (QoR) prediction and
functional reasoning. Our experimental results indicate that (1) HOGA reduces
estimation error over conventional GNNs by 46.76% for predicting QoR after
logic synthesis; (2) HOGA improves 10.0% reasoning accuracy over GNNs for
identifying functional blocks on unseen gate-level netlists after complex
technology mapping; (3) The training time for HOGA almost linearly decreases
with an increase in computing resources.
Authors' comments: Published as a conference paper at Design Automation Conference (DAC)
2024
Qiao Han, Mingqian Li, Yao Yang, Yiteng Zhai
Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method ``BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator's intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.
Mohammad Sadil Khan, Elona Dupont, Sk Aziz Ali, Kseniya Cherenkova, Anis Kacem, Djamila Aouada
Reverse engineering in the realm of Computer-Aided Design (CAD) has been a longstanding aspiration, though not yet entirely realized. Its primary aim is to uncover the CAD process behind a physical object given its 3D scan. We propose CAD-SIGNet, an end-to-end trainable and auto-regressive architecture to recover the design history of a CAD model represented as a sequence of sketch-and-extrusion from an input point cloud. Our model learns visual-language representations by layer-wise cross-attention between point cloud and CAD language embedding. In particular, a new Sketch instance Guided Attention (SGA) module is proposed in order to reconstruct the fine-grained details of the sketches. Thanks to its auto-regressive nature, CAD-SIGNet not only reconstructs a unique full design history of the corresponding CAD model given an input point cloud but also provides multiple plausible design choices. This allows for an interactive reverse engineering scenario by providing designers with multiple next-step choices along with the design process. Extensive experiments on publicly available CAD datasets showcase the effectiveness of our approach against existing baseline models in two settings, namely, full design history recovery and conditional auto-completion from point clouds.
Tianjie Ju, Weiwei Sun, Wei Du, Xinwei Yuan, Zhaochun Ren, Gongshen Liu
Previous work has showcased the intriguing capability of large language
models (LLMs) in retrieving facts and processing context knowledge. However,
only limited research exists on the layer-wise capability of LLMs to encode
knowledge, which challenges our understanding of their internal mechanisms. In
this paper, we devote the first attempt to investigate the layer-wise
capability of LLMs through probing tasks. We leverage the powerful generative
capability of ChatGPT to construct probing datasets, providing diverse and
coherent evidence corresponding to various facts. We employ $\mathcal V$-usable
information as the validation metric to better reflect the capability in
encoding context knowledge across different layers. Our experiments on
conflicting and newly acquired knowledge show that LLMs: (1) prefer to encode
more context knowledge in the upper layers; (2) primarily encode context
knowledge within knowledge-related entity tokens at lower layers while
progressively expanding more knowledge within other tokens at upper layers; and
(3) gradually forget the earlier context knowledge retained within the
intermediate layers when provided with irrelevant evidence. Code is publicly
available at https://github.com/Jometeorie/probing_llama.
Authors' comments: Accepted at LREC-COLING 2024 (Long Paper)
Jinxu Zhang
Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal Large Language Models (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.