Haoran Li, Xingjian Li, Jiahua Shi, Huaming Chen, Bo Du, Daisuke Kihara, Johan Barthelemy, Jun Shen et al.
Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating
the study of macromolecular structures at near-atomic resolution. Recent
volumetric segmentation approaches on cryo-ET images have drawn widespread
interest in biological sector. However, existing methods heavily rely on
manually labeled data, which requires highly professional skills, thereby
hindering the adoption of fully-supervised approaches for cryo-ET images. Some
unsupervised domain adaptation (UDA) approaches have been designed to enhance
the segmentation network performance using unlabeled data. However, applying
these methods directly to cryo-ET images segmentation tasks remains challenging
due to two main issues: 1) the source data, usually obtained through
simulation, contain a certain level of noise, while the target data, directly
collected from raw-data from real-world scenario, have unpredictable noise
levels. 2) the source data used for training typically consists of known
macromoleculars, while the target domain data are often unknown, causing the
model's segmenter to be biased towards these known macromolecules, leading to a
domain shift problem. To address these challenges, in this work, we introduce
the first voxel-wise unsupervised domain adaptation approach, termed Vox-UDA,
specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise
generation module to simulate target-like noises in the source dataset for
cross-noise level adaptation. Additionally, we propose a denoised
pseudo-labeling strategy based on improved Bilateral Filter to alleviate the
domain shift problem. Experimental results on both simulated and real cryo-ET
subtomogram datasets demonstrate the superiority of our proposed approach
compared to state-of-the-art UDA methods.
Authors' comments: 11 pages
Ning Lin, Shaocong Wang, Yue Zhang, Yangu He, Kwunhang Wong, Arindam Basu, Dashan Shang, Xiaoming Chen et al.
Deep neural networks (DNNs), such as the widely-used GPT-3 with billions of
parameters, are often kept secret due to high training costs and privacy
concerns surrounding the data used to train them. Previous approaches to
securing DNNs typically require expensive circuit redesign, resulting in
additional overheads such as increased area, energy consumption, and latency.
To address these issues, we propose a novel hardware-software co-design
approach for DNN intellectual property (IP) protection that capitalizes on the
inherent aging characteristics of circuits and a novel differential orientation
fine-tuning (DOFT) to ensure effective protection. Hardware-wise, we employ
random aging to produce authorized chips. This process circumvents the need for
chip redesign, thereby eliminating any additional hardware overhead during the
inference procedure of DNNs. Moreover, the authorized chips demonstrate a
considerable disparity in DNN inference performance when compared to
unauthorized chips. Software-wise, we propose a novel DOFT, which allows
pre-trained DNNs to maintain their original accuracy on authorized chips with
minimal fine-tuning, while the model's performance on unauthorized chips is
reduced to random guessing. Extensive experiments on various models, including
MLP, VGG, ResNet, Mixer, and SwinTransformer, with lightweight binary and
practical multi-bit weights demonstrate that the proposed method achieves
effective IP protection, with only 10\% accuracy on unauthorized chips, while
preserving nearly the original accuracy on authorized ones.
Authors' comments: Design Automation Conference 2024
Xiaoxiong Zhang, Zhiwei Zeng, Xin Zhou, Dusit Niyato, Zhiqi Shen
Federated Knowledge Graphs Embedding learning (FKGE) encounters challenges in communication efficiency stemming from the considerable size of parameters and extensive communication rounds. However, existing FKGE methods only focus on reducing communication rounds by conducting multiple rounds of local training in each communication round, and ignore reducing the size of parameters transmitted within each communication round. To tackle the problem, we first find that universal reduction in embedding precision across all entities during compression can significantly impede convergence speed, underscoring the importance of maintaining embedding precision. We then propose bidirectional communication-efficient FedS based on Entity-Wise Top-K Sparsification strategy. During upload, clients dynamically identify and upload only the Top-K entity embeddings with the greater changes to the server. During download, the server first performs personalized embedding aggregation for each client. It then identifies and transmits the Top-K aggregated embeddings to each client. Besides, an Intermittent Synchronization Mechanism is used by FedS to mitigate negative effect of embedding inconsistency among shared entities of clients caused by heterogeneity of Federated Knowledge Graph. Extensive experiments across three datasets showcase that FedS significantly enhances communication efficiency with negligible (even no) performance degradation.
Kosuke Doi, Yuka Ko, Mana Makinae, Katsuhito Sudoh, Satoshi Nakamura
This paper analyzes the features of monotonic translations, which follow the
word order of the source language, in simultaneous interpreting (SI). Word
order differences are one of the biggest challenges in SI, especially for
language pairs with significant structural differences like English and
Japanese. We analyzed the characteristics of chunk-wise monotonic translation
(CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic
Translation Evaluation Dataset and identified some grammatical structures that
make monotonic translation difficult in English-Japanese SI. We further
investigated the features of CMT sentences by evaluating the output from the
existing speech translation (ST) and simultaneous speech translation (simulST)
models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation
Evaluation Dataset as well as on existing test sets. The results indicate the
possibility that the existing SI-based test set underestimates the model
performance. The results also suggest that using CMT sentences as references
gives higher scores to simulST models than ST models, and that using an
offline-based test set to evaluate the simulST models underestimates the model
performance.
Authors' comments: Accepted to IWSLT2024
Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye
The field of 3D object detection from point clouds is rapidly advancing in
computer vision, aiming to accurately and efficiently detect and localize
objects in three-dimensional space. Current 3D detectors commonly fall short in
terms of flexibility and scalability, with ample room for advancements in
performance. In this paper, our objective is to address these limitations by
introducing two frameworks for 3D object detection with minimal hand-crafted
design. Firstly, we propose CT3D, which sequentially performs raw-point-based
embedding, a standard Transformer encoder, and a channel-wise decoder for point
features within each proposal. Secondly, we present an enhanced network called
CT3D++, which incorporates geometric and semantic fusion-based embedding to
extract more valuable and comprehensive proposal-aware information.
Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more
efficient feature encoding with reduced computational cost. By replacing the
corresponding components of CT3D with these novel modules, CT3D++ achieves
state-of-the-art performance on both the KITTI dataset and the large-scale
Way\-mo Open Dataset. The source code for our frameworks will be made
accessible at https://github.com/hlsheng1/CT3D-plusplus.
Authors' comments: 19 pages, 8 figures
Yuanjie Shi, Subhankar Ghosh, Taha Belkhouja, Janardhan Rao Doppa, Yan Yan
Conformal prediction (CP) is an emerging uncertainty quantification framework that allows us to construct a prediction set to cover the true label with a pre-specified marginal or conditional probability. Although the valid coverage guarantee has been extensively studied for classification problems, CP often produces large prediction sets which may not be practically useful. This issue is exacerbated for the setting of class-conditional coverage on imbalanced classification tasks with many and/or imbalanced classes. This paper proposes the Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the prediction set sizes to achieve class-conditional coverage, where the valid coverage holds for each class. In contrast to the standard class-conditional CP (CCP) method that uniformly thresholds the class-wise conformity score for each class, the augmented label rank calibration step allows RC3P to selectively iterate this class-wise thresholding subroutine only for a subset of classes whose class-wise top-k error is small. We prove that agnostic to the classifier and data distribution, RC3P achieves class-wise coverage. We also show that RC3P reduces the size of prediction sets compared to the CCP method. Comprehensive experiments on multiple real-world datasets demonstrate that RC3P achieves class-wise coverage and 26.25% reduction in prediction set sizes on average.
Peiyu Liang, Hongchang Gao, Xubin He
While Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation, existing methods assume identical local topology structures across modalities that overlook real-world discrepancies. This leads MVGNNs straggles in modality fusion and representations denoising. To address these issues, we propose adaptive modality-wise structure learning (AMoSL). AMoSL captures node correspondences between modalities via optimal transport, and jointly learning with graph embedding. To enable efficient end-to-end training, we employ an efficient solution for the resulting complex bilevel optimization problem. Furthermore, AMoSL adapts to downstream tasks through unsupervised learning on inter-modality distances. The effectiveness of AMoSL is demonstrated by its ability to train more accurate graph classifiers on six benchmark datasets.
Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
The Audio-Visual Video Parsing task aims to identify and temporally localize
the events that occur in either or both the audio and visual streams of audible
videos. It often performs in a weakly-supervised manner, where only video event
labels are provided, \ie, the modalities and the timestamps of the labels are
unknown. Due to the lack of densely annotated labels, recent work attempts to
leverage pseudo labels to enrich the supervision. A commonly used strategy is
to generate pseudo labels by categorizing the known video event labels for each
modality. However, the labels are still confined to the video level, and the
temporal boundaries of events remain unlabeled. In this paper, we propose a new
pseudo label generation strategy that can explicitly assign labels to each
video segment by utilizing prior knowledge learned from the open world.
Specifically, we exploit the large-scale pretrained models, namely CLIP and
CLAP, to estimate the events in each video segment and generate segment-level
visual and audio pseudo labels, respectively. We then propose a new loss
function to exploit these pseudo labels by taking into account their
category-richness and segment-richness. A label denoising strategy is also
adopted to further improve the visual pseudo labels by flipping them whenever
abnormally large forward losses occur. We perform extensive experiments on the
LLP dataset and demonstrate the effectiveness of each proposed design and we
achieve state-of-the-art video parsing performance on all types of event
parsing, \ie, audio event, visual event, and audio-visual event. We also
examine the proposed pseudo label generation strategy on a relevant
weakly-supervised audio-visual event localization task and the experimental
results again verify the benefits and generalization of our method.
Authors' comments: IJCV 2024 Accepted. arXiv admin note: substantial text overlap with
arXiv:2303.02344
Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang
Recent deep learning-based multi-view people detection (MVD) methods have
shown promising results on existing datasets. However, current methods are
mainly trained and evaluated on small, single scenes with a limited number of
multi-view frames and fixed camera views. As a result, these methods may not be
practical for detecting people in larger, more complex scenes with severe
occlusions and camera calibration errors. This paper focuses on improving
multi-view people detection by developing a supervised view-wise contribution
weighting approach that better fuses multi-camera information under large
scenes. Besides, a large synthetic dataset is adopted to enhance the model's
generalization ability and enable more practical evaluation and comparison. The
model's performance on new testing scenes is further improved with a simple
domain adaptation technique. Experimental results demonstrate the effectiveness
of our approach in achieving promising cross-scene multi-view people detection
performance. See code here: https://vcc.tech/research/2024/MVD.
Authors' comments: AAAI 2024
Zheng Tracy Ke, Jingming Wang
Topic modeling is a widely utilized tool in text analysis. We investigate the
optimal rate for estimating a topic model. Specifically, we consider a scenario
with $n$ documents, a vocabulary of size $p$, and document lengths at the order
$N$. When $N\geq c\cdot p$, referred to as the long-document case, the optimal
rate is established in the literature at $\sqrt{p/(Nn)}$. However, when
$N=o(p)$, referred to as the short-document case, the optimal rate remains
unknown. In this paper, we first provide new entry-wise large-deviation bounds
for the empirical singular vectors of a topic model. We then apply these bounds
to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by
comparing the improved error rate with the minimax lower bound, we conclude
that the optimal rate is still $\sqrt{p/(Nn)}$ in the short-document case.
Authors' comments: 50 pages
Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen et al.
Length generalization failure problem, namely the large language model (LLM)
fails to generalize to texts longer than its maximum training length, greatly
restricts the application of LLM in the scenarios with streaming long inputs.
To address this problem, the existing methods either require substantial costs
or introduce precision loss. In this paper, we empirically find that the
accuracy of the LLM's prediction is highly correlated to its certainty. Based
on this, we propose an efficient training free framework, named XL3M (it means
extra-long large language model), which enables the LLMs trained on short
sequences to reason extremely long sequence without any further training or
fine-tuning. Under the XL3M framework, the input context will be firstly
decomposed into multiple short sub-contexts, where each sub-context contains an
independent segment and a common ``question'' which is a few tokens from the
end of the original context. Then XL3M gives a method to measure the relevance
between each segment and the ``question'', and constructs a concise key context
by splicing all the relevant segments in chronological order. The key context
is further used instead of the original context to complete the inference task.
Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our
framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card
Huawei Ascend 910B NPU machine with 64GB memory per card.
Authors' comments: 11 pages, 5 figures
Oleksii Furman, Patryk Wielopolski, Łukasz Lenkiewicz, Jerzy Stefanowski, Maciej Zięba
The growing complexity of AI systems has intensified the need for transparency through Explainable AI (XAI). Counterfactual explanations (CFs) offer actionable "what-if" scenarios on three levels: Local CFs providing instance-specific insights, Global CFs addressing broader trends, and Group-wise CFs (GWCFs) striking a balance and revealing patterns within cohesive groups. Despite the availability of methods for each granularity level, the field lacks a unified method that integrates these complementary approaches. We address this limitation by proposing a gradient-based optimization method for differentiable models that generates Local, Global, and Group-wise Counterfactual Explanations in a unified manner. We especially enhance GWCF generation by combining instance grouping and counterfactual generation into a single efficient process, replacing traditional two-step methods. Moreover, to ensure trustworthiness, we innovatively introduce the integration of plausibility criteria into the GWCF domain, making explanations both valid and realistic. Our results demonstrate the method's effectiveness in balancing validity, proximity, and plausibility while optimizing group granularity, with practical utility validated through practical use cases.
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang et al.
Large language models (LLMs) need knowledge updates to meet the ever-growing
world facts and correct the hallucinated responses, facilitating the methods of
lifelong model editing. Where the updated knowledge resides in memories is a
fundamental question for model editing. In this paper, we find that editing
either long-term memory (direct model parameters) or working memory
(non-parametric knowledge of neural network activations/representations by
retrieval) will result in an impossible triangle -- reliability,
generalization, and locality can not be realized together in the lifelong
editing settings. For long-term memory, directly editing the parameters will
cause conflicts with irrelevant pretrained knowledge or previous edits (poor
reliability and locality). For working memory, retrieval-based activations can
hardly make the model understand the edits and generalize (poor
generalization). Therefore, we propose WISE to bridge the gap between memories.
In WISE, we design a dual parametric memory scheme, which consists of the main
memory for the pretrained knowledge and a side memory for the edited knowledge.
We only edit the knowledge in the side memory and train a router to decide
which memory to go through when given a query. For continual editing, we devise
a knowledge-sharding mechanism where different sets of edits reside in distinct
subspaces of parameters, and are subsequently merged into a shared memory
without conflicts. Extensive experiments show that WISE can outperform previous
model editing methods and overcome the impossible triangle under lifelong model
editing of question answering, hallucination, and out-of-distribution settings
across trending LLM architectures, e.g., GPT, LLaMA, and Mistral. Code is
available at https://github.com/zjunlp/EasyEdit.
Authors' comments: NeurIPS 2024
Bart Jacobs
In probabilistic updating one transforms a prior distribution in the light of given evidence into a posterior distribution, via what is called conditioning, updating, belief revision or inference. This is the essence of learning, as Bayesian updating. It will be illustrated via a physical model involving (adapted) water flows through pipes with different diameters. Bayesian updating makes us wiser, in the sense that the posterior distribution makes the evidence more likely than the prior, since it incorporates the evidence. Things are less clear when one wishes to learn from multiple pieces of evidence / data. It turns out that there are (at least) two forms of updating for this, associated with Jeffrey and Pearl. The difference is not always clearly recognised. This paper provides an introduction and an overview in the setting of discrete probability theory. It starts from an elementary question, involving multiple pieces of evidence, that has been sent to a small group academic specialists. Their answers show considerable differences. This is used as motivation and starting point to introduce the two forms of updating, of Jeffrey and Pearl, for multiple inputs and to elaborate their properties. In the end the account is related to so-called variational free energy (VFE) update in the cognitive theory of predictive processing. It is shown that both Jeffrey and Pearl outperform VFE updating and that VFE updating need not decrease divergence - that is correct errors - as it is supposed to do.
Chenchen Liu, Wenjun Jiang, Xiaojun Yuan
In this paper, we propose a learning-based block-wise planar channel estimator (LBPCE) with high accuracy and low complexity to estimate the time-varying frequency-selective channel of a multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) system. First, we establish a block-wise planar channel model (BPCM) to characterize the correlation of the channel across subcarriers and OFDM symbols. Specifically, adjacent subcarriers and OFDM symbols are divided into several sub-blocks, and an affine function (i.e., a plane) with only three variables (namely, mean, time-domain slope, and frequency-domain slope) is used to approximate the channel in each sub-block, which significantly reduces the number of variables to be determined in channel estimation. Second, we design a 3D dilated residual convolutional network (3D-DRCN) that leverages the time-frequency-space-domain correlations of the channel to further improve the channel estimates of each user. Numerical results demonstrate that the proposed significantly outperforms the state-of-the-art estimators and maintains a relatively low computational complexity.
Ollie Ballinger
In the context of recent, highly destructive conflicts in Gaza and Ukraine, reliable estimates of building damage are essential for an informed public discourse, human rights monitoring, and humanitarian aid provision. Given the contentious nature of conflict damage assessment, these estimates must be fully reproducible, explainable, and derived from open access data. This paper introduces a new method for building damage detection-- the Pixel-Wise T-Test (PWTT)-- that satisfies these conditions. Using a combination of freely-available synthetic aperture radar imagery and statistical change detection, the PWTT generates accurate conflict damage estimates across a wide area at regular time intervals. Accuracy is assessed using an original dataset of over half a million labeled building footprints spanning 12 cities across Ukraine, Palestine, Syria, and Iraq. Despite being simple and lightweight, the algorithm achieves building-level accuracy statistics (AUC=0.88 across Ukraine, 0.81 in Gaza) rivalling state of the art methods that use deep learning and high resolution imagery. The workflow is open source and deployed entirely within the Google Earth Engine environment, allowing for the generation of interactive Battle Damage Dashboards for Ukraine and Gaza that update in near-real time, allowing the public and humanitarian practitioners to immediately get estimates of damaged buildings in a given area.
Lucas Gretta, William He, Angelos Pelecanos
We prove that the permutation computed by a reversible circuit with
$\tilde{O}(nk\cdot \log(1/\varepsilon))$ random $3$-bit gates is
$\varepsilon$-approximately $k$-wise independent. Our bound improves on
currently known bounds in the regime when the approximation error $\varepsilon$
is not too small. We obtain our results by analyzing the log-Sobolev constants
of appropriate Markov chains rather than their spectral gaps.
Authors' comments: 19 pages
Kumar Shubham, Aishwarya Jayagopal, Syed Mohammed Danish, Prathosh AP, Vaibhav Rajan
Cancer, a leading cause of death globally, occurs due to genomic changes and
manifests heterogeneously across patients. To advance research on personalized
treatment strategies, the effectiveness of various drugs on cells derived from
cancers (`cell lines') is experimentally determined in laboratory settings.
Nevertheless, variations in the distribution of genomic data and drug responses
between cell lines and humans arise due to biological and environmental
differences. Moreover, while genomic profiles of many cancer patients are
readily available, the scarcity of corresponding drug response data limits the
ability to train machine learning models that can predict drug response in
patients effectively. Recent cancer drug response prediction methods have
largely followed the paradigm of unsupervised domain-invariant representation
learning followed by a downstream drug response classification step.
Introducing supervision in both stages is challenging due to heterogeneous
patient response to drugs and limited drug response data. This paper addresses
these challenges through a novel representation learning method in the first
phase and weak supervision in the second. Experimental results on real patient
data demonstrate the efficacy of our method (WISER) over state-of-the-art
alternatives on predicting personalized drug response.
Authors' comments: ICML 2024
Matías Suazo, Erik Zackrisson, Priyatam K. Mahto, Fabian Lundell, Carl Nettelblad, Andreas J. Korn, Jason T. Wright, Suman Majumdar
The search for extraterrestrial intelligence is currently being pursued using
multiple techniques and in different wavelength bands. Dyson spheres,
megastructures that could be constructed by advanced civilizations to harness
the radiation energy of their host stars, represent a potential
technosignature, that in principle may be hiding in public data already
collected as part of large astronomical surveys. In this study, we present a
comprehensive search for partial Dyson spheres by analyzing optical and
infrared observations from Gaia, 2MASS, and WISE. We develop a pipeline that
employs multiple filters to identify potential candidates and reject
interlopers in a sample of five million objects, which incorporates a
convolutional neural network to help identify confusion in WISE data. Finally,
the pipeline identifies 7 candidates deserving of further analysis. All of
these objects are M-dwarfs, for which astrophysical phenomena cannot easily
account for the observed infrared excess emission.
Authors' comments: Accepted to be published in MNRAS
Ziyi Yin, Rafael Orozco, Felix J. Herrmann
We present a semi-amortized variational inference framework designed for computationally feasible uncertainty quantification in 2D full-waveform inversion to explore the multimodal posterior distribution without dimensionality reduction. The framework is called WISER, short for full-Waveform variational Inference via Subsurface Extensions with Refinements. WISER leverages the power of generative artificial intelligence to perform approximate amortized inference that is low-cost albeit showing an amortization gap. This gap is closed through non-amortized refinements that make frugal use of acoustic wave physics. Case studies illustrate that WISER is capable of full-resolution, computationally feasible, and reliable uncertainty estimates of velocity models and imaged reflectivities.