Beatrice Alessandra Motetti, Matteo Risso, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari
The resource requirements of deep neural networks (DNNs) pose significant
challenges to their deployment on edge devices. Common approaches to address
this issue are pruning and mixed-precision quantization, which lead to latency
and memory occupation improvements. These optimization techniques are usually
applied independently. We propose a novel methodology to apply them jointly via
a lightweight gradient-based search, and in a hardware-aware manner, greatly
reducing the time required to generate Pareto-optimal DNNs in terms of accuracy
versus cost (i.e., latency or memory). We test our approach on three
edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny
ImageNet. When targeting the optimization of the memory footprint, we are able
to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the
baseline networks with all weights quantized at 8 and 2-bit, respectively. Our
method surpasses a previous state-of-the-art approach with up to 56.17% size
reduction at iso-accuracy. With respect to the sequential application of
state-of-the-art pruning and mixed-precision optimizations, we obtain
comparable or superior results, but with a significantly lowered training time.
In addition, we show how well-tailored cost models can improve the cost versus
accuracy trade-offs when targeting specific hardware for deployment.
Authors' comments: Accepted for publication in IEEE Transactions on Computers
Jingheng Ye, Shang Qin, Yinghui Li, Xuxin Cheng, Libo Qin, Hai-Tao Zheng, Ying Shen, Peng Xing et al.
Existing studies explore the explainability of Grammatical Error Correction
(GEC) in a limited scenario, where they ignore the interaction between
corrections and explanations and have not established a corresponding
comprehensive benchmark. To bridge the gap, this paper first introduces the
task of EXplainable GEC (EXGEC), which focuses on the integral role of
correction and explanation tasks. To facilitate the task, we propose EXCGEC, a
tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented
samples featuring the design of hybrid edit-wise explanations. We then
benchmark several series of LLMs in multi-task learning settings, including
post-explaining and pre-explaining. To promote the development of the task, we
also build a comprehensive evaluation suite by leveraging existing automatic
metrics and conducting human evaluation experiments to demonstrate the human
consistency of the automatic metrics for free-text explanations. Our
experiments reveal the effectiveness of evaluating free-text explanations using
traditional metrics like METEOR and ROUGE, and the inferior performance of
multi-task models compared to the pipeline solution, indicating its challenges
to establish positive effects in learning both tasks.
Authors' comments: Accepted to AAAI 2025. 19 pages with an appendix, 10 tables, and 9
figures
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya Jia
Mathematical reasoning presents a significant challenge for Large Language
Models (LLMs) due to the extensive and precise chain of reasoning required for
accuracy. Ensuring the correctness of each reasoning step is critical. To
address this, we aim to enhance the robustness and factuality of LLMs by
learning from human feedback. However, Direct Preference Optimization (DPO) has
shown limited benefits for long-chain mathematical reasoning, as models
employing DPO struggle to identify detailed errors in incorrect answers. This
limitation stems from a lack of fine-grained process supervision. We propose a
simple, effective, and data-efficient method called Step-DPO, which treats
individual reasoning steps as units for preference optimization rather than
evaluating answers holistically. Additionally, we have developed a data
construction pipeline for Step-DPO, enabling the creation of a high-quality
dataset containing 10K step-wise preference pairs. We also observe that in DPO,
self-generated data is more effective than data generated by humans or GPT-4,
due to the latter's out-of-distribution nature. Our findings demonstrate that
as few as 10K preference data pairs and fewer than 500 Step-DPO training steps
can yield a nearly 3% gain in accuracy on MATH for models with over 70B
parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves
scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,
surpassing a series of closed-source models, including GPT-4-1106,
Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at
https://github.com/dvlab-research/Step-DPO.
Authors' comments: Code, data, and models are available at
https://github.com/dvlab-research/Step-DPO
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu
We present a simple variable quantization approach that quantizes different
layers of a large language model (LLM) at different bit levels. Specifically,
we quantize the most important layers to higher bit precision and less
important layers to lower bits to achieve floating point quantization levels.
We propose two effective strategies to measure the importance of layers within
LLMs: the first measures the importance of a layer based on how different its
output embeddings are from the input embeddings (the higher the better); the
second estimates the importance of a layer using the number of layer weights
that are much larger than average (the smaller the better). We show that
quantizing different layers at varying bits according to our importance scores
results in minimal performance drop with a far more compressed model size.
Finally, we present several practical key takeaways from our variable
layer-wise quantization experiments: (a) LLM performance under variable
quantization remains close to the original model until 25-50% of layers are
moved in lower quantization using our proposed ordering but only until 5-10% if
moved using no specific ordering; (b) Quantizing LLMs to lower bits performs
substantially better than pruning unless extreme quantization (2-bit) is used;
and (c) Layer-wise quantization to lower bits works better in the case of
larger LLMs with more layers compared to smaller LLMs with fewer layers. The
code used to run the experiments is available at:
https://github.com/RazvanDu/LayerwiseQuant.
Authors' comments: submitted to EMNLP, 15 pages, 10 figures, 4 tables
Måns Williamson, Monika Eisenmann, Tony Stillfjord
Choosing the optimization algorithm that performs best on a given machine learning problem is often delicate, and there is no guarantee that current state-of-the-art algorithms will perform well across all tasks. Consequently, the more reliable methods that one has at hand, the larger the likelihood of a good end result. To this end, we introduce and analyze a large class of stochastic so-called soft-clipping schemes with a broad range of applications. Despite the wide adoption of clipping techniques in practice, soft-clipping methods have not been analyzed to a large extent in the literature. In particular, a rigorous mathematical analysis is lacking in the general, nonlinear case. Our analysis lays a theoretical foundation for a large class of such schemes, and motivates their usage. In particular, under standard assumptions such as Lipschitz continuous gradients of the objective function, we give rigorous proofs of convergence in expectation. These include rates in both the convex and the non-convex case, as well as almost sure convergence to a stationary point in the non-convex case. The computational cost of the analyzed schemes is essentially the same as that of stochastic gradient descent.
Haoran Li, Xingjian Li, Jiahua Shi, Huaming Chen, Bo Du, Daisuke Kihara, Johan Barthelemy, Jun Shen et al.
Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating
the study of macromolecular structures at near-atomic resolution. Recent
volumetric segmentation approaches on cryo-ET images have drawn widespread
interest in biological sector. However, existing methods heavily rely on
manually labeled data, which requires highly professional skills, thereby
hindering the adoption of fully-supervised approaches for cryo-ET images. Some
unsupervised domain adaptation (UDA) approaches have been designed to enhance
the segmentation network performance using unlabeled data. However, applying
these methods directly to cryo-ET images segmentation tasks remains challenging
due to two main issues: 1) the source data, usually obtained through
simulation, contain a certain level of noise, while the target data, directly
collected from raw-data from real-world scenario, have unpredictable noise
levels. 2) the source data used for training typically consists of known
macromoleculars, while the target domain data are often unknown, causing the
model's segmenter to be biased towards these known macromolecules, leading to a
domain shift problem. To address these challenges, in this work, we introduce
the first voxel-wise unsupervised domain adaptation approach, termed Vox-UDA,
specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise
generation module to simulate target-like noises in the source dataset for
cross-noise level adaptation. Additionally, we propose a denoised
pseudo-labeling strategy based on improved Bilateral Filter to alleviate the
domain shift problem. Experimental results on both simulated and real cryo-ET
subtomogram datasets demonstrate the superiority of our proposed approach
compared to state-of-the-art UDA methods.
Authors' comments: 11 pages
Ning Lin, Shaocong Wang, Yue Zhang, Yangu He, Kwunhang Wong, Arindam Basu, Dashan Shang, Xiaoming Chen et al.
Deep neural networks (DNNs), such as the widely-used GPT-3 with billions of
parameters, are often kept secret due to high training costs and privacy
concerns surrounding the data used to train them. Previous approaches to
securing DNNs typically require expensive circuit redesign, resulting in
additional overheads such as increased area, energy consumption, and latency.
To address these issues, we propose a novel hardware-software co-design
approach for DNN intellectual property (IP) protection that capitalizes on the
inherent aging characteristics of circuits and a novel differential orientation
fine-tuning (DOFT) to ensure effective protection. Hardware-wise, we employ
random aging to produce authorized chips. This process circumvents the need for
chip redesign, thereby eliminating any additional hardware overhead during the
inference procedure of DNNs. Moreover, the authorized chips demonstrate a
considerable disparity in DNN inference performance when compared to
unauthorized chips. Software-wise, we propose a novel DOFT, which allows
pre-trained DNNs to maintain their original accuracy on authorized chips with
minimal fine-tuning, while the model's performance on unauthorized chips is
reduced to random guessing. Extensive experiments on various models, including
MLP, VGG, ResNet, Mixer, and SwinTransformer, with lightweight binary and
practical multi-bit weights demonstrate that the proposed method achieves
effective IP protection, with only 10\% accuracy on unauthorized chips, while
preserving nearly the original accuracy on authorized ones.
Authors' comments: Design Automation Conference 2024
Xiaoxiong Zhang, Zhiwei Zeng, Xin Zhou, Dusit Niyato, Zhiqi Shen
Federated Knowledge Graphs Embedding learning (FKGE) encounters challenges in communication efficiency stemming from the considerable size of parameters and extensive communication rounds. However, existing FKGE methods only focus on reducing communication rounds by conducting multiple rounds of local training in each communication round, and ignore reducing the size of parameters transmitted within each communication round. To tackle the problem, we first find that universal reduction in embedding precision across all entities during compression can significantly impede convergence speed, underscoring the importance of maintaining embedding precision. We then propose bidirectional communication-efficient FedS based on Entity-Wise Top-K Sparsification strategy. During upload, clients dynamically identify and upload only the Top-K entity embeddings with the greater changes to the server. During download, the server first performs personalized embedding aggregation for each client. It then identifies and transmits the Top-K aggregated embeddings to each client. Besides, an Intermittent Synchronization Mechanism is used by FedS to mitigate negative effect of embedding inconsistency among shared entities of clients caused by heterogeneity of Federated Knowledge Graph. Extensive experiments across three datasets showcase that FedS significantly enhances communication efficiency with negligible (even no) performance degradation.
Kosuke Doi, Yuka Ko, Mana Makinae, Katsuhito Sudoh, Satoshi Nakamura
This paper analyzes the features of monotonic translations, which follow the
word order of the source language, in simultaneous interpreting (SI). Word
order differences are one of the biggest challenges in SI, especially for
language pairs with significant structural differences like English and
Japanese. We analyzed the characteristics of chunk-wise monotonic translation
(CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic
Translation Evaluation Dataset and identified some grammatical structures that
make monotonic translation difficult in English-Japanese SI. We further
investigated the features of CMT sentences by evaluating the output from the
existing speech translation (ST) and simultaneous speech translation (simulST)
models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation
Evaluation Dataset as well as on existing test sets. The results indicate the
possibility that the existing SI-based test set underestimates the model
performance. The results also suggest that using CMT sentences as references
gives higher scores to simulST models than ST models, and that using an
offline-based test set to evaluate the simulST models underestimates the model
performance.
Authors' comments: Accepted to IWSLT2024
Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye
The field of 3D object detection from point clouds is rapidly advancing in
computer vision, aiming to accurately and efficiently detect and localize
objects in three-dimensional space. Current 3D detectors commonly fall short in
terms of flexibility and scalability, with ample room for advancements in
performance. In this paper, our objective is to address these limitations by
introducing two frameworks for 3D object detection with minimal hand-crafted
design. Firstly, we propose CT3D, which sequentially performs raw-point-based
embedding, a standard Transformer encoder, and a channel-wise decoder for point
features within each proposal. Secondly, we present an enhanced network called
CT3D++, which incorporates geometric and semantic fusion-based embedding to
extract more valuable and comprehensive proposal-aware information.
Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more
efficient feature encoding with reduced computational cost. By replacing the
corresponding components of CT3D with these novel modules, CT3D++ achieves
state-of-the-art performance on both the KITTI dataset and the large-scale
Way\-mo Open Dataset. The source code for our frameworks will be made
accessible at https://github.com/hlsheng1/CT3D-plusplus.
Authors' comments: 19 pages, 8 figures
Yuanjie Shi, Subhankar Ghosh, Taha Belkhouja, Janardhan Rao Doppa, Yan Yan
Conformal prediction (CP) is an emerging uncertainty quantification framework that allows us to construct a prediction set to cover the true label with a pre-specified marginal or conditional probability. Although the valid coverage guarantee has been extensively studied for classification problems, CP often produces large prediction sets which may not be practically useful. This issue is exacerbated for the setting of class-conditional coverage on imbalanced classification tasks with many and/or imbalanced classes. This paper proposes the Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the prediction set sizes to achieve class-conditional coverage, where the valid coverage holds for each class. In contrast to the standard class-conditional CP (CCP) method that uniformly thresholds the class-wise conformity score for each class, the augmented label rank calibration step allows RC3P to selectively iterate this class-wise thresholding subroutine only for a subset of classes whose class-wise top-k error is small. We prove that agnostic to the classifier and data distribution, RC3P achieves class-wise coverage. We also show that RC3P reduces the size of prediction sets compared to the CCP method. Comprehensive experiments on multiple real-world datasets demonstrate that RC3P achieves class-wise coverage and 26.25% reduction in prediction set sizes on average.
Peiyu Liang, Hongchang Gao, Xubin He
While Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation, existing methods assume identical local topology structures across modalities that overlook real-world discrepancies. This leads MVGNNs straggles in modality fusion and representations denoising. To address these issues, we propose adaptive modality-wise structure learning (AMoSL). AMoSL captures node correspondences between modalities via optimal transport, and jointly learning with graph embedding. To enable efficient end-to-end training, we employ an efficient solution for the resulting complex bilevel optimization problem. Furthermore, AMoSL adapts to downstream tasks through unsupervised learning on inter-modality distances. The effectiveness of AMoSL is demonstrated by its ability to train more accurate graph classifiers on six benchmark datasets.
Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
The Audio-Visual Video Parsing task aims to identify and temporally localize
the events that occur in either or both the audio and visual streams of audible
videos. It often performs in a weakly-supervised manner, where only video event
labels are provided, \ie, the modalities and the timestamps of the labels are
unknown. Due to the lack of densely annotated labels, recent work attempts to
leverage pseudo labels to enrich the supervision. A commonly used strategy is
to generate pseudo labels by categorizing the known video event labels for each
modality. However, the labels are still confined to the video level, and the
temporal boundaries of events remain unlabeled. In this paper, we propose a new
pseudo label generation strategy that can explicitly assign labels to each
video segment by utilizing prior knowledge learned from the open world.
Specifically, we exploit the large-scale pretrained models, namely CLIP and
CLAP, to estimate the events in each video segment and generate segment-level
visual and audio pseudo labels, respectively. We then propose a new loss
function to exploit these pseudo labels by taking into account their
category-richness and segment-richness. A label denoising strategy is also
adopted to further improve the visual pseudo labels by flipping them whenever
abnormally large forward losses occur. We perform extensive experiments on the
LLP dataset and demonstrate the effectiveness of each proposed design and we
achieve state-of-the-art video parsing performance on all types of event
parsing, \ie, audio event, visual event, and audio-visual event. We also
examine the proposed pseudo label generation strategy on a relevant
weakly-supervised audio-visual event localization task and the experimental
results again verify the benefits and generalization of our method.
Authors' comments: IJCV 2024 Accepted. arXiv admin note: substantial text overlap with
arXiv:2303.02344
Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang
Recent deep learning-based multi-view people detection (MVD) methods have
shown promising results on existing datasets. However, current methods are
mainly trained and evaluated on small, single scenes with a limited number of
multi-view frames and fixed camera views. As a result, these methods may not be
practical for detecting people in larger, more complex scenes with severe
occlusions and camera calibration errors. This paper focuses on improving
multi-view people detection by developing a supervised view-wise contribution
weighting approach that better fuses multi-camera information under large
scenes. Besides, a large synthetic dataset is adopted to enhance the model's
generalization ability and enable more practical evaluation and comparison. The
model's performance on new testing scenes is further improved with a simple
domain adaptation technique. Experimental results demonstrate the effectiveness
of our approach in achieving promising cross-scene multi-view people detection
performance. See code here: https://vcc.tech/research/2024/MVD.
Authors' comments: AAAI 2024
Zheng Tracy Ke, Jingming Wang
Topic modeling is a widely utilized tool in text analysis. We investigate the
optimal rate for estimating a topic model. Specifically, we consider a scenario
with $n$ documents, a vocabulary of size $p$, and document lengths at the order
$N$. When $N\geq c\cdot p$, referred to as the long-document case, the optimal
rate is established in the literature at $\sqrt{p/(Nn)}$. However, when
$N=o(p)$, referred to as the short-document case, the optimal rate remains
unknown. In this paper, we first provide new entry-wise large-deviation bounds
for the empirical singular vectors of a topic model. We then apply these bounds
to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by
comparing the improved error rate with the minimax lower bound, we conclude
that the optimal rate is still $\sqrt{p/(Nn)}$ in the short-document case.
Authors' comments: 50 pages
Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen et al.
Length generalization failure problem, namely the large language model (LLM)
fails to generalize to texts longer than its maximum training length, greatly
restricts the application of LLM in the scenarios with streaming long inputs.
To address this problem, the existing methods either require substantial costs
or introduce precision loss. In this paper, we empirically find that the
accuracy of the LLM's prediction is highly correlated to its certainty. Based
on this, we propose an efficient training free framework, named XL3M (it means
extra-long large language model), which enables the LLMs trained on short
sequences to reason extremely long sequence without any further training or
fine-tuning. Under the XL3M framework, the input context will be firstly
decomposed into multiple short sub-contexts, where each sub-context contains an
independent segment and a common ``question'' which is a few tokens from the
end of the original context. Then XL3M gives a method to measure the relevance
between each segment and the ``question'', and constructs a concise key context
by splicing all the relevant segments in chronological order. The key context
is further used instead of the original context to complete the inference task.
Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our
framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card
Huawei Ascend 910B NPU machine with 64GB memory per card.
Authors' comments: 11 pages, 5 figures
Oleksii Furman, Patryk Wielopolski, Łukasz Lenkiewicz, Jerzy Stefanowski, Maciej Zięba
The growing complexity of AI systems has intensified the need for transparency through Explainable AI (XAI). Counterfactual explanations (CFs) offer actionable "what-if" scenarios on three levels: Local CFs providing instance-specific insights, Global CFs addressing broader trends, and Group-wise CFs (GWCFs) striking a balance and revealing patterns within cohesive groups. Despite the availability of methods for each granularity level, the field lacks a unified method that integrates these complementary approaches. We address this limitation by proposing a gradient-based optimization method for differentiable models that generates Local, Global, and Group-wise Counterfactual Explanations in a unified manner. We especially enhance GWCF generation by combining instance grouping and counterfactual generation into a single efficient process, replacing traditional two-step methods. Moreover, to ensure trustworthiness, we innovatively introduce the integration of plausibility criteria into the GWCF domain, making explanations both valid and realistic. Our results demonstrate the method's effectiveness in balancing validity, proximity, and plausibility while optimizing group granularity, with practical utility validated through practical use cases.
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang et al.
Large language models (LLMs) need knowledge updates to meet the ever-growing
world facts and correct the hallucinated responses, facilitating the methods of
lifelong model editing. Where the updated knowledge resides in memories is a
fundamental question for model editing. In this paper, we find that editing
either long-term memory (direct model parameters) or working memory
(non-parametric knowledge of neural network activations/representations by
retrieval) will result in an impossible triangle -- reliability,
generalization, and locality can not be realized together in the lifelong
editing settings. For long-term memory, directly editing the parameters will
cause conflicts with irrelevant pretrained knowledge or previous edits (poor
reliability and locality). For working memory, retrieval-based activations can
hardly make the model understand the edits and generalize (poor
generalization). Therefore, we propose WISE to bridge the gap between memories.
In WISE, we design a dual parametric memory scheme, which consists of the main
memory for the pretrained knowledge and a side memory for the edited knowledge.
We only edit the knowledge in the side memory and train a router to decide
which memory to go through when given a query. For continual editing, we devise
a knowledge-sharding mechanism where different sets of edits reside in distinct
subspaces of parameters, and are subsequently merged into a shared memory
without conflicts. Extensive experiments show that WISE can outperform previous
model editing methods and overcome the impossible triangle under lifelong model
editing of question answering, hallucination, and out-of-distribution settings
across trending LLM architectures, e.g., GPT, LLaMA, and Mistral. Code is
available at https://github.com/zjunlp/EasyEdit.
Authors' comments: NeurIPS 2024
Bart Jacobs
In probabilistic updating one transforms a prior distribution in the light of given evidence into a posterior distribution, via what is called conditioning, updating, belief revision or inference. This is the essence of learning, as Bayesian updating. It will be illustrated via a physical model involving (adapted) water flows through pipes with different diameters. Bayesian updating makes us wiser, in the sense that the posterior distribution makes the evidence more likely than the prior, since it incorporates the evidence. Things are less clear when one wishes to learn from multiple pieces of evidence / data. It turns out that there are (at least) two forms of updating for this, associated with Jeffrey and Pearl. The difference is not always clearly recognised. This paper provides an introduction and an overview in the setting of discrete probability theory. It starts from an elementary question, involving multiple pieces of evidence, that has been sent to a small group academic specialists. Their answers show considerable differences. This is used as motivation and starting point to introduce the two forms of updating, of Jeffrey and Pearl, for multiple inputs and to elaborate their properties. In the end the account is related to so-called variational free energy (VFE) update in the cognitive theory of predictive processing. It is shown that both Jeffrey and Pearl outperform VFE updating and that VFE updating need not decrease divergence - that is correct errors - as it is supposed to do.
Chenchen Liu, Wenjun Jiang, Xiaojun Yuan
In this paper, we propose a learning-based block-wise planar channel estimator (LBPCE) with high accuracy and low complexity to estimate the time-varying frequency-selective channel of a multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) system. First, we establish a block-wise planar channel model (BPCM) to characterize the correlation of the channel across subcarriers and OFDM symbols. Specifically, adjacent subcarriers and OFDM symbols are divided into several sub-blocks, and an affine function (i.e., a plane) with only three variables (namely, mean, time-domain slope, and frequency-domain slope) is used to approximate the channel in each sub-block, which significantly reduces the number of variables to be determined in channel estimation. Second, we design a 3D dilated residual convolutional network (3D-DRCN) that leverages the time-frequency-space-domain correlations of the channel to further improve the channel estimates of each user. Numerical results demonstrate that the proposed significantly outperforms the state-of-the-art estimators and maintains a relatively low computational complexity.