Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.
Authors' comments: Accepted by EMNLP 2025
Sweta Kaman, Ankita Sharma, Romi Banerjee
Background: Wisdom is a superordinate construct that embraces perspective
taking, reflectiveness, prosocial orientation, reflective empathetic action,
and intellectual humility. Unlike conventional models of reasoning that are
rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity,
requiring both graded evaluation and self-reflective humility. Current measures
depend on self-reports and seldom reflect the humility and uncertainty inherent
in wise reasoning. A computational framework that takes into account both
multidimensionality and confidence has the potential to improve psychological
science and allow humane AI. Method: We present a fuzzy inference system with Z
numbers, each of the decisions being expressed in terms of a wisdom score
(restriction) and confidence score (certainty). As part of this study,
participants (N = 100) were exposed to culturally neutral pictorial moral
dilemma tasks to which they generated think-aloud linguistic responses, which
were mapped into five theoretically based components of wisdom. The scores of
each individual component were combined using a base of 21 rules, with
membership functions tuned via Gaussian kernel density estimation. Results: In
a proof of concept study, the system produced dual attribute wisdom
representations that correlated modestly but significantly with established
scales while showing negligible relations with unrelated traits, supporting
convergent and divergent validity. Contribution: The contribution is to
formalize wisdom as a multidimensional, uncertainty-conscious construct,
operationalized in the form of Z-numbers. In addition to progressing
measurement in psychology, it calculates how fuzzy Z numbers can provide AI
systems with interpretable, confidence-sensitive reasoning that affords a safe,
middle ground between rigorous computation and human-like judgment.
Authors' comments: total 17 pages, main manuscript 12 pages, supplementary 5 pages, 6
tables in main manuscript, 5 figures in main manuscript, 2 tables in
supplementary, and 3 figures in supplementary
Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
Generative models have been successfully used in the field of time series
generation. However, when dealing with long-term time series, which span over
extended periods and exhibit more complex long-term temporal patterns, the task
of generation becomes significantly more challenging. Long-term time series
exhibit long-range temporal dependencies, but their data distribution also
undergoes gradual changes over time. Finding a balance between these long-term
dependencies and the drift in data distribution is a key challenge. On the
other hand, long-term time series contain more complex interrelationships
between different feature sequences, making the task of effectively capturing
both intra-sequence and inter-sequence dependencies another important
challenge. To address these issues, we propose Stage-Diff, a staged generative
model for long-term time series based on diffusion models. First, through
stage-wise sequence generation and inter-stage information transfer, the model
preserves long-term sequence dependencies while enabling the modeling of data
distribution shifts. Second, within each stage, progressive sequence
decomposition is applied to perform channel-independent modeling at different
time scales, while inter-stage information transfer utilizes multi-channel
fusion modeling. This approach combines the robustness of channel-independent
modeling with the information fusion advantages of multi-channel modeling,
effectively balancing the intra-sequence and inter-sequence dependencies of
long-term time series. Extensive experiments on multiple real-world datasets
validate the effectiveness of Stage-Diff in long-term time series generation
tasks.
Authors' comments: 8 pages, 5 figures
Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil
Numerous methods have been proposed to enhance Keyword Spotting (KWS) in
adult speech, but children's speech presents unique challenges for KWS systems
due to its distinct acoustic and linguistic characteristics. This paper
introduces a zero-shot KWS approach that leverages state-of-the-art
self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec.
Features are extracted layer-wise from these SSL models and used to train a
Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for
training, while the PFSTAR children's speech dataset was used for testing,
demonstrating the zero-shot capability of our method. Our approach achieved
state-of-the-art results across all keyword sets for children's speech.
Notably, the Wav2Vec2 model, particularly layer 22, performed the best,
delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of
false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a
set of 30 keywords. Furthermore, age-specific performance evaluation confirmed
the system's effectiveness across different age groups of children. To assess
the system's robustness against noise, additional experiments were conducted
using the best-performing layer of the best-performing Wav2Vec2 model. The
results demonstrated a significant improvement over traditional MFCC-based
baseline, emphasizing the potential of SSL embeddings even in noisy conditions.
To further generalize the KWS framework, the experiments were repeated for an
additional CMU dataset. Overall the results highlight the significant
contribution of SSL features in enhancing Zero-Shot KWS performance for
children's speech, effectively addressing the challenges associated with the
distinct characteristics of child speakers.
Authors' comments: Accepted
Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan
Automatic Speech Recognition (ASR) systems often struggle to accurately
process children's speech due to its distinct and highly variable acoustic and
linguistic characteristics. While recent advancements in self-supervised
learning (SSL) models have greatly enhanced the transcription of adult speech,
accurately transcribing children's speech remains a significant challenge. This
study investigates the effectiveness of layer-wise features extracted from
state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT,
Data2Vec, and WavLM in improving the performance of ASR for children's speech
in zero-shot scenarios. A detailed analysis of features extracted from these
models was conducted, integrating them into a simplified DNN-based ASR system
using the Kaldi toolkit. The analysis identified the most effective layers for
enhancing ASR performance on children's speech in a zero-shot scenario, where
WSJCAM0 adult speech was used for training and PFSTAR children speech for
testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model
achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64%
relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of
10.65%). Additionally, age group-wise analysis demonstrated consistent
performance improvements with increasing age, along with significant gains
observed even in younger age groups using the SSL features. Further experiments
on the CMU Kids dataset confirmed similar trends, highlighting the
generalizability of the proposed approach.
Authors' comments: Accepted
Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You
Large-batch training has become a cornerstone in accelerating the training of
deep neural networks, yet it poses challenges in optimization and
generalization. Existing optimizers like AdamW present performance degradation
during language models' large-batch training, due to the information bottleneck
in attention layers caused by the sharp increase of max attention logit. While
the LAMB optimizer partially addresses this issue, some attention layers still
face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are
less effective in directly influencing the max value of query/key weights.
Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks
relationships of weight values within rows or columns. Building on these
observations, we propose a novel optimizer, MERIT, which leverages the max-norm
to calculate the trust ratio to constrain the max attention logit more
effectively. Moreover, we further construct element-wise trust ratios to
provide more robust update scaling by focusing on local weight structures.
Extensive experiments of large-batch training across various sizes of GPT-2
models demonstrate the superior performance of MERIT. Notably, during the
training of GPT-2 Medium, MERIT enables a 6k batch size without any performance
degradation compared to the standard batch size (480) with 48B training tokens.
This work highlights the importance of considering the max attention logit and
finer-granularity trust ratio in large-batch training. It successfully improves
the training stability and paves the way for larger batch usage, enabling
faster development and iteration of large language models. Code is available at
https://github.com/NUS-HPC-AI-Lab/MERIT.
Authors' comments: ICML 2025
Tiandi Ye, Wenyan Liu, Kai Yao, Lichun Li, Shangchao Su, Cen Chen, Xiang Li, Shan Yin et al.
Federated learning (FL) is a privacy-preserving machine learning paradigm
that enables collaborative model training across multiple distributed clients
without disclosing their raw data. Personalized federated learning (pFL) has
gained increasing attention for its ability to address data heterogeneity.
However, most existing pFL methods assume that each client's data follows a
single distribution and learn one client-level personalized model for each
client. This assumption often fails in practice, where a single client may
possess data from multiple sources or domains, resulting in significant
intra-client heterogeneity and suboptimal performance. To tackle this
challenge, we propose pFedBayesPT, a fine-grained instance-wise pFL framework
based on visual prompt tuning. Specifically, we formulate instance-wise prompt
generation from a Bayesian perspective and model the prompt posterior as an
implicit distribution to capture diverse visual semantics. We derive a
variational training objective under the semi-implicit variational inference
framework. Extensive experiments on benchmark datasets demonstrate that
pFedBayesPT consistently outperforms existing pFL methods under both feature
and label heterogeneity settings.
Authors' comments: Accepted by CIKM2025
Md Anwar Hossen, Fatema Siddika, Wensheng Zhang, Anuj Sharma, Ali Jannesari
Heterogeneous Federated Learning (HFL) has gained attention for its ability
to accommodate diverse models and heterogeneous data across clients.
Prototype-based HFL methods emerge as a promising solution to address
statistical heterogeneity and privacy challenges, paving the way for new
advancements in HFL research. This method focuses on sharing only
class-representative prototypes among heterogeneous clients. However, these
prototypes are often aggregated on the server using weighted averaging, leading
to sub-optimal global knowledge; these cause the shrinking of aggregated
prototypes, which negatively affects the model performance in scenarios when
models are heterogeneous and data distributions are extremely non-IID. We
propose FedProtoKD in a Heterogeneous Federated Learning setting, using an
enhanced dual-knowledge distillation mechanism to improve the system
performance with clients' logits and prototype feature representation. We aim
to resolve the prototype margin-shrinking problem using a contrastive
learning-based trainable server prototype by leveraging a class-wise adaptive
prototype margin. Furthermore, we assess the importance of public samples using
the closeness of the sample's prototype to its class representative prototypes,
which enhances learning performance. FedProtoKD achieved average improvements
of 1.13% up to 34.13% accuracy across various settings and significantly
outperforms existing state-of-the-art HFL methods.
Authors' comments: 12 pages, 6 figures
Domenico De Cristofaro, Vincenzo Norman Vitale, Alessandro Vietti
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
Kejian Zhang, Muxuan Liang, Robert Maile, Doudou Zhou
Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire data sources or modalities are systematically absent for subsets of subjects. This structured form of missingness presents significant challenges for statistical analysis, particularly for two-sample hypothesis testing. Standard approaches such as imputation or complete-case analysis can introduce bias or result in substantial information loss, especially when the missingness mechanism is not random. To address this methodological gap, we propose the Block-Pattern Enhanced Test (BPET), a general framework for two-sample testing that directly accounts for block-wise missingness without requiring imputation or deletion of observations. As a concrete instantiation, we develop the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which extends rank-based similarity graph methods to settings with block-wise missing data. Under mild conditions, we establish that the null distribution of BRISE converges to a chi-squared distribution. Simulation studies show that BRISE consistently controls the type I error rate and achieves good statistical power under a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further demonstrate the practical utility of our method in identifying meaningful distributional differences.
Austin Braniff, Yuhe Tian
This work presents a novel reinforcement learning (RL) algorithm based on Y-wise Affine Neural Networks (YANNs). YANNs provide an interpretable neural network which can exactly represent known piecewise affine functions of arbitrary input and output dimensions defined on any amount of polytopic subdomains. One representative application of YANNs is to reformulate explicit solutions of multi-parametric linear model predictive control. Built on this, we propose the use of YANNs to initialize RL actor and critic networks, which enables the resulting YANN-RL control algorithm to start with the confidence of linear optimal control. The YANN-actor is initialized by representing the multi-parametric control solutions obtained via offline computation using an approximated linear system model. The YANN-critic represents the explicit form of the state-action value function for the linear system and the reward function as the objective in an optimal control problem (OCP). Additional network layers are injected to extend YANNs for nonlinear expressions, which can be trained online by directly interacting with the true complex nonlinear system. In this way, both the policy and state-value functions exactly represent a linear OCP initially and are able to eventually learn the solution of a general nonlinear OCP. Continuous policy improvement is also implemented to provide heuristic confidence that the linear OCP solution serves as an effective lower bound to the performance of RL policy. The YANN-RL algorithm is demonstrated on a clipped pendulum and a safety-critical chemical-reactive system. Our results show that YANN-RL significantly outperforms the modern RL algorithm using deep deterministic policy gradient, especially when considering safety constraints.
Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas
Code completion entails the task of providing missing tokens given a
surrounding context. It can boost developer productivity while providing a
powerful code discovery tool. Following the Large Language Model (LLM) wave,
code completion has been approached with diverse LLMs fine-tuned on code (code
LLMs). The performance of code LLMs can be assessed with downstream and
intrinsic metrics. Downstream metrics are usually employed to evaluate the
practical utility of a model, but can be unreliable and require complex
calculations and domain-specific knowledge. In contrast, intrinsic metrics such
as perplexity, entropy, and mutual information, which measure model confidence
or uncertainty, are simple, versatile, and universal across LLMs and tasks, and
can serve as proxies for functional correctness and hallucination risk in
LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when
generating code by measuring code perplexity across programming languages,
models, and datasets using various LLMs, and a sample of 1008 files from 657
GitHub projects. We find that strongly-typed languages exhibit lower perplexity
than dynamically typed languages. Scripting languages also demonstrate higher
perplexity. Perl appears universally high in perplexity, whereas Java appears
low. Code perplexity depends on the employed LLM, but not on the code dataset.
Although code comments often increase perplexity, the language ranking based on
perplexity is barely affected by their presence. LLM researchers, developers,
and users can employ our findings to assess the benefits and suitability of
LLM-based code completion in specific software projects based on how language,
model choice, and code characteristics impact model confidence.
Authors' comments: 30 pages, 10 figures
Ninh Tran
Partial conjunction (PC) hypothesis testing is widely used to assess the replicability of scientific findings across multiple comparable studies. In high-throughput meta-analyses, testing a large number of PC hypotheses with k-family-wise error rate (k-FWER) control often suffers from low statistical power due to the multiplicity burden. The state-of-the-art AdaFilter-Bon procedure by Wang et al. (2022, Ann. Stat., 50(4), 1890-1909) alleviates this problem by filtering out hypotheses unlikely to be false before applying a rejection rule. However, a side effect of filtering is that it renders the rejection rule more stringent than necessary, leading to conservative k-FWER control. In this paper, we mitigate this conservativeness - and thereby improve the power of AdaFilter-Bon - by incorporating a post-filter null proportion estimate into the procedure. The resulting method, AdaFilter-AdaBon, has proven asymptotic k-FWER control under weak dependence and demonstrates empirical finite-sample control with higher power than the original AdaFilter-Bon in simulations.
Sebastian Musiał, Bartosz Zieliński, Tomasz Danel
Graph neural networks have demonstrated remarkable success in predicting molecular properties by leveraging the rich structural information encoded in molecular graphs. However, their black-box nature reduces interpretability, which limits trust in their predictions for important applications such as drug discovery and materials design. Furthermore, existing explanation techniques often fail to reliably quantify the contribution of individual atoms or substructures due to the entangled message-passing dynamics. We introduce SEAL (Substructure Explanation via Attribution Learning), a new interpretable graph neural network that attributes model predictions to meaningful molecular subgraphs. SEAL decomposes input graphs into chemically relevant fragments and estimates their causal influence on the output. The strong alignment between fragment contributions and model predictions is achieved by explicitly reducing inter-fragment message passing in our proposed model architecture. Extensive evaluations on synthetic benchmarks and real-world molecular datasets demonstrate that SEAL outperforms other explainability methods in both quantitative attribution metrics and human-aligned interpretability. A user study further confirms that SEAL provides more intuitive and trustworthy explanations to domain experts. By bridging the gap between predictive performance and interpretability, SEAL offers a promising direction for more transparent and actionable molecular modeling.
Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen, Peter K. Sorger, Hanspeter Pfister, Won-Ki Jeong
Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.
Elie Thellier, Huiyu Li, Nicholas Ayache, Hervé Delingette
Data lakes enable the training of powerful machine learning models on sensitive, high-value medical datasets, but also introduce serious privacy risks due to potential leakage of protected health information. Recent studies show adversaries can exfiltrate training data by embedding latent representations into model parameters or inducing memorization via multi-task learning. These attacks disguise themselves as benign utility models while enabling reconstruction of high-fidelity medical images, posing severe privacy threats with legal and ethical implications. In this work, we propose a simple yet effective mitigation strategy that perturbs model parameters at export time through fine-tuning with a decaying layer-wise learning rate to corrupt embedded data without degrading task performance. Evaluations on DermaMNIST, ChestMNIST, and MIMIC-CXR show that our approach maintains utility task performance, effectively disrupts state-of-the-art exfiltration attacks, outperforms prior defenses, and renders exfiltrated data unusable for training. Ablations and discussions on adaptive attacks highlight challenges and future directions. Our findings offer a practical defense against data leakage in data lake-trained models and centralized federated learning.
Zehang Lin, Zheng Lin, Miao Yang, Jianhao Huang, Yuxin Zhang, Zihan Fang, Xia Du, Zhe Chen et al.
The increasing complexity of neural networks poses a significant barrier to
the deployment of distributed machine learning (ML) on resource-constrained
devices, such as federated learning (FL). Split learning (SL) offers a
promising solution by offloading the primary computing load from edge devices
to a server via model partitioning. However, as the number of participating
devices increases, the transmission of excessive smashed data (i.e.,
activations and gradients) becomes a major bottleneck for SL, slowing down the
model training. To tackle this challenge, we propose a communication-efficient
SL framework, named SL-ACC, which comprises two key components: adaptive
channel importance identification (ACII) and channel grouping compression
(CGC). ACII first identifies the contribution of each channel in the smashed
data to model training using Shannon entropy. Following this, CGC groups the
channels based on their entropy and performs group-wise adaptive compression to
shrink the transmission volume without compromising training accuracy.
Extensive experiments across various datasets validate that our proposed SL-ACC
framework takes considerably less time to achieve a target accuracy than
state-of-the-art benchmarks.
Authors' comments: 6 pages, 7 figures
Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
Test-time scaling has proven effective in further enhancing the performance
of pretrained Large Language Models (LLMs). However, mainstream post-training
methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT)
reasoning) often incur substantial computational overhead due to auxiliary
models and overthinking. In this paper, we empirically reveal that the
incorrect answers partially stem from verbose reasoning processes lacking
correct self-fix, where errors accumulate across multiple reasoning steps. To
this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a
pluggable RL process supervision framework that enables fine-grained
optimization of each reasoning step. Specifically, SSPO requires neither
auxiliary models nor stepwise manual annotations. Instead, it leverages
step-wise preference signals generated by the model itself to guide the
optimization process for reasoning compression. Experiments demonstrate that
the generated reasoning sequences from SSPO are both accurate and succinct,
effectively mitigating overthinking behaviors without compromising model
performance across diverse domains and languages.
Authors' comments: Work in progress
Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri
Children's speech presents challenges for age and gender classification due
to high variability in pitch, articulation, and developmental traits. While
self-supervised learning (SSL) models perform well on adult speech tasks, their
ability to encode speaker traits in children remains underexplored. This paper
presents a detailed layer-wise analysis of four Wav2Vec2 variants using the
PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture
speaker-specific cues more effectively than deeper layers, which increasingly
focus on linguistic information. Applying PCA further improves classification,
reducing redundancy and highlighting the most informative components. The
Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU
Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These
results reveal how speaker traits are structured across SSL model depth and
support more targeted, adaptive strategies for child-aware speech interfaces.
Authors' comments: Accepted at Workshop on Child Computer Interaction (WOCCI 2025)
He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong
Large language models with billions of parameters are often over-provisioned:
many layers contribute little unique information yet dominate the memory and
energy footprint during inference. We present LieQ, a metric-driven
post-training quantization framework that addresses the critical challenge of
maintaining accuracy in sub-7B models under extreme low-bit compression. Our
method introduces three complementary layer-wise diagnostics-Perplexity Drop,
Representational Compactness, and Top-k Energy Gain -that reveal a canonical
division of labour across layers, enabling automatic bit-width allocation
without gradient updates. Unlike existing approaches that suffer severe
accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art
compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16
baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and
AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to
LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision
while enabling 4x memory reduction, establishing new paradigms for deploying
small language models on resource-constrained edge devices.
Authors' comments: low-bit quantization