Elie Thellier, Huiyu Li, Nicholas Ayache, Hervé Delingette
Data lakes enable the training of powerful machine learning models on sensitive, high-value medical datasets, but also introduce serious privacy risks due to potential leakage of protected health information. Recent studies show adversaries can exfiltrate training data by embedding latent representations into model parameters or inducing memorization via multi-task learning. These attacks disguise themselves as benign utility models while enabling reconstruction of high-fidelity medical images, posing severe privacy threats with legal and ethical implications. In this work, we propose a simple yet effective mitigation strategy that perturbs model parameters at export time through fine-tuning with a decaying layer-wise learning rate to corrupt embedded data without degrading task performance. Evaluations on DermaMNIST, ChestMNIST, and MIMIC-CXR show that our approach maintains utility task performance, effectively disrupts state-of-the-art exfiltration attacks, outperforms prior defenses, and renders exfiltrated data unusable for training. Ablations and discussions on adaptive attacks highlight challenges and future directions. Our findings offer a practical defense against data leakage in data lake-trained models and centralized federated learning.
Zehang Lin, Zheng Lin, Miao Yang, Jianhao Huang, Yuxin Zhang, Zihan Fang, Xia Du, Zhe Chen et al.
The increasing complexity of neural networks poses a significant barrier to
the deployment of distributed machine learning (ML) on resource-constrained
devices, such as federated learning (FL). Split learning (SL) offers a
promising solution by offloading the primary computing load from edge devices
to a server via model partitioning. However, as the number of participating
devices increases, the transmission of excessive smashed data (i.e.,
activations and gradients) becomes a major bottleneck for SL, slowing down the
model training. To tackle this challenge, we propose a communication-efficient
SL framework, named SL-ACC, which comprises two key components: adaptive
channel importance identification (ACII) and channel grouping compression
(CGC). ACII first identifies the contribution of each channel in the smashed
data to model training using Shannon entropy. Following this, CGC groups the
channels based on their entropy and performs group-wise adaptive compression to
shrink the transmission volume without compromising training accuracy.
Extensive experiments across various datasets validate that our proposed SL-ACC
framework takes considerably less time to achieve a target accuracy than
state-of-the-art benchmarks.
Authors' comments: 6 pages, 7 figures
Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
Test-time scaling has proven effective in further enhancing the performance
of pretrained Large Language Models (LLMs). However, mainstream post-training
methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT)
reasoning) often incur substantial computational overhead due to auxiliary
models and overthinking. In this paper, we empirically reveal that the
incorrect answers partially stem from verbose reasoning processes lacking
correct self-fix, where errors accumulate across multiple reasoning steps. To
this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a
pluggable RL process supervision framework that enables fine-grained
optimization of each reasoning step. Specifically, SSPO requires neither
auxiliary models nor stepwise manual annotations. Instead, it leverages
step-wise preference signals generated by the model itself to guide the
optimization process for reasoning compression. Experiments demonstrate that
the generated reasoning sequences from SSPO are both accurate and succinct,
effectively mitigating overthinking behaviors without compromising model
performance across diverse domains and languages.
Authors' comments: Work in progress
Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri
Children's speech presents challenges for age and gender classification due
to high variability in pitch, articulation, and developmental traits. While
self-supervised learning (SSL) models perform well on adult speech tasks, their
ability to encode speaker traits in children remains underexplored. This paper
presents a detailed layer-wise analysis of four Wav2Vec2 variants using the
PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture
speaker-specific cues more effectively than deeper layers, which increasingly
focus on linguistic information. Applying PCA further improves classification,
reducing redundancy and highlighting the most informative components. The
Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU
Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These
results reveal how speaker traits are structured across SSL model depth and
support more targeted, adaptive strategies for child-aware speech interfaces.
Authors' comments: Accepted at Workshop on Child Computer Interaction (WOCCI 2025)
He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong
Large language models with billions of parameters are often over-provisioned:
many layers contribute little unique information yet dominate the memory and
energy footprint during inference. We present LieQ, a metric-driven
post-training quantization framework that addresses the critical challenge of
maintaining accuracy in sub-7B models under extreme low-bit compression. Our
method introduces three complementary layer-wise diagnostics-Perplexity Drop,
Representational Compactness, and Top-k Energy Gain -that reveal a canonical
division of labour across layers, enabling automatic bit-width allocation
without gradient updates. Unlike existing approaches that suffer severe
accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art
compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16
baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and
AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to
LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision
while enabling 4x memory reduction, establishing new paradigms for deploying
small language models on resource-constrained edge devices.
Authors' comments: low-bit quantization
Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, Weikai Yang
Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.
Zihan Fang, Zhiyong Xu, Lan Du, Shide Du, Zhiling Cai, Shiping Wang
Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.
Marc Hölle, Walter Kellermann, Vasileios Belagiannis
Semantic segmentation models trained on known object classes often fail in
real-world autonomous driving scenarios by confidently misclassifying unknown
objects. While pixel-wise out-of-distribution detection can identify unknown
objects, existing methods struggle in complex scenes where rare object classes
are often confused with truly unknown objects. We introduce an
uncertainty-aware likelihood ratio estimation method that addresses these
limitations. Our approach uses an evidential classifier within a likelihood
ratio test to distinguish between known and unknown pixel features from a
semantic segmentation model, while explicitly accounting for uncertainty.
Instead of producing point estimates, our method outputs probability
distributions that capture uncertainty from both rare training examples and
imperfect synthetic outliers. We show that by incorporating uncertainty in this
way, outlier exposure can be leveraged more effectively. Evaluated on five
standard benchmark datasets, our method achieves the lowest average false
positive rate (2.5%) among state-of-the-art while maintaining high average
precision (90.91%) and incurring only negligible computational overhead. Code
is available at https://github.com/glasbruch/ULRE.
Authors' comments: Accepted at ICCVW 2025, 11 pages, 4 figures
Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin
Time-series anomaly detection plays a central role across a wide range of
application domains. With the increasing proliferation of the Internet of
Things (IoT) and smart manufacturing, time-series data has dramatically
increased in both scale and dimensionality. This growth has exposed the
limitations of traditional statistical methods in handling the high
heterogeneity and complexity of such data. Inspired by the recent success of
large language models (LLMs) in multimodal tasks across language and vision
domains, we propose a novel unsupervised anomaly detection framework: A
Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly
Detection (TriP-LLM). TriP-LLM integrates local and global temporal features
through a tri-branch design-Patching, Selection, and Global-to encode the input
time series into patch-wise tokens, which are then processed by a frozen,
pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from
which anomaly scores are derived. We evaluate TriP-LLM on several public
benchmark datasets using PATE, a recently proposed threshold-free evaluation
metric, and conduct all comparisons within a unified open-source framework to
ensure fairness. Experimental results show that TriP-LLM consistently
outperforms recent state-of-the-art methods across all datasets, demonstrating
strong detection capabilities. Furthermore, through extensive ablation studies,
we verify the substantial contribution of the LLM to the overall architecture.
Compared to LLM-based approaches using Channel Independence (CI) patch
processing, TriP-LLM achieves significantly lower memory consumption, making it
more suitable for GPU memory-constrained environments. All code and model
checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git
Authors' comments: 11 pages, 2 figures
Zhihui Guo, Xin Man, Hui Xu, Jie Shao
Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbf{LISA}, a \textbf{L}ayer-wise \textbf{I}ntegration and \textbf{S}uppression \textbf{A}pproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbf{plug-and-play} and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6\% in $\mathrm{CHAIR}_I$ and improves POPE F1 by 4.5\%, demonstrating strong generalization across models and tasks.
Guangteng Liu, Xiayue Liu, Zhixiang Xu, Yufeng Yuan, Hui Zhao, Yuxuan Liu, Yufei Jiang
Wi-Fi sensing offers a promising technique for contactless human respiration
monitoring. A key challenge, however, is the blind spot problem caused by
random phase offsets that corrupt the complementarity of respiratory signals.
To address the challenge, we propose a single-antenna-Wi-Fi-sensing
(SA-WiSense) framework to improve accuracy of human respiration monitoring,
robust against random phase offsets. The proposed SA-WiSense framework is
cost-efficient, as only a single antenna is used rather than multiple antennas
as in the previous works. Therefore, the proposed framework is applicable to
Internet of Thing (IoT), where most of sensors are equipped with a single
antenna. On one hand, we propose a cross-subcarrier channel state information
(CSI) ratio (CSCR) based blind spot mitigation approach for IoT, where the
ratios of two values of CSI between subcarriers are leveraged to mitigate
random phase offsets. We prove that the random phase offsets can be cancelled
by the proposed CSCR approach, thereby restoring the inherent complementarity
of signals for blind-spot-free sensing. On the other hand, we propose a genetic
algorithm (GA) based subcarrier selection (GASS) approach by formulating an
optimization problem in terms of the sensing-signal-to-noise ratio (SSNR) of
CSCR between subcarriers. GA is utilized to solve the formulated optimization
problem. We use commodity ESP32 microcontrollers to build an experiment test.
The proposed works are validated to achieve an detection rate of 91.2% for
respiration monitoring at distances up to 8.0 meters, substantially more
accurate than the state-of-the-art methods with a single antenna.
Authors' comments: 12pages, 10figures
Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang et al.
3D Gaussian Splatting (3DGS) has emerged as a leading neural rendering technique for high-fidelity view synthesis, prompting the development of dedicated 3DGS accelerators for mobile applications. Through in-depth analysis, we identify two major limitations in the conventional decoupled preprocessing-rendering dataflow adopted by existing accelerators: 1) a significant portion of preprocessed Gaussians are not used in rendering, and 2) the same Gaussian gets repeatedly loaded across different tile renderings, resulting in substantial computational and data movement overhead. To address these issues, we propose GCC, a novel accelerator designed for fast and energy-efficient 3DGS inference. At the dataflow level, GCC introduces: 1) cross-stage conditional processing, which interleaves preprocessing and rendering to dynamically skip unnecessary Gaussian preprocessing; and 2) Gaussian-wise rendering, ensuring that all rendering operations for a given Gaussian are completed before moving to the next, thereby eliminating duplicated Gaussian loading. We also propose an alpha-based boundary identification method to derive compact and accurate Gaussian regions, thereby reducing rendering costs. We implement our GCC accelerator in 28nm technology. Extensive experiments demonstrate that GCC significantly outperforms the state-of-the-art 3DGS inference accelerator, GSCore, in both performance and energy efficiency.
Enze Zhou, Wenjian Li, Wenting Xu, Yuwei Lu, Shangbin Chen, Shaoyang Wang, Gang Zheng, Tianwu Xie et al.
Photon-counting computed tomography (PCCT) has demonstrated significant
advancements in recent years; however, pixel-wise detector response
nonuniformity remains a key challenge, frequently manifesting as ring artifacts
in reconstructed images. Existing correction methods exhibit limited
generalizability in complex multi-material scenarios, such as contrast-enhanced
imaging. This study introduces a Signal-to-Uniformity Error Polynomial
Calibration (STEPC) framework to address this issue. STEPC first fits
multi-energy projections using a 2D polynomial surface to generate ideal
references, then applies a nonlinear multi-energy polynomial model to predict
and correct pixel-wise nonuniformity errors. The model is calibrated using
homogeneous slab phantoms of different materials, including PMMA, aluminum, and
iodinated contrast agents, enabling correction for both non-contrast and
contrast-enhanced imaging. Experiments were performed on a custom Micro-PCCT
system with phantoms and mouse. Correction performance of STEPC was evaluated
using the mean local standard deviation (MLSD) in the projection domain and the
ring artifact deviation (RAD) on the reconstructed images. STEPC consistently
outperformed existing correction methods in both non-contrast and
contrast-enhanced scenarios. It achieved the lowest MLSD and RAD for both
phantoms and mouse scans. These results indicate that STEPC provides a robust
and practical solution for correcting detector nonuniformity in multi-material
PCCT imaging, witch position it as a promising general-purpose calibration
framework for photon-counting CT systems.
Authors' comments: 10 pages, 12 figures. Submitted to IEEE Transactions on Medical
Imaging
Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Marković, Oskar Marko, Molly Sears
Crop mapping involves identifying and classifying crop types using spatial
data, primarily derived from remote sensing imagery. This study presents the
first comprehensive review of large-scale, pixel-wise crop mapping workflows,
encompassing both conventional supervised methods and emerging transfer
learning approaches. To identify the optimal supervised crop mapping workflows,
we conducted systematic experiments, comparing six widely adopted satellite
image-based preprocessing methods, alongside eleven supervised pixel-wise
classification models. Additionally, we assessed the synergistic impact of
varied training sample sizes and variable combinations. Moreover, we identified
optimal transfer learning techniques for different magnitudes of domain shift.
The evaluation of best methods was conducted across five diverse agricultural
sites. Landsat 8 served as the primary satellite data source. Labels come from
CDL trusted pixels and field surveys.
Our findings reveal three key insights. First, fine-scale interval
preprocessing paired with Transformer models consistently delivered optimal
performance for both supervised and transferable workflows. RF offered rapid
training and competitive performance in conventional supervised learning and
direct transfer to similar domains. Second, transfer learning techniques
enhanced workflow adaptability, with UDA being effective for homogeneous crop
classes while fine-tuning remains robust across diverse scenarios. Finally,
workflow choice depends heavily on the availability of labeled samples. With a
sufficient sample size, supervised training typically delivers more accurate
and generalizable results. Below a certain threshold, transfer learning that
matches the level of domain shift is a viable alternative to achieve crop
mapping. Repository:
Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows
Authors' comments: A review article. 41 pages, 22 figures. Preprint
Malavika Vasist, Paul Mollire, Helena Kühnle, Olivier Absil, Gilles Louppe, Rens Waters, Manuel Güdel, Thomas Henning et al.
Cold brown dwarf atmospheres are good training grounds for analyzing temperate giant planets. WISEP J173835.52+273258.9 (WISE 1738) is an isolated Y0 brown dwarf with a temperature between 350-400 K, at the T-Y transition. While its near-infrared spectrum has been studied, bulk properties and chemistry remain uncertain. We analyze new JWST MIRI medium-resolution spectra (5-18 micron), combined with near-infrared spectra (0.98-2.2 micron) from HST/WFC3 and Gemini/GNIRS, to better constrain WISE 1738's atmosphere and physical parameters. We use Neural Posterior Estimation (NPE) with a cloud-free petitRADTRANS model and evaluate results using posterior checks, coverage, and L-C2ST diagnostics. Our retrieval confirms previous constraints on H2O, CH4, and NH3, and for the first time constrains CO, CO2, and 15NH3. We find evidence of disequilibrium chemistry through CO and CO2 abundances not expected under equilibrium. Estimated properties are temperature 402 (+12,-9) K, log g 4.43 (+0.26,-0.34) cm/s2, mass 13 (+11,-7) M_Jup, radius 1.14 (+0.03,-0.03) R_Jup, and bolometric luminosity -6.52 (+0.05,-0.04) log L/L_sun. Evolutionary models suggest an age between 1 and 4 Gyr, consistent with a 6-hour rotation. We place an upper bound on 15NH3, implying a 3-sigma lower limit on the 14N/15N ratio of 275. We also derive a C/O ratio of 1.35 (+0.39,-0.31) and metallicity of 0.34 (+0.12,-0.11), without accounting for oxygen sequestration.
Osama Hardan, Omar Elshenhabi, Tamer Khattab, Mohamed Mabrok
Vision Mamba models promise transformer-level performance at linear computational cost, but their reliance on serializing 2D images into 1D sequences introduces a critical, yet overlooked, design choice: the patch scan order. In medical imaging, where modalities like brain MRI contain strong anatomical priors, this choice is non-trivial. This paper presents the first systematic study of how scan order impacts MRI segmentation. We introduce Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures that facilitates exploring diverse scan paths without additional computational cost. We conduct a large-scale benchmark of 21 scan strategies on three public datasets (BraTS 2020, ISLES 2022, LGG), covering over 70,000 slices. Our analysis shows conclusively that scan order is a statistically significant factor (Friedman test: $Ï^{2}_{20}=43.9, p=0.0016$), with performance varying by as much as 27 Dice points. Spatially contiguous paths -- simple horizontal and vertical rasters -- consistently outperform disjointed diagonal scans. We conclude that scan order is a powerful, cost-free hyperparameter, and provide an evidence-based shortlist of optimal paths to maximize the performance of Mamba models in medical imaging.
Authors' comments: Submitted to the 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)
Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh
Interpreting the decision-making process of Convolutional Neural Networks
(CNNs) is critical for deploying models in high-stakes domains.
Gradient-weighted Class Activation Mapping (Grad-CAM) is a widely used method
for visual explanations, yet it typically focuses on the final convolutional
layer or na\"ively averages across layers, strategies that can obscure
important semantic cues or amplify irrelevant noise. We propose Winsor-CAM, a
novel, human-tunable extension of Grad-CAM that generates robust and coherent
saliency maps by aggregating information across all convolutional layers. To
mitigate the influence of noisy or extreme attribution values, Winsor-CAM
applies Winsorization, a percentile-based outlier attenuation technique. A
user-controllable threshold allows for semantic-level tuning, enabling flexible
exploration of model behavior across representational hierarchies. Evaluations
on standard architectures (ResNet50, DenseNet121, VGG16, InceptionV3) using the
PASCAL VOC 2012 dataset demonstrate that Winsor-CAM produces more interpretable
heatmaps and achieves superior performance in localization metrics, including
intersection-over-union and center-of-mass alignment, when compared to Grad-CAM
and uniform layer-averaging baselines. Winsor-CAM advances the goal of
trustworthy AI by offering interpretable, multi-layer insights with
human-in-the-loop control.
Authors' comments: 15 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on
Pattern Analysis and Machine Intelligence
Zihe Yan, Zhuosheng Zhang
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across 12 types of pop-up perturbations and 4 different model backbones show that LaSM consistently enhances the defense success rate. When combined with prompt-level alerts, LaSM achieves over 98\% robustness even under strong inductive attacks. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.
Authors' comments: 10 pages, 9 figures
Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fr\'echet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
A. Bochkov
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.