Xiaozheng Jiang, Wei Zhang, Xuerui Mao
Detecting tiny objects in remote sensing (RS) imagery has been a long-standing challenge due to their extremely limited spatial information, weak feature representations, and dense distributions across complex backgrounds. Despite numerous efforts devoted, mainstream detectors still underperform in such scenarios. To bridge this gap, we introduce RS-TinyNet, a multi-stage feature fusion and enhancement model explicitly tailored for RS tiny object detection in various RS scenarios. RS-TinyNet comes with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Guided by these principles, we design three step-wise feature enhancement modules. Among them, the multi-dimensional collaborative attention (MDCA) module employs multi-dimensional attention to enhance the saliency of tiny objects. Additionally, the auxiliary reversible branch (ARB) and a progressive fusion detection head (PFDH) module are introduced to preserve information flow and fuse multi-level features to bridge semantic gaps and retain structural detail. Comprehensive experiments on public RS dataset AI-TOD show that our RS-TinyNet surpasses existing state-of-the-art (SOTA) detectors by 4.0% AP and 6.5% AP75. Evaluations on DIOR benchmark dataset further validate its superior detection performance in diverse RS scenarios. These results demonstrate that the proposed multi-stage feature fusion strategy offers an effective and practical solution for tiny object detection in complex RS environments.
Authors' comments: The content of the thesis requires supplementation to make it more substantial
Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae
KV caching significantly improves the efficiency of Large Language Model
(LLM) inference by storing attention states from previously processed tokens,
enabling faster generation of subsequent tokens. However, as sequence length
increases, the KV cache quickly becomes a major memory bottleneck. To address
this, we propose PagedEviction, a novel fine-grained, structured KV cache
pruning strategy that enhances the memory efficiency of vLLM's PagedAttention.
Unlike existing approaches that rely on attention-based token importance or
evict tokens across different vLLM pages, PagedEviction introduces an efficient
block-wise eviction algorithm tailored for paged memory layouts. Our method
integrates seamlessly with PagedAttention without requiring any modifications
to its CUDA attention kernels. We evaluate PagedEviction across
Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models
on the LongBench benchmark suite, demonstrating improved memory usage with
better accuracy than baselines on long context tasks.
Authors' comments: Preprint
Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang
Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives -- either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.
Jiequn Han, Kui Ren, Nathan Soedjak
We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose inverse map from datasets drawn from a prior distribution, with the training process independent of the specific test instance. When the prior has a high intrinsic dimension or when high accuracy of the learned solution is required, a large number of training samples may be needed, resulting in substantial data collection costs. In contrast, our method dynamically allocates sampling effort based on the specific test instance, enabling significant gains in sample efficiency. By iteratively refining the training dataset conditioned on the latest prediction, the proposed strategy tailors the dataset to the geometry of the inverse map around each test instance. We demonstrate the effectiveness of our approach in the inverse scattering problem under two types of structured priors. Our results show that the advantage of the adaptive method becomes more pronounced in settings with more complex priors or higher accuracy requirements. While our experiments focus on a particular inverse problem, the adaptive sampling strategy is broadly applicable and readily extends to other inverse problems, offering a scalable and practical alternative to conventional fixed-dataset training regimes.
Guanzhong Hu, Wenpan Li, Rujing Zha, Ping Guo
Directed energy deposition (DED), a metal additive manufacturing process, is highly susceptible to process-induced defects such as geometric deviations, lack of fusion, and poor surface finish. This work presents a build-height-synchronized fringe projection system for in-situ, layer-wise surface reconstruction of laser-DED components, achieving a reconstruction accuracy of ${\pm}$46 $μ$m. From the reconstructed 3D morphology, two complementary geometry-based point cloud metrics are introduced: local point density, which highlights poor surface finish, and normal-change rate, which identifies lack-of-fusion features. These methods enable automated, annotation-free identification of common deposition anomalies directly from reconstructed surfaces, without the need for manual labeling. By directly linking geometric deviation to defect formation, the approach enables precise anomaly localization and advances the feasibility of closed-loop process control. This work establishes fringe projection as a practical tool for micrometer-scale monitoring in DED, bridging the gap between process signatures and part geometry for certifiable additive manufacturing.
Authors' comments: 26 pages, 15 figures
Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.
Authors' comments: Accepted by EMNLP 2025
Sweta Kaman, Ankita Sharma, Romi Banerjee
Background: Wisdom is a superordinate construct that embraces perspective
taking, reflectiveness, prosocial orientation, reflective empathetic action,
and intellectual humility. Unlike conventional models of reasoning that are
rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity,
requiring both graded evaluation and self-reflective humility. Current measures
depend on self-reports and seldom reflect the humility and uncertainty inherent
in wise reasoning. A computational framework that takes into account both
multidimensionality and confidence has the potential to improve psychological
science and allow humane AI. Method: We present a fuzzy inference system with Z
numbers, each of the decisions being expressed in terms of a wisdom score
(restriction) and confidence score (certainty). As part of this study,
participants (N = 100) were exposed to culturally neutral pictorial moral
dilemma tasks to which they generated think-aloud linguistic responses, which
were mapped into five theoretically based components of wisdom. The scores of
each individual component were combined using a base of 21 rules, with
membership functions tuned via Gaussian kernel density estimation. Results: In
a proof of concept study, the system produced dual attribute wisdom
representations that correlated modestly but significantly with established
scales while showing negligible relations with unrelated traits, supporting
convergent and divergent validity. Contribution: The contribution is to
formalize wisdom as a multidimensional, uncertainty-conscious construct,
operationalized in the form of Z-numbers. In addition to progressing
measurement in psychology, it calculates how fuzzy Z numbers can provide AI
systems with interpretable, confidence-sensitive reasoning that affords a safe,
middle ground between rigorous computation and human-like judgment.
Authors' comments: total 17 pages, main manuscript 12 pages, supplementary 5 pages, 6
tables in main manuscript, 5 figures in main manuscript, 2 tables in
supplementary, and 3 figures in supplementary
Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
Generative models have been successfully used in the field of time series
generation. However, when dealing with long-term time series, which span over
extended periods and exhibit more complex long-term temporal patterns, the task
of generation becomes significantly more challenging. Long-term time series
exhibit long-range temporal dependencies, but their data distribution also
undergoes gradual changes over time. Finding a balance between these long-term
dependencies and the drift in data distribution is a key challenge. On the
other hand, long-term time series contain more complex interrelationships
between different feature sequences, making the task of effectively capturing
both intra-sequence and inter-sequence dependencies another important
challenge. To address these issues, we propose Stage-Diff, a staged generative
model for long-term time series based on diffusion models. First, through
stage-wise sequence generation and inter-stage information transfer, the model
preserves long-term sequence dependencies while enabling the modeling of data
distribution shifts. Second, within each stage, progressive sequence
decomposition is applied to perform channel-independent modeling at different
time scales, while inter-stage information transfer utilizes multi-channel
fusion modeling. This approach combines the robustness of channel-independent
modeling with the information fusion advantages of multi-channel modeling,
effectively balancing the intra-sequence and inter-sequence dependencies of
long-term time series. Extensive experiments on multiple real-world datasets
validate the effectiveness of Stage-Diff in long-term time series generation
tasks.
Authors' comments: 8 pages, 5 figures
Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil
Numerous methods have been proposed to enhance Keyword Spotting (KWS) in
adult speech, but children's speech presents unique challenges for KWS systems
due to its distinct acoustic and linguistic characteristics. This paper
introduces a zero-shot KWS approach that leverages state-of-the-art
self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec.
Features are extracted layer-wise from these SSL models and used to train a
Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for
training, while the PFSTAR children's speech dataset was used for testing,
demonstrating the zero-shot capability of our method. Our approach achieved
state-of-the-art results across all keyword sets for children's speech.
Notably, the Wav2Vec2 model, particularly layer 22, performed the best,
delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of
false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a
set of 30 keywords. Furthermore, age-specific performance evaluation confirmed
the system's effectiveness across different age groups of children. To assess
the system's robustness against noise, additional experiments were conducted
using the best-performing layer of the best-performing Wav2Vec2 model. The
results demonstrated a significant improvement over traditional MFCC-based
baseline, emphasizing the potential of SSL embeddings even in noisy conditions.
To further generalize the KWS framework, the experiments were repeated for an
additional CMU dataset. Overall the results highlight the significant
contribution of SSL features in enhancing Zero-Shot KWS performance for
children's speech, effectively addressing the challenges associated with the
distinct characteristics of child speakers.
Authors' comments: Accepted
Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan
Automatic Speech Recognition (ASR) systems often struggle to accurately
process children's speech due to its distinct and highly variable acoustic and
linguistic characteristics. While recent advancements in self-supervised
learning (SSL) models have greatly enhanced the transcription of adult speech,
accurately transcribing children's speech remains a significant challenge. This
study investigates the effectiveness of layer-wise features extracted from
state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT,
Data2Vec, and WavLM in improving the performance of ASR for children's speech
in zero-shot scenarios. A detailed analysis of features extracted from these
models was conducted, integrating them into a simplified DNN-based ASR system
using the Kaldi toolkit. The analysis identified the most effective layers for
enhancing ASR performance on children's speech in a zero-shot scenario, where
WSJCAM0 adult speech was used for training and PFSTAR children speech for
testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model
achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64%
relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of
10.65%). Additionally, age group-wise analysis demonstrated consistent
performance improvements with increasing age, along with significant gains
observed even in younger age groups using the SSL features. Further experiments
on the CMU Kids dataset confirmed similar trends, highlighting the
generalizability of the proposed approach.
Authors' comments: Accepted
Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You
Large-batch training has become a cornerstone in accelerating the training of
deep neural networks, yet it poses challenges in optimization and
generalization. Existing optimizers like AdamW present performance degradation
during language models' large-batch training, due to the information bottleneck
in attention layers caused by the sharp increase of max attention logit. While
the LAMB optimizer partially addresses this issue, some attention layers still
face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are
less effective in directly influencing the max value of query/key weights.
Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks
relationships of weight values within rows or columns. Building on these
observations, we propose a novel optimizer, MERIT, which leverages the max-norm
to calculate the trust ratio to constrain the max attention logit more
effectively. Moreover, we further construct element-wise trust ratios to
provide more robust update scaling by focusing on local weight structures.
Extensive experiments of large-batch training across various sizes of GPT-2
models demonstrate the superior performance of MERIT. Notably, during the
training of GPT-2 Medium, MERIT enables a 6k batch size without any performance
degradation compared to the standard batch size (480) with 48B training tokens.
This work highlights the importance of considering the max attention logit and
finer-granularity trust ratio in large-batch training. It successfully improves
the training stability and paves the way for larger batch usage, enabling
faster development and iteration of large language models. Code is available at
https://github.com/NUS-HPC-AI-Lab/MERIT.
Authors' comments: ICML 2025
Tiandi Ye, Wenyan Liu, Kai Yao, Lichun Li, Shangchao Su, Cen Chen, Xiang Li, Shan Yin et al.
Federated learning (FL) is a privacy-preserving machine learning paradigm
that enables collaborative model training across multiple distributed clients
without disclosing their raw data. Personalized federated learning (pFL) has
gained increasing attention for its ability to address data heterogeneity.
However, most existing pFL methods assume that each client's data follows a
single distribution and learn one client-level personalized model for each
client. This assumption often fails in practice, where a single client may
possess data from multiple sources or domains, resulting in significant
intra-client heterogeneity and suboptimal performance. To tackle this
challenge, we propose pFedBayesPT, a fine-grained instance-wise pFL framework
based on visual prompt tuning. Specifically, we formulate instance-wise prompt
generation from a Bayesian perspective and model the prompt posterior as an
implicit distribution to capture diverse visual semantics. We derive a
variational training objective under the semi-implicit variational inference
framework. Extensive experiments on benchmark datasets demonstrate that
pFedBayesPT consistently outperforms existing pFL methods under both feature
and label heterogeneity settings.
Authors' comments: Accepted by CIKM2025
Md Anwar Hossen, Fatema Siddika, Wensheng Zhang, Anuj Sharma, Ali Jannesari
Heterogeneous Federated Learning (HFL) has gained attention for its ability
to accommodate diverse models and heterogeneous data across clients.
Prototype-based HFL methods emerge as a promising solution to address
statistical heterogeneity and privacy challenges, paving the way for new
advancements in HFL research. This method focuses on sharing only
class-representative prototypes among heterogeneous clients. However, these
prototypes are often aggregated on the server using weighted averaging, leading
to sub-optimal global knowledge; these cause the shrinking of aggregated
prototypes, which negatively affects the model performance in scenarios when
models are heterogeneous and data distributions are extremely non-IID. We
propose FedProtoKD in a Heterogeneous Federated Learning setting, using an
enhanced dual-knowledge distillation mechanism to improve the system
performance with clients' logits and prototype feature representation. We aim
to resolve the prototype margin-shrinking problem using a contrastive
learning-based trainable server prototype by leveraging a class-wise adaptive
prototype margin. Furthermore, we assess the importance of public samples using
the closeness of the sample's prototype to its class representative prototypes,
which enhances learning performance. FedProtoKD achieved average improvements
of 1.13% up to 34.13% accuracy across various settings and significantly
outperforms existing state-of-the-art HFL methods.
Authors' comments: 12 pages, 6 figures
Domenico De Cristofaro, Vincenzo Norman Vitale, Alessandro Vietti
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
Kejian Zhang, Muxuan Liang, Robert Maile, Doudou Zhou
Multi-source and multi-modal datasets are increasingly common in scientific research, yet they often exhibit block-wise missingness, where entire data sources or modalities are systematically absent for subsets of subjects. This structured form of missingness presents significant challenges for statistical analysis, particularly for two-sample hypothesis testing. Standard approaches such as imputation or complete-case analysis can introduce bias or result in substantial information loss, especially when the missingness mechanism is not random. To address this methodological gap, we propose the Block-Pattern Enhanced Test (BPET), a general framework for two-sample testing that directly accounts for block-wise missingness without requiring imputation or deletion of observations. As a concrete instantiation, we develop the Block-wise Rank In Similarity graph Edge-count (BRISE) test, which extends rank-based similarity graph methods to settings with block-wise missing data. Under mild conditions, we establish that the null distribution of BRISE converges to a chi-squared distribution. Simulation studies show that BRISE consistently controls the type I error rate and achieves good statistical power under a wide range of alternatives. Applications to two real-world datasets with block-wise missingness further demonstrate the practical utility of our method in identifying meaningful distributional differences.
Austin Braniff, Yuhe Tian
This work presents a novel reinforcement learning (RL) algorithm based on Y-wise Affine Neural Networks (YANNs). YANNs provide an interpretable neural network which can exactly represent known piecewise affine functions of arbitrary input and output dimensions defined on any amount of polytopic subdomains. One representative application of YANNs is to reformulate explicit solutions of multi-parametric linear model predictive control. Built on this, we propose the use of YANNs to initialize RL actor and critic networks, which enables the resulting YANN-RL control algorithm to start with the confidence of linear optimal control. The YANN-actor is initialized by representing the multi-parametric control solutions obtained via offline computation using an approximated linear system model. The YANN-critic represents the explicit form of the state-action value function for the linear system and the reward function as the objective in an optimal control problem (OCP). Additional network layers are injected to extend YANNs for nonlinear expressions, which can be trained online by directly interacting with the true complex nonlinear system. In this way, both the policy and state-value functions exactly represent a linear OCP initially and are able to eventually learn the solution of a general nonlinear OCP. Continuous policy improvement is also implemented to provide heuristic confidence that the linear OCP solution serves as an effective lower bound to the performance of RL policy. The YANN-RL algorithm is demonstrated on a clipped pendulum and a safety-critical chemical-reactive system. Our results show that YANN-RL significantly outperforms the modern RL algorithm using deep deterministic policy gradient, especially when considering safety constraints.
Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas
Code completion entails the task of providing missing tokens given a
surrounding context. It can boost developer productivity while providing a
powerful code discovery tool. Following the Large Language Model (LLM) wave,
code completion has been approached with diverse LLMs fine-tuned on code (code
LLMs). The performance of code LLMs can be assessed with downstream and
intrinsic metrics. Downstream metrics are usually employed to evaluate the
practical utility of a model, but can be unreliable and require complex
calculations and domain-specific knowledge. In contrast, intrinsic metrics such
as perplexity, entropy, and mutual information, which measure model confidence
or uncertainty, are simple, versatile, and universal across LLMs and tasks, and
can serve as proxies for functional correctness and hallucination risk in
LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when
generating code by measuring code perplexity across programming languages,
models, and datasets using various LLMs, and a sample of 1008 files from 657
GitHub projects. We find that strongly-typed languages exhibit lower perplexity
than dynamically typed languages. Scripting languages also demonstrate higher
perplexity. Perl appears universally high in perplexity, whereas Java appears
low. Code perplexity depends on the employed LLM, but not on the code dataset.
Although code comments often increase perplexity, the language ranking based on
perplexity is barely affected by their presence. LLM researchers, developers,
and users can employ our findings to assess the benefits and suitability of
LLM-based code completion in specific software projects based on how language,
model choice, and code characteristics impact model confidence.
Authors' comments: 30 pages, 10 figures
Ninh Tran
Partial conjunction (PC) hypothesis testing is widely used to assess the replicability of scientific findings across multiple comparable studies. In high-throughput meta-analyses, testing a large number of PC hypotheses with k-family-wise error rate (k-FWER) control often suffers from low statistical power due to the multiplicity burden. The state-of-the-art AdaFilter-Bon procedure by Wang et al. (2022, Ann. Stat., 50(4), 1890-1909) alleviates this problem by filtering out hypotheses unlikely to be false before applying a rejection rule. However, a side effect of filtering is that it renders the rejection rule more stringent than necessary, leading to conservative k-FWER control. In this paper, we mitigate this conservativeness - and thereby improve the power of AdaFilter-Bon - by incorporating a post-filter null proportion estimate into the procedure. The resulting method, AdaFilter-AdaBon, has proven asymptotic k-FWER control under weak dependence and demonstrates empirical finite-sample control with higher power than the original AdaFilter-Bon in simulations.
Sebastian Musiał, Bartosz Zieliński, Tomasz Danel
Graph neural networks have demonstrated remarkable success in predicting molecular properties by leveraging the rich structural information encoded in molecular graphs. However, their black-box nature reduces interpretability, which limits trust in their predictions for important applications such as drug discovery and materials design. Furthermore, existing explanation techniques often fail to reliably quantify the contribution of individual atoms or substructures due to the entangled message-passing dynamics. We introduce SEAL (Substructure Explanation via Attribution Learning), a new interpretable graph neural network that attributes model predictions to meaningful molecular subgraphs. SEAL decomposes input graphs into chemically relevant fragments and estimates their causal influence on the output. The strong alignment between fragment contributions and model predictions is achieved by explicitly reducing inter-fragment message passing in our proposed model architecture. Extensive evaluations on synthetic benchmarks and real-world molecular datasets demonstrate that SEAL outperforms other explainability methods in both quantitative attribution metrics and human-aligned interpretability. A user study further confirms that SEAL provides more intuitive and trustworthy explanations to domain experts. By bridging the gap between predictive performance and interpretability, SEAL offers a promising direction for more transparent and actionable molecular modeling.
Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen, Peter K. Sorger, Hanspeter Pfister, Won-Ki Jeong
Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.