Gaurav Kumar Sharma
In this study, we perform a systematic analysis of the JARVIS-DFT bandgap dataset and identify and remove descriptors that may inadvertently encode band-structure information, such as effective masses. This process yields a curated, leakage-controlled subset of 2280 materials. Using this dataset, a three-phase modeling framework is implemented that incrementally incorporates basic physical descriptors, engineered features, and compositional attributes. The results show that tree-based models achieve R2 values of approximately 0.88 to 0.90 across all phases, indicating that expanding the descriptor space does not substantially improve predictive accuracy when leakage is controlled. SHAP analysis consistently identifies the dielectric tensor components as the dominant contributors. This work provides a curated dataset and baseline performance metrics for future leakage-aware bandgap prediction studies.
Authors' comments: 21 pages, 11 figures
Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang
Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
Authors' comments: 16 pages
Li Xia
Decentralized federated learning (DFL) faces critical challenges from statistical heterogeneity and communication overhead. While NTK-based methods achieve faster convergence, transmitting full Jacobian matrices is impractical for bandwidth-constrained edge networks. We propose SPARK, synergistically integrating random projection-based Jacobian compression, stage-wise annealed distillation, and Nesterov momentum acceleration. Random projections compress Jacobians while preserving spectral properties essential for convergence. Stage-wise annealed distillation transitions from pure NTK evolution to neighbor-regularized learning, counteracting compression noise. Nesterov momentum accelerates convergence through stable accumulation enabled by distillation smoothing. SPARK achieves 98.7% communication reduction compared to NTK-DFL while maintaining convergence speed and superior accuracy. With momentum, SPARK reaches target performance 3 times faster, establishing state-of-the-art results for communication-efficient decentralized learning and enabling practical deployment in bandwidth-limited edge environments.
Authors' comments: 11 pages, 8 figures
Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang
Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.
Andreas Papageorgiou, Paulo Vitor Itaborai, Kostas Blekos, Karl Jansen
This paper introduces quantum circuit methodologies for pointwise multiplication and convolution of complex functions, conceptualized as "processing through encoding". Leveraging known techniques, we describe an approach where multiple complex functions are encoded onto auxiliary qubits. Applying the proposed scheme for two functions $f$ and $g$, their pointwise product $f(x)g(x)$ is shown to naturally form as the coefficients of part of the resulting quantum state. Adhering to the convolution theorem, we then demonstrate how the convolution $f*g$ can be constructed. Similarly to related work, this involves the encoding of the Fourier coefficients $\mathcal{F}[f]$ and $\mathcal{F}[g]$, which facilitates their pointwise multiplication, followed by the inverse Quantum Fourier Transform. We discuss the simulation of these techniques, their integration into an extended \verb|quantumaudio| package for audio signal processing, and present initial experimental validations. This work offers a promising avenue for quantum signal processing, with potential applications in areas such as quantum-enhanced audio manipulation and synthesis.
Authors' comments: Presented at ISQCMC '25: 3rd International Symposium on Quantum Computing and Musical Creativity
Kesheng Chen, Wenjian Luo, Zhenqian Zhu, Yamin Hu, Yiya Xi
Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the "curse of dimensionality," rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.
Kevin Lee, Pablo Millan Arias
Layer-wise Relevance Propagation (LRP) provides principled attribution for neural networks through conservation properties and foundations in Deep Taylor Decomposition. However, existing implementations operate at the module level, requiring architecture-specific propagation rules and modifications. These limit the generality of target model and sustainability of implementations as architectures evolve. We introduce DynamicLRP, a model-agnostic LRP framework operating at the tensor operation level. By decomposing attribution to individual operations within computation graphs and introducing a novel mechanism for deferred activation resolution, named the Promise System, our approach achieves true architecture agnosticity while maintaining LRP's theoretical guarantees. This design operates independently of backpropagation machinery, enabling operation on arbitrary computation graphs without model modification and side-by-side execution with gradient backpropagation. Being based on computation graphs, this method is theoretically extensible to other deep learning libraries that support auto-differentiation. We demonstrate faithfulness matching or exceeding specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent performance on ViT, 93.70\% and 95.06\% top-1 attribution accuracy for explaining RoBERTa-large and Flan-T5-large answers on SQuADv2, respectively) while maintaining practical efficiency on models with hundreds of millions of parameters. We achieved 99.92\% node coverage across 31,465 computation graph nodes from 15 diverse architectures, including state-space models (Mamba), audio transformers (Whisper), and multimodal systems (DePlot) without any model-specific code with rules for 47 fundamental operations implemented. Our operation-level decomposition and Promise System establish a sustainable, extensible foundation for LRP across evolving architectures.
Authors' comments: Work in progress, (12 pages manuscript, 6 figures, 6 tables, 3 pages references, 14 pages appendix)
Yebo Wu, Jingguang Li, Zhijiang Guo, Li Li
Federated fine-tuning offers a promising solution for adapting Large Language Models (LLMs) to downstream tasks while safeguarding data privacy. However, its high computational and communication demands hinder its deployment on resource-constrained devices. In this paper, we propose SmartFed, a resource-efficient federated fine-tuning framework. SmartFed intelligently reuses knowledge embedded in existing LoRA modules, eliminating the need for expensive training from scratch when adapting LLMs to new tasks. To effectively exploit this knowledge and ensure scalability, we introduce the Mixture of Rank-Wise Experts (MoRE). MoRE decomposes LoRA modules into fine-grained rank-level experts. These experts are selectively activated and combined based on input semantics and resource budgets. Moreover, to optimize resource utilization, we present the Elastic Expert Quota Allocation (EEQA). EEQA adaptively allocates expert capacity across parameter matrices based on their contribution to model performance, focusing computing resources on the critical experts. Extensive evaluations across multiple benchmarks demonstrate that SmartFed significantly outperforms existing methods in model performance and training efficiency.
Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin
Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category (c), opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.
Authors' comments: 11 pages, 7 figures, and 6 tables
Amirreza Zamani, Ayfer Özgür, Mikael Skoglund
In this paper, we study an information-theoretic problem of designing a fair representation under a bounded point-wise statistical (demographic) parity constraint. More specifically, an agent uses some useful data (database) $X$ to solve a task $T$. Since both $X$ and $T$ are correlated with some latent sensitive attribute or secret $S$, the agent designs a representation $Y$ that satisfies a bounded point-wise statistical parity, that is, such that for all realizations of the representation $y\in\cal Y$, we have $χ^2(P_{S|y};P_S)\leq ε$. In contrast to our previous work, here we use the point-wise measure instead of a bounded mutual information, and we assume that the agent has no direct access to $S$ and $T$; hence, the Markov chains $S - X - Y$ and $T - X - Y$ hold. In this work, we design $Y$ that maximizes the mutual information $I(Y;T)$ about the task while satisfying a bounded compression rate constraint, that is, ensuring that $I(Y;X) \leq r$. Finally, $Y$ satisfies the point-wise bounded statistical parity constraint $χ^2(P_{S|y};P_S)\leq ε$. When $ε$ is small, concepts from information geometry allow us to locally approximate the KL-divergence and mutual information. To design the representation $Y$, we utilize this approximation and show that the main complex fairness design problem can be rewritten as a quadratic optimization problem that has simple closed-form solution under certain constraints. For the cases where the closed-form solution is not obtained we obtain lower bounds with low computational complexity. Here, we provide simple fairness designs with low complexity which are based on finding the maximum singular value and singular vector of a matrix. Finally, in a numerical example we compare our obtained results with the optimal solution.
Kyeongha Rho, Hyeongkeun Lee, Jae Won Cho, Joon Son Chung
In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.
Authors' comments: 10 pages, 5 figures
Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
Jungyeon Koh, Hyeonho Noh, Hyun Jong Yang
In this letter, we propose a group-wise semantic splitting multiple access framework for multi-user semantic communication in downlink scenarios. The framework begins by applying a balanced clustering mechanism that groups users based on the similarity of their semantic characteristics, enabling the extraction of group-level common features and user-specific private features. The base station then transmits the common features via multicast and the private features via unicast, effectively leveraging both shared and user-dependent semantic information. To further enhance semantic separability and reconstruction fidelity, we design a composite loss function that integrates a reconstruction loss with a repulsion loss, improving both the accuracy of semantic recovery and the distinctiveness of common embeddings in the latent space. Simulation results demonstrate that the proposed method achieves up to 3.26% performance improvement over conventional schemes across various channel conditions, validating its robustness and semantic efficiency for next-generation wireless networks.
Huiyun Tang, Long Feng, Yang Li, Feifei Wang
Vertical Federated Learning (VFL) often suffers from client-wise missingness, where entire feature blocks from some clients are unobserved, and conventional approaches are vulnerable to privacy leakage. We propose a Gaussian copulabased framework for VFL data privatization under missingness constraints, which requires no prior specification of downstream analysis tasks and imposes no restriction on the number of analyses. To privately estimate copula parameters, we introduce a debiased randomized response mechanism for correlation matrix estimation from perturbed ranks, together with a nonparametric privatized marginal estimation that yields consistent CDFs even under MAR. The proposed methods comprise VCDS for MCAR data, EVCDS for MAR data, and IEVCDS, which iteratively refines copula parameters to mitigate MAR-induced bias. Notably, EVCDS and IEVCDS also apply under MCAR, and the framework accommodates mixed data types, including discrete variables. Theoretically, we introduce the notion of Vertical Distributed Attribute Differential Privacy (VDADP), tailored to the VFL setting, establish corresponding privacy and utility guarantees, and investigate the utility of privatized data for GLM coefficient estimation and variable selection. We further establish asymptotic properties including estimation and variable selection consistency for VFL-GLMs. Extensive simulations and a real-data application demonstrate the effectiveness of the proposed framework.
Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira
Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
Cuong Pham, Hoang Anh Dung, Cuong C. Nguyen, Trung Le, Gustavo Carneiro, Thanh-Toan Do
Large language models (LLMs) have significantly advanced natural language processing, but their massive parameter counts create substantial computational and memory challenges during deployment. Post-training quantization (PTQ) has emerged as a promising approach to mitigate these challenges with minimal overhead. While existing PTQ methods can effectively quantize LLMs, they experience substantial accuracy loss at extremely low bit-widths, primarily due to high-impact parameters that significantly influence quantization performance. Several approaches address these issues by identifying and retaining the high-impact parameters in FP16 format. However, they apply fixed ratios of high-impact parameters across all layers, overlooking layer-wise sensitivity variations. In this paper, we propose a quadratic optimization framework that determines layer-specific ratios of high-impact parameters while considering inter-layer dependencies. We quantize high-impact parameters to moderate bit-widths, which often result in negligible performance degradation in quantized LLMs, while the remaining parameters can be quantized to extremely low bit-widths. Under the same resource-constrained budget, this allows for preserving more high-impact parameters than methods that keep selecting a few in FP16 format. Additionally, the proposed framework allows us to leverage an advanced quantization method that often requires extensive learnable parameters solely for high-impact parameters, while applying a computationally efficient method to the rest. Our approach achieves an effective balance between computational efficiency and model accuracy while maintaining high performance compared to state-of-the-art methods.
Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine
Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.
Authors' comments: 5 pages, 2 figures, 2 tables
Edoardo Colombo, Vasil Dimitrov, Dario Martelli, Alberto Zaffaroni
In this paper, using equivariant localization for foliations, we compute the on-shell action of a general class of supersymmetric solutions of five-dimensional gauged supergravity with vector multiplets. Unlike previous literature, we also allow for a non-trivial topology for the space-time as well as for the gauge fields. In practice, we achieve this by covering the spacetime manifold with patches and localizing in each patch. We derive a general formula for the on-shell action that depends on topological data only and can be used without a detailed knowledge of the solution. Our final result is relevant for the physically interesting examples of multi-center black holes, black rings and black lenses, topological solitons and Euclidean black saddles. We also show how to connect the topological data with the thermodynamic data: electrostatic potential, the magnetic fluxes and the possible flat connections of the solutions. Our formula for the on-shell action is derived in the context of gauged supergravity, but it is straightforward to take the ungauged limit. Thus, we reproduce known results for a large class of explicit asymptotically flat supersymmetric solutions, and we provide predictions for AdS solutions still to be found.
Authors' comments: 88 pages, 2 figures
Ali Abu-Nada, Aryan Iliat, Russell Ceballo
Amplitude damping fundamentally limits qubit lifetimes by irreversibly leaking energy and information into the environment. Standard Wiseman--Milburn feedback offers only modest improvement because it acts on a single measured quadrature and its corrective drive is degraded by loop delay. We introduce a compact hybrid upgrade with two components: (i) a coherently coupled \emph{ancilla} qubit that receives the homodyne current and feeds back \emph{quantum-coherently} on the system, recovering information from \emph{both} field quadratures and intentionally engineered to decay much faster than the system; and (ii) a lightweight supervised predictor that forecasts the near-future homodyne current, phase-aligning the correction to overcome hardware latency. A Lindblad treatment yields closed-form effective decay rates: the ancilla suppresses the emission channel by a cooperativity factor, while the predictor further suppresses the residual decay in proportion to forecast quality. Using IBM-scale parameters (baseline \(T_1 = 50~μ\mathrm{s}\)), numerical simulations surpass the W--M limit, achieving \(\sim 3\!-\!4\times\) longer \(T_1\) together with improved population retention and integrated energy. The method is modular and hardware-compatible: ancilla coupling and supervised prediction can be added to existing W--M loops to convert leaked information into a precise, time-advanced corrective drive. We also include a detailed, student-friendly derivation of the effective rates for both ancilla-assisted and prediction-enhanced feedback, making the impact of each design element analytically transparent.
Authors' comments: 13 pages, 8 figures. Intended for submission to IEEE QCNC 2026. Comments welcome