Runlin Zhou, Letian Li, Zemin Zheng
We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors across clients and induces within-row sparsity, and we develop RowFed, a communication-efficient federated algorithm that embeds SROF into a linearized ADMM framework with privacy-preserving partial participation. Theoretically, we establish an oracle property for SROF-achieving correct variable-level group recovery with asymptotic normality-and prove convergence of RowFed to a stationary solution. Under random client participation, the iterate gap contracts at a rate that improves with participation probability. Empirically, simulations in heterogeneous regimes show that RowFed consistently lowers estimation and prediction error and strengthens variable-level cluster recovery over NonFed, FedAvg, and a personalized matrix-fusion baseline. A real-data study further corroborates these gains while preserving interpretability. Together, our results position row-wise fusion as an effective and transparent paradigm for large-scale personalized federated multivariate learning, bridging the gap between entry-wise and matrix-wise formulations.
Xiao Zheng, Wenchi Cheng, Jingqing Wang, Zhuohui Yao, Jiangzhou Wang
Active reconfigurable intelligent surface (RIS) emerges as an effective
technique to resist the double-fading attenuation of passive RIS. By embedding
with power harvesting function, it further evolves to zero-power active RIS,
which can effectively enhance the flexibility of RIS deployment without
external power demand. Nevertheless, existing works neglected the inherent
difficulty of channel estimation (CE) for RIS-assisted systems, and the
discrete phase shift constraint in practical deployment. In this paper we
design a new element-wise RIS architecture and propose a distributed
location-aided transmission scheme with low complexity to enhance the reflected
gain for channel state information (CSI)-limited RIS-assisted near-field
communications. Specifically, the new element-wise RIS provides dynamic element
selection capability with low hardware resources. Based on Fresnel diffraction
theory, we construct the mapping from locations in space-domain to phase
distributions of waves in phase-domain and reveal the priority of elements for
harvesting and reflecting. {Then, the distributed beamforming design with the
phase of determine-then-align is proposed, where the estimation overhead
reduction stems from exempted requirements of RIS-associated CE at base station
(BS).} The asymptotic analysis indicates that the proposed scheme can achieve
the optimal gain with a fixed proportion of reflective elements when RIS is
large, followed by simulations to verify its superiority to other protocols.
Authors' comments: 17 Pages
Hong-Kai Zheng, Piji Li
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang et al.
Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one
language that captures the emotion of a speaker from another language while
maintaining the target voice's timbre. This process of cross-lingual emotional
speech synthesis presents a complex challenge, necessitating flexible control
over emotion, timbre, and language. However, emotion and timbre are highly
entangled in speech signals, making fine-grained control challenging. To
address this issue, we propose EMM-TTS, a novel two-stage cross-lingual
emotional speech synthesis framework based on perturbed self-supervised
learning (SSL) representations. In the first stage, the model explicitly and
implicitly encodes prosodic cues to capture emotional expressiveness, while the
second stage restores the timbre from perturbed SSL representations. We further
investigate the effect of different speaker perturbation strategies-formant
shifting and speaker anonymization-on the disentanglement of emotion and
timbre. To strengthen speaker preservation and expressive control, we introduce
Speaker Consistency Loss (SCL) and Speaker-Emotion Adaptive Layer Normalization
(SEALN) modules. Additionally, we find that incorporating explicit acoustic
features (e.g., F0, energy, and duration) alongside pretrained latent features
improves voice cloning performance. Comprehensive multi-metric evaluations,
including both subjective and objective measures, demonstrate that EMM-TTS
achieves superior naturalness, emotion transferability, and timbre consistency
across languages.
Authors' comments: Submitted to Expert Systems with Applications,11 pages
Shuwei Chen, Jiajun Cui, Zhengqi Xu, Fan Zhang, Jiangke Fan, Teng Zhang, Xingxing Wang
Click-through rate (CTR) prediction, which models behavior sequence and
non-sequential features (e.g., user/item profiles or cross features) to infer
user interest, underpins industrial recommender systems. However, most methods
face three forms of heterogeneity that degrade predictive performance: (i)
Feature Heterogeneity persists when limited sequence side features provide less
granular interest representation compared to extensive non-sequential features,
thereby impairing sequence modeling performance; (ii) Context Heterogeneity
arises because a user's interest in an item will be influenced by other items,
yet point-wise prediction neglects cross-item interaction context from the
entire item set; (iii) Architecture Heterogeneity stems from the fragmented
integration of specialized network modules, which compounds the model's
effectiveness, efficiency and scalability in industrial deployments. To tackle
the above limitations, we propose HoMer, a Homogeneous-Oriented TransforMer for
modeling sequential and set-wise contexts. First, we align sequence side
features with non-sequential features for accurate sequence modeling and
fine-grained interest representation. Second, we shift the prediction paradigm
from point-wise to set-wise, facilitating cross-item interaction in a highly
parallel manner. Third, HoMer's unified encoder-decoder architecture achieves
dual optimization through structural simplification and shared computation,
ensuring computational efficiency while maintaining scalability with model
size. Without arduous modification to the prediction pipeline, HoMer
successfully scales up and outperforms our industrial baseline by 0.0099 in the
AUC metric, and enhances online business metrics like CTR/RPM by 1.99%/2.46%.
Additionally, HoMer saves 27% of GPU resources via preliminary engineering
optimization, further validating its superiority and practicality.
Authors' comments: 10 pages, 6 figures
X. Xie, A. Bergamaschi, M. Brückner, M. Carulla, R. Dinapoli, S. Ebner, K. Ferjaoui, E. Fröjdh et al.
The M\"ONCH hybrid pixel detector, with a 25 \textmu m pixel pitch and fast charge-integrating readout, has demonstrated subpixel resolution capabilities for X-ray imaging and deep learning-based electron localization in electron microscopy. Fully exploiting this potential requires extensive calibration to ensure both linearity and uniformity of the pixel response, which is challenging for detectors with a large dynamic range. To overcome the limitations of conventional calibration methods, we developed an accurate and efficient correction method to achieve pixel-wise gain and nonlinearity calibration based on the backside pulsing technique. A three-dimensional lookup table was generated for all pixels across the full dynamic range, mapping the pixel response to a calibrated linear energy scale. Compared with conventional linear calibration, the proposed method yields negligible deviations between the calibrated and nominal energies for photons and electrons. The improvement in energy resolution ranges from 4% to 22% for 15-25 keV photons and from 16% to 23% for 60-200 keV electrons. Deep learning-based electron localization demonstrates a 4% improvement in spatial resolution when using the proposed calibration method. This approach further enables rapid diagnosis of the cause of bad pixels and estimation of bump-bonding yield.
Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu
Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.
Manjiang Yu, Hongji Li, Priyanka Singh, Xue Li, Di Wang, Lijie Hu
Reliable behavior control is central to deploying large language models
(LLMs) on the web. Activation steering offers a tuning-free route to align
attributes (e.g., truthfulness) that ensure trustworthy generation. Prevailing
approaches rely on coarse heuristics and lack a principled account of where to
steer and how strongly to intervene. To this end, we propose Position-wise
Injection with eXact Estimated Levels (PIXEL), a position-wise activation
steering framework that, in contrast to prior work, learns a property-aligned
subspace from dual views (tail-averaged and end-token) and selects intervention
strength via a constrained geometric objective with a closed-form solution,
thereby adapting to token-level sensitivity without global hyperparameter
tuning. PIXEL further performs sample-level orthogonal residual calibration to
refine the global attribute direction and employs a lightweight
position-scanning routine to identify receptive injection sites. We
additionally provide representation-level guarantees for the
minimal-intervention rule, supporting reliable alignment. Across diverse models
and evaluation paradigms, PIXEL consistently improves attribute alignment while
preserving model general capabilities, offering a practical and principled
method for LLMs' controllable generation. Our code is available at
https://github.com/V1centNevwake/PIXEL-Adaptive-Steering
Authors' comments: 18 pages,3 figures
Xueyi Liu, He Wang, Li Yi
Achieving generalized in-hand object rotation remains a significant challenge
in robotics, largely due to the difficulty of transferring policies from
simulation to the real world. The complex, contact-rich dynamics of dexterous
manipulation create a "reality gap" that has limited prior work to constrained
scenarios involving simple geometries, limited object sizes and aspect ratios,
constrained wrist poses, or customized hands. We address this sim-to-real
challenge with a novel framework that enables a single policy, trained in
simulation, to generalize to a wide variety of objects and conditions in the
real world. The core of our method is a joint-wise dynamics model that learns
to bridge the reality gap by effectively fitting limited amount of real-world
collected data and then adapting the sim policy's actions accordingly. The
model is highly data-efficient and generalizable across different whole-hand
interaction distributions by factorizing dynamics across joints, compressing
system-wide influences into low-dimensional variables, and learning each
joint's evolution from its own dynamic profile, implicitly capturing these net
effects. We pair this with a fully autonomous data collection strategy that
gathers diverse, real-world interaction data with minimal human intervention.
Our complete pipeline demonstrates unprecedented generality: a single policy
successfully rotates challenging objects with complex shapes (e.g., animals),
high aspect ratios (up to 5.33), and small sizes, all while handling diverse
wrist orientations and rotation axes. Comprehensive real-world evaluations and
a teleoperation application for complex tasks validate the effectiveness and
robustness of our approach. Website: https://meowuu7.github.io/DexNDM/
Authors' comments: Project Website: https://meowuu7.github.io/DexNDM/ Video:
https://youtu.be/tU2Mv8vWftU
Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery
Training vision-language models on cognitively-plausible amounts of data
requires rethinking how models integrate multimodal information. Within the
constraints of the Vision track for the BabyLM Challenge 2025, we propose a
lightweight decoder-based architecture with (1) token-wise dynamic gating for
adaptive fusion of linguistic and visual cues, (2) feature modulation and
channel attention to maximise the utility of limited visual information and (3)
auxiliary contrastive objectives for visual grounding. Evaluation on five
benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows
competitive or superior performance to multimodal baselines. More notably, our
dynamic gate discovers interpretable patterns without explicit supervision,
favouring visual cues for content words and linguistic cues for function words.
While we identify limitations in the Challenge constraints, such as the
information bottleneck created by global image embeddings and training
instability from the dataset split, our findings establish dynamic gating as a
powerful tool for efficient multimodal learning, offering both interpretability
and performance even under severe constraints.
Authors' comments: Accepted to the EMNLP 2025 BabyLM Workshop
Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning
method for foundation models, but it suffers from parameter interference,
resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based
LoRA variants show promise in mitigating intra-task correlations in single-task
instruction tuning, they introduce additional router parameters and remain
ineffective in multi-task model merging where inter-task interference arises.
Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit
MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the
up-projection matrix, and (2) an implicit router that unifies expert routing
and down-projection, where a frozen sparse random projection matrix replaces
the traditional dense trainable version. This design resolves the trade-off
between intra-task decorrelation and computational efficiency by eliminating
the need for an explicit router, while inherently mitigating inter-task
interference due to the orthogonality property of random matrices. Extensive
experiments across four domains -- general knowledge understanding, scientific
question answering, mathematical reasoning, and code generation -- demonstrate
consistent performance improvements over existing methods. Beyond empirical
gains, FlyLoRA highlights how biological structures can inspire innovations in
AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
Authors' comments: NeurIPS 2025 accepted paper
Larissa Reichart, Cem Ata Baykara, Ali Burak Ünal, Mete Akgün, Harlin Lee
Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
Seungsu Han, Juyoung Hwang, Won Chang
Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.
Linping Qu, Shenghui Song, Chi-Ying Tsui
In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers' importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.
Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
Amir Hameed Mir
Large Language Models (LLMs) often produce fluent yet factually incorrect
statements-a phenomenon known as hallucination-posing serious risks in
high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric
framework for hallucination detection that analyzes the evolution of
hidden-state semantics across transformer layers. Unlike prior methods that
rely on multiple sampling passes or external verification sources, LSD operates
intrinsically within the model's representational space. Using margin-based
contrastive learning, LSD aligns hidden activations with ground-truth
embeddings derived from a factual encoder, revealing a distinct separation in
semantic trajectories: factual responses preserve stable alignment, while
hallucinations exhibit pronounced semantic drift across depth. Evaluated on the
TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an
F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming
SelfCheckGPT and Semantic Entropy baselines while requiring only a single
forward pass. This efficiency yields a 5-20x speedup over sampling-based
methods without sacrificing precision or interpretability. LSD offers a
scalable, model-agnostic mechanism for real-time hallucination monitoring and
provides new insights into the geometry of factual consistency within large
language models.
Authors' comments: Comments: 14 pages, 14 figures, 5 tables. Code available at:
https://github.com/sirraya-tech/Sirraya_LSD_Code
Yitong Cui, Liu Liu, Baosheng Yu, Jiayan Qiu, Xikai Zhang, Likang Xiao, Yixing Liu, Quan Chen
Large language models (LLMs) have exhibited significant capabilities in addressing challenging problems throughout various fields, often through the use of agentic workflows that adhere to structured instructions and multi-step procedures. However, designing such workflows demands substantial manual effort, posing challenges to scalability and generalizability. Recent studies have aimed to minimize the human intervention needed for their construction, leading to advances in automated techniques for optimizing agentic workflows. However, current approaches are often constrained by their limited representational capacity, insufficient adaptability, weak scalability, and pairwise comparison paradigm -- issues that stem primarily from a dependence on discrete optimization techniques. To overcome these limitations, we introduce a new score-based preference approach, refereed as SPOGW, which operates directly on cardinal reward signals through group-wise comparison and enables more efficient and stable optimization in a continuous space. SPOGW incorporates Iterative offline GRPO (ioGRPO) with advantage-masked KL divergence (mKL), which regulates training update by placing greater emphasis on the advantageous regions of the policy response. In five benchmark datasets covering mathematical reasoning, coding, and question answering, SPOGW matches or exceeds the performance of current state-of-the-art approaches, presenting a viable and forward-looking methodology for automated generation and optimization of agentic workflows.
Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao, Lingyong Fang, Ziniu Liu, Ru Zhang et al.
Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10\% to 25\% improvement in F1 scores over baselines while utilizing $6\times$~$58\times$ fewer tokens than traditional methods like CoT-based verifiers.
Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi
The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
Authors' comments: It is work in progress
Harshwardhan Fartale, Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir
Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified "recall circuits" reduces fact-retrieval accuracy by up to 15\% while leaving reasoning intact, whereas disabling "reasoning circuits" reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.