X. Xie, A. Bergamaschi, M. Brückner, M. Carulla, R. Dinapoli, S. Ebner, K. Ferjaoui, E. Fröjdh et al.
The M\"ONCH hybrid pixel detector, with a 25 \textmu m pixel pitch and fast charge-integrating readout, has demonstrated subpixel resolution capabilities for X-ray imaging and deep learning-based electron localization in electron microscopy. Fully exploiting this potential requires extensive calibration to ensure both linearity and uniformity of the pixel response, which is challenging for detectors with a large dynamic range. To overcome the limitations of conventional calibration methods, we developed an accurate and efficient correction method to achieve pixel-wise gain and nonlinearity calibration based on the backside pulsing technique. A three-dimensional lookup table was generated for all pixels across the full dynamic range, mapping the pixel response to a calibrated linear energy scale. Compared with conventional linear calibration, the proposed method yields negligible deviations between the calibrated and nominal energies for photons and electrons. The improvement in energy resolution ranges from 4% to 22% for 15-25 keV photons and from 16% to 23% for 60-200 keV electrons. Deep learning-based electron localization demonstrates a 4% improvement in spatial resolution when using the proposed calibration method. This approach further enables rapid diagnosis of the cause of bad pixels and estimation of bump-bonding yield.
Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu
Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.
Manjiang Yu, Hongji Li, Priyanka Singh, Xue Li, Di Wang, Lijie Hu
Reliable behavior control is central to deploying large language models
(LLMs) on the web. Activation steering offers a tuning-free route to align
attributes (e.g., truthfulness) that ensure trustworthy generation. Prevailing
approaches rely on coarse heuristics and lack a principled account of where to
steer and how strongly to intervene. To this end, we propose Position-wise
Injection with eXact Estimated Levels (PIXEL), a position-wise activation
steering framework that, in contrast to prior work, learns a property-aligned
subspace from dual views (tail-averaged and end-token) and selects intervention
strength via a constrained geometric objective with a closed-form solution,
thereby adapting to token-level sensitivity without global hyperparameter
tuning. PIXEL further performs sample-level orthogonal residual calibration to
refine the global attribute direction and employs a lightweight
position-scanning routine to identify receptive injection sites. We
additionally provide representation-level guarantees for the
minimal-intervention rule, supporting reliable alignment. Across diverse models
and evaluation paradigms, PIXEL consistently improves attribute alignment while
preserving model general capabilities, offering a practical and principled
method for LLMs' controllable generation. Our code is available at
https://github.com/V1centNevwake/PIXEL-Adaptive-Steering
Authors' comments: 18 pages,3 figures
Xueyi Liu, He Wang, Li Yi
Achieving generalized in-hand object rotation remains a significant challenge
in robotics, largely due to the difficulty of transferring policies from
simulation to the real world. The complex, contact-rich dynamics of dexterous
manipulation create a "reality gap" that has limited prior work to constrained
scenarios involving simple geometries, limited object sizes and aspect ratios,
constrained wrist poses, or customized hands. We address this sim-to-real
challenge with a novel framework that enables a single policy, trained in
simulation, to generalize to a wide variety of objects and conditions in the
real world. The core of our method is a joint-wise dynamics model that learns
to bridge the reality gap by effectively fitting limited amount of real-world
collected data and then adapting the sim policy's actions accordingly. The
model is highly data-efficient and generalizable across different whole-hand
interaction distributions by factorizing dynamics across joints, compressing
system-wide influences into low-dimensional variables, and learning each
joint's evolution from its own dynamic profile, implicitly capturing these net
effects. We pair this with a fully autonomous data collection strategy that
gathers diverse, real-world interaction data with minimal human intervention.
Our complete pipeline demonstrates unprecedented generality: a single policy
successfully rotates challenging objects with complex shapes (e.g., animals),
high aspect ratios (up to 5.33), and small sizes, all while handling diverse
wrist orientations and rotation axes. Comprehensive real-world evaluations and
a teleoperation application for complex tasks validate the effectiveness and
robustness of our approach. Website: https://meowuu7.github.io/DexNDM/
Authors' comments: Project Website: https://meowuu7.github.io/DexNDM/ Video:
https://youtu.be/tU2Mv8vWftU
Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery
Training vision-language models on cognitively-plausible amounts of data
requires rethinking how models integrate multimodal information. Within the
constraints of the Vision track for the BabyLM Challenge 2025, we propose a
lightweight decoder-based architecture with (1) token-wise dynamic gating for
adaptive fusion of linguistic and visual cues, (2) feature modulation and
channel attention to maximise the utility of limited visual information and (3)
auxiliary contrastive objectives for visual grounding. Evaluation on five
benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows
competitive or superior performance to multimodal baselines. More notably, our
dynamic gate discovers interpretable patterns without explicit supervision,
favouring visual cues for content words and linguistic cues for function words.
While we identify limitations in the Challenge constraints, such as the
information bottleneck created by global image embeddings and training
instability from the dataset split, our findings establish dynamic gating as a
powerful tool for efficient multimodal learning, offering both interpretability
and performance even under severe constraints.
Authors' comments: Accepted to the EMNLP 2025 BabyLM Workshop
Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning
method for foundation models, but it suffers from parameter interference,
resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based
LoRA variants show promise in mitigating intra-task correlations in single-task
instruction tuning, they introduce additional router parameters and remain
ineffective in multi-task model merging where inter-task interference arises.
Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit
MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the
up-projection matrix, and (2) an implicit router that unifies expert routing
and down-projection, where a frozen sparse random projection matrix replaces
the traditional dense trainable version. This design resolves the trade-off
between intra-task decorrelation and computational efficiency by eliminating
the need for an explicit router, while inherently mitigating inter-task
interference due to the orthogonality property of random matrices. Extensive
experiments across four domains -- general knowledge understanding, scientific
question answering, mathematical reasoning, and code generation -- demonstrate
consistent performance improvements over existing methods. Beyond empirical
gains, FlyLoRA highlights how biological structures can inspire innovations in
AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
Authors' comments: NeurIPS 2025 accepted paper
Larissa Reichart, Cem Ata Baykara, Ali Burak Ünal, Mete Akgün, Harlin Lee
Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
Seungsu Han, Juyoung Hwang, Won Chang
Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.
Linping Qu, Shenghui Song, Chi-Ying Tsui
In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers' importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.
Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
Amir Hameed Mir
Large Language Models (LLMs) often produce fluent yet factually incorrect
statements-a phenomenon known as hallucination-posing serious risks in
high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric
framework for hallucination detection that analyzes the evolution of
hidden-state semantics across transformer layers. Unlike prior methods that
rely on multiple sampling passes or external verification sources, LSD operates
intrinsically within the model's representational space. Using margin-based
contrastive learning, LSD aligns hidden activations with ground-truth
embeddings derived from a factual encoder, revealing a distinct separation in
semantic trajectories: factual responses preserve stable alignment, while
hallucinations exhibit pronounced semantic drift across depth. Evaluated on the
TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an
F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming
SelfCheckGPT and Semantic Entropy baselines while requiring only a single
forward pass. This efficiency yields a 5-20x speedup over sampling-based
methods without sacrificing precision or interpretability. LSD offers a
scalable, model-agnostic mechanism for real-time hallucination monitoring and
provides new insights into the geometry of factual consistency within large
language models.
Authors' comments: Comments: 14 pages, 14 figures, 5 tables. Code available at:
https://github.com/sirraya-tech/Sirraya_LSD_Code
Yitong Cui, Liu Liu, Baosheng Yu, Jiayan Qiu, Xikai Zhang, Likang Xiao, Yixing Liu, Quan Chen
Large language models (LLMs) have exhibited significant capabilities in addressing challenging problems throughout various fields, often through the use of agentic workflows that adhere to structured instructions and multi-step procedures. However, designing such workflows demands substantial manual effort, posing challenges to scalability and generalizability. Recent studies have aimed to minimize the human intervention needed for their construction, leading to advances in automated techniques for optimizing agentic workflows. However, current approaches are often constrained by their limited representational capacity, insufficient adaptability, weak scalability, and pairwise comparison paradigm -- issues that stem primarily from a dependence on discrete optimization techniques. To overcome these limitations, we introduce a new score-based preference approach, refereed as SPOGW, which operates directly on cardinal reward signals through group-wise comparison and enables more efficient and stable optimization in a continuous space. SPOGW incorporates Iterative offline GRPO (ioGRPO) with advantage-masked KL divergence (mKL), which regulates training update by placing greater emphasis on the advantageous regions of the policy response. In five benchmark datasets covering mathematical reasoning, coding, and question answering, SPOGW matches or exceeds the performance of current state-of-the-art approaches, presenting a viable and forward-looking methodology for automated generation and optimization of agentic workflows.
Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao, Lingyong Fang, Ziniu Liu, Ru Zhang et al.
Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10\% to 25\% improvement in F1 scores over baselines while utilizing $6\times$~$58\times$ fewer tokens than traditional methods like CoT-based verifiers.
Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi
The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
Authors' comments: It is work in progress
Harshwardhan Fartale, Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir
Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified "recall circuits" reduces fact-retrieval accuracy by up to 15\% while leaving reasoning intact, whereas disabling "reasoning circuits" reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.
Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan
Recent advances in speech enhancement have shown that models combining Mamba
and attention mechanisms yield superior cross-corpus generalization
performance. At the same time, integrating Mamba in a U-Net structure has
yielded state-of-the-art enhancement performance, while reducing both model
size and computational complexity. Inspired by these insights, we propose
RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and
multi-head attention in a U-Net structure for improved cross-corpus
performance. Resolution-wise shared attention (RWSA) refers to layerwise
attention-sharing across corresponding time- and frequency resolutions. Our
best-performing RWSA-MambaUNet model achieves state-of-the-art generalization
performance on two out-of-domain test sets. Notably, our smallest model
surpasses all baselines on the out-of-domain DNS 2020 test set in terms of
PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms
of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and
a fraction of the FLOPs.
Authors' comments: Submitted to IEEE for possible publication
Jhonatan Contreras, Thomas Bocklitz
Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.
Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao
Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.
Ali Shadman Yazdi, Annalisa Cappella, Benedetta Baldini, Riccardo Solazzo, Gianluca Tartaglia, Chiarella Sforza, Giuseppe Baselli
Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention
Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao
Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources.