Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan
Recent advances in speech enhancement have shown that models combining Mamba
and attention mechanisms yield superior cross-corpus generalization
performance. At the same time, integrating Mamba in a U-Net structure has
yielded state-of-the-art enhancement performance, while reducing both model
size and computational complexity. Inspired by these insights, we propose
RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and
multi-head attention in a U-Net structure for improved cross-corpus
performance. Resolution-wise shared attention (RWSA) refers to layerwise
attention-sharing across corresponding time- and frequency resolutions. Our
best-performing RWSA-MambaUNet model achieves state-of-the-art generalization
performance on two out-of-domain test sets. Notably, our smallest model
surpasses all baselines on the out-of-domain DNS 2020 test set in terms of
PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms
of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and
a fraction of the FLOPs.
Authors' comments: Submitted to IEEE for possible publication
Jhonatan Contreras, Thomas Bocklitz
Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.
Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao
Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.
Ali Shadman Yazdi, Annalisa Cappella, Benedetta Baldini, Riccardo Solazzo, Gianluca Tartaglia, Chiarella Sforza, Giuseppe Baselli
Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention
Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao
Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources.
Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, Xiangyang Ji
Pre-trained video generation models hold great potential for generative video
super-resolution (VSR). However, adapting them for full-size VSR, as most
existing methods do, suffers from unnecessary intensive full-attention
computation and fixed output resolution. To overcome these limitations, we make
the first exploration into utilizing video diffusion priors for patch-wise VSR.
This is non-trivial because pre-trained video diffusion models are not native
for patch-level detail generation. To mitigate this challenge, we propose an
innovative approach, called PatchVSR, which integrates a dual-stream adapter
for conditional guidance. The patch branch extracts features from input patches
to maintain content fidelity while the global branch extracts context features
from the resized full video to bridge the generation gap caused by incomplete
semantics of patches. Particularly, we also inject the patch's location
information into the model to better contextualize patch synthesis within the
global video frame. Experiments demonstrate that our method can synthesize
high-fidelity, high-resolution details at the patch level. A tailor-made
multi-patch joint modulation is proposed to ensure visual consistency across
individually enhanced patches. Due to the flexibility of our patch-based
paradigm, we can achieve highly competitive 4K VSR based on a 512x512
resolution base model, with extremely high efficiency.
Authors' comments: CVPR 2025
Nhan T. Luu, Duong T. Luu, Pham Ngoc Nam, Truong Cong Thang
Spiking Neural Networks (SNNs) have gained significant traction in both
computational neuroscience and artificial intelligence for their potential in
energy-efficient computing. In contrast, artificial neural networks (ANNs)
excel at gradient-based optimization and high accuracy. This contrast has
consequently led to a growing subfield of hybrid ANN-SNN research. However,
existing hybrid approaches often rely on either a strict separation between ANN
and SNN components or employ SNN-only encoders followed by ANN classifiers due
to the constraints of non-differentiability of spike encoding functions,
causing prior hybrid architectures to lack deep layer-wise cooperation during
backpropagation. To address this gap, we propose a novel hybrid ANN-SNN
framework that integrates layer-wise encode-decode SNN blocks within
conventional ANN pipelines. Central to our method is the use of surrogate
gradients for a bit-plane-based spike encoding function, enabling end-to-end
differentiable training across ANN and SNN layers. This design achieves
competitive accuracy with state-of-the-art pure ANN and SNN models while
retaining the potential efficiency and temporal representation benefits of
spiking computation. To the best of our knowledge, this is the first
implementation of a surrogate gradient for bit plane coding specifically and
spike encoder interface in general to be utilized in the context of hybrid
ANN-SNN, successfully leading to a new class of hybrid models that pave new
directions for future research.
Authors' comments: Work under peer-review
Zhangyao Song, Nanqing Jiang, Miaohong He, Xiaoyu Zhao, Tao Guo
Downsampling-based methods for time series forecasting have attracted increasing attention due to their superiority in capturing sequence trends. However, this approaches mainly capture dependencies within subsequences but neglect inter-subsequence and inter-channel interactions, which limits forecasting accuracy. To address these limitations, we propose CTPNet, a novel framework that explicitly learns representations from three perspectives: i) inter-channel dependencies, captured by a temporal query-based multi-head attention mechanism; ii) intra-subsequence dependencies, modeled via a Transformer to characterize trend variations; and iii) inter-subsequence dependencies, extracted by reusing the encoder with residual connections to capture global periodic patterns. By jointly integrating these levels, proposed method provides a more holistic representation of temporal dynamics. Extensive experiments demonstrate the superiority of the proposed method.
Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
While large audio language models excel at tasks like ASR and emotion
recognition, they still struggle with complex reasoning due to the modality gap
between audio and text as well as the lack of structured intermediate
supervision. To address this, we propose a unified knowledge distillation
framework to transfer reasoning capabilities from a high-capacity textual
teacher model to a student audio models while preserving its acoustic
competence. Our method introduces two key dimensions: source-wise distillation,
which leverages both textual and acoustic teachers to provide complementary
modality-specific supervision; and layer-wise distillation, which aligns
teacher signals with appropriate student layers to improve transfer efficiency.
This dual-dimensional strategy enables fine-grained control over the
distillation process, effectively bridging the gap between symbolic reasoning
and speech representations. Experimental results show significant improvements
in audio reasoning performance, demonstrating the effectiveness of our
framework as a reasoning transfer solution for audio modeling.
Authors' comments: 5 pages; submitted to ICASSP 2026
Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan
Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
Yuan-Ming Hsu, Xiaosheng Huang, Christopher J. Storfer, Jose Carlos Inchausti, David Schlegel, John Moustakas, J. Aguilar, S. Ahlen et al.
We present a new method to search for strong gravitational lensing systems by
pairing spectra that are close together on the sky in a spectroscopic survey.
We visually inspect 26,621 spectra in the Dark Energy Spectroscopic Instrument
(DESI) Data Release 1 that are selected in this way. We further inspect the
11,848 images corresponding to these spectra in the DESI Legacy Imaging Surveys
Data Release 10, and obtain 2046 conventional strong gravitational lens
candidates, of which 1906 are new. This constitutes the largest sample of lens
candidates identified to date in spectroscopic data. Besides the conventional
candidates, we identify a new class of systems that we term "dimple lenses".
These systems have a low-mass foreground galaxy as a lens, typically smaller in
angular extent and fainter compared with the lensed background source galaxy,
producing subtle surface brightness indentations in the latter. We report the
discovery of 318 of these "dimple-lens" candidates. We suspect that these
represent dwarf galaxy lensing. With follow-up observations, they could offer a
new avenue to test the cold dark matter model by probing their mass profiles,
stellar mass-halo mass relation, and halo mass function for $M_{\textrm{Halo}}
\lesssim 10^{13}\,M_\odot$. Thus, in total, we report 2164 new lens candidates.
Our method demonstrates the power of pairwise spectroscopic analysis and
provides a pathway complementary to imaging-based and single-spectrum lens
searches.
Authors' comments: 23 pages, 15 figures, 4 tables, submitted to ApJS
Linyang He, Qiaolin Wang, Xilin Jiang, Nima Mesgarani
Transformer-based speech language models (SLMs) have significantly improved
neural speech recognition and understanding. While existing research has
examined how well SLMs encode shallow acoustic and phonetic features, the
extent to which SLMs encode nuanced syntactic and conceptual features remains
unclear. By drawing parallels with linguistic competence assessments for large
language models, this study is the first to systematically evaluate the
presence of contextual syntactic and semantic features across SLMs for
self-supervised learning (S3M), automatic speech recognition (ASR), speech
compression (codec), and as the encoder for auditory large language models
(AudioLLMs). Through minimal pair designs and diagnostic feature analysis
across 71 tasks spanning diverse linguistic levels, our layer-wise and
time-resolved analysis uncovers that 1) all speech encode grammatical features
more robustly than conceptual ones.
Authors' comments: EMNLP 2025 Main Conference (Oral)
Cedric Gaberle, Manpreet Singh Jattana
We propose an optimization method for the Variational Quantum Eigensolver (VQE) that combines adaptive and physics-inspired ansatz design. Instead of optimizing multiple layers simultaneously, the ansatz is built incrementally from its operator subsets, enabling subspace optimization that provides better initialization for subsequent steps. This quasi-dynamical approach preserves expressivity and hardware efficiency while avoiding the overhead of operator selection associated with adaptive methods. Benchmarks on one- and two-dimensional Heisenberg and Hubbard models with up to 20 qubits show improved fidelities, reduced function evaluations, or both, compared to fixed-layer VQE. The method is simple, cost-effective, and particularly well-suited for current noisy intermediate-scale quantum (NISQ) devices.
Sumanta Bhattacharyya, Sara Riaz, Pedram Rooshenas
Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.
BaiChen Fan, Sifan Zhou, Jian Li, Shibo Zhao, Muqing Cao, Qin Wang
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics
and autonomous systems. Existing methods typically follow frame-wise motion
estimation or a sequence-based paradigm. However, the two-frame methods are
efficient but lack long-term temporal context, making them vulnerable in sparse
or occluded scenes, while sequence-based methods that process multiple point
clouds gain robustness at a significant computational cost. To resolve this
dilemma, we propose a novel trajectory-based paradigm and its instantiation,
TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame
tracker by implicitly learning motion continuity from historical bounding box
trajectories alone-without requiring additional, costly point cloud inputs. It
first generates a fast, explicit motion proposal and then uses an implicit
motion modeling module to predict the future trajectory, which in turn refines
and corrects the initial proposal. Extensive experiments on the large-scale
NuScenes benchmark show that TrajTrack achieves new state-of-the-art
performance, dramatically improving tracking precision by 4.48% over a strong
baseline while running at 56 FPS. Besides, we also demonstrate the strong
generalizability of TrajTrack across different base trackers. Video is
available at https://www.bilibili.com/video/BV1ahYgzmEWP.
Authors' comments: 9 pages, 7 figures
M. Ohishi, R. Oda
The kick-one-out (KOO) method is a variable selection method based on a model
selection criterion. The method is very simple, and yet it has consistency in
variable selection under a high-dimensional asymptotic framework with a
specific model selection criterion. This paper proposes the join-twotogether
(JTT) method, which is a clustering method based on the KOO method for
group-wise linear regression with graph structure. The JTT method formulates
the clustering problem as an edge selection problem for a graph and determines
whether to select each edge based on the KOO method. We can employ network
Lasso to perform such a clustering. However, network Lasso is somewhat
cumbersome because there is no good algorithm for solving the associated
optimization problem and the tuning is complicated. Therefore, by deriving a
model selection criterion such that the JTT method has consistency in
clustering under a high-dimensional asymptotic framework, we propose a simple
yet powerful method that outperforms network Lasso.
Authors' comments: 18 pages, 1 figure
Shashidhar Reddy Javaji, Bhavul Gauri, Zining Zhu
Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it'' feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration. Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.
Juan Alberto Cano, Joaquín González-Nuevo, Laura Bonavera, Marcos M. Cueli, Tom Bakx, Jose M. Casas, Rebeca Fernández-Fernández, David Crespo
We present a new and independent methodology for identifying gravitational
lens candidates using data from the H-ATLAS and AllWISE surveys. Unlike
previous approaches, which are typically biased toward bright, strongly lensed
submillimeter galaxies (SMGs), our method uncovers fainter systems with lower
magnifications. This enables the identification and individual study of lensing
events that would otherwise only be accessible through statistical weak lensing
analyses. Our approach focuses on high-redshift SMGs from H-ATLAS in the range
$1.2 < z < 4.0$, and searches for associated AllWISE sources within an angular
distance of 18 arcsec. Candidate lenses are selected based on their WISE colors
($0.5 < \mathrm{W2} - \mathrm{W3} < 1.5$ mag), consistent with those of
elliptical galaxies, and further filtered using $J-\mathrm{W1}$ color and
photometric redshift cuts to reduce stellar contamination. This conservative
selection yields 68 new lens candidates. We then performed SED fitting with
CIGALE across UV to submillimeter wavelengths to estimate the physical
properties of both the lenses and the background SMGs, and to assess the
lensing nature of these candidates. Despite the uncertainties, we were able to
constrain key parameters such as stellar and dust masses, infrared
luminosities, and star formation rates. In addition, the estimated
magnifications for most candidates are modest, consistent with the weak lensing
regime ($\mu \simeq 1{-}1.5$), although a few sources may require more precise
modeling. Future efforts could refine this methodology to recover potential
candidates outside our selection, and high-resolution follow-up observations
will be essential to confirm the lensing nature of these sources and to further
investigate their physical properties.
Authors' comments: Accepted in A&A
Shivam Akhauri
In production tool-use agents (e.g., retrieval -> summarization -> calculator), routers must know when to stop exploring while preserving local DP and leaving an auditable trail. We present run-wise early-stopping certificates for perturb-and-MAP (PaM) best-first search on context-indexed prefix DAGs whose children partition the leaves. We couple realized path scores and pruning keys to a single exponential race realized lazily via offset propagation. With exact leaf counts N(v), lazy reuse at winners and independent residuals yield an Exact mode with a sound halting rule based on Key(v) = M_tau(v) - log t(v), where t(v) is the minimum arrival time among leaves under v. With only upper bounds N_ub >= N, a Surrogate mode uses a parent-anchored surrogate race without winner reuse; because -log t_hat >= -log t, the frontier invariant holds and stopping remains sound. We add a compiler from shared-node DAGs to prefix DAGs, local finiteness checks, a SuffixCountDP routine for exact counts with safe downgrades, a validator-side tightening term kappa = log(N/N_ub), and an auditable ledger/validator that replays runs deterministically. We also give an absolute LogSumExp tail bound, an acyclicity certificate, and a fallback PRF-per-leaf scheme (NoCert) whose work matches a realized-score best-first baseline up to a small per-node overhead. Finally, we integrate a price/latency/(epsilon, delta)-aware multi-LLM controller and DP-trained LoRA adapters chosen at runtime; these choices do not affect the two-mode frontier invariants. We report Mac/commodity-hardware reproducible results, a small real tool-use pipeline, and validator-checked audit trails, with code and ledgers provided.
Andrei Baroian, Kasper Notebomer
Transformer-based language models traditionally use uniform (isotropic) layer
sizes, yet they ignore the diverse functional roles that different depths can
play and their computational capacity needs. Building on Layer-Wise Scaling
(LWS) and pruning literature, we introduce three new LWS variants - Framed,
Reverse, and Crown - that redistribute FFN widths and attention heads via two
or three-point linear interpolation in the pre-training stage. We present the
first systematic ablation of LWS and its variants, on a fixed budget of 180M
parameters, trained on 5B tokens. All models converge to similar losses and
achieve better performance compared to an equal-cost isotropic baseline,
without a substantial decrease in training throughput. This work represents an
initial step into the design space of layer-wise architectures for
pre-training, but future work should scale experiments to orders of magnitude
more tokens and parameters to fully assess their potential.
Authors' comments: The reported results are skewed due to a data type mismatch. The
dataset was saved with int32, but the data loader interpreted it as uint16.
As a result, each 32-bit token was incorrectly split into two 16-bit tokens.
Outcome: a consistent artifact where every other token is zero