Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, Xiangyang Ji
Pre-trained video generation models hold great potential for generative video
super-resolution (VSR). However, adapting them for full-size VSR, as most
existing methods do, suffers from unnecessary intensive full-attention
computation and fixed output resolution. To overcome these limitations, we make
the first exploration into utilizing video diffusion priors for patch-wise VSR.
This is non-trivial because pre-trained video diffusion models are not native
for patch-level detail generation. To mitigate this challenge, we propose an
innovative approach, called PatchVSR, which integrates a dual-stream adapter
for conditional guidance. The patch branch extracts features from input patches
to maintain content fidelity while the global branch extracts context features
from the resized full video to bridge the generation gap caused by incomplete
semantics of patches. Particularly, we also inject the patch's location
information into the model to better contextualize patch synthesis within the
global video frame. Experiments demonstrate that our method can synthesize
high-fidelity, high-resolution details at the patch level. A tailor-made
multi-patch joint modulation is proposed to ensure visual consistency across
individually enhanced patches. Due to the flexibility of our patch-based
paradigm, we can achieve highly competitive 4K VSR based on a 512x512
resolution base model, with extremely high efficiency.
Authors' comments: CVPR 2025
Nhan T. Luu, Duong T. Luu, Pham Ngoc Nam, Truong Cong Thang
Spiking Neural Networks (SNNs) have gained significant traction in both
computational neuroscience and artificial intelligence for their potential in
energy-efficient computing. In contrast, artificial neural networks (ANNs)
excel at gradient-based optimization and high accuracy. This contrast has
consequently led to a growing subfield of hybrid ANN-SNN research. However,
existing hybrid approaches often rely on either a strict separation between ANN
and SNN components or employ SNN-only encoders followed by ANN classifiers due
to the constraints of non-differentiability of spike encoding functions,
causing prior hybrid architectures to lack deep layer-wise cooperation during
backpropagation. To address this gap, we propose a novel hybrid ANN-SNN
framework that integrates layer-wise encode-decode SNN blocks within
conventional ANN pipelines. Central to our method is the use of surrogate
gradients for a bit-plane-based spike encoding function, enabling end-to-end
differentiable training across ANN and SNN layers. This design achieves
competitive accuracy with state-of-the-art pure ANN and SNN models while
retaining the potential efficiency and temporal representation benefits of
spiking computation. To the best of our knowledge, this is the first
implementation of a surrogate gradient for bit plane coding specifically and
spike encoder interface in general to be utilized in the context of hybrid
ANN-SNN, successfully leading to a new class of hybrid models that pave new
directions for future research.
Authors' comments: Work under peer-review
Zhangyao Song, Nanqing Jiang, Miaohong He, Xiaoyu Zhao, Tao Guo
Downsampling-based methods for time series forecasting have attracted increasing attention due to their superiority in capturing sequence trends. However, this approaches mainly capture dependencies within subsequences but neglect inter-subsequence and inter-channel interactions, which limits forecasting accuracy. To address these limitations, we propose CTPNet, a novel framework that explicitly learns representations from three perspectives: i) inter-channel dependencies, captured by a temporal query-based multi-head attention mechanism; ii) intra-subsequence dependencies, modeled via a Transformer to characterize trend variations; and iii) inter-subsequence dependencies, extracted by reusing the encoder with residual connections to capture global periodic patterns. By jointly integrating these levels, proposed method provides a more holistic representation of temporal dynamics. Extensive experiments demonstrate the superiority of the proposed method.
Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
While large audio language models excel at tasks like ASR and emotion
recognition, they still struggle with complex reasoning due to the modality gap
between audio and text as well as the lack of structured intermediate
supervision. To address this, we propose a unified knowledge distillation
framework to transfer reasoning capabilities from a high-capacity textual
teacher model to a student audio models while preserving its acoustic
competence. Our method introduces two key dimensions: source-wise distillation,
which leverages both textual and acoustic teachers to provide complementary
modality-specific supervision; and layer-wise distillation, which aligns
teacher signals with appropriate student layers to improve transfer efficiency.
This dual-dimensional strategy enables fine-grained control over the
distillation process, effectively bridging the gap between symbolic reasoning
and speech representations. Experimental results show significant improvements
in audio reasoning performance, demonstrating the effectiveness of our
framework as a reasoning transfer solution for audio modeling.
Authors' comments: 5 pages; submitted to ICASSP 2026
Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan
Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
Yuan-Ming Hsu, Xiaosheng Huang, Christopher J. Storfer, Jose Carlos Inchausti, David Schlegel, John Moustakas, J. Aguilar, S. Ahlen et al.
We present a new method to search for strong gravitational lensing systems by
pairing spectra that are close together on the sky in a spectroscopic survey.
We visually inspect 26,621 spectra in the Dark Energy Spectroscopic Instrument
(DESI) Data Release 1 that are selected in this way. We further inspect the
11,848 images corresponding to these spectra in the DESI Legacy Imaging Surveys
Data Release 10, and obtain 2046 conventional strong gravitational lens
candidates, of which 1906 are new. This constitutes the largest sample of lens
candidates identified to date in spectroscopic data. Besides the conventional
candidates, we identify a new class of systems that we term "dimple lenses".
These systems have a low-mass foreground galaxy as a lens, typically smaller in
angular extent and fainter compared with the lensed background source galaxy,
producing subtle surface brightness indentations in the latter. We report the
discovery of 318 of these "dimple-lens" candidates. We suspect that these
represent dwarf galaxy lensing. With follow-up observations, they could offer a
new avenue to test the cold dark matter model by probing their mass profiles,
stellar mass-halo mass relation, and halo mass function for $M_{\textrm{Halo}}
\lesssim 10^{13}\,M_\odot$. Thus, in total, we report 2164 new lens candidates.
Our method demonstrates the power of pairwise spectroscopic analysis and
provides a pathway complementary to imaging-based and single-spectrum lens
searches.
Authors' comments: 23 pages, 15 figures, 4 tables, submitted to ApJS
Linyang He, Qiaolin Wang, Xilin Jiang, Nima Mesgarani
Transformer-based speech language models (SLMs) have significantly improved
neural speech recognition and understanding. While existing research has
examined how well SLMs encode shallow acoustic and phonetic features, the
extent to which SLMs encode nuanced syntactic and conceptual features remains
unclear. By drawing parallels with linguistic competence assessments for large
language models, this study is the first to systematically evaluate the
presence of contextual syntactic and semantic features across SLMs for
self-supervised learning (S3M), automatic speech recognition (ASR), speech
compression (codec), and as the encoder for auditory large language models
(AudioLLMs). Through minimal pair designs and diagnostic feature analysis
across 71 tasks spanning diverse linguistic levels, our layer-wise and
time-resolved analysis uncovers that 1) all speech encode grammatical features
more robustly than conceptual ones.
Authors' comments: EMNLP 2025 Main Conference (Oral)
Cedric Gaberle, Manpreet Singh Jattana
We propose an optimization method for the Variational Quantum Eigensolver (VQE) that combines adaptive and physics-inspired ansatz design. Instead of optimizing multiple layers simultaneously, the ansatz is built incrementally from its operator subsets, enabling subspace optimization that provides better initialization for subsequent steps. This quasi-dynamical approach preserves expressivity and hardware efficiency while avoiding the overhead of operator selection associated with adaptive methods. Benchmarks on one- and two-dimensional Heisenberg and Hubbard models with up to 20 qubits show improved fidelities, reduced function evaluations, or both, compared to fixed-layer VQE. The method is simple, cost-effective, and particularly well-suited for current noisy intermediate-scale quantum (NISQ) devices.
Sumanta Bhattacharyya, Sara Riaz, Pedram Rooshenas
Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.
BaiChen Fan, Sifan Zhou, Jian Li, Shibo Zhao, Muqing Cao, Qin Wang
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics
and autonomous systems. Existing methods typically follow frame-wise motion
estimation or a sequence-based paradigm. However, the two-frame methods are
efficient but lack long-term temporal context, making them vulnerable in sparse
or occluded scenes, while sequence-based methods that process multiple point
clouds gain robustness at a significant computational cost. To resolve this
dilemma, we propose a novel trajectory-based paradigm and its instantiation,
TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame
tracker by implicitly learning motion continuity from historical bounding box
trajectories alone-without requiring additional, costly point cloud inputs. It
first generates a fast, explicit motion proposal and then uses an implicit
motion modeling module to predict the future trajectory, which in turn refines
and corrects the initial proposal. Extensive experiments on the large-scale
NuScenes benchmark show that TrajTrack achieves new state-of-the-art
performance, dramatically improving tracking precision by 4.48% over a strong
baseline while running at 56 FPS. Besides, we also demonstrate the strong
generalizability of TrajTrack across different base trackers. Video is
available at https://www.bilibili.com/video/BV1ahYgzmEWP.
Authors' comments: 9 pages, 7 figures
M. Ohishi, R. Oda
The kick-one-out (KOO) method is a variable selection method based on a model
selection criterion. The method is very simple, and yet it has consistency in
variable selection under a high-dimensional asymptotic framework with a
specific model selection criterion. This paper proposes the join-twotogether
(JTT) method, which is a clustering method based on the KOO method for
group-wise linear regression with graph structure. The JTT method formulates
the clustering problem as an edge selection problem for a graph and determines
whether to select each edge based on the KOO method. We can employ network
Lasso to perform such a clustering. However, network Lasso is somewhat
cumbersome because there is no good algorithm for solving the associated
optimization problem and the tuning is complicated. Therefore, by deriving a
model selection criterion such that the JTT method has consistency in
clustering under a high-dimensional asymptotic framework, we propose a simple
yet powerful method that outperforms network Lasso.
Authors' comments: 18 pages, 1 figure
Shashidhar Reddy Javaji, Bhavul Gauri, Zining Zhu
Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it'' feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration. Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.
Juan Alberto Cano, Joaquín González-Nuevo, Laura Bonavera, Marcos M. Cueli, Tom Bakx, Jose M. Casas, Rebeca Fernández-Fernández, David Crespo
We present a new and independent methodology for identifying gravitational
lens candidates using data from the H-ATLAS and AllWISE surveys. Unlike
previous approaches, which are typically biased toward bright, strongly lensed
submillimeter galaxies (SMGs), our method uncovers fainter systems with lower
magnifications. This enables the identification and individual study of lensing
events that would otherwise only be accessible through statistical weak lensing
analyses. Our approach focuses on high-redshift SMGs from H-ATLAS in the range
$1.2 < z < 4.0$, and searches for associated AllWISE sources within an angular
distance of 18 arcsec. Candidate lenses are selected based on their WISE colors
($0.5 < \mathrm{W2} - \mathrm{W3} < 1.5$ mag), consistent with those of
elliptical galaxies, and further filtered using $J-\mathrm{W1}$ color and
photometric redshift cuts to reduce stellar contamination. This conservative
selection yields 68 new lens candidates. We then performed SED fitting with
CIGALE across UV to submillimeter wavelengths to estimate the physical
properties of both the lenses and the background SMGs, and to assess the
lensing nature of these candidates. Despite the uncertainties, we were able to
constrain key parameters such as stellar and dust masses, infrared
luminosities, and star formation rates. In addition, the estimated
magnifications for most candidates are modest, consistent with the weak lensing
regime ($\mu \simeq 1{-}1.5$), although a few sources may require more precise
modeling. Future efforts could refine this methodology to recover potential
candidates outside our selection, and high-resolution follow-up observations
will be essential to confirm the lensing nature of these sources and to further
investigate their physical properties.
Authors' comments: Accepted in A&A
Shivam Akhauri
In production tool-use agents (e.g., retrieval -> summarization -> calculator), routers must know when to stop exploring while preserving local DP and leaving an auditable trail. We present run-wise early-stopping certificates for perturb-and-MAP (PaM) best-first search on context-indexed prefix DAGs whose children partition the leaves. We couple realized path scores and pruning keys to a single exponential race realized lazily via offset propagation. With exact leaf counts N(v), lazy reuse at winners and independent residuals yield an Exact mode with a sound halting rule based on Key(v) = M_tau(v) - log t(v), where t(v) is the minimum arrival time among leaves under v. With only upper bounds N_ub >= N, a Surrogate mode uses a parent-anchored surrogate race without winner reuse; because -log t_hat >= -log t, the frontier invariant holds and stopping remains sound. We add a compiler from shared-node DAGs to prefix DAGs, local finiteness checks, a SuffixCountDP routine for exact counts with safe downgrades, a validator-side tightening term kappa = log(N/N_ub), and an auditable ledger/validator that replays runs deterministically. We also give an absolute LogSumExp tail bound, an acyclicity certificate, and a fallback PRF-per-leaf scheme (NoCert) whose work matches a realized-score best-first baseline up to a small per-node overhead. Finally, we integrate a price/latency/(epsilon, delta)-aware multi-LLM controller and DP-trained LoRA adapters chosen at runtime; these choices do not affect the two-mode frontier invariants. We report Mac/commodity-hardware reproducible results, a small real tool-use pipeline, and validator-checked audit trails, with code and ledgers provided.
Andrei Baroian, Kasper Notebomer
Transformer-based language models traditionally use uniform (isotropic) layer
sizes, yet they ignore the diverse functional roles that different depths can
play and their computational capacity needs. Building on Layer-Wise Scaling
(LWS) and pruning literature, we introduce three new LWS variants - Framed,
Reverse, and Crown - that redistribute FFN widths and attention heads via two
or three-point linear interpolation in the pre-training stage. We present the
first systematic ablation of LWS and its variants, on a fixed budget of 180M
parameters, trained on 5B tokens. All models converge to similar losses and
achieve better performance compared to an equal-cost isotropic baseline,
without a substantial decrease in training throughput. This work represents an
initial step into the design space of layer-wise architectures for
pre-training, but future work should scale experiments to orders of magnitude
more tokens and parameters to fully assess their potential.
Authors' comments: The reported results are skewed due to a data type mismatch. The
dataset was saved with int32, but the data loader interpreted it as uint16.
As a result, each 32-bit token was incorrectly split into two 16-bit tokens.
Outcome: a consistent artifact where every other token is zero
Xiaozheng Jiang, Wei Zhang, Xuerui Mao
Detecting tiny objects in remote sensing (RS) imagery has been a long-standing challenge due to their extremely limited spatial information, weak feature representations, and dense distributions across complex backgrounds. Despite numerous efforts devoted, mainstream detectors still underperform in such scenarios. To bridge this gap, we introduce RS-TinyNet, a multi-stage feature fusion and enhancement model explicitly tailored for RS tiny object detection in various RS scenarios. RS-TinyNet comes with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Guided by these principles, we design three step-wise feature enhancement modules. Among them, the multi-dimensional collaborative attention (MDCA) module employs multi-dimensional attention to enhance the saliency of tiny objects. Additionally, the auxiliary reversible branch (ARB) and a progressive fusion detection head (PFDH) module are introduced to preserve information flow and fuse multi-level features to bridge semantic gaps and retain structural detail. Comprehensive experiments on public RS dataset AI-TOD show that our RS-TinyNet surpasses existing state-of-the-art (SOTA) detectors by 4.0% AP and 6.5% AP75. Evaluations on DIOR benchmark dataset further validate its superior detection performance in diverse RS scenarios. These results demonstrate that the proposed multi-stage feature fusion strategy offers an effective and practical solution for tiny object detection in complex RS environments.
Authors' comments: The content of the thesis requires supplementation to make it more substantial
Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae
KV caching significantly improves the efficiency of Large Language Model
(LLM) inference by storing attention states from previously processed tokens,
enabling faster generation of subsequent tokens. However, as sequence length
increases, the KV cache quickly becomes a major memory bottleneck. To address
this, we propose PagedEviction, a novel fine-grained, structured KV cache
pruning strategy that enhances the memory efficiency of vLLM's PagedAttention.
Unlike existing approaches that rely on attention-based token importance or
evict tokens across different vLLM pages, PagedEviction introduces an efficient
block-wise eviction algorithm tailored for paged memory layouts. Our method
integrates seamlessly with PagedAttention without requiring any modifications
to its CUDA attention kernels. We evaluate PagedEviction across
Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models
on the LongBench benchmark suite, demonstrating improved memory usage with
better accuracy than baselines on long context tasks.
Authors' comments: Preprint
Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang
Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives -- either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.
Jiequn Han, Kui Ren, Nathan Soedjak
We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose inverse map from datasets drawn from a prior distribution, with the training process independent of the specific test instance. When the prior has a high intrinsic dimension or when high accuracy of the learned solution is required, a large number of training samples may be needed, resulting in substantial data collection costs. In contrast, our method dynamically allocates sampling effort based on the specific test instance, enabling significant gains in sample efficiency. By iteratively refining the training dataset conditioned on the latest prediction, the proposed strategy tailors the dataset to the geometry of the inverse map around each test instance. We demonstrate the effectiveness of our approach in the inverse scattering problem under two types of structured priors. Our results show that the advantage of the adaptive method becomes more pronounced in settings with more complex priors or higher accuracy requirements. While our experiments focus on a particular inverse problem, the adaptive sampling strategy is broadly applicable and readily extends to other inverse problems, offering a scalable and practical alternative to conventional fixed-dataset training regimes.
Guanzhong Hu, Wenpan Li, Rujing Zha, Ping Guo
Directed energy deposition (DED), a metal additive manufacturing process, is highly susceptible to process-induced defects such as geometric deviations, lack of fusion, and poor surface finish. This work presents a build-height-synchronized fringe projection system for in-situ, layer-wise surface reconstruction of laser-DED components, achieving a reconstruction accuracy of ${\pm}$46 $μ$m. From the reconstructed 3D morphology, two complementary geometry-based point cloud metrics are introduced: local point density, which highlights poor surface finish, and normal-change rate, which identifies lack-of-fusion features. These methods enable automated, annotation-free identification of common deposition anomalies directly from reconstructed surfaces, without the need for manual labeling. By directly linking geometric deviation to defect formation, the approach enables precise anomaly localization and advances the feasibility of closed-loop process control. This work establishes fringe projection as a practical tool for micrometer-scale monitoring in DED, bridging the gap between process signatures and part geometry for certifiable additive manufacturing.
Authors' comments: 26 pages, 15 figures