Marco Foschini, Marianne Defresne, Emilio Gamba, Bart Bogaerts, Tias Guns
Step-wise explanations can explain logic puzzles and other satisfaction problems by showing how to derive decisions step by step. Each step consists of a set of constraints that derive an assignment to one or more decision variables. However, many candidate explanation steps exist, with different sets of constraints and different decisions they derive. To identify the most comprehensible one, a user-defined objective function is required to quantify the quality of each step. However, defining a good objective function is challenging. Here, interactive preference elicitation methods from the wider machine learning community can offer a way to learn user preferences from pairwise comparisons. We investigate the feasibility of this approach for step-wise explanations and address several limitations that distinguish it from elicitation for standard combinatorial problems. First, because the explanation quality is measured using multiple sub-objectives that can vary a lot in scale, we propose two dynamic normalization techniques to rescale these features and stabilize the learning process. We also observed that many generated comparisons involve similar explanations. For this reason, we introduce MACHOP (Multi-Armed CHOice Perceptron), a novel query generation strategy that integrates non-domination constraints with upper confidence bound-based diversification. We evaluate the elicitation techniques on Sudokus and Logic-Grid puzzles using artificial users, and validate them with a real-user evaluation. In both settings, MACHOP consistently produces higher-quality explanations than the standard approach.
Ignace Bleukx, Maarten Flippo, Bart Bogaerts, Emir Demirović, Tias Guns
In the field of Explainable Constraint Solving, it is common to explain to a user why a problem is unsatisfiable. A recently proposed method for this is to compute a sequence of explanation steps. Such a step-wise explanation shows individual reasoning steps involving constraints from the original specification, that in the end explain a conflict. However, computing a step-wise explanation is computationally expensive, limiting the scope of problems for which it can be used. We investigate how we can use proofs generated by a constraint solver as a starting point for computing step-wise explanations, instead of computing them step-by-step. More specifically, we define a framework of abstract proofs, in which both proofs and step-wise explanations can be represented. We then propose several methods for converting a proof to a step-wise explanation sequence, with special attention to trimming and simplification techniques to keep the sequence and its individual steps small. Our results show our method significantly speeds up the generation of step-wise explanation sequences, while the resulting step-wise explanation has a quality similar to the current state-of-the-art.
Authors' comments: Accepted for publication at AAAI 2026
Haoran Zhang, Wenhao Zhang, Xianping Wu
We study finite horizon linear quadratic control with additive noise in a
perturbancewise framework that unifies the classical model, a constraint
embedded affine policy class, and a distributionally robust formulation with a
Wasserstein ambiguity set. Based on an augmented affine representation, we
model feasibility as an affine perturbation and unknown noise as distributional
perturbation from samples, thereby addressing constrained implementation and
model uncertainty in a single scheme. First, we construct an implementable
policy gradient method that accommodates nonzero noise means estimated from
data. Second, we analyze its convergence under constant stepsizes chosen as
simple polynomials of problem parameters, ensuring global decrease of the value
function. Finally, numerical studies: mean variance portfolio allocation and
dynamic benchmark tracking on real data, validating stable convergence and
illuminating sensitivity tradeoffs across horizon length, trading cost
intensity, state penalty scale, and estimation window.
Authors' comments: 40 pages, 14 figures
Landson Guo, Andres M. Diaz Aguilar, William Talbot, Turcan Tuna, Marco Hutter, Cesar Cadena
Accurate point-wise velocity estimation in 3D is crucial for robot interaction with non-rigid, dynamic agents, such as humans, enabling robust performance in path planning, collision avoidance, and object manipulation in dynamic environments. To this end, this paper proposes a novel RADAR, LiDAR, and camera fusion pipeline for point-wise 3D velocity estimation named CaRLi-V. This pipeline leverages raw RADAR measurements to create a novel RADAR representation, the velocity cube, which densely represents radial velocities within the RADAR's field-of-view. By combining the velocity cube for radial velocity extraction, optical flow for tangential velocity estimation, and LiDAR for point-wise range measurements through a closed-form solution, our approach can produce 3D velocity estimates for a dense array of points. Developed as an open-source ROS2 package, CaRLi-V has been field-tested against a custom dataset and proven to produce low velocity error metrics relative to ground truth, enabling point-wise velocity estimation for robotic applications.
Jaehyun Park, Konyul Park, Daehun Kim, Junseo Park, Jun Won Choi
In autonomous driving, transparency in the decision-making of perception
models is critical, as even a single misperception can be catastrophic. Yet
with multi-sensor inputs, it is difficult to determine how each modality
contributes to a prediction because sensor information becomes entangled within
the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a
post-hoc, model-agnostic interpretability method that disentangles
modality-specific information across all layers of a pretrained fusion model.
To our knowledge, LMD is the first approach to attribute the predictions of a
perception model to individual input modalities in a sensor-fusion system for
autonomous driving. We evaluate LMD on pretrained fusion models under
camera-radar, camera-LiDAR, and camera-radar-LiDAR settings for autonomous
driving. Its effectiveness is validated using structured perturbation-based
metrics and modality-wise visual decompositions, demonstrating practical
applicability to interpreting high-capacity multimodal architectures. Code is
available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.
Authors' comments: Accepted to NeurIPS 2025
Shuyan Lyu, Zhanzimo Wu, Junliang Du
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang et al.
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan
The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an \(\mathcal{O}(1/\sqrt{K})\) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA's superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.
Mikayla J. Wilson, Mary Anne Limbach, Andrew J. Skemer, Johanna M. Vos, Brittany E. Miles, Melanie J. Rowland, Andrew Vanderburg, Adam C. Schneider et al.
JWST is collecting time-series observations of many free-floating planets
(FFPs) to study their weather, but these light curves are the ideal datasets to
search for exomoons that transit the FFP during observations. In this paper, we
present observations of the planetary-mass Y dwarf ($T=250-285K$, $M =
6.5\pm3.5 M_{Jup}$, d = 2.3$\,$pc) WISE J085510.83-071442.5 (WISE 0855), whose
proximity and brightness make it ideal for a transiting exomoon search. We
examine 11 hours of time-series spectra from the JWST Near-Infrared
Spectrograph (NIRSpec) whose sensitivity, in combination with Gaussian process
(GP) modeling, allows for the disentanglement of exomoon transits from WISE
0855's variability. We do not find statistically significant evidence of an
exomoon transit in this dataset. Using injection and recovery tests of
artificial transits for depths ranging between 0.1-1% (0.35-1.12 $R_{\oplus}$)
we explore the exomoon parameter space where we could successfully detect
transits. For transit depths $\geq 0.5\%$ (1.96$\,R_{\text{Titan}}$), our
detection rate is 96%, which, for WISE 0855, corresponds to a moon with a
companion-to-host mass ratio similar to that of Titan and Saturn. Given our
sensitivity, transit probabilities, and our observational duration, we
determine a $\sim$91% probability of detecting a Titan mass analog exomoon
after 18 such observations if every observed system hosts a Titan mass analog
exomoon in a Galilean-like system. This suggests that JWST observations of
dozens of FFPs could yield meaningful constraints on the occurrence rate of
exomoons. This paper is the first demonstration that JWST is sensitive to
Galilean moon mass analogs around FFPs.
Authors' comments: 19 pages, 18 figures, accepted for publication in AJ
Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
Ejafa Bassam, Dawei Zhu, Kaigui Bian
Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.
Xue Chen, Shengtang Huang, Xin Li
We study explicit constructions of min-wise hash families and their extension to $k$-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset $X\subseteq[N]$, every element in $X$ has an equal chance to have the smallest value among all elements in $X$; a $k$-min-wise hash family guarantees this for every subset of size $k$ in $X$. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and $\ell_0$ sampling. The classical works by Indyk and P\u{a}tra\c{s}cu and Thorup have shown $\Theta(\log(1/\delta))$-wise independent families give min-wise hash of multiplicative (relative) error $\delta$, resulting in a construction with $\Theta(\log(1/\delta)\log N)$ random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopolan and Yehudayoff improved the number of bits to $O(\log N\log\log N)$ for polynomially small errors $\delta$. However, no construction with $O(\log N)$ bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing ($k$-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of $k$-min-wise hash with $O(k\log N)$ bits and $2^{-O(\log N/\log\log N)}$ error. This improves all previous results for any $k=\log^{O(1)}N$ under $O(k \log N)$ bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.
Ashutosh Tomar, Suvendu Rakshit, Amit Kumar Mandal, Shivangi Pandey
We present measurements of the dusty torus sizes of 51 active galactic nuclei
(AGNs) with a redshift of $z<$ 0.8. Our analysis utilizes about 16 years of
optical photometric data of 146 AGNs from various time-domain surveys,
including ASAS-SN, CRTS, and ZTF, along with 14 years of infrared data in the
$W$1 ($\sim$ 3.4 $\mu$m) and $W$2 ($\sim$ 4.6 $\mu$m) bands obtained from the
Wide-Field Infrared Survey Explorer (WISE). The estimated dust torus size
ranges from 1000 to 3000 days, using both the cross-correlation analysis and
lightcurve modeling through `MICA'. The measured lag has been corrected by
$(1+z)^{-0.37}$, to account for cosmological time dilation and the torus
temperature-gradient scaling. We conduct a linear regression analysis for both
the $W$1 and $W$2 bands to examine the radius--luminosity ($R$--$L_{BOL}$)
relationship under two conditions: one where the slope is fixed at 0.5 and one
where it is allowed to vary. For the fixed slope of 0.5, we find the ratio of
R$_{\mathrm{BLR}}$: R$_{W1}$: R$_{W2}$ to be 1: 9: 12, indicating that the
torus lies outside the BLR and that its size increases with wavelength.
Furthermore, we determine the relationship between torus size and L$_{BOL}$,
yielding best-fit slopes of $0.413\pm0.047$ for the $W$1 band and
$0.397\pm0.058$ for the $W$2 band. Both slopes are shallower than predicted by
the dust radiation equilibrium model. Furthermore, our findings indicate that
the torus size systematically decreases as the Eddington ratio increases, a
trend that can be explained by the self-shadowing effects of slim disks.
Authors' comments: 16 pages, 11 figures, Accepted for publication in ApJ
Yejin Kim, Youngbin Lee, Juhyeong Kim, Yongjae Lee
This study demonstrates that GuruAgents, prompt-guided AI agents, can
systematically operationalize the strategies of legendary investment gurus. We
develop five distinct GuruAgents, each designed to emulate an iconic investor,
by encoding their distinct philosophies into LLM prompts that integrate
financial tools and a deterministic reasoning pipeline. In a backtest on
NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique
behaviors driven by their prompted personas. The Buffett GuruAgent achieves the
highest performance, delivering a 42.2\% CAGR that significantly outperforms
benchmarks, while other agents show varied results. These findings confirm that
prompt engineering can successfully translate the qualitative philosophies of
investment gurus into reproducible, quantitative strategies, highlighting a
novel direction for automated systematic investing. The source code and data
are available at https://github.com/yejining99/GuruAgents.
Authors' comments: 7 Pages, 2 figures
Zhexiong Liu, Diane Litman
Large Language Models (LLMs) have shown extraordinary success across various
text generation tasks; however, their potential for simple yet essential text
classification remains underexplored, as LLM pre-training tends to emphasize
generation over classification. While LLMs with instruction tuning can
transform classification into a generation task, they often struggle to
categorize nuanced texts. One such example is text revision, which involves
nuanced edits between pairs of texts. Although simply fine-tuning LLMs for
revision classification seems plausible, it requires a large amount of revision
annotations, which are exceptionally expensive and scarce in the community. To
address this issue, we introduce a plug-and-play layer-wise parameter-efficient
fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of
important LLM layers that are dynamically selected based on their gradient norm
distribution, while freezing those of redundant layers. Extensive experiments
suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse
text revisions, while achieving fast convergence, low GPU memory consumption,
and effectiveness on small revision corpora.
Authors' comments: In The Conference on Empirical Methods in Natural Language Processing
(EMNLP), November 2025
Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang
Large language models (LLMs) have rapidly scaled in size, bringing severe
memory and computational challenges that hinder their deployment. Singular
Value Decomposition (SVD)-based compression has emerged as an appealing
post-training compression technique for LLMs, yet most existing methods apply a
uniform compression ratio across all layers, implicitly assuming homogeneous
information included in various layers. This overlooks the substantial
intra-layer heterogeneity observed in LLMs, where middle layers tend to encode
richer information while early and late layers are more redundant. In this
work, we revisit the existing SVD-based compression method and propose D-Rank,
a framework with layer-wise balanced Dynamic Rank allocation for LLMs
compression. We first introduce effective rank as a principled metric to
measure the information density of weight matrices, and then allocate ranks via
a Lagrange multiplier-based optimization scheme to adaptively assign more
capacity to groups with higher information density under a fixed compression
ratio. Moreover, we rebalance the allocated ranks across attention layers to
account for their varying importance and extend D-Rank to latest LLMs with
grouped-query attention. Extensive experiments on various LLMs with different
scales across multiple compression ratios demonstrate that D-Rank consistently
outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower
perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up
to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40%
compression ratio while achieving even higher throughput.
Authors' comments: 10 pages, 5 figures
Cheng Jin, Qitan Shi, Yuantao Gu
Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity
in diffusion models, but its impact on sampling dynamics remains poorly
understood. Prior studies, often restricted to unimodal conditional
distributions or simplified cases, provide only a partial picture. We analyze
CFG under multimodal conditionals and show that the sampling process unfolds in
three successive stages. In the Direction Shift stage, guidance accelerates
movement toward the weighted mean, introducing initialization bias and norm
growth. In the Mode Separation stage, local dynamics remain largely neutral,
but the inherited bias suppresses weaker modes, reducing global diversity. In
the Concentration stage, guidance amplifies within-mode contraction,
diminishing fine-grained variability. This unified view explains a widely
observed phenomenon: stronger guidance improves semantic alignment but
inevitably reduces diversity. Experiments support these predictions, showing
that early strong guidance erodes global diversity, while late strong guidance
suppresses fine-grained variation. Moreover, our theory naturally suggests a
time-varying guidance schedule, and empirical results confirm that it
consistently improves both quality and diversity.
Authors' comments: 24 pages, 10 figures
Soham Bonnerjee, Sayar Karmakar, Subhrajyoty Roy
With the increasing popularity of large language models, concerns over content authenticity have led to the development of myriad watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change-points. By highlighting the similarities as well as differences of these two problems, we motivate and propose WISER: a novel, computationally efficient, watermark segmentation algorithm. We theoretically validate our algorithm by deriving finite sample error-bounds, and establishing its consistency in detecting multiple watermarked segments in a single text. Complementing these theoretical results, our extensive numerical experiments show that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy, on various benchmark datasets embedded with diverse watermarking schemes. Our theoretical and empirical findings establish WISER as an effective tool for watermark localization in most settings. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.
Chunxu Zhang, Weipeng Zhang, Guodong Long, Zhiheng Xue, Riting Xia, Bo Yang
Federated Recommendation (FR) is a new learning paradigm to tackle the learn-to-rank problem in a privacy-preservation manner. How to integrate multi-modality features into federated recommendation is still an open challenge in terms of efficiency, distribution heterogeneity, and fine-grained alignment. To address these challenges, we propose a novel multimodal fusion mechanism in federated recommendation settings (GFMFR). Specifically, it offloads multimodal representation learning to the server, which stores item content and employs a high-capacity encoder to generate expressive representations, alleviating client-side overhead. Moreover, a group-aware item representation fusion approach enables fine-grained knowledge sharing among similar users while retaining individual preferences. The proposed fusion loss could be simply plugged into any existing federated recommender systems empowering their capability by adding multi-modality features. Extensive experiments on five public benchmark datasets demonstrate that GFMFR consistently outperforms state-of-the-art multimodal FR baselines.
Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou et al.
World models allow agents to simulate the consequences of actions in imagined
environments for planning, control, and long-horizon decision-making. However,
existing autoregressive world models struggle with visually coherent
predictions due to disrupted spatial structure, inefficient decoding, and
inadequate motion modeling. In response, we propose \textbf{S}cale-wise
\textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt
(\textbf{SAMPO}), a hybrid framework that combines visual autoregressive
modeling for intra-frame generation with causal modeling for next-frame
generation. Specifically, SAMPO integrates temporal causal decoding with
bidirectional spatial attention, which preserves spatial locality and supports
parallel decoding within each scale. This design significantly enhances both
temporal consistency and rollout efficiency. To further improve dynamic scene
understanding, we devise an asymmetric multi-scale tokenizer that preserves
spatial details in observed frames and extracts compact dynamic representations
for future frames, optimizing both memory usage and model performance.
Additionally, we introduce a trajectory-aware motion prompt module that injects
spatiotemporal cues about object and robot trajectories, focusing attention on
dynamic regions and improving temporal consistency and physical realism.
Extensive experiments show that SAMPO achieves competitive performance in
action-conditioned video prediction and model-based control, improving
generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's
zero-shot generalization and scaling behavior, demonstrating its ability to
generalize to unseen tasks and benefit from larger model sizes.
Authors' comments: 22 pages,15 figures