Mikayla J. Wilson, Mary Anne Limbach, Andrew J. Skemer, Johanna M. Vos, Brittany E. Miles, Melanie J. Rowland, Andrew Vanderburg, Adam C. Schneider et al.
JWST is collecting time-series observations of many free-floating planets
(FFPs) to study their weather, but these light curves are the ideal datasets to
search for exomoons that transit the FFP during observations. In this paper, we
present observations of the planetary-mass Y dwarf ($T=250-285K$, $M =
6.5\pm3.5 M_{Jup}$, d = 2.3$\,$pc) WISE J085510.83-071442.5 (WISE 0855), whose
proximity and brightness make it ideal for a transiting exomoon search. We
examine 11 hours of time-series spectra from the JWST Near-Infrared
Spectrograph (NIRSpec) whose sensitivity, in combination with Gaussian process
(GP) modeling, allows for the disentanglement of exomoon transits from WISE
0855's variability. We do not find statistically significant evidence of an
exomoon transit in this dataset. Using injection and recovery tests of
artificial transits for depths ranging between 0.1-1% (0.35-1.12 $R_{\oplus}$)
we explore the exomoon parameter space where we could successfully detect
transits. For transit depths $\geq 0.5\%$ (1.96$\,R_{\text{Titan}}$), our
detection rate is 96%, which, for WISE 0855, corresponds to a moon with a
companion-to-host mass ratio similar to that of Titan and Saturn. Given our
sensitivity, transit probabilities, and our observational duration, we
determine a $\sim$91% probability of detecting a Titan mass analog exomoon
after 18 such observations if every observed system hosts a Titan mass analog
exomoon in a Galilean-like system. This suggests that JWST observations of
dozens of FFPs could yield meaningful constraints on the occurrence rate of
exomoons. This paper is the first demonstration that JWST is sensitive to
Galilean moon mass analogs around FFPs.
Authors' comments: 19 pages, 18 figures, accepted for publication in AJ
Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
Ejafa Bassam, Dawei Zhu, Kaigui Bian
Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.
Xue Chen, Shengtang Huang, Xin Li
We study explicit constructions of min-wise hash families and their extension to $k$-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset $X\subseteq[N]$, every element in $X$ has an equal chance to have the smallest value among all elements in $X$; a $k$-min-wise hash family guarantees this for every subset of size $k$ in $X$. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and $\ell_0$ sampling. The classical works by Indyk and P\u{a}tra\c{s}cu and Thorup have shown $\Theta(\log(1/\delta))$-wise independent families give min-wise hash of multiplicative (relative) error $\delta$, resulting in a construction with $\Theta(\log(1/\delta)\log N)$ random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopolan and Yehudayoff improved the number of bits to $O(\log N\log\log N)$ for polynomially small errors $\delta$. However, no construction with $O(\log N)$ bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing ($k$-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of $k$-min-wise hash with $O(k\log N)$ bits and $2^{-O(\log N/\log\log N)}$ error. This improves all previous results for any $k=\log^{O(1)}N$ under $O(k \log N)$ bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.
Ashutosh Tomar, Suvendu Rakshit, Amit Kumar Mandal, Shivangi Pandey
We present measurements of the dusty torus sizes of 51 active galactic nuclei
(AGNs) with a redshift of $z<$ 0.8. Our analysis utilizes about 16 years of
optical photometric data of 146 AGNs from various time-domain surveys,
including ASAS-SN, CRTS, and ZTF, along with 14 years of infrared data in the
$W$1 ($\sim$ 3.4 $\mu$m) and $W$2 ($\sim$ 4.6 $\mu$m) bands obtained from the
Wide-Field Infrared Survey Explorer (WISE). The estimated dust torus size
ranges from 1000 to 3000 days, using both the cross-correlation analysis and
lightcurve modeling through `MICA'. The measured lag has been corrected by
$(1+z)^{-0.37}$, to account for cosmological time dilation and the torus
temperature-gradient scaling. We conduct a linear regression analysis for both
the $W$1 and $W$2 bands to examine the radius--luminosity ($R$--$L_{BOL}$)
relationship under two conditions: one where the slope is fixed at 0.5 and one
where it is allowed to vary. For the fixed slope of 0.5, we find the ratio of
R$_{\mathrm{BLR}}$: R$_{W1}$: R$_{W2}$ to be 1: 9: 12, indicating that the
torus lies outside the BLR and that its size increases with wavelength.
Furthermore, we determine the relationship between torus size and L$_{BOL}$,
yielding best-fit slopes of $0.413\pm0.047$ for the $W$1 band and
$0.397\pm0.058$ for the $W$2 band. Both slopes are shallower than predicted by
the dust radiation equilibrium model. Furthermore, our findings indicate that
the torus size systematically decreases as the Eddington ratio increases, a
trend that can be explained by the self-shadowing effects of slim disks.
Authors' comments: 16 pages, 11 figures, Accepted for publication in ApJ
Yejin Kim, Youngbin Lee, Juhyeong Kim, Yongjae Lee
This study demonstrates that GuruAgents, prompt-guided AI agents, can
systematically operationalize the strategies of legendary investment gurus. We
develop five distinct GuruAgents, each designed to emulate an iconic investor,
by encoding their distinct philosophies into LLM prompts that integrate
financial tools and a deterministic reasoning pipeline. In a backtest on
NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique
behaviors driven by their prompted personas. The Buffett GuruAgent achieves the
highest performance, delivering a 42.2\% CAGR that significantly outperforms
benchmarks, while other agents show varied results. These findings confirm that
prompt engineering can successfully translate the qualitative philosophies of
investment gurus into reproducible, quantitative strategies, highlighting a
novel direction for automated systematic investing. The source code and data
are available at https://github.com/yejining99/GuruAgents.
Authors' comments: 7 Pages, 2 figures
Zhexiong Liu, Diane Litman
Large Language Models (LLMs) have shown extraordinary success across various
text generation tasks; however, their potential for simple yet essential text
classification remains underexplored, as LLM pre-training tends to emphasize
generation over classification. While LLMs with instruction tuning can
transform classification into a generation task, they often struggle to
categorize nuanced texts. One such example is text revision, which involves
nuanced edits between pairs of texts. Although simply fine-tuning LLMs for
revision classification seems plausible, it requires a large amount of revision
annotations, which are exceptionally expensive and scarce in the community. To
address this issue, we introduce a plug-and-play layer-wise parameter-efficient
fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of
important LLM layers that are dynamically selected based on their gradient norm
distribution, while freezing those of redundant layers. Extensive experiments
suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse
text revisions, while achieving fast convergence, low GPU memory consumption,
and effectiveness on small revision corpora.
Authors' comments: In The Conference on Empirical Methods in Natural Language Processing
(EMNLP), November 2025
Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang
Large language models (LLMs) have rapidly scaled in size, bringing severe
memory and computational challenges that hinder their deployment. Singular
Value Decomposition (SVD)-based compression has emerged as an appealing
post-training compression technique for LLMs, yet most existing methods apply a
uniform compression ratio across all layers, implicitly assuming homogeneous
information included in various layers. This overlooks the substantial
intra-layer heterogeneity observed in LLMs, where middle layers tend to encode
richer information while early and late layers are more redundant. In this
work, we revisit the existing SVD-based compression method and propose D-Rank,
a framework with layer-wise balanced Dynamic Rank allocation for LLMs
compression. We first introduce effective rank as a principled metric to
measure the information density of weight matrices, and then allocate ranks via
a Lagrange multiplier-based optimization scheme to adaptively assign more
capacity to groups with higher information density under a fixed compression
ratio. Moreover, we rebalance the allocated ranks across attention layers to
account for their varying importance and extend D-Rank to latest LLMs with
grouped-query attention. Extensive experiments on various LLMs with different
scales across multiple compression ratios demonstrate that D-Rank consistently
outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower
perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up
to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40%
compression ratio while achieving even higher throughput.
Authors' comments: 10 pages, 5 figures
Cheng Jin, Qitan Shi, Yuantao Gu
Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity
in diffusion models, but its impact on sampling dynamics remains poorly
understood. Prior studies, often restricted to unimodal conditional
distributions or simplified cases, provide only a partial picture. We analyze
CFG under multimodal conditionals and show that the sampling process unfolds in
three successive stages. In the Direction Shift stage, guidance accelerates
movement toward the weighted mean, introducing initialization bias and norm
growth. In the Mode Separation stage, local dynamics remain largely neutral,
but the inherited bias suppresses weaker modes, reducing global diversity. In
the Concentration stage, guidance amplifies within-mode contraction,
diminishing fine-grained variability. This unified view explains a widely
observed phenomenon: stronger guidance improves semantic alignment but
inevitably reduces diversity. Experiments support these predictions, showing
that early strong guidance erodes global diversity, while late strong guidance
suppresses fine-grained variation. Moreover, our theory naturally suggests a
time-varying guidance schedule, and empirical results confirm that it
consistently improves both quality and diversity.
Authors' comments: 24 pages, 10 figures
Soham Bonnerjee, Sayar Karmakar, Subhrajyoty Roy
With the increasing popularity of large language models, concerns over content authenticity have led to the development of myriad watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change-points. By highlighting the similarities as well as differences of these two problems, we motivate and propose WISER: a novel, computationally efficient, watermark segmentation algorithm. We theoretically validate our algorithm by deriving finite sample error-bounds, and establishing its consistency in detecting multiple watermarked segments in a single text. Complementing these theoretical results, our extensive numerical experiments show that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy, on various benchmark datasets embedded with diverse watermarking schemes. Our theoretical and empirical findings establish WISER as an effective tool for watermark localization in most settings. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.
Chunxu Zhang, Weipeng Zhang, Guodong Long, Zhiheng Xue, Riting Xia, Bo Yang
Federated Recommendation (FR) is a new learning paradigm to tackle the learn-to-rank problem in a privacy-preservation manner. How to integrate multi-modality features into federated recommendation is still an open challenge in terms of efficiency, distribution heterogeneity, and fine-grained alignment. To address these challenges, we propose a novel multimodal fusion mechanism in federated recommendation settings (GFMFR). Specifically, it offloads multimodal representation learning to the server, which stores item content and employs a high-capacity encoder to generate expressive representations, alleviating client-side overhead. Moreover, a group-aware item representation fusion approach enables fine-grained knowledge sharing among similar users while retaining individual preferences. The proposed fusion loss could be simply plugged into any existing federated recommender systems empowering their capability by adding multi-modality features. Extensive experiments on five public benchmark datasets demonstrate that GFMFR consistently outperforms state-of-the-art multimodal FR baselines.
Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou et al.
World models allow agents to simulate the consequences of actions in imagined
environments for planning, control, and long-horizon decision-making. However,
existing autoregressive world models struggle with visually coherent
predictions due to disrupted spatial structure, inefficient decoding, and
inadequate motion modeling. In response, we propose \textbf{S}cale-wise
\textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt
(\textbf{SAMPO}), a hybrid framework that combines visual autoregressive
modeling for intra-frame generation with causal modeling for next-frame
generation. Specifically, SAMPO integrates temporal causal decoding with
bidirectional spatial attention, which preserves spatial locality and supports
parallel decoding within each scale. This design significantly enhances both
temporal consistency and rollout efficiency. To further improve dynamic scene
understanding, we devise an asymmetric multi-scale tokenizer that preserves
spatial details in observed frames and extracts compact dynamic representations
for future frames, optimizing both memory usage and model performance.
Additionally, we introduce a trajectory-aware motion prompt module that injects
spatiotemporal cues about object and robot trajectories, focusing attention on
dynamic regions and improving temporal consistency and physical realism.
Extensive experiments show that SAMPO achieves competitive performance in
action-conditioned video prediction and model-based control, improving
generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's
zero-shot generalization and scaling behavior, demonstrating its ability to
generalize to unseen tasks and benefit from larger model sizes.
Authors' comments: 22 pages,15 figures
Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.
Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui
Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.
Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu
KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.
Qihua Zhu, Mingshuo Liu, Yuefeng Han, Doudou Zhou
We propose a nonparametric test for serial independence that aggregates pairwise similarities of observations with lag-dependent weights. The resulting statistic is powerful to general forms of temporal dependence, including nonlinear and uncorrelated alternatives, and applies to ultra-high-dimensional and non-Euclidean data. We derive asymptotic normality under both permutation and population nulls, and establish consistency in classical large-sample and high-dimension-low-sample-size (HDLSS) regimes. The test therefore provides the first theoretical power guarantees for serial independence in the HDLSS setting. Simulations demonstrate accurate size and strong power against a wide range of alternatives, showing significant power improvement over existing methods under various high-dimensional time series models. An application to spatio-temporal data illustrates the method's utility for non-Euclidean observations.
Erica Cooper, Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai
While supervised quality predictors for synthesized speech have demonstrated
strong correlations with human ratings, their requirement for in-domain labeled
training data hinders their generalization ability to new domains. Unsupervised
approaches based on pretrained self-supervised learning (SSL) based models and
automatic speech recognition (ASR) models are a promising alternative; however,
little is known about how these models encode information about speech quality.
Towards the goal of better understanding how different aspects of speech
quality are encoded in a multilingual setting, we present a layer-wise analysis
of multilingual pretrained speech models based on reference modeling. We find
that features extracted from early SSL layers show correlations with human
ratings of synthesized speech, and later layers of ASR models can predict
quality of non-neural systems as well as intelligibility. We also demonstrate
the importance of using well-matched reference data.
Authors' comments: Copyright 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
Bo Wu, Zhiqi Ai, Jun Jiang, Congcong Zhu, Shugong Xu
Label ambiguity poses a significant challenge in age estimation tasks. Most existing methods address this issue by modeling correlations between adjacent age groups through label distribution learning. However, they often overlook the varying degrees of ambiguity present across different age stages. In this paper, we propose a Stage-wise Adaptive Label Distribution Learning (SA-LDL) algorithm, which leverages the observation -- revealed through our analysis of embedding similarities between an anchor and all other ages -- that label ambiguity exhibits clear stage-wise patterns. By jointly employing stage-wise adaptive variance modeling and weighted loss function, SA-LDL effectively captures the complex and structured nature of label ambiguity, leading to more accurate and robust age estimation. Extensive experiments demonstrate that SA-LDL achieves competitive performance, with MAE of 1.74 and 2.15 on the MORPH-II and FG-NET datasets.
Authors' comments: 14 pages, 3 fugures
Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.
Kaiqi Zhao
Neural network quantization aims to reduce the bit-widths of weights and activations, making it a critical technique for deploying deep neural networks on resource-constrained hardware. Most Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions by replacing their derivatives with that of the identity function. While effective, STE overlooks discretization errors between continuous and quantized values, which can lead to accuracy degradation -- especially at extremely low bit-widths. In this paper, we propose Progressive Element-wise Gradient Estimation (PEGE), a simple yet effective alternative to STE, which can be seamlessly integrated with any forward propagation methods and improves the quantized model accuracy. PEGE progressively replaces full-precision weights and activations with their quantized counterparts via a novel logarithmic curriculum-driven mixed-precision replacement strategy. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the task loss for prediction and the discretization error for quantization, providing a unified and generalizable framework. Extensive experiments on CIFAR-10 and ImageNet across various architectures (e.g., ResNet, VGG) demonstrate that PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.