Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.
Seyed Mohsen Hosseini
Class imbalance and the difficulty imbalance are the two types of data imbalance that affect the performance of neural networks in medical segmentation tasks. In class imbalance the loss is dominated by the majority classes and in difficulty imbalance the loss is dominated by easy to classify pixels. This leads to an ineffective training. Dice loss, which is based on a geometrical metric, is very effective in addressing the class imbalance compared to the cross entropy (CE) loss, which is adopted directly from classification tasks. To address the difficulty imbalance, the common approach is employing a re-weighted CE loss or a modified Dice loss to focus the training on difficult to classify areas. The existing modification methods are computationally costly and with limited success. In this study we propose a simple modification to the Dice loss with minimal computational cost. With a pixel level modulating term, we take advantage of the effectiveness of Dice loss in handling the class imbalance to also handle the difficulty imbalance. Results on three commonly used medical segmentation tasks show that the proposed Pixel-wise Modulated Dice loss (PM Dice loss) outperforms other methods, which are designed to tackle the difficulty imbalance problem.
Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Shengjia Hua, Lei Chen, Zhi Wang
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose Block-wise Adaptive Caching(BAC), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and locks. To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to signiffcant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with signiffcant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3x inference speedup for free.
YaChen Yan, Liubo Li, Ravi Choudhary
In modern recommender systems, CTR/CVR models are increasingly trained with ranking objectives to improve item ranking quality. While this shift aligns training more closely with serving goals, most existing methods rely on in-batch negative sampling, which predominantly surfaces easy negatives. This limits the model's ability to capture fine-grained user preferences and weakens overall ranking performance. To address this, we propose a Hierarchical Group-wise Ranking Framework with two key components. First, we apply residual vector quantization to user embeddings to generate hierarchical user codes that partition users into hierarchical, trie-structured clusters. Second, we apply listwise ranking losses to user-item pairs at each level of the hierarchy, where shallow levels group loosely similar users and deeper levels group highly similar users, reinforcing learning-to-rank signals through progressively harder negatives. Since users with similar preferences and content exposure tend to yield more informative negatives, applying ranking losses within these hierarchical user groups serves as an effective approximation of hard negative mining. Our approach improves ranking performance without requiring complex real-time context collection or retrieval infrastructure. Extensive experiments demonstrate that the proposed framework consistently enhances both model calibration and ranking accuracy, offering a scalable and practical solution for industrial recommender systems.
Fuhan Cai, Yong Guo, Jie Li, Wenbo Li, Xiangzhong Fang, Jian Chen
Recent advancements in text-to-image (T2I) generation have led to the
emergence of highly expressive models such as diffusion transformers (DiTs),
exemplified by FLUX. However, their massive parameter sizes lead to slow
inference, high memory usage, and poor deployability. Existing acceleration
methods (e.g., single-step distillation and attention pruning) often suffer
from significant performance degradation and incur substantial training costs.
To address these limitations, we propose FastFLUX, an architecture-level
pruning framework designed to enhance the inference efficiency of FLUX. At its
core is the Block-wise Replacement with Linear Layers (BRLL) method, which
replaces structurally complex residual branches in ResBlocks with lightweight
linear layers while preserving the original shortcut connections for stability.
Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning
strategy that leverages LoRA to supervise neighboring blocks, mitigating
performance drops caused by structural replacement. Experiments show that our
FastFLUX maintains high image quality under both qualitative and quantitative
evaluations, while significantly improving inference speed, even with 20\% of
the hierarchy pruned. Our code will be available soon.
Authors' comments: 14 pages
Weijie Shi, Han Zhu, Jiaming Ji, Mengze Li, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu et al.
Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step's logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at https://huggingface.co/datasets/weijiezz/LegalHK.
Yuling Wang, Zihui Chen, Pengfei Jiao, Xiao Wang
Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the
need for tailored attacks to assess their robustness and ensure security.
However, existing HGNN attacks often require complex retraining of parameters
to generate specific perturbations for new scenarios. Recently, foundation
models have opened new horizons for the generalization of graph neural networks
by capturing shared semantics across various graph distributions. This leads us
to ask:Can we design a foundation attack model for HGNNs that enables
generalizable perturbations across different HGNNs, and quickly adapts to new
heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant
differences in model design and parameter space, different HGNNs surprisingly
share common vulnerability patterns from a relation-aware perspective.
Therefore, we explore how to design foundation HGNN attack criteria by mining
shared attack units. In this paper, we propose a novel relation-wise
heterogeneous graph foundation attack model, HeTa. We introduce a foundation
surrogate model to align heterogeneity and identify the importance of shared
relation-aware attack units. Building on this, we implement a serialized
relation-by-relation attack based on the identified relational weights. In this
way, the perturbation can be transferred to various target HGNNs and easily
fine-tuned for new HGs. Extensive experiments exhibit powerful attack
performances and generalizability of our method.
Authors' comments: Accepted by IJCAI 2025
Chunyuan Deng, Ruidi Chang, Hanjie Chen
Interventions in language models (LMs) are applied strategically to steer
model behavior during the forward pass. Learnable interventions, also known as
representation fine-tuning, aim to apply pointwise control within the concept
subspace and have proven effective in altering high-level behaviors. In this
work, we extend this approach to the distribution level, enabling the model to
learn not only pointwise transformations but also the surrounding regions of
the concept subspace. We demonstrate that these methods perform effectively in
early layers, with larger standard deviations correlating strongly with
improved performance. Across eight commonsense reasoning and seven arithmetic
reasoning benchmarks, our distribution-wise interventions consistently
outperform pointwise interventions in controllability and robustness. These
results illustrate that distribution-wise interventions provide a more
comprehensive method for steering model behavior and enabling finer-grained
control over language models. The code is at:
\href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.
Authors' comments: ICML 2025
Hans W. A. Hanley, Zakir Durumeric
Contextual large language model embeddings are increasingly utilized for
topic modeling and clustering. However, current methods often scale poorly,
rely on opaque similarity metrics, and struggle in multilingual settings. In
this work, we present a novel, scalable, interpretable, hierarchical, and
multilingual approach to clustering news articles and social media data. To do
this, we first train multilingual Matryoshka embeddings that can determine
story similarity at varying levels of granularity based on which subset of the
dimensions of the embeddings is examined. This embedding model achieves
state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson
$\rho$ = 0.816). Once trained, we develop an efficient hierarchical clustering
algorithm that leverages the hierarchical nature of Matryoshka embeddings to
identify unique news stories, narratives, and themes. We conclude by
illustrating how our approach can identify and cluster stories, narratives, and
overarching themes within real-world news datasets.
Authors' comments: Accepted to The 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025)
Ronghuan Wu, Wanchao Su, Jing Liao
Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.
Authors' comments: Project Page: https://layerpeeler.github.io/
Kuo Zhou, Lu Zhang
Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee
Large language models (LLMs) offer impressive performance but are impractical
for resource-constrained deployment due to high latency and energy consumption.
Knowledge distillation (KD) addresses this by transferring knowledge from a
large teacher to a smaller student model. However, conventional KD, notably
approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence
loss across the entire vocabulary, neglecting token-level prediction
discrepancies. By investigating these representative divergences via gradient
analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses
overestimated ones, showing their complementary roles. Based on this
observation, we propose Token-wise Distillation (ToDi), a novel method that
adaptively combines FKL and RKL per token using a sigmoid-based weighting
function derived from the teacher-student probability log-ratio. ToDi
dynamically emphasizes the appropriate divergence for each token, enabling
precise distribution alignment. We demonstrate that ToDi consistently
outperforms recent distillation baselines using uniform or less granular
strategies across instruction-following benchmarks. Extensive ablation studies
and efficiency analysis further validate ToDi's effectiveness and practicality.
Authors' comments: 13 pages, 7 figures
Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, Zhiyu Zoey Chen
Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.
Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
Dong Huang, Qiong Liu, Mark C. Wyatt, Grant M. Kennedy
The discovery of extra-terrestrial life is one of the ultimate goals for
future exoplanet-seeking missions, with one major challenge being the presence
of 'exozodiacal' dust near target stars or within their habitable zone.
Therefore, it is critical to identify which stars possess exozodiacal dust and
quantify their emission levels. In this study, we conducted a search for
exozodi candidates within 10 parsecs using the Reyl'e sample. We performed
proper motion calculations and cross-matched the sample with the WISE and 2MASS
database, resulting in 339 preliminary target samples. We further analysed the
infrared radiation characteristics of these targets, using spectral energy
distribution (SED) fitting to predict photometric flux levels in the infrared
and searching for 3sigma excesses in the WISE W3 band. During further selection
processes, we applied various analysis methods to perform rigorous validation.
We identified five exozodi candidates all of which are brown dwarfs (BDs).
Given the clustering in candidate spectral types, we expect that these are not
true exozodi candidates, rather the apparent excess arises from the inability
of the BD photosphere models to accurately represent the SEDs of objects at the
L-T transition. Indeed, for the object DENIS J025503.3-470049, excess is likely
due to silicate clouds in the BD atmosphere. We suggest that a more stringent
5sigma excess is required to infer excess for this spectral type. The detection
rate (0/339) in our sample shows that less than 1% M stars have exozodi above
21% excess levels. This is consistent with the rate of exozodi at similar level
towards FGK stars in the Kennedy & Wyatt sample (25/24,174). We provide upper
limits on the 12 micron exozodi emission for the sample, which is typically at
21% relative to the star. For most stars, in particular the low mass M stars,
this is the first such upper limit in the literature.
Authors' comments: 14 pages, 11 figures, accepted for publication in A&A
Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen et al.
Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.
Hui Sun, Feng Bao
In this work, we study the stochastic optimal control problem (SOC) mainly
from the probabilistic view point, i.e. via the Stochastic Maximum principle
(SMP) \cite{Peng4}. We adopt the sample-wise backpropagation scheme proposed in
\cite{Hui1} to solve the SOC problem under the strong convexity assumption.
Importantly, in the Stochastic Gradient Descent (SGD) procedure, we use batch
samples with higher order scheme in the forward SDE to improve the convergence
rate in \cite{Hui1} from $\sim \mathcal{O}(\sqrt{\frac{N}{K} + \frac{1}{N}})$
to $\sim \mathcal{O}(\sqrt{\frac{1}{K} + \frac{1}{N^2}})$ and note that the
main source of uncertainty originates from the scheme for the simulation of $Z$
term in the BSDE. In the meantime, we note the SGD procedure uses only the
necessary condition of the SMP, while the batch simulation of the approximating
solution of BSDEs allows one to obtain a more accurate estimate of the control
$u$ that minimizes the Hamiltonian. We then propose a damped contraction
algorithm to solve the SOC problem whose proof of convergence for a special
case is attained under some appropriate assumption. We then show numerical
results to check the first order convergence rate of the projection algorithm
and analyze the convergence behavior of the damped contraction algorithm.
Lastly, we briefly discuss how to incorporate the proposed scheme in solving
practical problems especially when the Randomized Neural Networks are used. We
note that in this special case, the error backward propagation can be avoided
and parameter update can be achieved via purely algebraic computation (vector
algebra) which will potentially improve the efficiency of the whole training
procedure. Such idea will require further exploration and we will leave it as
our future work.
Authors' comments: Follow up work for a previous SINUM paper
Daneul Kim, Jaeah Lee, Jaesik Park
Most real-world image editing tasks require multiple sequential edits to
achieve desired results. Current editing approaches, primarily designed for
single-object modifications, struggle with sequential editing: especially with
maintaining previous edits along with adapting new objects naturally into the
existing content. These limitations significantly hinder complex editing
scenarios where multiple objects need to be modified while preserving their
contextual relationships. We address this fundamental challenge through two key
proposals: enabling rough mask inputs that preserve existing content while
naturally integrating new elements and supporting consistent editing across
multiple modifications. Our framework achieves this through layer-wise memory,
which stores latent representations and prompt embeddings from previous edits.
We propose Background Consistency Guidance that leverages memorized latents to
maintain scene coherence and Multi-Query Disentanglement in cross-attention
that ensures natural adaptation to existing content. To evaluate our method, we
present a new benchmark dataset incorporating semantic alignment metrics and
interactive editing scenarios. Through comprehensive experiments, we
demonstrate superior performance in iterative image editing tasks with minimal
user effort, requiring only rough masks while maintaining high-quality results
throughout multiple editing steps.
Authors' comments: CVPR 2025. Project page :
https://carpedkm.github.io/projects/improving_edit/index.html
Yijie Jin, Junjie Peng, Xuanchao Lin, Haochen Yuan, Lan Wang, Cangzhi Zheng
Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.