Woonsang Kang, Joohyung Lee, Seungjun Kim, Jungchan Cho, Yoonseon Oh
Grasp pose detection (GPD) is a fundamental capability for robotic autonomy,
but its reliance on large, diverse datasets creates significant data privacy
and centralization challenges. Federated Learning (FL) offers a
privacy-preserving solution, but its application to GPD is hindered by the
substantial communication overhead of large models, a key issue for
resource-constrained robots. To address this, we propose a novel module-wise FL
framework that begins by analyzing the learning dynamics of the GPD model's
functional components. This analysis identifies slower-converging modules, to
which our framework then allocates additional communication effort. This is
realized through a two-phase process: a standard full-model training phase is
followed by a communication-efficient phase where only the identified subset of
slower-converging modules is trained and their partial updates are aggregated.
Extensive experiments on the GraspNet-1B dataset demonstrate that our method
outperforms standard FedAvg and other baselines, achieving higher accuracy for
a given communication budget. Furthermore, real-world experiments on a physical
robot validate our approach, showing a superior grasp success rate compared to
baseline methods in cluttered scenes. Our work presents a
communication-efficient framework for training robust, generalized GPD models
in a decentralized manner, effectively improving the trade-off between
communication cost and model performance.
Authors' comments: 8 pages, 5 figures. Submitted to IEEE Robotics and Automation Letters
(RA-L)
Jiawei Sun, Hongkang Li, Meng Wang
Jumping connections enable Graph Convolutional Networks (GCNs) to overcome
over-smoothing, while graph sparsification reduces computational demands by
selecting a sub-matrix of the graph adjacency matrix during neighborhood
aggregation. Learning GCNs with graph sparsification has shown empirical
success across various applications, but a theoretical understanding of the
generalization guarantees remains limited, with existing analyses ignoring
either graph sparsification or jumping connections. This paper presents the
first learning dynamics and generalization analysis of GCNs with jumping
connections using graph sparsification. Our analysis demonstrates that the
generalization accuracy of the learned model closely approximates the highest
achievable accuracy within a broad class of target functions dependent on the
proposed sparse effective adjacency matrix $A^*$. Thus, graph sparsification
maintains generalization performance when $A^*$ preserves the essential edges
that support meaningful message propagation. We reveal that jumping connections
lead to different sparsification requirements across layers. In a
two-hidden-layer GCN, the generalization is more affected by the sparsified
matrix deviations from $A^*$ of the first layer than the second layer. To the
best of our knowledge, this marks the first theoretical characterization of
jumping connections' role in sparsification requirements. We validate our
theoretical results on benchmark datasets in deep GCNs.
Authors' comments: TMLR
Yan Dong, Enci Xu, Shaoqiang Qiu, Wenxuan Li, Yang Liu, Bin Han
High-speed ground robots moving on unstructured terrains generate intense
high-frequency vibrations, leading to LiDAR scan distortions in Lidar-inertial
odometry (LIO). Accurate and efficient undistortion is extremely challenging
due to (1) rapid and non-smooth state changes during intense vibrations and (2)
unpredictable IMU noise coupled with a limited IMU sampling frequency. To
address this issue, this paper introduces post-undistortion uncertainty. First,
we model the undistortion errors caused by linear and angular vibrations and
assign post-undistortion uncertainty to each point. We then leverage this
uncertainty to guide point-to-map matching, compute uncertainty-aware
residuals, and update the odometry states using an iterated Kalman filter. We
conduct vibration-platform and mobile-platform experiments on multiple public
datasets as well as our own recordings, demonstrating that our method achieves
better performance than other methods when LiDAR undergoes intense vibration.
Authors' comments: 8 pages, 10 figures, 5 tables. Accepted by Robotics and Automation
Letters at June 30
Lorenzo Giaretto, Nicola Soave
In this paper we establish existence and properties of minimal energy solutions for the weakly coupled system $$ \begin{cases}
-Îu_i + λ_i u_i = μ_i|u_i|^{Kq-2}u_i + β|u_i|^{q-2}u_i\prod_{j\neq i}|u_j|^q & \text{in }\mathbb{R}^d, \qquad
u_i \in H^1(\mathbb{R}^d), \end{cases}\qquad i=1,\dots, K, $$ characterized by $K$-wise interaction (namely the interaction term involves the product of all the components). We consider both attractive ($β>0$) and repulsive cases ($β<0$), and we give sufficient conditions on $β$ in order to have least energy fully non-trivial solutions, if necessary under a radial constraint. We also study the asymptotic behavior of least energy fully non-trivial radial solutions in the limit of strong competition $β\to -\infty$, showing partial segregation phenomena which differ substantially from those arising in pairwise interaction models.
Authors' comments: 26 pages, no figures
Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi
We introduce a deepfake video detection approach that exploits pixel-wise
temporal inconsistencies, which traditional spatial frequency-based detectors
often overlook. Traditional detectors represent temporal information merely by
stacking spatial frequency spectra across frames, resulting in the failure to
detect temporal artifacts in the pixel plane. Our approach performs a 1D
Fourier transform on the time axis for each pixel, extracting features highly
sensitive to temporal inconsistencies, especially in areas prone to unnatural
movements. To precisely locate regions containing the temporal artifacts, we
introduce an attention proposal module trained in an end-to-end manner.
Additionally, our joint transformer module effectively integrates pixel-wise
temporal frequency features with spatio-temporal context features, expanding
the range of detectable forgery artifacts. Our framework represents a
significant advancement in deepfake video detection, providing robust
performance across diverse and challenging detection scenarios.
Authors' comments: accepted by iccv 2025. code is will be available at
https://github.com/rama0126/PwTF-DVD
Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.
Dake Guo, Jixun Yao, Linhan Ma, Wang He, Lei Xie
Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: https://dukguo.github.io/StreamFlow/}
Yash Vardhan Tomar
Bias in predictive machine learning (ML) models is a fundamental challenge due to the skewed or unfair outcomes produced by biased models. Existing mitigation strategies rely on either post-hoc corrections or rigid constraints. However, emerging research claims that these techniques can limit scalability and reduce generalizability. To address this, this paper introduces a feature-wise mixing framework to mitigate contextual bias. This was done by redistributing feature representations across multiple contextual datasets. To assess feature-wise mixing's effectiveness, four ML classifiers were trained using cross-validation and evaluated with bias-sensitive loss functions, including disparity metrics and mean squared error (MSE), which served as a standard measure of predictive performance. The proposed method achieved an average bias reduction of 43.35% and a statistically significant decrease in MSE across all classifiers trained on mixed datasets. Additionally, benchmarking against established bias mitigation techniques found that feature-wise mixing consistently outperformed SMOTE oversampling and demonstrated competitive effectiveness without requiring explicit bias attribute identification. Feature-wise mixing efficiently avoids the computational overhead typically associated with fairness-aware learning algorithms. Future work could explore applying feature-wise mixing for real-world fields where accurate predictions are necessary.
Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu
Recently, autoregressive (AR) language models have emerged as a dominant
approach in speech synthesis, offering expressive generation and scalable
training. However, conventional AR speech synthesis models relying on the
next-token prediction paradigm often encounter significant challenges when
handling long speech sequences. These models often struggle to construct stable
frame-to-frame attention, leading to increased latency and degraded synthesis
quality, thereby limiting their feasibility for real-time applications. To
address these limitations, we introduce a novel dynamic chunk-wise
autoregressive synthesis framework, termed DCAR, designed to enhance both
efficiency and intelligibility robustness in AR speech generation. DCAR
introduces a chunk-to-frame attention mechanism through training with
multi-token prediction, enabling dynamic chunk prediction in variable speech
contexts using a lightweight module trained on-policy. DCAR dynamically adjusts
the token prediction span, significantly reducing the sequence length
dependency while obtaining high synthesis quality. Comprehensive empirical
evaluations demonstrate that DCAR substantially outperforms traditional
next-token prediction models, achieving up to 72.27% intelligibility
improvement and 2.61x inference speedup simultaneously on the test set.
Furthermore, we conduct comprehensive analysis to support it as a versatile
foundation for next-generation speech synthesis systems.
Authors' comments: 17 pages, 8 figures, 5 tables
Taiga Someya, Ryo Yoshida, Hitomi Yanaka, Yohei Oseki
Recent work has demonstrated that neural language models encode syntactic structures in their internal representations, yet the derivations by which these structures are constructed across layers remain poorly understood. In this paper, we propose Derivational Probing to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers. Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers. Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma
RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in
various applications like intelligent surveillance. However, as the number of
modalities increases, the required data storage and transmission costs also
double. Therefore, efficient RGB-IR data compression is essential. This work
proposes a joint compression framework for RGB-IR image pair. Specifically, to
fully utilize cross-modality prior information for accurate context probability
modeling within and between modalities, we propose a Channel-wise
Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context
Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are
designed for extracting and aggregating the global low-frequency information
from both modalities, which assist the model in predicting entropy parameters
more accurately. Experimental results demonstrate that our approach outperforms
existing RGB-IR image pair and single-modality compression methods on LLVIP and
KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate
saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec
presented at CVPR 2022.
Authors' comments: IEEE International Conference on Systems, Man, and Cybernetics 2025.
(SMC), under review
Lucas Tendick, Costantino Budroni, Marco Túlio Quintino
The incompatibility of quantum measurements, i.e. the fact that certain
observable quantities cannot be measured jointly is widely regarded as a
distinctive quantum feature with important implications for the foundations and
the applications of quantum information theory. While the standard
incompatibility of multiple measurements has been the focus of attention since
the inception of quantum theory, its generalizations, such as measurement
simulability, $n$-wise incompatibility, and mulit-copy incompatibility have
only been proposed recently. Here, we point out that all these generalizations
are differing notions of the question of how many measurements are genuinely
contained in a measurement device. We then show, that all notions do differ not
only in their operational meaning but also mathematically in the set of
measurement assemblages they describe. We then fully resolve the relations
between these different generalizations, by showing a strict hierarchy between
these notions. Hence, we provide a general framework for generalized
measurement incompatibility. Finally, we consider the implications our results
have for recent works using these different notions.
Authors' comments: 24 pages, 6 figures, comments are welcome!
Lingfei Wang, Yu Xing, Yuhao Yi, Ming Cao, Karl H. Johansson
This paper investigates the problem of leadership development for an external influencer using the Friedkin-Johnsen (FJ) opinion dynamics model, where the influencer is modeled as a fully stubborn agent and leadership is quantified by social power. The influencer seeks to maximize her social power by strategically adding a limited number of links to regular agents. This optimization problem is shown to be equivalent to maximizing the absorbing probability to the influencer in an augmented Markov chain. The resulting objective function is both monotone and submodular, enabling the use of a greedy algorithm to compute an approximate solution. To handle large-scale networks efficiently, a random walk sampling over the Markov chain is employed to reduce computational complexity. Analytical characterizations of the solution are provided for both low and high stubbornness of regular agents. Specific network topologies are also examined: for complete graphs with rank-one weight matrices, the problem reduces to a hyperbolic 0-1 programmming problem, which is solvable in polynomial time; for symmetric ring graphs with circulant weight matrices and uniform agent stubbornness, the optimal strategy involves selecting agents that are sufficiently dispersed across the network. Numerical simulations are presented for illustration.
Yu-Yang Qian, Yuan-Ze Xu, Zhen-Yu Zhang, Peng Zhao, Zhi-Hua Zhou
Many real-world applications collect data in a streaming environment, where
learning tasks are encountered sequentially. This necessitates continual
learning (CL) to update models online, enabling adaptation to new tasks while
preserving past knowledge to prevent catastrophic forgetting. Nowadays, with
the flourish of large pre-trained models (LPMs), efficiency has become
increasingly critical for CL, due to their substantial computational demands
and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of
Low-Rank Adapters), a novel approach that constructs layer-wise adapters by
leveraging hierarchical gradient similarity to enable efficient CL,
particularly for LPMs. To reduce the computational burden of task similarity
estimation, we employ bandit techniques to develop an algorithm based on lower
confidence bounds to efficiently explore the task structure. Furthermore, we
use sparse gradient updates to facilitate parameter optimization, making the
approach better suited for LPMs. Theoretical analysis is provided to justify
the rationale behind our approach, and experiments on both vision transformers
(ViTs) and large language models (LLMs) demonstrate the effectiveness and
efficiency of our approach across various domains, including vision and natural
language processing tasks.
Authors' comments: ICML 2025
Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li
Recent advances in medical imaging have established deep learning-based
segmentation as the predominant approach, though it typically requires large
amounts of manually annotated data. However, obtaining annotations for
intracranial hemorrhage (ICH) remains particularly challenging due to the
tedious and costly labeling process. Semi-supervised learning (SSL) has emerged
as a promising solution to address the scarcity of labeled data, especially in
volumetric medical image segmentation. Unlike conventional SSL methods that
primarily focus on high-confidence pseudo-labels or consistency regularization,
we propose SWDL-Net, a novel SSL framework that exploits the complementary
advantages of Laplacian pyramid and deep convolutional upsampling. The
Laplacian pyramid excels at edge sharpening, while deep convolutions enhance
detail precision through flexible feature mapping. Our framework achieves
superior segmentation of lesion details and boundaries through a difference
learning mechanism that effectively integrates these complementary approaches.
Extensive experiments on a 271-case ICH dataset and public benchmarks
demonstrate that SWDL-Net outperforms current state-of-the-art methods in
scenarios with only 2% labeled data. Additional evaluations on the publicly
available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data
further confirm the superiority of our approach. Code and data have been
released at https://github.com/SIAT-CT-LAB/SWDL.
Authors' comments: 11 pages, 4 figures, 6 Tables
Shiji Zhao, Chi Chen, Ranjie Duan, Xizhe Wang, Xingxing Wei
Adversarial Training (AT) is widely recognized as an effective approach to
enhance the adversarial robustness of Deep Neural Networks. As a variant of AT,
Adversarial Robustness Distillation (ARD) has shown outstanding performance in
enhancing the robustness of small models. However, both AT and ARD face robust
fairness issue: these models tend to display strong adversarial robustness
against some classes (easy classes) while demonstrating weak adversarial
robustness against others (hard classes). This paper explores the underlying
factors of this problem and points out the smoothness degree of soft labels for
different classes significantly impacts the robust fairness from both empirical
observation and theoretical analysis. Based on the above exploration, we
propose Anti-Bias Soft Label Distillation (ABSLD) within the Knowledge
Distillation framework to enhance the adversarial robust fairness.
Specifically, ABSLD adaptively reduces the student's error risk gap between
different classes, which is accomplished by adjusting the class-wise smoothness
degree of teacher's soft labels during the training process, and the adjustment
is managed by assigning varying temperatures to different classes.
Additionally, as a label-based approach, ABSLD is highly adaptable and can be
integrated with the sample-based methods. Extensive experiments demonstrate
ABSLD outperforms state-of-the-art methods on the comprehensive performance of
robustness and fairness.
Authors' comments: arXiv admin note: text overlap with arXiv:2312.05508
Zengjue Chen, Runliang Niu, He Kong, Qi Wang
Recent advances in Vision-Language-Action (VLA) model have demonstrated strong generalization capabilities across diverse scenes, tasks, and robotic platforms when pretrained at large-scale datasets. However, these models still require task-specific fine-tuning in novel environments, a process that relies almost exclusively on supervised fine-tuning (SFT) using static trajectory datasets. Such approaches neither allow robot to interact with environment nor do they leverage feedback from live execution. Also, their success is critically dependent on the size and quality of the collected trajectories. Reinforcement learning (RL) offers a promising alternative by enabling closed-loop interaction and aligning learned policies directly with task objectives. In this work, we draw inspiration from the ideas of GRPO and propose the Trajectory-wise Group Relative Policy Optimization (TGRPO) method. By fusing step-level and trajectory-level advantage signals, this method improves GRPO's group-level advantage estimation, thereby making the algorithm more suitable for online reinforcement learning training of VLA. Experimental results on ten manipulation tasks from the libero-object benchmark demonstrate that TGRPO consistently outperforms various baseline methods, capable of generating more robust and efficient policies across multiple tested scenarios. Our source codes are available at: https://github.com/hahans/TGRPO
Tarushri N. S.
Universal Differential Equations (UDEs), which blend neural networks with physical differential equations, have emerged as a powerful framework for scientific machine learning (SciML), enabling data-efficient, interpretable, and physically consistent modeling. In the context of smart grid systems, modeling node-wise battery dynamics remains a challenge due to the stochasticity of solar input and variability in household load profiles. Traditional approaches often struggle with generalization and fail to capture unmodeled residual dynamics. This work proposes a UDE-based approach to learn node-specific battery evolution by embedding a neural residual into a physically inspired battery ODE. Synthetic yet realistic solar generation and load demand data are used to simulate battery dynamics over time. The neural component learns to model unobserved or stochastic corrections arising from heterogeneity in node demand and environmental conditions. Comprehensive experiments reveal that the trained UDE aligns closely with ground truth battery trajectories, exhibits smooth convergence behavior, and maintains stability in long-term forecasts. These findings affirm the viability of UDE-based SciML approaches for battery modeling in decentralized energy networks and suggest broader implications for real-time control and optimization in renewable-integrated smart grids.
Yao Yan
Multi-digit addition is a clear probe of the computational power of large
language models. To dissect the internal arithmetic processes in
LLaMA-3-8B-Instruct, we combine linear probing with logit-lens inspection.
Inspired by the step-by-step manner in which humans perform addition, we
propose and analyze a coherent four-stage trajectory in the forward
pass:Formula-structure representations become linearly decodable first, while
the answer token is still far down the candidate list.Core computational
features then emerge prominently.At deeper activation layers, numerical
abstractions of the result become clearer, enabling near-perfect detection and
decoding of the individual digits in the sum.Near the output, the model
organizes and generates the final content, with the correct token reliably
occupying the top rank.This trajectory suggests a hierarchical process that
favors internal computation over rote memorization. We release our code and
data to facilitate reproducibility.
Authors' comments: 12 pages, including appendix, 7 figures. EMNLP 2025 submission (ARR
May 2025 cycle, reviews pending)
Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng et al.
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).