Yujia Wu, Wei Lan, Long Feng, Chih-Ling Tsai
The stochastic block model (SBM) has been widely used to analyze network data. Various goodness-of-fit tests have been proposed to assess the adequacy of model structures. To the best of our knowledge, however, none of the existing approaches are applicable for sparse networks in which the connection probability of any two communities is of order log n/n, and the number of communities is divergent. To fill this gap, we propose a novel goodness-of-fit test for the stochastic block model. The key idea is to construct statistics by sampling the maximum entry-deviations of the adjacency matrix that the negative impacts of network sparsity are alleviated by the sampling process. We demonstrate theoretically that the proposed test statistic converges to the Type-I extreme value distribution under the null hypothesis regardless of the network structure. Accordingly, it can be applied to both dense and sparse networks. In addition, we obtain the asymptotic power against alternatives. Moreover, we introduce a bootstrap-corrected test statistic to improve the finite sample performance, recommend an augmented test statistic to increase the power, and extend the proposed test to the degree-corrected SBM. Simulation studies and two empirical examples with both dense and sparse networks indicate that the proposed method performs well.
Jisoo Kim, Sungmin Kang, Sunwoo Lee
Expensive communication cost is a common performance bottleneck in Federated Learning (FL), which makes it less appealing in real-world applications. Many communication-efficient FL methods focus on discarding a part of model updates mostly based on gradient magnitude. In this study, we find that recycling previous updates, rather than simply dropping them, more effectively reduces the communication cost while maintaining FL performance. We propose FedLUAR, a Layer-wise Update Aggregation with Recycling scheme for communication-efficient FL. We first define a useful metric that quantifies the extent to which the aggregated gradients influences the model parameter values in each layer. FedLUAR selects a few layers based on the metric and recycles their previous updates on the server side. Our extensive empirical study demonstrates that the update recycling scheme significantly reduces the communication cost while maintaining model accuracy. For example, our method achieves nearly the same AG News accuracy as FedAvg, while reducing the communication cost to just 17%.
Shu-Xun Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang, Jie Tang
Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.
JunYong Choi, Min-Cheol Sagong, SeokYeong Lee, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho
We propose a diffusion-based inverse rendering framework that decomposes a
single RGB image into geometry, material, and lighting. Inverse rendering is
inherently ill-posed, making it difficult to predict a single accurate
solution. To address this challenge, recent generative model-based methods aim
to present a range of possible solutions. However, finding a single accurate
solution and generating diverse solutions can be conflicting. In this paper, we
propose a channel-wise noise scheduling approach that allows a single diffusion
model architecture to achieve two conflicting objectives. The resulting two
diffusion models, trained with different channel-wise noise schedules, can
predict a single highly accurate solution and present multiple possible
solutions. The experimental results demonstrate the superiority of our two
models in terms of both diversity and accuracy, which translates to enhanced
performance in downstream applications such as object insertion and material
editing.
Authors' comments: Accepted by CVPR 2025
Haiyan Wei, Hangrui Xu, Bingxu Zhu, Yulian Geng, Aolei Liu, Wenfei Yin, Jian Liu
Virtual stain transfer leverages computer-assisted technology to transform the histochemical staining patterns of tissue samples into other staining types. However, existing methods often lose detailed pathological information due to the limitations of the cycle consistency assumption. To address this challenge, we propose STNHCL, a hypergraph-based patch-wise contrastive learning method. STNHCL captures higher-order relationships among patches through hypergraph modeling, ensuring consistent higher-order topology between input and output images. Additionally, we introduce a novel negative sample weighting strategy that leverages discriminator heatmaps to apply different weights based on the Gaussian distribution for tissue and background, thereby enhancing traditional weighting methods. Experiments demonstrate that STNHCL achieves state-of-the-art performance in the two main categories of stain transfer tasks. Furthermore, our model also performs excellently in downstream tasks. Code will be made available.
Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai
Backdoor attacks pose a severe threat to deep neural networks (DNN) by
implanting hidden backdoors that can be activated with predefined triggers to
manipulate model behaviors maliciously. Existing 3D point cloud backdoor
attacks primarily rely on sample-wise global modifications, resulting in
suboptimal stealthiness. To address this limitation, we propose Stealthy
Patch-Wise Backdoor Attack (SPBA), which employs the first patch-wise trigger
for 3D point clouds and restricts perturbations to local regions, significantly
enhancing stealthiness. Specifically, SPBA decomposes point clouds into local
patches and evaluates their geometric complexity using a curvature-based patch
imperceptibility score, ensuring that the trigger remains less perceptible to
the human eye by strategically applying it across multiple geometrically
complex patches with lower visual sensitivity. By leveraging the Graph Fourier
Transform (GFT), SPBA optimizes a patch-wise spectral trigger that perturbs the
spectral features of selected patches, enhancing attack effectiveness while
preserving the global geometric structure of the point cloud. Extensive
experiments on ModelNet40 and ShapeNetPart demonstrate that SPBA consistently
achieves an attack success rate (ASR) exceeding 96.5% across different models
while achieving state-of-the-art imperceptibility compared to existing backdoor
attack methods.
Authors' comments: 12 pages, 8 figures, 6 tables
Minyue Dai, Jingbo Wang, Ke Fan, Bin Ji, Haoyu Zhao, Junting Dong, Bo Dai
Styled motion in-betweening is crucial for computer animation and gaming.
However, existing methods typically encode motion styles by modeling whole-body
motions, often overlooking the representation of individual body parts. This
limitation reduces the flexibility of infilled motion, particularly in
adjusting the motion styles of specific limbs independently. To overcome this
challenge, we propose a novel framework that models motion styles at the
body-part level, enhancing both the diversity and controllability of infilled
motions. Our approach enables more nuanced and expressive animations by
allowing precise modifications to individual limb motions while maintaining
overall motion coherence. Leveraging phase-related insights, our framework
employs periodic autoencoders to automatically extract the phase of each body
part, capturing distinctive local style features. Additionally, we effectively
decouple the motion source from synthesis control by integrating motion
manifold learning and conditional generation techniques from both image and
motion domains. This allows the motion source to generate high-quality motions
across various styles, with extracted motion and style features readily
available for controlled synthesis in subsequent tasks. Comprehensive
evaluations demonstrate that our method achieves superior speed, robust
generalization, and effective generation of extended motion sequences.
Authors' comments: 10 pages, 5 figures
Arghya Pal, Sailaja Rajanala, CheeMing Ting, Raphael Phan
Medical image denoising is essential for improving the reliability of clinical diagnosis and guiding subsequent image-based tasks. In this paper, we propose a multi-scale approach that integrates anisotropic Gaussian filtering with progressive Bezier-path redrawing. Our method constructs a scale-space pyramid to mitigate noise while preserving critical structural details. Starting at the coarsest scale, we segment partially denoised images into coherent components and redraw each using a parametric Bezier path with representative color. Through iterative refinements at finer scales, small and intricate structures are accurately reconstructed, while large homogeneous regions remain robustly smoothed. We employ both mean square error and self-intersection constraints to maintain shape coherence during path optimization. Empirical results on multiple MRI datasets demonstrate consistent improvements in PSNR and SSIM over competing methods. This coarse-to-fine framework offers a robust, data-efficient solution for cross-domain denoising, reinforcing its potential clinical utility and versatility. Future work extends this technique to three-dimensional data.
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng et al.
Text-to-Image (T2I) models are capable of generating high-quality artistic
creations and visual content. However, existing research and evaluation
standards predominantly focus on image realism and shallow text-image
alignment, lacking a comprehensive assessment of complex semantic understanding
and world knowledge integration in text to image generation. To address this
challenge, we propose $\textbf{WISE}$, the first benchmark specifically
designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic
$\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by
challenging models with 1000 meticulously crafted prompts across 25 sub-domains
in cultural common sense, spatio-temporal reasoning, and natural science. To
overcome the limitations of traditional CLIP metric, we introduce
$\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image
alignment. Through comprehensive testing of 20 models (10 dedicated T2I models
and 10 unified multimodal models) using 1,000 structured prompts spanning 25
subdomains, our findings reveal significant limitations in their ability to
effectively integrate and apply world knowledge during image generation,
highlighting critical pathways for enhancing knowledge incorporation and
application in next-generation T2I models. Code and data are available at
https://github.com/PKU-YuanGroup/WISE.
Authors' comments: Code, data and leaderboard: https://github.com/PKU-YuanGroup/WISE
Francesco Battistaa, Sergio Chibbarob, Paolo Gualtieria
We study the settling of suspensions of relatively large particles with a
diameter of the order of ten Kolmogorov scales and density slightly larger than
the carrier fluid in statistically steady homogeneous isotropic turbulence. The
particle-to-fluid density ratio is varied to obtain a wide range of Galileo
numbers, which are the ratios between buoyancy and viscous forces. We analyse
the problem through high-resolved one-way coupled direct numerical simulations
where the particles are modeled as material points. The physical parameters are
chosen in the same range used in recent particle-resolved simulations, against
which we compare (PRS, [1, 2]). The results of the point-wise simulations are
in good agreement with the PRS ones, showing a reduced settling speed for the
range of parameters under investigation, relevant to suspensions settling in
aqueous media, at volume fractions up to a few percent for density ratios order
of one. Results are obtained neglecting the inter-particles and particle-fluid
interactions while purposely including/not including the different forces (e.g.
Stokes drag, added mass, lift force) in the particles' equations of motion to
highlight their contributions respectively. At a high Galileo number, the mean
settling velocity is only slightly affected by turbulent fluctuations, and it
is the same obtained for the settling velocity of a single particle in a
quiescent fluid. When the Galileo number is reduced, the settling velocity is
progressively affected by turbulent fluctuations that cause a substantial
decrease in the particle sedimentation speed. The present results are
particularly relevant for applications. Point-wise models endowed with an
accurate description of the hydrodynamic force are effective in capturing the
particle settling speed and other higher-order statistics as demonstrated by
direct comparison against particle-resolved simulations [1, 2].
Authors' comments: 23 pages, 6 figures
Feng Chen, Yiran Meng, Kegan Li, Chaoran Yang, Jiong Yang
Recently, physics-informed neural networks (PINNs) and their variants have gained significant popularity as a scientific computing method for solving partial differential equations (PDEs), whereas accuracy is still its main shortcoming. Despite numerous development efforts, there is no literature demonstrating that these methods surpass classic numerical algorithms in solving the forward issue. In this paper, by analyzing the disparities between PINNs and traditional numerical methods based on mesh discretization, we investigate the underlying causes for the in adequate precision of PINNs and introduce a novel approach named global physics-informed neural networks (GPINNs). Inspired by the crucial concept of global nodal association in conventional numerical algorithms, GPINNs leverages the prior field distribution information from pre-trained PINNs to estimate the association weights between arbitrary nodes in space. GPINNs can not only be regarded as a meshless approach but also be demonstrated, both theoretically and in practical circumstances, to have the ability of second-order convergence when trained with equidistant nodes. Overall, GPINNs may be seen as an ideal approach to inheriting the merits of scientific machine learning (SciML) and conventional numerical computing, which also represent the first SciML algorithm to surpass standard numerical methods in terms of accuracy.
Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi et al.
Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
Hetarth Chopra, Vidhi Rambhia, Vikram Adve
As specialized large language models (LLMs) become increasingly prevalent,
model merging methods are being used to combine them to create a single
multi-task model without requiring any additional data or training. However,
these approaches fall short when the objective of merging is to increase the
downstream model's performance on a particular task-specific benchmark. In this
work, we propose LEWIS (Layer Wise Sparsity), a guided model-merging framework
that uses activation-based layer importance to dynamically adjust layer-wise
task-vector sparsity required for the merge process. LEWIS uses a calibration
dataset to prioritize critical layers during the task-vector pruning process
required for model merging. This approach guides existing merging methods by
preserving essential layer-wise task-specific knowledge while ensuring the
merged model performs the best at benchmarks resembling the calibration
dataset. Our experiments demonstrate the effectiveness of LEWIS with
performance improvements of code instruction-following and math-solving models
created through model merging up to 4 percent and 11.3 percent, respectively,
outperforming unguided data-less model merging approaches that use
uniform-sparsity.
Authors' comments: Accepted at ICLR 2025 Workshop: SLLM (Sparsity in Large Language
Models)
Tianyu Jia, Zongxia Xie, Yanru Sun, Dilfira Kudrat, Qinghua Hu
Non-stationarity is an intrinsic property of real-world time series and plays a crucial role in time series forecasting. Previous studies primarily adopt instance normalization to attenuate the non-stationarity of original series for better predictability. However, instance normalization that directly removes the inherent non-stationarity can lead to three issues: (1) disrupting global temporal dependencies, (2) ignoring channel-specific differences, and (3) producing over-smoothed predictions. To address these issues, we theoretically demonstrate that variance can be a valid and interpretable proxy for quantifying non-stationarity of time series. Based on the analysis, we propose a novel lightweight \textit{C}hannel-wise \textit{D}ynamic \textit{F}usion \textit{M}odel (\textit{CDFM}), which selectively and dynamically recovers intrinsic non-stationarity of the original series, while keeping the predictability of normalized series. First, we design a Dual-Predictor Module, which involves two branches: a Time Stationary Predictor for capturing stable patterns and a Time Non-stationary Predictor for modeling global dynamics patterns. Second, we propose a Fusion Weight Learner to dynamically characterize the intrinsic non-stationary information across different samples based on variance. Finally, we introduce a Channel Selector to selectively recover non-stationary information from specific channels by evaluating their non-stationarity, similarity, and distribution consistency, enabling the model to capture relevant dynamic features and avoid overfitting. Comprehensive experiments on seven time series datasets demonstrate the superiority and generalization capabilities of CDFM.
Dimitri Ognibene, Gregor Donabauer, Emily Theophilou, Cansu Koyuturk, Mona Yavari, Sathya Bursic, Alessia Telari, Alessia Testa et al.
The use of large language model (LLM)-powered chatbots, such as ChatGPT, has
become popular across various domains, supporting a range of tasks and
processes. However, due to the intrinsic complexity of LLMs, effective
prompting is more challenging than it may seem. This highlights the need for
innovative educational and support strategies that are both widely accessible
and seamlessly integrated into task workflows. Yet, LLM prompting is highly
task- and domain-dependent, limiting the effectiveness of generic approaches.
In this study, we explore whether LLM-based methods can facilitate learning
assessments by using ad-hoc guidelines and a minimal number of annotated prompt
samples. Our framework transforms these guidelines into features that can be
identified within learners' prompts. Using these feature descriptions and
annotated examples, we create few-shot learning detectors. We then evaluate
different configurations of these detectors, testing three state-of-the-art
LLMs and ensembles. We run experiments with cross-validation on a sample of
original prompts, as well as tests on prompts collected from task-naive
learners. Our results show how LLMs perform on feature detection. Notably, GPT-
4 demonstrates strong performance on most features, while closely related
models, such as GPT-3 and GPT-3.5 Turbo (Instruct), show inconsistent behaviors
in feature classification. These differences highlight the need for further
research into how design choices impact feature selection and prompt detection.
Our findings contribute to the fields of generative AI literacy and
computer-supported learning assessment, offering valuable insights for both
researchers and practitioners.
Authors' comments: Preprint accepted for Publication in Educational Technology & Society
(ET&S)
Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai et al.
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
Yueyang Wu, Sinan Yang, Yanming Wang, Jiajie He, Muhammad Mohsin Pathan, Bensheng Qiu, Xiaoxiao Wang
In recent years,the application of deep learning in task functional Magnetic
Resonance Imaging (tfMRI) decoding has led to significant advancements.
However,most studies remain constrained by assumption of temporal stationarity
in neural activity,resulting in predominantly block-wise analysis with limited
temporal resolution on the order of tens of seconds. This limitation restricts
the ability to decode cognitive functions in detail. To address these
limitations, this study proposes a deep neural network designed for volume-wise
identification of task states within tfMRI data,thereby overcoming the
constraints of conventional methods. Evaluated on Human Connectome Project
(HCP) motor and gambling tfMRI datasets,the model achieved impressive mean
accuracy rates of 94.0% and 79.6%,respectively. These results demonstrate a
substantial enhancement in temporal resolution,enabling more detailed
exploration of cognitive processes. The study further employs visualization
algorithms to investigate dynamic brain mappings during different tasks,marking
a significant step forward in deep learning-based frame-level tfMRI decoding.
This approach offers new methodologies and tools for examining dynamic changes
in brain activities and understanding the underlying cognitive mechanisms.
Authors' comments: 8 pages,11 figures
Ukcheol Shin, Kyunghyun Lee, Jean Oh
Deploying depth estimation networks in the real world requires high-level
robustness against various adverse conditions to ensure safe and reliable
autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor
systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar.
They mainly adopt two strategies to use multiple sensors: modality-wise and
multi-modal fused inference. The former method is flexible but
memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide
high-level reliability, yet it needs a specialized architecture. In this paper,
we propose an effective solution, named align-and-fuse strategy, for the depth
estimation from multi-spectral images. In the align stage, we align embedding
spaces between multiple spectrum bands to learn shareable representation across
multi-spectral images by minimizing contrastive loss of global and spatially
aligned local features with geometry cue. After that, in the fuse stage, we
train an attachable feature fusion module that can selectively aggregate the
multi-spectral features for reliable and robust prediction results. Based on
the proposed method, a single-depth network can achieve both spectral-invariant
and multi-spectral fused depth estimation while preserving reliability, memory
efficiency, and flexibility.
Authors' comments: Accepted at ICRA 2025, Github link:
https://github.com/UkcheolShin/BridgeMultiSpectralDepth
Jiawei Kong, Hao Fang, Sihang Guo, Chenxi Qing, Bin Chen, Bin Wang, Shu-Tao Xia
While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit excellent representational capabilities for multimodal data, recent studies have shown that they are vulnerable to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model, yet offer only marginal resistance to state-of-the-art attacks and often result in a decrease in clean accuracy, particularly in data-limited scenarios. Their failure may be attributed to the mismatch between insufficient fine-tuning data and massive parameters in VLMs. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT) defense, an efficient and effective method that operates on the text prompts to indirectly purify the poisoned VLMs. Specifically, we first employ the advanced contrastive learning via our carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we utilize the efficient prompt tuning technique to optimize these class-wise text prompts for modifying the model's decision boundary to further reclassify the feature regions of backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86\% and an Attack Success Rate (ASR) of 0.39\% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen model robustness against backdoor attacks.
Xuan Ding, Yao Zhu, Yunjian Zhang, Chuanlong Xie
Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. Howerver, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the "Patch-like" feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we proposes a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35\% pruning on the Vicuna-7B model, our method achieved a 1.654\% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.