Zifu Zhang, Shengxi Li, Henan Liu, Mai Xu, Ce Zhu
Most recently, learned image compression methods have outpaced traditional
hand-crafted standard codecs. However, their inference typically requires to
input the whole image at the cost of heavy computing resources, especially for
high-resolution image compression; otherwise, the block artefact can exist when
compressed by blocks within existing learned image compression methods. To
address this issue, we propose a novel continuous patch stitching (CPS)
framework for block-wise image compression that is able to achieve seamlessly
patch stitching and mathematically eliminate block artefact, thus capable of
significantly reducing the required computing resources when compressing
images. More specifically, the proposed CPS framework is achieved by
padding-free operations throughout, with a newly established parallel
overlapping stitching strategy to provide a general upper bound for ensuring
the continuity. Upon this, we further propose functional residual blocks with
even-sized kernels to achieve down-sampling and up-sampling, together with
bottleneck residual blocks retaining feature size to increase network depth.
Experimental results demonstrate that our CPS framework achieves the
state-of-the-art performance against existing baselines, whilst requiring less
than half of computing resources of existing models. Our code shall be released
upon acceptance.
Authors' comments: 5 pages, 8 figures
Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong
Visual grounding aims to ground an image region through natural language,
which heavily relies on cross-modal alignment. Most existing methods transfer
visual/linguistic knowledge separately by fully fine-tuning uni-modal
pre-trained models, followed by a simple stack of visual-language transformers
for multimodal fusion. However, these approaches not only limit adequate
interaction between visual and linguistic contexts, but also incur significant
computational costs. Therefore, to address these issues, we explore a step-wise
multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG
proposes step-wise multimodal prompts (Swip) and cross-modal interactive
adapters (CIA) for visual grounding, replacing the cumbersome transformer
stacks for multimodal fusion. Swip can improve {the} alignment between the
vision and language representations step by step, in a token-level fusion
manner. In addition, weight-level CIA further promotes multimodal fusion by
cross-modal interaction. Swip and CIA are both parameter-efficient paradigms,
and they fuse the cross-modal features from shallow to deep layers gradually.
Experimental results on four widely-used benchmarks demonstrate that SwimVG
achieves remarkable abilities and considerable benefits in terms of efficiency.
Our code is available at https://github.com/liuting20/SwimVG.
Authors' comments: 12 pages, 7 figures
Marzi Heidari, Yuhong Guo
Single Domain Generalization (SDG) remains a formidable challenge in the field of machine learning, particularly when models are deployed in environments that differ significantly from their training domains. In this paper, we propose a novel data augmentation approach, named as Model-aware Parametric Batch-wise Mixup (MPBM), to tackle the challenge of SDG. MPBM deploys adversarial queries generated with stochastic gradient Langevin dynamics, and produces model-aware augmenting instances with a parametric batch-wise mixup generator network that is carefully designed through an innovative attention mechanism. By exploiting inter-feature correlations, the parameterized mixup generator introduces additional versatility in combining features across a batch of instances, thereby enhancing the capacity to generate highly adaptive and informative synthetic instances for specific queries. The synthetic data produced by this adaptable generator network, guided by informative queries, is expected to significantly enrich the representation space covered by the original training dataset and subsequently enhance the prediction model's generalizability across diverse and previously unseen domains. To prevent excessive deviation from the training data, we further incorporate a real-data alignment-based adversarial loss into the learning process of MPBM, regularizing any tendencies toward undesirable expansions. We conduct extensive experiments on several benchmark datasets. The empirical results demonstrate that by augmenting the training set with informative synthesis data, our proposed MPBM method achieves the state-of-the-art performance for single domain generalization.
Jiayu Qin, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Wei Wang
The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao et al.
Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.
Jing Xu, Jiazheng Li, Jingzhao Zhang
Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.
Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang et al.
Most pruning methods concentrate on unimportant filters of neural networks.
However, they face the loss of statistical information due to a lack of
consideration for class-wise data. In this paper, from the perspective of
leveraging precise class-wise information for model pruning, we utilize
structured lasso with guidance from Information Bottleneck theory. Our approach
ensures that statistical information is retained during the pruning process.
With these techniques, we introduce two innovative adaptive network pruning
schemes: sparse graph-structured lasso pruning with Information Bottleneck
(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
multiple state-of-the-art methods, our approaches demonstrate superior
performance across three datasets and six model architectures in extensive
experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
an accuracy of 94.10% (0.14% higher than the original model); we reduce the
parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
computational resource usage while maintaining accuracy. Our codes are at
https://anonymous.4open.science/r/IJCAI-8104.
Authors' comments: 11 pages, 2 figures
Zhengjian Kang, Ye Zhang, Xiaoyu Deng, Xintao Li, Yongzhe Zhang
This paper presents LP-DETR (Layer-wise Progressive DETR), a novel approach
that enhances DETR-based object detection through multi-scale relation
modeling. Our method introduces learnable spatial relationships between object
queries through a relation-aware self-attention mechanism, which adaptively
learns to balance different scales of relations (local, medium and global)
across decoder layers. This progressive design enables the model to effectively
capture evolving spatial dependencies throughout the detection pipeline.
Extensive experiments on COCO 2017 dataset demonstrate that our method improves
both convergence speed and detection accuracy compared to standard
self-attention module. The proposed method achieves competitive results,
reaching 52.3\% AP with 12 epochs and 52.5\% AP with 24 epochs using ResNet-50
backbone, and further improving to 58.0\% AP with Swin-L backbone. Furthermore,
our analysis reveals an interesting pattern: the model naturally learns to
prioritize local spatial relations in early decoder layers while gradually
shifting attention to broader contexts in deeper layers, providing valuable
insights for future research in object detection.
Authors' comments: 12 pages, 4 figures
Yasaman Saadati, Mohammad Rostami, M. Hadi Amini
Traditional Federated Learning (FL) methods encounter significant challenges
when dealing with heterogeneous data and providing personalized solutions for
non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to
address these issues by balancing generalization and personalization, often
through parameter decoupling or partial models that freeze some neural network
layers for personalization while aggregating other layers globally. However,
existing methods still face challenges of global-local model discrepancy,
client drift, and catastrophic forgetting, which degrade model accuracy. To
overcome these limitations, we propose pMixFed, a dynamic, layer-wise PFL
approach that integrates mixup between shared global and personalized local
models. Our method introduces an adaptive strategy for partitioning between
personalized and shared layers, a gradual transition of personalization degree
to enhance local client adaptation, improved generalization across clients, and
a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive
experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods,
showing faster model training, increased robustness, and improved handling of
data heterogeneity under different heterogeneous settings.
Authors' comments: 20 pages, 9 Images
Georgios Antonopoulos, Shammi More, Simon B. Eickhoff, Federico Raimondo, Kaustubh R. Patil
Predictive modeling using structural magnetic resonance imaging (MRI) data is
a prominent approach to study brain-aging. Machine learning algorithms and
feature extraction methods have been employed to improve predictions and
explore healthy and accelerated aging e.g. neurodegenerative and psychiatric
disorders. The high-dimensional MRI data pose challenges to building
generalizable and interpretable models as well as for data privacy. Common
practices are resampling or averaging voxels within predefined parcels, which
reduces anatomical specificity and biological interpretability as voxels within
a region may differently relate to aging. Effectively, naive fusion by
averaging can result in information loss and reduced accuracy. We present a
conceptually novel two-level stacking ensemble (SE) approach. The first level
comprises regional models for predicting individuals' age based on voxel-wise
information, fused by a second-level model yielding final predictions. Eight
data fusion scenarios were explored using as input Gray matter volume (GMV)
estimates from four datasets covering the adult lifespan. Performance, measured
using mean absolute error (MAE), R2, correlation and prediction bias, showed
that SE outperformed the region-wise averages. The best performance was
obtained when first-level regional predictions were obtained as out-of-sample
predictions on the application site with second-level models trained on
independent and site-specific data (MAE=4.75 vs baseline regional mean GMV
MAE=5.68). Performance improved as more datasets were used for training.
First-level predictions showed improved and more robust aging signal providing
new biological insights and enhanced data privacy. Overall, the SE improves
accuracy compared to the baseline while preserving or enhancing data privacy.
Authors' comments: version1
Will Hartog, Lihua Lei
The closure principle is a standard tool for achieving family-wise error rate
(FWER) control in multiple testing problems. In general, the computational cost
for closed testing can be exponential in the number of hypotheses. The
celebrated graphical approach of FWER control overcomes the computational
hurdle by using weighted Bonferroni local tests on p-values with appropriately
chosen weights. In this study, we extend the graphical approach to e-values.
With valid e-values -- common in settings of sequential hypothesis testing or
universal inference for irregular parametric models -- we can derive strictly
more powerful local tests based on weighted averages of e-values. Consequently,
this e-value-based closed test is more powerful than the corresponding
graphical approach with inverse e-values as p-values. Although the
computational shortcuts for the p-value-based graphical approach are not
applicable, we develop efficient polynomial-time algorithms using dynamic
programming for e-value-based graphical approaches with any directed acyclic
graph. For special graphs, such as those used in the Holm's procedure and
fallback procedure, we develop tailored algorithms with computation cost linear
in the number of hypotheses, up to logarithmic factors.
Authors' comments: 19 pages, 5 figures, 4 algorithms
Zebo Yang, Ali Ghubaish, Raj Jain, Ala Al-Fuqaha, Aiman Erbad, Ramana Kompella, Hassan Shapourian, Reza Nejabati
With its significant security potential, the quantum internet is poised to
revolutionize technologies like cryptography and communications. Although it
boasts enhanced security over traditional networks, the quantum internet still
encounters unique security challenges essential for safeguarding its
Confidentiality, Integrity, and Availability (CIA). This study explores these
challenges by analyzing the vulnerabilities and the corresponding mitigation
strategies across different layers of the quantum internet, including physical,
link, network, and application layers. We assess the severity of potential
attacks, evaluate the expected effectiveness of mitigation strategies, and
identify vulnerabilities within diverse network configurations, integrating
both classical and quantum approaches. Our research highlights the dynamic
nature of these security issues and emphasizes the necessity for adaptive
security measures. The findings underline the need for ongoing research into
the security dimension of the quantum internet to ensure its robustness,
encourage its adoption, and maximize its impact on society.
Authors' comments: This article has been accepted for publication in the IEEE Journal on
Selected Areas in Communications (JSAC) UCP-QuantumEra special issue
Hongxin Zhi, Hongtao Yu, Shaome Li, Xiuming Zhao, Yiteng Wu
Adversarial training has proven to be a highly effective method for improving the robustness of deep neural networks against adversarial attacks. Nonetheless, it has been observed to exhibit a limitation in terms of robust fairness, characterized by a significant disparity in robustness across different classes. Recent efforts to mitigate this problem have turned to class-wise reweighted methods. However, these methods suffer from a lack of rigorous theoretical analysis and are limited in their exploration of the weight space, as they mainly rely on existing heuristic algorithms or intuition to compute weights. In addition, these methods fail to guarantee the consistency of the optimization direction due to the decoupled optimization of weights and the model parameters. They potentially lead to suboptimal weight assignments and consequently, a suboptimal model. To address these problems, this paper proposes a novel min-max training framework, Class Optimal Distribution Adversarial Training (CODAT), which employs distributionally robust optimization to fully explore the class-wise weight space, thus enabling the identification of the optimal weight with theoretical guarantees. Furthermore, we derive a closed-form optimal solution to the internal maximization and then get a deterministic equivalent objective function, which provides a theoretical basis for the joint optimization of weights and model parameters. Meanwhile, we propose a fairness elasticity coefficient for the evaluation of the algorithm with regard to both robustness and robust fairness. Experimental results on various datasets show that the proposed method can effectively improve the robust fairness of the model and outperform the state-of-the-art approaches.
Daniel Silver, Ron Kimmel
In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu
Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.
Xiaopeng Li, Jingtong Gao, Pengyue Jia, Xiangyu Zhao, Yichao Wang, Wanyu Wang, Yejing Wang, Yuhao Wang et al.
Multi Scenario Recommendation (MSR) tasks, referring to building a unified model to enhance performance across all recommendation scenarios, have recently gained much attention. However, current research in MSR faces two significant challenges that hinder the field's development: the absence of uniform procedures for multi-scenario dataset processing, thus hindering fair comparisons, and most models being closed-sourced, which complicates comparisons with current SOTA models. Consequently, we introduce our benchmark, \textbf{Scenario-Wise Rec}, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline. Additionally, we validated the benchmark using an industrial advertising dataset, reinforcing its reliability and applicability in real-world scenarios. We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models based on our benchmark and thereby fostering a collaborative research ecosystem in MSR. Our source code is also publicly available.
Hao Gui, Lin Hu, Rui Chen, Mingxiao Huang, Yuxin Yin, Jin Yang, Yong Wu, Chen Liu et al.
3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load imbalance scenarios where workload diversity among pixels and Gaussian spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS, a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS training process, perfectly solving load-imbalance issues. First, we innovatively introduce the inter-block dynamic workload distribution technique to map workloads to Streaming Multiprocessor(SM) resources within a single GPU dynamically, which constitutes the foundation of load balancing. Second, we are the first to propose the Gaussian-wise parallel rendering technique to significantly reduce workload divergence inside a warp, which serves as a critical component in addressing load imbalance. Based on the above two methods, we further creatively put forward the fine-grained combined load balancing technique to uniformly distribute workload across all SMs, which boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we present a self-adaptive render kernel selection strategy during the 3DGS training process based on different load-balance situations, which effectively improves training efficiency.
Yujin Kim, Sol Choi, Bum-Jae You, Keunwoo Jang, Yisoo Lee
Articulated object manipulation is a challenging task, requiring constrained
motion and adaptive control to handle the unknown dynamics of the manipulated
objects. While reinforcement learning (RL) has been widely employed to tackle
various scenarios and types of articulated objects, the complexity of these
tasks, stemming from multiple intertwined objectives makes learning a control
policy in the full task space highly difficult. To address this issue, we
propose a Subspace-wise hybrid RL (SwRL) framework that learns policies for
each divided task space, or subspace, based on independent objectives. This
approach enables adaptive force modulation to accommodate the unknown dynamics
of objects. Additionally, it effectively leverages the previously underlooked
redundant subspace, thereby maximizing the robot's dexterity. Our method
enhances both learning efficiency and task execution performance, as validated
through simulations and real-world experiments. Supplementary video is
available at https://youtu.be/PkNxv0P8Atk
Authors' comments: 14 pages, 10 figures, Submitted to Robotics and Autonomous Systems
Weihang Chen, Jie Ren, Zhiqiang Li, Ling Gao, Zheng Wang
Real-life deployment of federated Learning (FL) often faces non-IID data, which leads to poor accuracy and slow convergence. Personalized FL (pFL) tackles these issues by tailoring local models to individual data sources and using weighted aggregation methods for client-specific learning. However, existing pFL methods often fail to provide each local model with global knowledge on demand while maintaining low computational overhead. Additionally, local models tend to over-personalize their data during the training process, potentially dropping previously acquired global information. We propose FLAYER, a novel layer-wise learning method for pFL that optimizes local model personalization performance. FLAYER considers the different roles and learning abilities of neural network layers of individual local models. It incorporates global information for each local model as needed to initialize the local model cost-effectively. It then dynamically adjusts learning rates for each layer during local training, optimizing the personalized learning process for each local model while preserving global knowledge. Additionally, to enhance global representation in pFL, FLAYER selectively uploads parameters for global aggregation in a layer-wise manner. We evaluate FLAYER on four representative datasets in computer vision and natural language processing domains. Compared to six state-of-the-art pFL methods, FLAYER improves the inference accuracy, on average, by 5.42% (up to 14.29%).