Aishik Mandal, Dana Atzil-Slonim, Thamar Solorio, Iryna Gurevych
Depression is a highly prevalent and disabling condition that incurs
substantial personal and societal costs. Current depression diagnosis involves
determining the depression severity of a person through self-reported
questionnaires or interviews conducted by clinicians. This often leads to
delayed treatment and involves substantial human resources. Thus, several works
try to automate the process using multimodal data. However, they usually
overlook the following: i) The variable contribution of each modality for each
question in the questionnaire and ii) Using ordinal classification for the
task. This results in sub-optimal fusion and training methods. In this work, we
propose a novel Question-wise Modality Fusion (QuestMF) framework trained with
a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues.
The performance of our framework is comparable to the current state-of-the-art
models on the E-DAIC dataset and enhances interpretability by predicting scores
for each question. This will help clinicians identify an individual's symptoms,
allowing them to customise their interventions accordingly. We also make the
code for the QuestMF framework publicly available.
Authors' comments: 18 pages, 5 figures, The 10th Workshop on Computational Linguistics
and Clinical Psychology
Yu Mao, Jun Wang, Nan Guan, Chun Jason Xue
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
Huitong Chen, Yu Wang, Yan Fan, Guosong Jiang, Qinghua Hu
Class incremental learning (CIL) aims to enable models to continuously learn
new classes without catastrophically forgetting old ones. A promising direction
is to learn and use prototypes of classes during incremental updates. Despite
simplicity and intuition, we find that such methods suffer from inadequate
representation capability and unsatisfied feature overlap. These two factors
cause class-wise confusion and limited performance. In this paper, we develop a
Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our
method employs a lightweight auto-encoder module to learn compact manifold for
each class in the latent subspace, constraining samples to be well
reconstructed only on the semantically correct auto-encoder. Thus, the
representation stability and capability of class distributions are enhanced,
alleviating the potential class-wise confusion problem. To further distinguish
the overlapped features, we propose a confusion-aware latent space separation
loss that ensures samples are closely distributed in their corresponding
low-dimensional manifold while keeping away from the distributions of features
from other classes. Our method demonstrates stronger representational capacity
and discrimination ability by learning disentangled manifolds and reduces class
confusion. Extensive experiments on multiple datasets and settings show that
CREATE outperforms other state-of-the-art methods up to 5.41%.
Authors' comments: Accepted to CVPR 2025
Maoji Zheng, Ziyu Xu, Qiming Xia, Hai Wu, Chenglu Wen, Cheng Wang
LiDAR-based 3D object detection and semantic segmentation are critical tasks
in 3D scene understanding. Traditional detection and segmentation methods
supervise their models through bounding box labels and semantic mask labels.
However, these two independent labels inherently contain significant
redundancy. This paper aims to eliminate the redundancy by supervising 3D
object detection using only semantic labels. However, the challenge arises due
to the incomplete geometry structure and boundary ambiguity of point-cloud
instances, leading to inaccurate pseudo labels and poor detection results. To
address these challenges, we propose a novel method, named Seg2Box. We first
introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages
the spatio-temporal consistency of point clouds to generate accurate box-level
pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining
Self-Training (SGIM-ST) module is proposed to enhance the performance by
progressively refining the pseudo-labels and mining the instances without
generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes
Dataset show that our method significantly outperforms other competitive
methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the
great label-efficient potential and advancement of our method.
Authors' comments: 8 pages, 6 figures
Fatemeh Amerehi, Patrick Healy
Efforts to address declining accuracy as a result of data shifts often
involve various data-augmentation strategies. Adversarial training is one such
method, designed to improve robustness to worst-case distribution shifts caused
by adversarial examples. While this method can improve robustness, it may also
hinder generalization to clean examples and exacerbate performance imbalances
across different classes. This paper explores the impact of adversarial
training on both overall and class-specific performance, as well as its
spill-over effects. We observe that enhanced labeling during training boosts
adversarial robustness by 53.50% and mitigates class imbalances by 5.73%,
leading to improved accuracy in both clean and adversarial settings compared to
standard adversarial training.
Authors' comments: 4 figures, ICLR 2025 Workshop on Foundation Models in the Wild
Changlong Shi, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang
In Federated Learning (FL), weighted aggregation of local models is conducted
to generate a new global model, and the aggregation weights are typically
normalized to 1. A recent study identifies the global weight shrinking effect
in FL, indicating an enhancement in the global model's generalization when the
sum of weights (i.e., the shrinking factor) is smaller than 1, where how to
learn the shrinking factor becomes crucial. However, principled approaches to
this solution have not been carefully studied from the adequate consideration
of privacy concerns and layer-wise distinctions. To this end, we propose a
novel model aggregation strategy, Federated Learning with Adaptive Layer-wise
Weight Shrinking (FedLWS), which adaptively designs the shrinking factor in a
layer-wise manner and avoids optimizing the shrinking factors on a proxy
dataset. We initially explored the factors affecting the shrinking factor
during the training process. Then we calculate the layer-wise shrinking factors
by considering the distinctions among each layer of the global model. FedLWS
can be easily incorporated with various existing methods due to its
flexibility. Extensive experiments under diverse scenarios demonstrate the
superiority of our method over several state-of-the-art approaches, providing a
promising tool for enhancing the global model in FL.
Authors' comments: Accepted in ICLR 2025
Quang Trung Truong, Wong Yuk Kwan, Duc Thanh Nguyen, Binh-Son Hua, Sai-Kit Yeung
Underwater video analysis, hampered by the dynamic marine environment and
camera motion, remains a challenging task in computer vision. Existing
training-free video generation techniques, learning motion dynamics on the
frame-by-frame basis, often produce poor results with noticeable motion
interruptions and misaligments. To address these issues, we propose AUTV, a
framework for synthesizing marine video data with pixel-wise annotations. We
demonstrate the effectiveness of this framework by constructing two video
datasets, namely UTV, a real-world dataset comprising 2,000 video-text pairs,
and SUTV, a synthetic video dataset including 10,000 videos with segmentation
masks for marine objects. UTV provides diverse underwater videos with
comprehensive annotations including appearance, texture, camera intrinsics,
lighting, and animal behavior. SUTV can be used to improve underwater
downstream tasks, which are demonstrated in video inpainting and video object
segmentation.
Authors' comments: under review
Minje Kim, Minjun Kim, Xu Yang
Spiking Neural Networks (SNNs) present a more energy-efficient alternative to
Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and
event-driven spikes. Effective utilization of temporal information is crucial
for SNNs, leading to the exploration of attention mechanisms to enhance this
capability. Conventional attention operations either apply identical operation
or employ non-identical operations across target dimensions. We identify that
these approaches provide distinct perspectives on temporal information. To
leverage the strengths of both operations, we propose a novel Dual
Temporal-channel-wise Attention (DTA) mechanism that integrates both
identical/non-identical attention strategies. To the best of our knowledge,
this is the first attempt to concentrate on both the correlation and dependency
of temporal-channel using both identical and non-identical attention
operations. Experimental results demonstrate that the DTA mechanism achieves
state-of-the-art performance on both static datasets (CIFAR10, CIFAR100,
ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation
and capturing complex temporal-channel relationship. We open-source our code:
https://github.com/MnJnKIM/DTA-SNN.
Authors' comments: Accepted by IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV) 2025
Jikai Chen, Leilei Gan
Recent advancements in Text-to-SQL systems have improved the conversion of natural language queries into SQL, but challenges remain in ensuring accuracy and reliability. While self-correction techniques refine outputs, they often introduce new errors. Existing methods focused on execution feedback mainly address syntax issues, leaving semantic errors -- where the query's logic fails to align with the user's intent -- largely unaddressed. We propose a novel approach combining structured execution feedback with a trained critic agent that provides detailed, interpretable critiques. This method effectively identifies and corrects both syntactic and semantic errors, enhancing accuracy and interpretability. Experimental results show significant improvements on two major Text-to-SQL benchmarks, Spider and BIRD, demonstrating the effectiveness of our approach.
Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, Masahiro Nomura
Active learning (AL) is a label-efficient machine learning paradigm that
focuses on selectively annotating high-value instances to maximize learning
efficiency. Its effectiveness can be further enhanced by incorporating weak
supervision, which uses rough yet cost-effective annotations instead of exact
(i.e., full) but expensive annotations. We introduce a novel AL framework,
Instance-wise Supervision-Level Optimization (ISO), which not only selects the
instances to annotate but also determines their optimal annotation level within
a fixed annotation budget. Its optimization criterion leverages the
value-to-cost ratio (VCR) of each instance while ensuring diversity among the
selected instances. In classification experiments, ISO consistently outperforms
traditional AL methods and surpasses a state-of-the-art AL approach that
combines full and weak supervision, achieving higher accuracy at a lower
overall cost. This code is available at
https://github.com/matsuo-shinnosuke/ISOAL.
Authors' comments: Accepted at CVPR2025
Dilfira Kudrat, Zongxia Xie, Yanru Sun, Tianyu Jia, Qinghua Hu
Time-series forecasting has gained significant attention in machine learning due to its crucial role in various domains. However, most existing forecasting models rely heavily on point-wise loss functions like Mean Square Error, which treat each time step independently and neglect the structural dependencies inherent in time series data, making it challenging to capture complex temporal patterns accurately. To address these challenges, we propose a novel Patch-wise Structural (PS) loss, designed to enhance structural alignment by comparing time series at the patch level. Through leveraging local statistical properties, such as correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise losses. Furthermore, it integrates seamlessly with point-wise loss, simultaneously addressing local structural inconsistencies and individual time-step errors. PS loss establishes a novel benchmark for accurately modeling complex time series data and provides a new perspective on time series loss function design. Extensive experiments demonstrate that PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets.
Haozhong Sun, Zhongsen Li, Chenlin Du, Haokun Li, Yajie Wang, Huijun Chen
Quantitative magnetic resonance imaging (qMRI) requires multi-phase
acqui-sition, often relying on reduced data sampling and reconstruction
algorithms to accelerate scans, which inherently poses an ill-posed inverse
problem. While many studies focus on measuring uncertainty during this process,
few explore how to leverage it to enhance reconstruction performance. In this
paper, we in-troduce PUQ, a novel approach that pioneers the use of uncertainty
infor-mation for qMRI reconstruction. PUQ employs a two-stage reconstruction
and parameter fitting framework, where phase-wise uncertainty is estimated
during reconstruction and utilized in the fitting stage. This design allows
uncertainty to reflect the reliability of different phases and guide
information integration during parameter fitting. We evaluated PUQ on in vivo
T1 and T2 mapping datasets from healthy subjects. Compared to existing qMRI
reconstruction methods, PUQ achieved the state-of-the-art performance in
parameter map-pings, demonstrating the effectiveness of uncertainty guidance.
Our code is available at https://anonymous.4open.science/r/PUQ-75B2/.
Authors' comments: Submitted to MICCAI2025
Zifu Zhang, Shengxi Li, Henan Liu, Mai Xu, Ce Zhu
Most recently, learned image compression methods have outpaced traditional
hand-crafted standard codecs. However, their inference typically requires to
input the whole image at the cost of heavy computing resources, especially for
high-resolution image compression; otherwise, the block artefact can exist when
compressed by blocks within existing learned image compression methods. To
address this issue, we propose a novel continuous patch stitching (CPS)
framework for block-wise image compression that is able to achieve seamlessly
patch stitching and mathematically eliminate block artefact, thus capable of
significantly reducing the required computing resources when compressing
images. More specifically, the proposed CPS framework is achieved by
padding-free operations throughout, with a newly established parallel
overlapping stitching strategy to provide a general upper bound for ensuring
the continuity. Upon this, we further propose functional residual blocks with
even-sized kernels to achieve down-sampling and up-sampling, together with
bottleneck residual blocks retaining feature size to increase network depth.
Experimental results demonstrate that our CPS framework achieves the
state-of-the-art performance against existing baselines, whilst requiring less
than half of computing resources of existing models. Our code shall be released
upon acceptance.
Authors' comments: 5 pages, 8 figures
Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong
Visual grounding aims to ground an image region through natural language,
which heavily relies on cross-modal alignment. Most existing methods transfer
visual/linguistic knowledge separately by fully fine-tuning uni-modal
pre-trained models, followed by a simple stack of visual-language transformers
for multimodal fusion. However, these approaches not only limit adequate
interaction between visual and linguistic contexts, but also incur significant
computational costs. Therefore, to address these issues, we explore a step-wise
multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG
proposes step-wise multimodal prompts (Swip) and cross-modal interactive
adapters (CIA) for visual grounding, replacing the cumbersome transformer
stacks for multimodal fusion. Swip can improve {the} alignment between the
vision and language representations step by step, in a token-level fusion
manner. In addition, weight-level CIA further promotes multimodal fusion by
cross-modal interaction. Swip and CIA are both parameter-efficient paradigms,
and they fuse the cross-modal features from shallow to deep layers gradually.
Experimental results on four widely-used benchmarks demonstrate that SwimVG
achieves remarkable abilities and considerable benefits in terms of efficiency.
Our code is available at https://github.com/liuting20/SwimVG.
Authors' comments: 12 pages, 7 figures
Marzi Heidari, Yuhong Guo
Single Domain Generalization (SDG) remains a formidable challenge in the field of machine learning, particularly when models are deployed in environments that differ significantly from their training domains. In this paper, we propose a novel data augmentation approach, named as Model-aware Parametric Batch-wise Mixup (MPBM), to tackle the challenge of SDG. MPBM deploys adversarial queries generated with stochastic gradient Langevin dynamics, and produces model-aware augmenting instances with a parametric batch-wise mixup generator network that is carefully designed through an innovative attention mechanism. By exploiting inter-feature correlations, the parameterized mixup generator introduces additional versatility in combining features across a batch of instances, thereby enhancing the capacity to generate highly adaptive and informative synthetic instances for specific queries. The synthetic data produced by this adaptable generator network, guided by informative queries, is expected to significantly enrich the representation space covered by the original training dataset and subsequently enhance the prediction model's generalizability across diverse and previously unseen domains. To prevent excessive deviation from the training data, we further incorporate a real-data alignment-based adversarial loss into the learning process of MPBM, regularizing any tendencies toward undesirable expansions. We conduct extensive experiments on several benchmark datasets. The empirical results demonstrate that by augmenting the training set with informative synthesis data, our proposed MPBM method achieves the state-of-the-art performance for single domain generalization.
Jiayu Qin, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Wei Wang
The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao et al.
Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.
Jing Xu, Jiazheng Li, Jingzhao Zhang
Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.
Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang et al.
Most pruning methods concentrate on unimportant filters of neural networks.
However, they face the loss of statistical information due to a lack of
consideration for class-wise data. In this paper, from the perspective of
leveraging precise class-wise information for model pruning, we utilize
structured lasso with guidance from Information Bottleneck theory. Our approach
ensures that statistical information is retained during the pruning process.
With these techniques, we introduce two innovative adaptive network pruning
schemes: sparse graph-structured lasso pruning with Information Bottleneck
(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
multiple state-of-the-art methods, our approaches demonstrate superior
performance across three datasets and six model architectures in extensive
experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
an accuracy of 94.10% (0.14% higher than the original model); we reduce the
parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
computational resource usage while maintaining accuracy. Our codes are at
https://anonymous.4open.science/r/IJCAI-8104.
Authors' comments: 11 pages, 2 figures