Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao et al.
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma
Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
Saniya Karwa, Navpreet Singh
Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.
Yamato Arai, Yuma Ichikawa
Layer-wise PTQ is a promising technique for compressing large language models
(LLMs), due to its simplicity and effectiveness without requiring retraining.
However, recent progress in this area is saturating, underscoring the need to
revisit its core limitations and explore further improvements. We address this
challenge by identifying a key limitation of existing layer-wise PTQ methods:
the growth of quantization errors across layers significantly degrades
performance, particularly in low-bit regimes. To address this fundamental
issue, we propose Quantization Error Propagation (QEP), a general, lightweight,
and scalable framework that enhances layer-wise PTQ by explicitly propagating
quantization errors and compensating for accumulated errors. QEP also offers a
tunable propagation mechanism that prevents overfitting and controls
computational overhead, enabling the framework to adapt to various
architectures and resource budgets. Extensive experiments on several LLMs
demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher
accuracy than existing methods. Notably, the gains are most pronounced in the
extremely low-bit quantization regime.
Authors' comments: 28 pages, 3 figures
Aidan Tiruvan
Robust low-rank approximation under row-wise adversarial corruption can be
achieved with a single pass, randomized procedure that detects and removes
outlier rows by thresholding their projected norms. We propose a scalable,
non-iterative algorithm that efficiently recovers the underlying low-rank
structure in the presence of row-wise adversarial corruption. By first
compressing the data with a Johnson Lindenstrauss projection, our approach
preserves the geometry of clean rows while dramatically reducing
dimensionality. Robust statistical techniques based on the median and median
absolute deviation then enable precise identification and removal of outlier
rows with abnormally high norms. The subsequent rank-k approximation achieves
near-optimal error bounds with a one pass procedure that scales linearly with
the number of observations. Empirical results confirm that combining random
sketches with robust statistics yields efficient, accurate decompositions even
in the presence of large fractions of corrupted rows.
Authors' comments: 27 pages, 9 figures, preprint
Alfonso Artigue
We show that for a compact surface without boundary $M$ the set of cw-expansive homeomorphisms is dense in the set of all the homeomorphisms of $M$ with respect to the $C^0$ topology. After this we show that for a generic homeomorphism $f$ of $M$ it holds that: for all $\epsilon>0$ there is a cw-expansive homeomorphism $g$ of $M$ which is $\epsilon$-close to $f$ and is semiconjugate to $f$; moreover, if $\pi\colon M\to M$ is this semiconjugacy then $\pi^{-1}(x)$ is connected, does not separate $M$ and has diameter less than $\epsilon$ for all $x\in M$.
Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang
Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.
Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili
Large-scale machine learning (ML) models are increasingly being used in
critical domains like education, lending, recruitment, healthcare, criminal
justice, etc. However, the training, deployment, and utilization of these
models demand substantial computational resources. To decrease computation and
memory costs, machine learning models with sparse weight matrices are widely
used in the literature. Among sparse models, those with special sparse
structures (e.g., models with block-wise sparse weight matrices) fit better
with the hardware accelerators and can decrease the memory and computation
costs during the inference. Unfortunately, while there are several efficient
training methods, none of them are designed to train a block-wise sparse model
efficiently. As a result, the current methods for training block-wise sparse
models start with full and dense models leading to inefficient training. In
this work, we focus on training models with \textit{block-wise sparse matrices}
and propose an efficient training algorithm to decrease both computation and
memory costs during training and inference. In addition, we will show that our
proposed method enables us to efficiently find the right block size for the
sparsity pattern during the training process. Our extensive empirical and
theoretical analyses show that our algorithms can decrease the computation and
memory costs significantly without a performance drop compared to baselines.
Authors' comments: 24 pages, submitted on Transactions on Machine Learning Research
Aishik Mandal, Dana Atzil-Slonim, Thamar Solorio, Iryna Gurevych
Depression is a highly prevalent and disabling condition that incurs
substantial personal and societal costs. Current depression diagnosis involves
determining the depression severity of a person through self-reported
questionnaires or interviews conducted by clinicians. This often leads to
delayed treatment and involves substantial human resources. Thus, several works
try to automate the process using multimodal data. However, they usually
overlook the following: i) The variable contribution of each modality for each
question in the questionnaire and ii) Using ordinal classification for the
task. This results in sub-optimal fusion and training methods. In this work, we
propose a novel Question-wise Modality Fusion (QuestMF) framework trained with
a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues.
The performance of our framework is comparable to the current state-of-the-art
models on the E-DAIC dataset and enhances interpretability by predicting scores
for each question. This will help clinicians identify an individual's symptoms,
allowing them to customise their interventions accordingly. We also make the
code for the QuestMF framework publicly available.
Authors' comments: 18 pages, 5 figures, The 10th Workshop on Computational Linguistics
and Clinical Psychology
Yu Mao, Jun Wang, Nan Guan, Chun Jason Xue
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
Huitong Chen, Yu Wang, Yan Fan, Guosong Jiang, Qinghua Hu
Class incremental learning (CIL) aims to enable models to continuously learn
new classes without catastrophically forgetting old ones. A promising direction
is to learn and use prototypes of classes during incremental updates. Despite
simplicity and intuition, we find that such methods suffer from inadequate
representation capability and unsatisfied feature overlap. These two factors
cause class-wise confusion and limited performance. In this paper, we develop a
Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our
method employs a lightweight auto-encoder module to learn compact manifold for
each class in the latent subspace, constraining samples to be well
reconstructed only on the semantically correct auto-encoder. Thus, the
representation stability and capability of class distributions are enhanced,
alleviating the potential class-wise confusion problem. To further distinguish
the overlapped features, we propose a confusion-aware latent space separation
loss that ensures samples are closely distributed in their corresponding
low-dimensional manifold while keeping away from the distributions of features
from other classes. Our method demonstrates stronger representational capacity
and discrimination ability by learning disentangled manifolds and reduces class
confusion. Extensive experiments on multiple datasets and settings show that
CREATE outperforms other state-of-the-art methods up to 5.41%.
Authors' comments: Accepted to CVPR 2025
Maoji Zheng, Ziyu Xu, Qiming Xia, Hai Wu, Chenglu Wen, Cheng Wang
LiDAR-based 3D object detection and semantic segmentation are critical tasks
in 3D scene understanding. Traditional detection and segmentation methods
supervise their models through bounding box labels and semantic mask labels.
However, these two independent labels inherently contain significant
redundancy. This paper aims to eliminate the redundancy by supervising 3D
object detection using only semantic labels. However, the challenge arises due
to the incomplete geometry structure and boundary ambiguity of point-cloud
instances, leading to inaccurate pseudo labels and poor detection results. To
address these challenges, we propose a novel method, named Seg2Box. We first
introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages
the spatio-temporal consistency of point clouds to generate accurate box-level
pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining
Self-Training (SGIM-ST) module is proposed to enhance the performance by
progressively refining the pseudo-labels and mining the instances without
generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes
Dataset show that our method significantly outperforms other competitive
methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the
great label-efficient potential and advancement of our method.
Authors' comments: 8 pages, 6 figures
Fatemeh Amerehi, Patrick Healy
Efforts to address declining accuracy as a result of data shifts often
involve various data-augmentation strategies. Adversarial training is one such
method, designed to improve robustness to worst-case distribution shifts caused
by adversarial examples. While this method can improve robustness, it may also
hinder generalization to clean examples and exacerbate performance imbalances
across different classes. This paper explores the impact of adversarial
training on both overall and class-specific performance, as well as its
spill-over effects. We observe that enhanced labeling during training boosts
adversarial robustness by 53.50% and mitigates class imbalances by 5.73%,
leading to improved accuracy in both clean and adversarial settings compared to
standard adversarial training.
Authors' comments: 4 figures, ICLR 2025 Workshop on Foundation Models in the Wild
Changlong Shi, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang
In Federated Learning (FL), weighted aggregation of local models is conducted
to generate a new global model, and the aggregation weights are typically
normalized to 1. A recent study identifies the global weight shrinking effect
in FL, indicating an enhancement in the global model's generalization when the
sum of weights (i.e., the shrinking factor) is smaller than 1, where how to
learn the shrinking factor becomes crucial. However, principled approaches to
this solution have not been carefully studied from the adequate consideration
of privacy concerns and layer-wise distinctions. To this end, we propose a
novel model aggregation strategy, Federated Learning with Adaptive Layer-wise
Weight Shrinking (FedLWS), which adaptively designs the shrinking factor in a
layer-wise manner and avoids optimizing the shrinking factors on a proxy
dataset. We initially explored the factors affecting the shrinking factor
during the training process. Then we calculate the layer-wise shrinking factors
by considering the distinctions among each layer of the global model. FedLWS
can be easily incorporated with various existing methods due to its
flexibility. Extensive experiments under diverse scenarios demonstrate the
superiority of our method over several state-of-the-art approaches, providing a
promising tool for enhancing the global model in FL.
Authors' comments: Accepted in ICLR 2025
Quang Trung Truong, Wong Yuk Kwan, Duc Thanh Nguyen, Binh-Son Hua, Sai-Kit Yeung
Underwater video analysis, hampered by the dynamic marine environment and
camera motion, remains a challenging task in computer vision. Existing
training-free video generation techniques, learning motion dynamics on the
frame-by-frame basis, often produce poor results with noticeable motion
interruptions and misaligments. To address these issues, we propose AUTV, a
framework for synthesizing marine video data with pixel-wise annotations. We
demonstrate the effectiveness of this framework by constructing two video
datasets, namely UTV, a real-world dataset comprising 2,000 video-text pairs,
and SUTV, a synthetic video dataset including 10,000 videos with segmentation
masks for marine objects. UTV provides diverse underwater videos with
comprehensive annotations including appearance, texture, camera intrinsics,
lighting, and animal behavior. SUTV can be used to improve underwater
downstream tasks, which are demonstrated in video inpainting and video object
segmentation.
Authors' comments: under review
Minje Kim, Minjun Kim, Xu Yang
Spiking Neural Networks (SNNs) present a more energy-efficient alternative to
Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and
event-driven spikes. Effective utilization of temporal information is crucial
for SNNs, leading to the exploration of attention mechanisms to enhance this
capability. Conventional attention operations either apply identical operation
or employ non-identical operations across target dimensions. We identify that
these approaches provide distinct perspectives on temporal information. To
leverage the strengths of both operations, we propose a novel Dual
Temporal-channel-wise Attention (DTA) mechanism that integrates both
identical/non-identical attention strategies. To the best of our knowledge,
this is the first attempt to concentrate on both the correlation and dependency
of temporal-channel using both identical and non-identical attention
operations. Experimental results demonstrate that the DTA mechanism achieves
state-of-the-art performance on both static datasets (CIFAR10, CIFAR100,
ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation
and capturing complex temporal-channel relationship. We open-source our code:
https://github.com/MnJnKIM/DTA-SNN.
Authors' comments: Accepted by IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV) 2025
Jikai Chen, Leilei Gan
Recent advancements in Text-to-SQL systems have improved the conversion of natural language queries into SQL, but challenges remain in ensuring accuracy and reliability. While self-correction techniques refine outputs, they often introduce new errors. Existing methods focused on execution feedback mainly address syntax issues, leaving semantic errors -- where the query's logic fails to align with the user's intent -- largely unaddressed. We propose a novel approach combining structured execution feedback with a trained critic agent that provides detailed, interpretable critiques. This method effectively identifies and corrects both syntactic and semantic errors, enhancing accuracy and interpretability. Experimental results show significant improvements on two major Text-to-SQL benchmarks, Spider and BIRD, demonstrating the effectiveness of our approach.
Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, Masahiro Nomura
Active learning (AL) is a label-efficient machine learning paradigm that
focuses on selectively annotating high-value instances to maximize learning
efficiency. Its effectiveness can be further enhanced by incorporating weak
supervision, which uses rough yet cost-effective annotations instead of exact
(i.e., full) but expensive annotations. We introduce a novel AL framework,
Instance-wise Supervision-Level Optimization (ISO), which not only selects the
instances to annotate but also determines their optimal annotation level within
a fixed annotation budget. Its optimization criterion leverages the
value-to-cost ratio (VCR) of each instance while ensuring diversity among the
selected instances. In classification experiments, ISO consistently outperforms
traditional AL methods and surpasses a state-of-the-art AL approach that
combines full and weak supervision, achieving higher accuracy at a lower
overall cost. This code is available at
https://github.com/matsuo-shinnosuke/ISOAL.
Authors' comments: Accepted at CVPR2025
Dilfira Kudrat, Zongxia Xie, Yanru Sun, Tianyu Jia, Qinghua Hu
Time-series forecasting has gained significant attention in machine learning due to its crucial role in various domains. However, most existing forecasting models rely heavily on point-wise loss functions like Mean Square Error, which treat each time step independently and neglect the structural dependencies inherent in time series data, making it challenging to capture complex temporal patterns accurately. To address these challenges, we propose a novel Patch-wise Structural (PS) loss, designed to enhance structural alignment by comparing time series at the patch level. Through leveraging local statistical properties, such as correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise losses. Furthermore, it integrates seamlessly with point-wise loss, simultaneously addressing local structural inconsistencies and individual time-step errors. PS loss establishes a novel benchmark for accurately modeling complex time series data and provides a new perspective on time series loss function design. Extensive experiments demonstrate that PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets.
Haozhong Sun, Zhongsen Li, Chenlin Du, Haokun Li, Yajie Wang, Huijun Chen
Quantitative magnetic resonance imaging (qMRI) requires multi-phase
acqui-sition, often relying on reduced data sampling and reconstruction
algorithms to accelerate scans, which inherently poses an ill-posed inverse
problem. While many studies focus on measuring uncertainty during this process,
few explore how to leverage it to enhance reconstruction performance. In this
paper, we in-troduce PUQ, a novel approach that pioneers the use of uncertainty
infor-mation for qMRI reconstruction. PUQ employs a two-stage reconstruction
and parameter fitting framework, where phase-wise uncertainty is estimated
during reconstruction and utilized in the fitting stage. This design allows
uncertainty to reflect the reliability of different phases and guide
information integration during parameter fitting. We evaluated PUQ on in vivo
T1 and T2 mapping datasets from healthy subjects. Compared to existing qMRI
reconstruction methods, PUQ achieved the state-of-the-art performance in
parameter map-pings, demonstrating the effectiveness of uncertainty guidance.
Our code is available at https://anonymous.4open.science/r/PUQ-75B2/.
Authors' comments: Submitted to MICCAI2025