Hans W. A. Hanley, Zakir Durumeric
Contextual large language model embeddings are increasingly utilized for
topic modeling and clustering. However, current methods often scale poorly,
rely on opaque similarity metrics, and struggle in multilingual settings. In
this work, we present a novel, scalable, interpretable, hierarchical, and
multilingual approach to clustering news articles and social media data. To do
this, we first train multilingual Matryoshka embeddings that can determine
story similarity at varying levels of granularity based on which subset of the
dimensions of the embeddings is examined. This embedding model achieves
state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson
$\rho$ = 0.816). Once trained, we develop an efficient hierarchical clustering
algorithm that leverages the hierarchical nature of Matryoshka embeddings to
identify unique news stories, narratives, and themes. We conclude by
illustrating how our approach can identify and cluster stories, narratives, and
overarching themes within real-world news datasets.
Authors' comments: Accepted to The 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025)
Ronghuan Wu, Wanchao Su, Jing Liao
Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.
Authors' comments: Project Page: https://layerpeeler.github.io/
Kuo Zhou, Lu Zhang
Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee
Large language models (LLMs) offer impressive performance but are impractical
for resource-constrained deployment due to high latency and energy consumption.
Knowledge distillation (KD) addresses this by transferring knowledge from a
large teacher to a smaller student model. However, conventional KD, notably
approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence
loss across the entire vocabulary, neglecting token-level prediction
discrepancies. By investigating these representative divergences via gradient
analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses
overestimated ones, showing their complementary roles. Based on this
observation, we propose Token-wise Distillation (ToDi), a novel method that
adaptively combines FKL and RKL per token using a sigmoid-based weighting
function derived from the teacher-student probability log-ratio. ToDi
dynamically emphasizes the appropriate divergence for each token, enabling
precise distribution alignment. We demonstrate that ToDi consistently
outperforms recent distillation baselines using uniform or less granular
strategies across instruction-following benchmarks. Extensive ablation studies
and efficiency analysis further validate ToDi's effectiveness and practicality.
Authors' comments: 13 pages, 7 figures
Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, Zhiyu Zoey Chen
Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.
Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
Dong Huang, Qiong Liu, Mark C. Wyatt, Grant M. Kennedy
The discovery of extra-terrestrial life is one of the ultimate goals for
future exoplanet-seeking missions, with one major challenge being the presence
of 'exozodiacal' dust near target stars or within their habitable zone.
Therefore, it is critical to identify which stars possess exozodiacal dust and
quantify their emission levels. In this study, we conducted a search for
exozodi candidates within 10 parsecs using the Reyl'e sample. We performed
proper motion calculations and cross-matched the sample with the WISE and 2MASS
database, resulting in 339 preliminary target samples. We further analysed the
infrared radiation characteristics of these targets, using spectral energy
distribution (SED) fitting to predict photometric flux levels in the infrared
and searching for 3sigma excesses in the WISE W3 band. During further selection
processes, we applied various analysis methods to perform rigorous validation.
We identified five exozodi candidates all of which are brown dwarfs (BDs).
Given the clustering in candidate spectral types, we expect that these are not
true exozodi candidates, rather the apparent excess arises from the inability
of the BD photosphere models to accurately represent the SEDs of objects at the
L-T transition. Indeed, for the object DENIS J025503.3-470049, excess is likely
due to silicate clouds in the BD atmosphere. We suggest that a more stringent
5sigma excess is required to infer excess for this spectral type. The detection
rate (0/339) in our sample shows that less than 1% M stars have exozodi above
21% excess levels. This is consistent with the rate of exozodi at similar level
towards FGK stars in the Kennedy & Wyatt sample (25/24,174). We provide upper
limits on the 12 micron exozodi emission for the sample, which is typically at
21% relative to the star. For most stars, in particular the low mass M stars,
this is the first such upper limit in the literature.
Authors' comments: 14 pages, 11 figures, accepted for publication in A&A
Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen et al.
Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.
Hui Sun, Feng Bao
In this work, we study the stochastic optimal control problem (SOC) mainly
from the probabilistic view point, i.e. via the Stochastic Maximum principle
(SMP) \cite{Peng4}. We adopt the sample-wise backpropagation scheme proposed in
\cite{Hui1} to solve the SOC problem under the strong convexity assumption.
Importantly, in the Stochastic Gradient Descent (SGD) procedure, we use batch
samples with higher order scheme in the forward SDE to improve the convergence
rate in \cite{Hui1} from $\sim \mathcal{O}(\sqrt{\frac{N}{K} + \frac{1}{N}})$
to $\sim \mathcal{O}(\sqrt{\frac{1}{K} + \frac{1}{N^2}})$ and note that the
main source of uncertainty originates from the scheme for the simulation of $Z$
term in the BSDE. In the meantime, we note the SGD procedure uses only the
necessary condition of the SMP, while the batch simulation of the approximating
solution of BSDEs allows one to obtain a more accurate estimate of the control
$u$ that minimizes the Hamiltonian. We then propose a damped contraction
algorithm to solve the SOC problem whose proof of convergence for a special
case is attained under some appropriate assumption. We then show numerical
results to check the first order convergence rate of the projection algorithm
and analyze the convergence behavior of the damped contraction algorithm.
Lastly, we briefly discuss how to incorporate the proposed scheme in solving
practical problems especially when the Randomized Neural Networks are used. We
note that in this special case, the error backward propagation can be avoided
and parameter update can be achieved via purely algebraic computation (vector
algebra) which will potentially improve the efficiency of the whole training
procedure. Such idea will require further exploration and we will leave it as
our future work.
Authors' comments: Follow up work for a previous SINUM paper
Daneul Kim, Jaeah Lee, Jaesik Park
Most real-world image editing tasks require multiple sequential edits to
achieve desired results. Current editing approaches, primarily designed for
single-object modifications, struggle with sequential editing: especially with
maintaining previous edits along with adapting new objects naturally into the
existing content. These limitations significantly hinder complex editing
scenarios where multiple objects need to be modified while preserving their
contextual relationships. We address this fundamental challenge through two key
proposals: enabling rough mask inputs that preserve existing content while
naturally integrating new elements and supporting consistent editing across
multiple modifications. Our framework achieves this through layer-wise memory,
which stores latent representations and prompt embeddings from previous edits.
We propose Background Consistency Guidance that leverages memorized latents to
maintain scene coherence and Multi-Query Disentanglement in cross-attention
that ensures natural adaptation to existing content. To evaluate our method, we
present a new benchmark dataset incorporating semantic alignment metrics and
interactive editing scenarios. Through comprehensive experiments, we
demonstrate superior performance in iterative image editing tasks with minimal
user effort, requiring only rough masks while maintaining high-quality results
throughout multiple editing steps.
Authors' comments: CVPR 2025. Project page :
https://carpedkm.github.io/projects/improving_edit/index.html
Yijie Jin, Junjie Peng, Xuanchao Lin, Haochen Yuan, Lan Wang, Cangzhi Zheng
Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.
Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao et al.
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma
Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
Saniya Karwa, Navpreet Singh
Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.
Yamato Arai, Yuma Ichikawa
Layer-wise PTQ is a promising technique for compressing large language models
(LLMs), due to its simplicity and effectiveness without requiring retraining.
However, recent progress in this area is saturating, underscoring the need to
revisit its core limitations and explore further improvements. We address this
challenge by identifying a key limitation of existing layer-wise PTQ methods:
the growth of quantization errors across layers significantly degrades
performance, particularly in low-bit regimes. To address this fundamental
issue, we propose Quantization Error Propagation (QEP), a general, lightweight,
and scalable framework that enhances layer-wise PTQ by explicitly propagating
quantization errors and compensating for accumulated errors. QEP also offers a
tunable propagation mechanism that prevents overfitting and controls
computational overhead, enabling the framework to adapt to various
architectures and resource budgets. Extensive experiments on several LLMs
demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher
accuracy than existing methods. Notably, the gains are most pronounced in the
extremely low-bit quantization regime.
Authors' comments: 28 pages, 3 figures
Aidan Tiruvan
Robust low-rank approximation under row-wise adversarial corruption can be
achieved with a single pass, randomized procedure that detects and removes
outlier rows by thresholding their projected norms. We propose a scalable,
non-iterative algorithm that efficiently recovers the underlying low-rank
structure in the presence of row-wise adversarial corruption. By first
compressing the data with a Johnson Lindenstrauss projection, our approach
preserves the geometry of clean rows while dramatically reducing
dimensionality. Robust statistical techniques based on the median and median
absolute deviation then enable precise identification and removal of outlier
rows with abnormally high norms. The subsequent rank-k approximation achieves
near-optimal error bounds with a one pass procedure that scales linearly with
the number of observations. Empirical results confirm that combining random
sketches with robust statistics yields efficient, accurate decompositions even
in the presence of large fractions of corrupted rows.
Authors' comments: 27 pages, 9 figures, preprint
Alfonso Artigue
We show that for a compact surface without boundary $M$ the set of cw-expansive homeomorphisms is dense in the set of all the homeomorphisms of $M$ with respect to the $C^0$ topology. After this we show that for a generic homeomorphism $f$ of $M$ it holds that: for all $\epsilon>0$ there is a cw-expansive homeomorphism $g$ of $M$ which is $\epsilon$-close to $f$ and is semiconjugate to $f$; moreover, if $\pi\colon M\to M$ is this semiconjugacy then $\pi^{-1}(x)$ is connected, does not separate $M$ and has diameter less than $\epsilon$ for all $x\in M$.
Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang
Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.
Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili
Large-scale machine learning (ML) models are increasingly being used in
critical domains like education, lending, recruitment, healthcare, criminal
justice, etc. However, the training, deployment, and utilization of these
models demand substantial computational resources. To decrease computation and
memory costs, machine learning models with sparse weight matrices are widely
used in the literature. Among sparse models, those with special sparse
structures (e.g., models with block-wise sparse weight matrices) fit better
with the hardware accelerators and can decrease the memory and computation
costs during the inference. Unfortunately, while there are several efficient
training methods, none of them are designed to train a block-wise sparse model
efficiently. As a result, the current methods for training block-wise sparse
models start with full and dense models leading to inefficient training. In
this work, we focus on training models with \textit{block-wise sparse matrices}
and propose an efficient training algorithm to decrease both computation and
memory costs during training and inference. In addition, we will show that our
proposed method enables us to efficiently find the right block size for the
sparsity pattern during the training process. Our extensive empirical and
theoretical analyses show that our algorithms can decrease the computation and
memory costs significantly without a performance drop compared to baselines.
Authors' comments: 24 pages, submitted on Transactions on Machine Learning Research