Ahmed Ali Abbasi, Namrata Vaswani
We precisely formulate, and provide a solution for, the Low Rank Columnwise Sensing (LRCS) problem when some of the observed data is scrambled/permuted/unlabeled. This problem, which we refer to as permuted LRCS, lies at the intersection of two distinct topics of recent research: unlabeled sensing and low rank column-wise (matrix) sensing. We introduce a novel generalization of the recently developed Alternating Gradient Descent and Minimization (AltGDMin) algorithm to solve this problem. We also develop an alternating minimization (AltMin) solution. We show, using simulation experiments, that both converge but PermutedAltGDmin is much faster than Permuted-AltMin.
Li Lin, Xiaojun Wan
A natural and intuitive idea in model quantization is to approximate each
component's quantized output to match its original. Layer-wise post-training
quantization (PTQ), though based on this idea, adopts a strictly local view and
can achieve, at best, only activation-aware approximations of weights. As a
result, it often leads to insufficient approximations and practical deviations
from this guiding intuition. Recent work has achieved a more accurate
approximation of linear-layer outputs within the framework of layer-wise PTQ,
but such refinements remain inadequate for achieving alignment with the full
model output. Based on a deeper understanding of the structural characteristics
of mainstream LLMs, we propose $LoaQ$, an output-approximation method for
layer-wise PTQ that explicitly targets output-level consistency. It better
aligns with this intuition and can feature a simple closed-form solution,
making it orthogonal to existing techniques and readily integrable into
existing quantization pipelines. Experiments on the LLaMA and Qwen model
families demonstrate that LoaQ performs effectively in both weight-only and
weight-activation joint quantization. By integrating seamlessly with existing
quantization strategies, it further enhances overall quantization quality and
shows strong potential to advance the frontier of post-training quantization.
Authors' comments: 7 pages, under review
Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu
Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a
Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Deep neural networks often produce miscalibrated probability estimates,
leading to overconfident predictions. A common approach for calibration is
fitting a post-hoc calibration map on unseen validation data that transforms
predicted probabilities. A key desirable property of the calibration map is
instance-wise monotonicity (i.e., preserving the ranking of probability
outputs). However, most existing post-hoc calibration methods do not guarantee
monotonicity. Previous monotonic approaches either use an under-parameterized
calibration map with limited expressive ability or rely on black-box neural
networks, which lack interpretability and robustness. In this paper, we propose
a family of novel monotonic post-hoc calibration methods, which employs a
constrained calibration map parameterized linearly with respect to the number
of classes. Our proposed approach ensures expressiveness, robustness, and
interpretability while preserving the relative ordering of the probability
output by formulating the proposed calibration map as a constrained
optimization problem. Our proposed methods achieve state-of-the-art performance
across datasets with different deep neural network models, outperforming
existing calibration methods while being data and computation-efficient. Our
code is available at
https://github.com/YunruiZhang/Calibration-by-Constrained-Transformation
Authors' comments: Accepted to Conference on Uncertainty in Artificial Intelligence
(UAI)
Edward L. Wright, Jack Foley
WISE 0855-0714 is the coldest known brown dwarf, located 2.28 pc from the
solar system. Discovered by the Wide-Field Infrared Survey Explorer (WISE) in
2014 (Luhman 2014), the object is of interest to scientists because of its low
temperature ($\approx270$ K), proximity to the solar system, small mass
($\sim3-10\: M_{J}$), and high proper motion. The first observations of W0855
by WISE in 2010 are heavily contaminated by a background source. With 10.5
years of observations following the NEOWISE reactivation in 2013 (Mainzer et
al., 2014), we present a robust analysis of W0855's flux and color unobstructed
by this background source. We obtain W1 = 19.3 and W1-W2 = 5.4 magnitudes with
an error of 0.37 magnitudes.
Authors' comments: 7 pages, 5 figures. v2 adds the author E-mail and fixes some wording
Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval
but struggle with frame-wise audio understanding. Prior works use
temporal-aware labels or unsupervised training to improve frame-wise
capabilities, but they still lack fine-grained labeling capability to pinpoint
when an event occurs. While traditional sound event detection models can
precisely localize events, they are limited to pre-defined categories, making
them ineffective for real-world scenarios with out-of-distribution events. In
this work, we introduce FLAM, an open-vocabulary contrastive audio-language
model capable of localizing specific sound events. FLAM employs a
memory-efficient and calibrated frame-wise objective with logit adjustment to
address spurious correlations, such as event dependencies and label imbalances
during training. To enable frame-wise supervision, we leverage a large-scale
dataset with diverse audio events, LLM-generated captions and simulation.
Experimental results and case studies demonstrate that FLAM significantly
improves the open-vocabulary localization capability while maintaining strong
performance in global retrieval and downstream tasks.
Authors' comments: Accepted at ICML 2025 V2: fixed small typo on eq. 15 and eq. 17
Mark Hagen, Alexandre Martin, Giovanni Sartori
We show that Wise's power alternative is stable under certain group
constructions, use this to prove the power alternative for new classes of
groups, and recover known results from a unified perspective.
For groups acting on trees, we introduce a dynamical condition that allows us
to deduce the power alternative for the group from the power alternative for
its stabilisers of points. As an application, we reduce the power alternative
for Artin groups to the power alternative for free-of-infinity Artin groups,
under some conditions on their parabolic subgroups. We also introduce a uniform
version of the power alternative and prove it, among other things, for a large
family of two-dimensional Artin groups. As a corollary, we deduce that these
Artin groups have uniform exponential growth.
Finally, we prove that the power alternative is stable under taking
relatively hyperbolic groups. We apply this to show that various examples,
including all free-by-$\mathbb{Z}$ groups and a natural subclass of
hierarchically hyperbolic groups, satisfy the uniform power alternative.
Authors' comments: 24 pages, 2 figures
Zak Buzzard
Extending deep Q-learning to cooperative multi-agent settings is challenging
due to the exponential growth of the joint action space, the non-stationary
environment, and the credit assignment problem. Value decomposition allows deep
Q-learning to be applied at the joint agent level, at the cost of reduced
expressivity. Building on past work in this direction, our paper proposes
PairVDN, a novel method for decomposing the value function into a collection of
pair-wise, rather than per-agent, functions, improving expressivity at the cost
of requiring a more complex (but still efficient) dynamic programming
maximisation algorithm. Our method enables the representation of value
functions which cannot be expressed as a monotonic combination of per-agent
functions, unlike past approaches such as VDN and QMIX. We implement a novel
many-agent cooperative environment, Box Jump, and demonstrate improved
performance over these baselines in this setting. We open-source our code and
environment at https://github.com/zzbuzzard/PairVDN.
Authors' comments: 8 pages, 5 figures
Bernardo Carvalho, Piotr Oprocha, Elias Rego
We prove that cw-hyperbolic homeomorphisms with jointly continuous
stable/unstable holonomies satisfy the periodic shadowing property and, if they
are topologically mixing, the periodic specification property. We discuss
difficulties to adapt Bowen's techniques to obtain a measure of maximal entropy
for cw-hyperbolic homeomorphisms, exhibit the unique measure of maximal entropy
for Walter's pseudo-Anosov diffeomorphism of $\mathbb{S}^2$, and prove it can
be obtained, as in the expansive case, as the weak* limit of an average of
Dirac measures on periodic orbits. As an application, we exhibit the unique
measure of maximal entropy for the homeomorphism on the Sierpi\'nski Carpet
defined in [12], which does not satisfy the specification property.
Authors' comments: 22 pages
Hangyu Liu, Bo Peng, Can Cui, Pengxiang Ding, Donglin Wang
Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples,
which pose significant challenges in security-sensitive applications. Among
various adversarial attack strategies, input transformation-based attacks have
demonstrated remarkable effectiveness in enhancing adversarial transferability.
However, existing methods still perform poorly across different architectures,
even though they have achieved promising results within the same architecture.
This limitation arises because, while models of the same architecture may focus
on different regions of the object, the variation is even more pronounced
across different architectures. Unfortunately, current approaches fail to
effectively guide models to attend to these diverse regions. To address this
issue, this paper proposes a novel input transformation-based attack method,
termed Component-Wise Transformation (CWT). CWT applies interpolation and
selective rotation to individual image blocks, ensuring that each transformed
image highlights different target regions, thereby improving the
transferability of adversarial examples. Extensive experiments on the standard
ImageNet dataset show that CWT consistently outperforms state-of-the-art
methods in both attack success rates and stability across CNN- and
Transformer-based models.
Authors' comments: 15 pages
Guoxin Feng
The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished âspikinessâ and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.
Albert Manuel Orozco Camacho, Stefan Horoi, Guy Wolf, Eugene Belilovsky
Combining multiple machine learning models has long been a technique for
enhancing performance, particularly in distributed settings. Traditional
approaches, such as model ensembles, work well, but are expensive in terms of
memory and compute. Recently, methods based on averaging model parameters have
achieved good results in some settings and have gained popularity. However,
merging models initialized differently that do not share a part of their
training trajectories can yield worse results than simply using the base
models, even after aligning their neurons. In this paper, we introduce a novel
approach, Non-uniform Parameter-wise Model Merging, or NP Merge, which merges
models by learning the contribution of each parameter to the final model using
gradient-based optimization. We empirically demonstrate the effectiveness of
our method for merging models of various architectures in multiple settings,
outperforming past methods. We also extend NP Merge to handle the merging of
multiple models, showcasing its scalability and robustness.
Authors' comments: 9 pages, 1 figure, to be published in the Proceedings of the 9th IEEE
Special Session on Machine Learning on Big Data (MLBD 2024)
Siyang Zhang, Ser-Nam Lim
Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.
Francesco Della Santa, Antonio Mastropietro, Sandra Pieraccini, Francesco Vaccarino
The problem of multi-task regression over graph nodes has been recently approached through Graph-Instructed Neural Network (GINN), which is a promising architecture belonging to the subset of message-passing graph neural networks. In this work, we discuss the limitations of the Graph-Instructed (GI) layer, and we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages of the EWGI layer and we provide numerical evidence that EWGINNs perform better than GINNs over some graph-structured input data, like the ones inferred from the Barabasi-Albert graph, and improve the training regularization on graphs with chaotic connectivity, like the ones inferred from the Erdos-Renyi graph.
Ankit Pratap Singh, Namrata Vaswani
This letter studies the AltGDmin algorithm for solving the noisy low rank
column-wise sensing (LRCS) problem. Our sample complexity guarantee improves
upon the best existing one by a factor $\max(r, \log(1/\epsilon))/r$ where $r$
is the rank of the unknown matrix and $\epsilon$ is the final desired accuracy.
A second contribution of this work is a detailed comparison of guarantees from
all work that studies the exact same mathematical problem as LRCS, but refers
to it by different names.
Authors' comments: 8 pages
Rolf van der Hulst, Matthias Walter
Given a $\{0,1\}$-matrix $M$, the graph realization problem for $M$ asks if
there exists a spanning forest such that the columns of $M$ are incidence
vectors of paths in the forest. The problem is closely related to the
recognition of network matrices, which are a large subclass of totally
unimodular matrices and have many applications in mixed-integer programming.
Previously, Bixby and Wagner have designed an efficient algorithm for graph
realization that grows a submatrix in a column-wise fashion whilst maintaining
a graphic realization. This paper complements their work by providing an
algorithm that works in a row-wise fashion and uses similar data structures.
The main challenge in designing efficient algorithms for the graph realization
problem is ambiguity as there may exist many graphs realizing $M$. The key
insight for designing an efficient row-wise algorithm is that a graphic matrix
is uniquely represented by an SPQR tree, a graph decomposition that stores all
graphs with the same set of cycles. The developed row-wise algorithm uses data
structures that are compatible with the column-wise algorithm and can be
combined with the latter to detect maximal graphic submatrices.
Authors' comments: 40 pages, 10 figures
Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang
The advent of large language models has revolutionized natural language
processing, but their increasing complexity has led to substantial training
costs, resource demands, and environmental impacts. In response, sparse
Mixture-of-Experts (MoE) models have emerged as a promising alternative to
dense models. Since training MoE models from scratch can be prohibitively
expensive, recent studies have explored leveraging knowledge from pre-trained
non-MoE models. However, existing approaches have limitations, such as
requiring significant hardware resources and data. We propose a novel
algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model
into a MoE model with minimal additional training cost. LaDiMo consists of two
stages: layer-wise expert construction and routing policy decision. By
harnessing the concept of Knowledge Distillation, we compress the model and
rapidly recover its performance. Furthermore, we develop an adaptive router
that optimizes inference efficiency by profiling the distribution of routing
weights and determining a layer-wise policy that balances accuracy and latency.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B
model to a MoE model using only 100K tokens, reducing activated parameters by
over 20% while keeping accuracy. Our approach offers a flexible and efficient
solution for building and deploying MoE models.
Authors' comments: 21 pages, 10 figures
Yusuf Sale, Paul Hofman, Timo Löhr, Lisa Wimmer, Thomas Nagler, Eyke Hüllermeier
We present a novel approach to uncertainty quantification in classification
tasks based on label-wise decomposition of uncertainty measures. This
label-wise perspective allows uncertainty to be quantified at the individual
class level, thereby improving cost-sensitive decision-making and helping
understand the sources of uncertainty. Furthermore, it allows to define total,
aleatoric, and epistemic uncertainty on the basis of non-categorical measures
such as variance, going beyond common entropy-based measures. In particular,
variance-based measures address some of the limitations associated with
established methods that have recently been discussed in the literature. We
show that our proposed measures adhere to a number of desirable properties.
Through empirical evaluation on a variety of benchmark data sets -- including
applications in the medical domain where accurate uncertainty quantification is
crucial -- we establish the effectiveness of label-wise uncertainty
quantification.
Authors' comments: Uncertainty in Artificial Intelligence. arXiv admin note: substantial
text overlap with arXiv:2401.00276
Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui
Evaluating the quality of free-text explanations is a multifaceted,
subjective, and labor-intensive task. Large language models (LLMs) present an
appealing alternative due to their potential for consistency, scalability, and
cost-efficiency. In this work, we present ACORN, a new dataset of 3,500
free-text explanations and aspect-wise quality ratings, and use it to evaluate
how LLMs rate explanations. We observed that larger models outputted labels
that maintained or increased the inter-annotator agreement, suggesting that
they are within the expected variance between human raters. However, their
correlation with majority-voted human ratings varied across different quality
aspects, indicating that they are not a complete replacement. In turn, using
LLMs as a supplement to a smaller group of human raters in some cases improved
the correlation with the original majority labels. However, the effect was
limited to cases where human raters were scarce, and an additional human rater
had a more pronounced effect in all cases. Overall, we recommend against using
LLMs as a complete replacement for human raters but encourage using them in
configurations that end with targeted human involvement. Data available here:
https://github.com/a-brassard/ACORN
Authors' comments: 18 pages, 7 figures, accepted to COLM 2024. Data available here:
https://github.com/a-brassard/ACORN