Quan Mai, Susan Gauch, Douglas Adams
We introduce SetBERT, a fine-tuned BERT-based model designed to enhance query
embeddings for set operations and Boolean logic queries, such as Intersection
(AND), Difference (NOT), and Union (OR). SetBERT significantly improves
retrieval performance for logic-structured queries, an area where both
traditional and neural retrieval methods typically underperform. We propose an
innovative use of inversed-contrastive loss, focusing on identifying the
negative sentence, and fine-tuning BERT with a dataset generated via prompt
GPT. Furthermore, we demonstrate that, unlike other BERT-based models,
fine-tuning with triplet loss actually degrades performance for this specific
task. Our experiments reveal that SetBERT-base not only significantly
outperforms BERT-base (up to a 63% improvement in Recall) but also achieves
performance comparable to the much larger BERT-large model, despite being only
one-third the size.
Authors' comments: 10 pages, 1 figure
Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu et al.
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.
Authors' comments: Code available at https://github.com/akarinmoe/SRaR
Hyunseo Shin, Wonseok Hwang
Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury
Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.
Authors' comments: This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html
Amirhoseein Afsharrad, Ahmadreza Moradipari, Sanjay Lall
In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|\lambda_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.
Ahmed Ali Abbasi, Namrata Vaswani
We precisely formulate, and provide a solution for, the Low Rank Columnwise Sensing (LRCS) problem when some of the observed data is scrambled/permuted/unlabeled. This problem, which we refer to as permuted LRCS, lies at the intersection of two distinct topics of recent research: unlabeled sensing and low rank column-wise (matrix) sensing. We introduce a novel generalization of the recently developed Alternating Gradient Descent and Minimization (AltGDMin) algorithm to solve this problem. We also develop an alternating minimization (AltMin) solution. We show, using simulation experiments, that both converge but PermutedAltGDmin is much faster than Permuted-AltMin.
Li Lin, Xiaojun Wan
A natural and intuitive idea in model quantization is to approximate each
component's quantized output to match its original. Layer-wise post-training
quantization (PTQ), though based on this idea, adopts a strictly local view and
can achieve, at best, only activation-aware approximations of weights. As a
result, it often leads to insufficient approximations and practical deviations
from this guiding intuition. Recent work has achieved a more accurate
approximation of linear-layer outputs within the framework of layer-wise PTQ,
but such refinements remain inadequate for achieving alignment with the full
model output. Based on a deeper understanding of the structural characteristics
of mainstream LLMs, we propose $LoaQ$, an output-approximation method for
layer-wise PTQ that explicitly targets output-level consistency. It better
aligns with this intuition and can feature a simple closed-form solution,
making it orthogonal to existing techniques and readily integrable into
existing quantization pipelines. Experiments on the LLaMA and Qwen model
families demonstrate that LoaQ performs effectively in both weight-only and
weight-activation joint quantization. By integrating seamlessly with existing
quantization strategies, it further enhances overall quantization quality and
shows strong potential to advance the frontier of post-training quantization.
Authors' comments: 7 pages, under review
Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu
Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a
Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Deep neural networks often produce miscalibrated probability estimates,
leading to overconfident predictions. A common approach for calibration is
fitting a post-hoc calibration map on unseen validation data that transforms
predicted probabilities. A key desirable property of the calibration map is
instance-wise monotonicity (i.e., preserving the ranking of probability
outputs). However, most existing post-hoc calibration methods do not guarantee
monotonicity. Previous monotonic approaches either use an under-parameterized
calibration map with limited expressive ability or rely on black-box neural
networks, which lack interpretability and robustness. In this paper, we propose
a family of novel monotonic post-hoc calibration methods, which employs a
constrained calibration map parameterized linearly with respect to the number
of classes. Our proposed approach ensures expressiveness, robustness, and
interpretability while preserving the relative ordering of the probability
output by formulating the proposed calibration map as a constrained
optimization problem. Our proposed methods achieve state-of-the-art performance
across datasets with different deep neural network models, outperforming
existing calibration methods while being data and computation-efficient. Our
code is available at
https://github.com/YunruiZhang/Calibration-by-Constrained-Transformation
Authors' comments: Accepted to Conference on Uncertainty in Artificial Intelligence
(UAI)
Edward L. Wright, Jack Foley
WISE 0855-0714 is the coldest known brown dwarf, located 2.28 pc from the
solar system. Discovered by the Wide-Field Infrared Survey Explorer (WISE) in
2014 (Luhman 2014), the object is of interest to scientists because of its low
temperature ($\approx270$ K), proximity to the solar system, small mass
($\sim3-10\: M_{J}$), and high proper motion. The first observations of W0855
by WISE in 2010 are heavily contaminated by a background source. With 10.5
years of observations following the NEOWISE reactivation in 2013 (Mainzer et
al., 2014), we present a robust analysis of W0855's flux and color unobstructed
by this background source. We obtain W1 = 19.3 and W1-W2 = 5.4 magnitudes with
an error of 0.37 magnitudes.
Authors' comments: 7 pages, 5 figures. v2 adds the author E-mail and fixes some wording
Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval
but struggle with frame-wise audio understanding. Prior works use
temporal-aware labels or unsupervised training to improve frame-wise
capabilities, but they still lack fine-grained labeling capability to pinpoint
when an event occurs. While traditional sound event detection models can
precisely localize events, they are limited to pre-defined categories, making
them ineffective for real-world scenarios with out-of-distribution events. In
this work, we introduce FLAM, an open-vocabulary contrastive audio-language
model capable of localizing specific sound events. FLAM employs a
memory-efficient and calibrated frame-wise objective with logit adjustment to
address spurious correlations, such as event dependencies and label imbalances
during training. To enable frame-wise supervision, we leverage a large-scale
dataset with diverse audio events, LLM-generated captions and simulation.
Experimental results and case studies demonstrate that FLAM significantly
improves the open-vocabulary localization capability while maintaining strong
performance in global retrieval and downstream tasks.
Authors' comments: Accepted at ICML 2025 V2: fixed small typo on eq. 15 and eq. 17
Mark Hagen, Alexandre Martin, Giovanni Sartori
We show that Wise's power alternative is stable under certain group
constructions, use this to prove the power alternative for new classes of
groups, and recover known results from a unified perspective.
For groups acting on trees, we introduce a dynamical condition that allows us
to deduce the power alternative for the group from the power alternative for
its stabilisers of points. As an application, we reduce the power alternative
for Artin groups to the power alternative for free-of-infinity Artin groups,
under some conditions on their parabolic subgroups. We also introduce a uniform
version of the power alternative and prove it, among other things, for a large
family of two-dimensional Artin groups. As a corollary, we deduce that these
Artin groups have uniform exponential growth.
Finally, we prove that the power alternative is stable under taking
relatively hyperbolic groups. We apply this to show that various examples,
including all free-by-$\mathbb{Z}$ groups and a natural subclass of
hierarchically hyperbolic groups, satisfy the uniform power alternative.
Authors' comments: 24 pages, 2 figures
Zak Buzzard
Extending deep Q-learning to cooperative multi-agent settings is challenging
due to the exponential growth of the joint action space, the non-stationary
environment, and the credit assignment problem. Value decomposition allows deep
Q-learning to be applied at the joint agent level, at the cost of reduced
expressivity. Building on past work in this direction, our paper proposes
PairVDN, a novel method for decomposing the value function into a collection of
pair-wise, rather than per-agent, functions, improving expressivity at the cost
of requiring a more complex (but still efficient) dynamic programming
maximisation algorithm. Our method enables the representation of value
functions which cannot be expressed as a monotonic combination of per-agent
functions, unlike past approaches such as VDN and QMIX. We implement a novel
many-agent cooperative environment, Box Jump, and demonstrate improved
performance over these baselines in this setting. We open-source our code and
environment at https://github.com/zzbuzzard/PairVDN.
Authors' comments: 8 pages, 5 figures
Bernardo Carvalho, Piotr Oprocha, Elias Rego
We prove that cw-hyperbolic homeomorphisms with jointly continuous
stable/unstable holonomies satisfy the periodic shadowing property and, if they
are topologically mixing, the periodic specification property. We discuss
difficulties to adapt Bowen's techniques to obtain a measure of maximal entropy
for cw-hyperbolic homeomorphisms, exhibit the unique measure of maximal entropy
for Walter's pseudo-Anosov diffeomorphism of $\mathbb{S}^2$, and prove it can
be obtained, as in the expansive case, as the weak* limit of an average of
Dirac measures on periodic orbits. As an application, we exhibit the unique
measure of maximal entropy for the homeomorphism on the Sierpi\'nski Carpet
defined in [12], which does not satisfy the specification property.
Authors' comments: 22 pages
Hangyu Liu, Bo Peng, Can Cui, Pengxiang Ding, Donglin Wang
Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples,
which pose significant challenges in security-sensitive applications. Among
various adversarial attack strategies, input transformation-based attacks have
demonstrated remarkable effectiveness in enhancing adversarial transferability.
However, existing methods still perform poorly across different architectures,
even though they have achieved promising results within the same architecture.
This limitation arises because, while models of the same architecture may focus
on different regions of the object, the variation is even more pronounced
across different architectures. Unfortunately, current approaches fail to
effectively guide models to attend to these diverse regions. To address this
issue, this paper proposes a novel input transformation-based attack method,
termed Component-Wise Transformation (CWT). CWT applies interpolation and
selective rotation to individual image blocks, ensuring that each transformed
image highlights different target regions, thereby improving the
transferability of adversarial examples. Extensive experiments on the standard
ImageNet dataset show that CWT consistently outperforms state-of-the-art
methods in both attack success rates and stability across CNN- and
Transformer-based models.
Authors' comments: 15 pages
Guoxin Feng
The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished âspikinessâ and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.
Albert Manuel Orozco Camacho, Stefan Horoi, Guy Wolf, Eugene Belilovsky
Combining multiple machine learning models has long been a technique for
enhancing performance, particularly in distributed settings. Traditional
approaches, such as model ensembles, work well, but are expensive in terms of
memory and compute. Recently, methods based on averaging model parameters have
achieved good results in some settings and have gained popularity. However,
merging models initialized differently that do not share a part of their
training trajectories can yield worse results than simply using the base
models, even after aligning their neurons. In this paper, we introduce a novel
approach, Non-uniform Parameter-wise Model Merging, or NP Merge, which merges
models by learning the contribution of each parameter to the final model using
gradient-based optimization. We empirically demonstrate the effectiveness of
our method for merging models of various architectures in multiple settings,
outperforming past methods. We also extend NP Merge to handle the merging of
multiple models, showcasing its scalability and robustness.
Authors' comments: 9 pages, 1 figure, to be published in the Proceedings of the 9th IEEE
Special Session on Machine Learning on Big Data (MLBD 2024)
Siyang Zhang, Ser-Nam Lim
Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.
Francesco Della Santa, Antonio Mastropietro, Sandra Pieraccini, Francesco Vaccarino
The problem of multi-task regression over graph nodes has been recently approached through Graph-Instructed Neural Network (GINN), which is a promising architecture belonging to the subset of message-passing graph neural networks. In this work, we discuss the limitations of the Graph-Instructed (GI) layer, and we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages of the EWGI layer and we provide numerical evidence that EWGINNs perform better than GINNs over some graph-structured input data, like the ones inferred from the Barabasi-Albert graph, and improve the training regularization on graphs with chaotic connectivity, like the ones inferred from the Erdos-Renyi graph.