Sofia Casarin, Sergio Escalera, Oswald Lanz
Training-free Neural Architecture Search (NAS) efficiently identifies
high-performing neural networks using zero-cost (ZC) proxies. Unlike multi-shot
and one-shot NAS approaches, ZC-NAS is both (i) time-efficient, eliminating the
need for model training, and (ii) interpretable, with proxy designs often
theoretically grounded. Despite rapid developments in the field, current SOTA
ZC proxies are typically constrained to well-established convolutional search
spaces. With the rise of Large Language Models shaping the future of deep
learning, this work extends ZC proxy applicability to Vision Transformers
(ViTs). We present a new benchmark using the Autoformer search space evaluated
on 6 distinct tasks and propose Layer-Sample Wise Activation with Gradients
information (L-SWAG), a novel, generalizable metric that characterizes both
convolutional and transformer architectures across 14 tasks. Additionally,
previous works highlighted how different proxies contain complementary
information, motivating the need for a ML model to identify useful
combinations. To further enhance ZC-NAS, we therefore introduce LIBRA-NAS (Low
Information gain and Bias Re-Alignment), a method that strategically combines
proxies to best represent a specific benchmark. Integrated into the NAS search,
LIBRA-NAS outperforms evolution and gradient-based NAS techniques by
identifying an architecture with a 17.0% test error on ImageNet1k in just 0.1
GPU days.
Authors' comments: accepted at CVPR 2025
Austin Braniff, Yuhe Tian
This work formally introduces Y-wise Affine Neural Networks (YANNs), a fully-explainable network architecture that continuously and efficiently represent piecewise affine functions with polytopic subdomains. Following from the proofs, it is shown that the development of YANNs requires no training to achieve the functionally equivalent representation. YANNs thus maintain all mathematical properties of the original formulations. Multi-parametric model predictive control is utilized as an application showcase of YANNs, which theoretically computes optimal control laws as a piecewise affine function of states, outputs, setpoints, and disturbances. With the exact representation of multi-parametric control laws, YANNs retain essential control-theoretic guarantees such as recursive feasibility and stability. This sets YANNs apart from the existing works which apply neural networks for approximating optimal control laws instead of exactly representing them. By optimizing the inference speed of the networks, YANNs can evaluate substantially faster in real-time compared to traditional piecewise affine function calculations. Numerical case studies are presented to demonstrate the algorithmic scalability with respect to the input/output dimensions and the number of subdomains. YANNs represent a significant advancement in control as the first neural network-based controller that inherently ensures both feasibility and stability. Future applications can leverage them as an efficient and interpretable starting point for data-driven modeling/control.
Moritz Vandenhirtz, Julia E. Vogt
Understanding the decision-making process of machine learning models provides
valuable insights into the task, the data, and the reasons behind a model's
failures. In this work, we propose a method that performs inherently
interpretable predictions through the instance-wise sparsification of input
images. To align the sparsification with human perception, we learn the masking
in the space of semantically meaningful pixel regions rather than on
pixel-level. Additionally, we introduce an explicit way to dynamically
determine the required level of sparsity for each instance. We show empirically
on semi-synthetic and natural image datasets that our inherently interpretable
classifier produces more meaningful, human-understandable predictions than
state-of-the-art benchmarks.
Authors' comments: International Conference on Machine Learning
Ingrid Pelisoli, T. R. Marsh, G. Tovmassian, L. A. Amaral, Amornrat Aungwerojwit, M. J. Green, R. P. Ashley, David A. H. Buckley et al.
After its discovery in 2016, the white dwarf binary AR Scorpii (AR Sco)
remained for several years the only white dwarf system to show pulsed radio
emission associated with a fast-spinning white dwarf. The evolutionary origin
and the emission mechanism for AR Sco are not completely understood, with
different models proposed. Testing and improving these models requires
observational input. Here we report the results of a targeted search for other
binary white dwarf pulsars like AR Sco. Using data from Gaia and WISE, we
identified 56 candidate systems with similar properties to AR Sco, of which 26
were previously uncharacterised. These were subject to spectroscopic and
photometric follow-up observations. Aside from one new binary white dwarf
pulsar found, J191213.72-441045.1, which was reported in a separate work, we
find no other systems whose characteristics are akin to AR Sco. The newly
characterised systems are primarily young stellar objects (with 10 found) or
cataclysmic variables (7 identifications), with the remaining being either
blended or non-variable on short timescales.
Authors' comments: 17 pages, 21 figures. Accepted for publication in MNRAS
Shiwei Guo, Ziang Chen, Yupeng Ma, Yunfei Han, Yi Wang
The Transformer model has shown strong performance in multivariate time series forecasting by leveraging channel-wise self-attention. However, this approach lacks temporal constraints when computing temporal features and does not utilize cumulative historical series effectively.To address these limitations, we propose the Structured Channel-wise Transformer with Cumulative Historical state (SCFormer). SCFormer introduces temporal constraints to all linear transformations, including the query, key, and value matrices, as well as the fully connected layers within the Transformer. Additionally, SCFormer employs High-order Polynomial Projection Operators (HiPPO) to deal with cumulative historical time series, allowing the model to incorporate information beyond the look-back window during prediction. Extensive experiments on multiple real-world datasets demonstrate that SCFormer significantly outperforms mainstream baselines, highlighting its effectiveness in enhancing time series forecasting. The code is publicly available at https://github.com/ShiweiGuo1995/SCFormer
Avalpreet Singh Brar, Rong Su, Christos G. Cassandras, Gioele Zardini
Traditional rebalancing methods in ride-hailing systems direct idle drivers to fixed destinations, overlooking the fact that ride allocations frequently occur while cruising. This destination-centric view fails to exploit the path-dependent nature of modern platforms, where real-time matching depends on the entire trajectory rather than a static endpoint. We propose the Wise Goose Chase (WGC) algorithm, an event-triggered, driver-specific path planning framework that anticipates future matching opportunities by forecasting spatio-temporal supply and demand dynamics. WGC uses a system of Retarded Functional Differential Equations (RFDEs) to model the evolution of idle driver density and passenger queues at the road-segment level, incorporating both en-route matching and competition among drivers. Upon request, WGC computes personalized cruising paths that minimize each driver's expected time to allocation. Monte Carlo simulations on synthetic urban networks show that WGC consistently outperforms baseline strategies, highlighting the advantage of predictive, context-aware rebalancing in dynamic mobility systems.
Yu-Hsiang Lan, Eric K. Oermann
There has been a recent surge of interest in time series modeling using the Transformer architecture. However, forecasting multivariate time series with Transformer presents a unique challenge as it requires modeling both temporal (cross-time) and variate (cross-variate) dependencies. While Transformer-based models have gained popularity for their flexibility in capturing both sequential and cross-variate relationships, it is unclear how to best integrate these two sources of information in the context of the Transformer architecture while optimizing for both performance and efficiency. We re-purpose the Transformer architecture to effectively model both cross-time and cross-variate dependencies. Our approach begins by embedding each variate independently into a variate-wise representation that captures its cross-time dynamics, and then models cross-variate dependencies through attention mechanisms on these learned embeddings. Gating operations in both cross-time and cross-variate modeling phases regulate information flow, allowing the model to focus on the most relevant features for accurate predictions. Our method achieves state-of-the-art performance across 13 real-world datasets and can be seamlessly integrated into other Transformer-based and LLM-based forecasters, delivering performance improvements up to 20.7\% over original models. Code is available at this repository: https://github.com/nyuolab/Gateformer.
Raffaele Di Santo, Dikran Dikranjan, Anna Giordano Bruno, Hans Weber
According to Cartan, given an ideal $\mathcal I$ of $\mathbb N$, a sequence $(x_n)_{n\in\mathbb N}$ in the circle group $\mathbb T$ is said to {\em $\mathcal I$-converge} to a point $x\in \mathbb T$ if $\{n\in \mathbb N: x_n \not \in U\}\in \mathcal I$ for every neighborhood $U$ of $x$ in $\mathbb T$. For a sequence $\mathbf u=(u_n)_{n\in\mathbb N}$ in $\mathbb Z$, let $$t_{\mathbf u}^\mathcal I(\mathbb T) :=\{x\in \mathbb T: u_nx \ \text{$\mathcal I$-converges to}\ 0 \}.$$ This set is a Borel (hence, Polishable) subgroup of $\mathbb T$ with many nice properties, largely studied in the case when $\mathcal I = \mathcal F in$ is the ideal of all finite subsets of $\mathbb N$ (so $\mathcal F in$-convergence coincides with the usual one) for its remarkable connection to topological algebra, descriptive set theory and harmonic analysis. We give a complete element-wise description of $t_{\mathbf u}^\mathcal I(\mathbb T)$ when $u_n\mid u_{n+1}$ for every $n\in\mathbb N$ and under suitable hypotheses on $\mathcal I$. In the special case when $\mathcal I =\mathcal F in$, we obtain an alternative proof of a simplified version of a known result.
Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu et al.
Multimodal agents, which integrate a controller e.g., a vision language
model) with external tools, have demonstrated remarkable capabilities in
tackling complex multimodal tasks. Existing approaches for training these
agents, both supervised fine-tuning and reinforcement learning, depend on
extensive human-annotated task-answer pairs and tool trajectories. However, for
complex multimodal tasks, such annotations are prohibitively expensive or
impractical to obtain. In this paper, we propose an iterative tool usage
exploration method for multimodal agents without any pre-collected data, namely
SPORT, via step-wise preference optimization to refine the trajectories of tool
usage. Our method enables multimodal agents to autonomously discover effective
tool usage strategies through self-exploration and optimization, eliminating
the bottleneck of human annotation. SPORT has four iterative components: task
synthesis, step sampling, step verification, and preference tuning. We first
synthesize multimodal tasks using language models. Then, we introduce a novel
trajectory exploration scheme, where step sampling and step verification are
executed alternately to solve synthesized tasks. In step sampling, the agent
tries different tools and obtains corresponding results. In step verification,
we employ a verifier to provide AI feedback to construct step-wise preference
data. The data is subsequently used to update the controller for tool usage
through preference tuning, producing a SPORT agent. By interacting with real
environments, the SPORT agent gradually evolves into a more refined and capable
system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent
achieves 6.41% and 3.64% improvements, underscoring the generalization and
effectiveness introduced by our method. The project page is
https://SPORT-Agents.github.io.
Authors' comments: 24 pages
Changjun Li, Runqing Jiang, Zhuo Song, Pengpeng Yu, Ye Zhang, Yulan Guo
Post-training quantization (PTQ) has evolved as a prominent solution for compressing complex models, which advocates a small calibration dataset and avoids end-to-end retraining. However, most existing PTQ methods employ block-wise reconstruction, which neglects cross-block dependency and exhibits a notable accuracy drop in low-bit cases. To address these limitations, this paper presents a novel PTQ method, dubbed Pack-PTQ. First, we design a Hessian-guided adaptive packing mechanism to partition blocks into non-overlapping packs, which serve as the base unit for reconstruction, thereby preserving the cross-block dependency and enabling accurate quantization parameters estimation. Second, based on the pack configuration, we propose a mixed-precision quantization approach to assign varied bit-widths to packs according to their distinct sensitivities, thereby further enhancing performance. Extensive experiments on 2D image and 3D point cloud classification tasks, using various network architectures, demonstrate the superiority of our method over the state-of-the-art PTQ methods.
Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu et al.
Existing large language model (LLM) serving systems fall into two categories:
1) a unified system where prefill phase and decode phase are co-located on the
same GPU, sharing the unified computational resource and storage, and 2) a
disaggregated system where the two phases are disaggregated to different GPUs.
The design of the disaggregated system addresses the latency interference and
sophisticated scheduling issues in the unified system but leads to storage
challenges including 1) replicated weights for both phases that prevent
flexible deployment, 2) KV cache transfer overhead between the two phases, 3)
storage imbalance that causes substantial wasted space of the GPU capacity, and
4) suboptimal resource adjustment arising from the difficulties in migrating KV
cache. Such storage inefficiency delivers poor serving performance under high
request rates.
In this paper, we identify that the advantage of the disaggregated system
lies in the disaggregated computation, i.e., partitioning the computational
resource to enable the asynchronous computation of two phases. Thus, we propose
a novel LLM serving system, semi-PD, characterized by disaggregated computation
and unified storage. In semi-PD, we introduce a computation resource controller
to achieve disaggregated computation at the streaming multi-processor (SM)
level, and a unified memory manager to manage the asynchronous memory access
from both phases. semi-PD has a low-overhead resource adjustment mechanism
between the two phases, and a service-level objective (SLO) aware dynamic
partitioning algorithm to optimize the SLO attainment. Compared to
state-of-the-art systems, semi-PD maintains lower latency at higher request
rates, reducing the average end-to-end latency per request by 1.27-2.58x on
DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency
constraints on Llama series models.
Authors' comments: 18 pages, 16 figures
Brian K. S. Isaac-Medina, Toby P. Breckon
Deep neural networks have demonstrated great generalization capabilities for
tasks whose training and test sets are drawn from the same distribution.
Nevertheless, out-of-distribution (OOD) detection remains a challenging task
that has received significant attention in recent years. Specifically, OOD
detection refers to the detection of instances that do not belong to the
training distribution, while still having good performance on the
in-distribution task (e.g., classification or object detection). Recent work
has focused on generating synthetic outliers and using them to train an outlier
detector, generally achieving improved OOD detection than traditional OOD
methods. In this regard, outliers can be generated either in feature or pixel
space. Feature space driven methods have shown strong performance on both the
classification and object detection tasks, at the expense that the
visualization of training outliers remains unknown, making further analysis on
OOD failure modes challenging. On the other hand, pixel space outlier
generation techniques enabled by diffusion models have been used for image
classification using, providing improved OOD detection performance and outlier
visualization, although their adaption to the object detection task is as yet
unexplored. We therefore introduce Dream-Box, a method that provides a link to
object-wise outlier generation in the pixel space for OOD detection.
Specifically, we use diffusion models to generate object-wise outliers that are
used to train an object detector for an in-distribution task and OOD detection.
Our method achieves comparable performance to previous traditional methods
while being the first technique to provide concrete visualization of generated
OOD objects.
Authors' comments: 9 pages, 6 figures, 2 tables, LatinX in AI CVPR 2025 Workshop
Dasol Jeong, Donggoo Kang, Jiwon Park, Hyebean Lee, Joonki Paik
We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.
Hanyu Zhang, Zhen Xing, Wenxuan Yang, Chenxi Ma, Weimin Tan, Bo Yan
As transfer learning models and datasets grow larger, efficient adaptation
and storage optimization have become critical needs. Coreset selection
addresses these challenges by identifying and retaining the most informative
samples, constructing a compact subset for target domain training. However,
current methods primarily rely on instance-level difficulty assessments,
overlooking crucial category-level characteristics and consequently
under-representing minority classes. To overcome this limitation, we propose
Non-Uniform Class-Wise Coreset Selection (NUCS), a novel framework that
integrates both class-level and instance-level criteria. NUCS automatically
allocates data selection budgets for each class based on intrinsic category
difficulty and adaptively selects samples within optimal difficulty ranges. By
explicitly incorporating category-specific insights, our approach achieves a
more balanced and representative coreset, addressing key shortcomings of prior
methods. Comprehensive theoretical analysis validates the rationale behind
adaptive budget allocation and sample selection, while extensive experiments
across 14 diverse datasets and model architectures demonstrate NUCS's
consistent improvements over state-of-the-art methods, achieving superior
accuracy and computational efficiency. Notably, on CIFAR100 and Food101, NUCS
matches full-data training accuracy while retaining just 30% of samples and
reducing computation time by 60%. Our work highlights the importance of
characterizing category difficulty in coreset selection, offering a robust and
data-efficient solution for transfer learning.
Authors' comments: 11pages
Thanh-Dung Le, Vu Nguyen Ha, Ti Ti Nguyen, Geoffrey Eappen, Prabhu Thiruvasagam, Hong-fu Chou, Duc-Dung Tran, Hung Nguyen-Kha et al.
This study introduces ResNet-GLUSE, a lightweight ResNet variant enhanced
with Gated Linear Unit-enhanced Squeeze-and-Excitation (GLUSE), an adaptive
channel-wise attention mechanism. By integrating dynamic gating into the
traditional SE framework, GLUSE improves feature recalibration while
maintaining computational efficiency. Experiments on EuroSAT and PatternNet
datasets confirm its effectiveness, achieving exceeding \textbf{94\% and 98\%
accuracy}, respectively. While \textbf{MobileViT achieves 99\% accuracy},
ResNet-GLUSE offers \textbf{33x fewer parameters, 27x fewer FLOPs, 33x smaller
model size (MB), $\approx$6x lower power consumption (W), and $\approx$3x
faster inference time (s)}, making it significantly more efficient for onboard
satellite deployment. Furthermore, due to its simplicity, ResNet-GLUSE can be
easily mimicked for \textbf{neuromorphic computing}, enabling ultra-low power
inference at just \textbf{852.30 mW} on Akida Brainchip. This balance between
high accuracy and ultra-low resource consumption establishes ResNet-GLUSE as a
practical solution for real-time Earth Observation (EO) tasks. Reproducible
codes are available in our shared repository.
Authors' comments: Under review. arXiv admin note: text overlap with arXiv:2411.00209
Yuibi Gomi, Akira Sato, Waleed Madany, Kenichi Okada, Satoshi Adachi, Masatoshi Itoh, Masanori Hashimoto
We developed a 55 nm CMOS SRAM chip that scans all data every 125 ns and outputs timestamped soft error data via an SPI interface through a FIFO. The proposed system, consisting of the developed chip and particle detectors, enables event-wise soft error measurement and precise identification of SBUs and MCUs, thus resolving misclassifications such as Pseudo- and Distant MCUs that conventional methods cannot distinguish. An 80-MeV proton irradiation experiment at RASiS, Tohoku University verified the system operation. Timestamps between the SRAM chip and the particle detectors were successfully synchronized, accounting for PLL disturbances caused by radiation. Event building was achieved by determining a reset offset with sub-ns resolution, and spatial synchronization was maintained within several tens of micrometers.
Javier Muñoz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez
In an increasingly digitalized world, verifying the authenticity of ID documents has become a critical challenge for real-life applications such as digital banking, crypto-exchanges, renting, etc. This study focuses on the topic of fake ID detection, covering several limitations in the field. In particular, no publicly available data from real ID documents exists, and most studies rely on proprietary in-house databases that are not available due to privacy reasons. In order to shed some light on this critical challenge that makes difficult to advance in the field, we explore a trade-off between privacy (i.e., amount of sensitive data available) and performance, proposing a novel patch-wise approach for privacy-preserving fake ID detection. Our proposed approach explores how privacy can be enhanced through: i) two levels of anonymization for an ID document (i.e., fully- and pseudo-anonymized), and ii) different patch size configurations, varying the amount of sensitive data visible in the patch image. Also, state-of-the-art methods such as Vision Transformers and Foundation Models are considered in the analysis. The experimental framework shows that, on an unseen database (DLC-2021), our proposal achieves 13.91% and 0% EERs at patch and ID document level, showing a good generalization to other databases. In addition to this exploration, another key contribution of our study is the release of the first publicly available database that contains 48,400 patches from both real and fake ID documents, along with the experimental framework and models, which will be available in our GitHub.
G. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona, Tanguy Urvoy
The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.
Samy-Melwan Vilhes, Gilles Gasso, Mokhtar Z Alaya
Time series anomaly detection (TSAD) focuses on identifying whether observations in streaming data deviate significantly from normal patterns. With the prevalence of connected devices, anomaly detection on time series has become paramount, as it enables real-time monitoring and early detection of irregular behaviors across various application domains. In this work, we introduce PatchTrAD, a Patch-based Transformer model for time series anomaly detection. Our approach leverages a Transformer encoder along with the use of patches under a reconstructionbased framework for anomaly detection. Empirical evaluations on multiple benchmark datasets show that PatchTrAD is on par, in terms of detection performance, with state-of-the-art deep learning models for anomaly detection while being time efficient during inference.
Rita Sevastjanova, Robin Gerling, Thilo Spinner, Mennatallah El-Assady
Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users' interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.