Jingyu Xiao, Jiantong Qin, Shuoqi Li, Man Ho Lam, Yuxuan Wan, Jen-tse Huang, Yintong Huo, Michael R. Lyu
Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI-to-code task, which aims to generate UI code from design mock-ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To systematically investigate and address these challenges, we introduce ComUIBench, a new multi-page complex webpage benchmark with component annotations, designed to evaluate MLLMs' ability to generate reusable UI code in realistic website scenarios. Building upon this benchmark, we propose ComUICoder, a component-based UI code generation framework that emphasizes semantic-aware segmentation, code reuse, and fine-grained refinement. Specifically, ComUICoder incorporates (1) Hybrid Semantic-aware Block Segmentation for accurate UI semantic coherent block detection, (2) Visual-aware Graph-based Block Merge to consolidate structurally similar components within and across webpages for reusable implementation, and (3) Priority-based Element-wise Feedback to refine generated code and reduce element-level inconsistencies. Extensive experiments demonstrate that ComUICoder significantly improves overall generation quality and code reusability on complex multipage websites. Our datasets and code are publicly available at https://github.com/WebPAI/ComUICoder.
Jihun Kim, Javad Lavaei
The least-squares estimator has achieved considerable success in learning linear dynamical systems from a single trajectory of length $T$. While it attains an optimal error of $\mathcal{O}(1/\sqrt{T})$ under independent zero-mean noise, it lacks robustness and is particularly susceptible to adversarial corruption. In this paper, we consider the identification of a networked system in which every node is subject to both noise and adversarial attacks. We assume that every node is independently corrupted with probability smaller than $0.5$ at each time, placing the overall system under almost-persistent local attack. We first show that no convex one-stage estimator can achieve a consistent estimate as $T$ grows under both noise and attacks. This motivates the development of a two-stage estimation method applied across nodes. In Stage I, we leverage the $\ell_1$-norm estimator and derive an estimation error bound proportional to the noise level $σ_w$. This bound is subsequently used to detect and filter out attacks, producing a clean dataset for each node, to which we apply the least-squares estimator in Stage II. The resulting estimation error is on the order $\mathcal{O}(1/\sqrt{T})$ plus the product of $σ_w$ and the number of misclassifications. In the event of perfect separability between attack and non-attack data, which occurs when injected attacks are sufficiently large relative to the noise scale, our two-stage estimator is consistent for the true system.
Authors' comments: 33 pages
A. Trudeau, Anthony H. Gonzalez, K. Thongkham, M. Brodwin, Thomas Connor, Peter R. M. Eisenhardt, Emily Moravec, S. A. Stanford et al.
The splashback radius, the radius of the apocenter of the first orbit of infalling material, is a measurable quantity marking the boundary between a galaxy cluster and its infalling region. We report detections of splashback radii in total light stacks, i.e. image stacks centered on the cores of galaxy clusters. Our analysis uses Wide-field Infrared Survey Explorer (WISE) W1 and W2 images of 83,345 candidate clusters at $0.5 \lesssim z \lesssim 1.9$ from the Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2). The clusters are organized in stacks by redshift and signal-to-noise ($S\slash N$) ratios. We adopt a statistical approach, using 1000 bootstrap realizations to determine the median projected splashback radius and its confidence interval in a given bin. We compare our splashback radii with the measurements made by K. Thongkham et al. on a similar sample of MaDCoWS2 clusters using galaxy-cluster cross-correlation and find that they are consistent, although our method yields larger error bars. Our main systematic error is the accuracy of the background subtraction, but its impact remains small: the consistency of K. Thongkham et al. and our results suggests that neither method suffers from large systematics. The sensitivity of total light stacking to the contribution of faint galaxies can be advantageous to locate splashback radii when only the brightest galaxies are detected in individual images, such as at high redshifts. We present a potential application of this new technique to probe the evolution of the stellar mass in cluster infalling regions.
Authors' comments: Accepted by ApJ; 16 pages, 6 figures, 3 table
Khunanon Thongkham, Anthony H. Gonzalez, Mark Brodwin, Ariane Trudeau, Peter Eisenhardt, S. A. Stanford, Emily Moravec, Thomas Connor et al.
The Massive and Distant Clusters of WISE Survey 2 (MaDCoWS2) is a WISE-selected catalog of galaxy clusters at $0.1<z<2$ covering an effective area of $>6000$ deg$^2$. In this paper, we derive splashback radii for this cluster ensemble from galaxy density profiles and constrain the mass threshold of the survey as a function of redshift. We use MaDCoWS2 cluster candidates at $0.4\leq z \leq 1.65$ divided into subsamples with different signal-to-noise (S/N$_{\rm P}$) and redshifts, cross-correlated with galaxies from the CatWISE2020 catalog, to obtain average surface density profiles. We perform a Markov Chain Monte Carlo analysis to derive parameter estimates for theoretical models consisting of orbiting and infalling terms. A distinct splashback feature is detected in all subsamples. The measured splashback radii span from $0.89^{+0.02}_{-0.02}h^{-1}$ comoving Mpc/cMpc ($0.61^{+0.02}_{-0.02}h^{-1}$ proper Mpc/pMpc) at $\overline{z}=0.45$ to $1.27^{+0.05}_{-0.05}h^{-1}$ cMpc ($0.53^{+0.04}_{-0.04}h^{-1}$ pMpc) at $\overline{z}=1.54$. We also find that splashback radii increase with $S/N_{\rm P}$ at fixed redshift. The resultant splashback radii constrain the redshift dependence of the mass of MaDCoWS2 clusters at fixed $S/N_{\rm P}$. We calculate $M_{\rm 200m}$ from the radii using a relation based on a cosmological simulation. MaDCoWS2 $M_{\rm 200m}$ values derived from the simulation-based relation are lower than the expected values based on weak-lensing observations. More robust mass constraints will come from calibrating splashback radii derived from galaxy density profiles with weak lensing shear profiles from facilities such as $\textit{Euclid}$, Rubin, and $\textit{Roman}$.
Authors' comments: 22 pages, 7 figures, 9 tables, submitted to ApJ
Lei Deng, Wenhao Huang, Chao Yang, Haoyuan Zheng, Yinbin Tian, Yue Ma
Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination (R) exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.
Authors' comments: Under review
Adel Javanmard, David P. Woodruff
The Courtade-Kumar conjecture posits that dictatorship functions maximize the mutual information between the function's output and a noisy version of its input over the Boolean hypercube. We present two significant advancements related to this conjecture. First, we resolve an open question posed by Courtade and Kumar, proving that for any Boolean function (regardless of bias), the sum of mutual information between the function's output and the individual noisy input coordinates is bounded by $1-H(α)$, where $α$ is the noise parameter of the Binary Symmetric Channel. This generalizes their previous result which was restricted to balanced Boolean functions. Second, we advance the study of the main conjecture in the high noise regime. We establish an optimal error bound of $O(λ^2)$ for the asymptotic entropy expansion, where $λ= (1-2α)^2$, improving upon the previous best-known bounds. This refined analysis leads to a sharp, linear Fourier concentration bound for highly informative functions and significantly extends the range of the noise parameter $λ$ for which the conjecture is proven to hold.
Authors' comments: 16 pages
Tomoki Kubo, Ryuken Uda, Yusuke Iida
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
Authors' comments: 17 pages, 9 figures
Prasanjit Dubey, Xiaoming Huo
Identifying the most powerful test in multiple hypothesis testing under strong family-wise error rate (FWER) control is a fundamental problem in statistical methodology. State-of-the-art approaches formulate this as a constrained optimisation problem, for which a dual problem with strong duality has been established in a general sense. However, a constructive method for solving the dual problem is lacking, leaving a significant computational gap. This paper fills this gap by deriving novel, necessary optimality conditions for the dual optimisation. We show that these conditions motivate an efficient coordinate-wise algorithm for computing the optimal dual solution, which, in turn, provides the most powerful test for the primal problem. We prove the linear convergence of our algorithm, i.e., the computational complexity of our proposed algorithm is proportional to the logarithm of the reciprocal of the target error. To the best of our knowledge, this is the first time such a fast and computationally efficient algorithm has been proposed for finding the most powerful test with family-wise error rate control. The method's superior power is demonstrated through simulation studies, and its practical utility is shown by identifying new, significant findings in both clinical and financial data applications.
Hossein Teimoori Faal, Hasan Khodakarami
We study the Pascal determinantal arrays $\PD_k$, whose entries $\PD_k(i,j)$ are the $k\times k$ minors of the lower-triangular Pascal matrix $P=( \binom{a}{b} )_{a,b\ge 0}$. We prove an exact factorization of the row-wise log-concavity operator: \[ \LC(\PD_k)=\PD_{k-1}\Had\PD_{k+1}, \] where $\LC(a)_j=a_j^2-a_{j-1}a_{j+1}$ and $\Had$ denotes the Hadamard (entrywise) product. This identity is established by an elementary manipulation of the Desnanot--Jacobi (Dodgson) identity in two adjacent positions. We further prove a general inequality asserting that the log-concavity operator is submultiplicative under Hadamard products of log-concave arrays: $\LC(A\Had X)\ge\LC(A)\Had\LC(X)$. Combining the factorization with this inequality yields a uniform algebraic proof that every row of every array $\PD_k$ ($k\ge 1$) is infinitely log-concave, extending the celebrated theorem of McNamara and Sagan from Pascal's triangle ($\PD_1$) to the entire determinantal hierarchy. Applications include the log-convexity of $\{\PD_k(i,j)\}_{k\ge 0}$ in the determinantal order $k$ and a family of determinantal Hadamard inequalities.
Barbara Steffen, Edward A. Lee, Moshe Y. Vardi, Bernhard Steffen
Artificial intelligence (AI) is no longer futuristic; it is a daily companion shaping our private and work lives. While AI simplifies our lives, its rise also invites us to rethink who we are - and who we wish to remain - as humans. Even if AI does not think, feel, or desire, it learns from our behavior, mirroring our collective values, biases, and aspirations. The question, then, is not what AI is, but what we are allowing it to become through data, computing power, and other parameters "teaching" it - and, even more importantly, who we are becoming through our relationship with AI. As the EU AI Act and the Vienna Manifesto on Digital Humanism emphasize, technology must serve human dignity,social well-being, and democratic accountability. In our opinion, responsible use of AI is not only a matter of code nor law, but also of conscientious practice: how each of us engages and teaches others to use AI at home and at work. We propose Ten Commandments for the Wise and Responsible Use of AI are meant as guideline for this very engagement. They closely align with Floridi and Cowls' five guiding principles for AI in society - beneficence, non-maleficence, autonomy, justice, and explicability.
Anuab Sen, Mir Sayeed Mohammad, Saibal Mukhopadhyay
We introduce SSMRadNet, the first multi-scale State Space Model (SSM) based detector for Frequency Modulated Continuous Wave (FMCW) radar that sequentially processes raw ADC samples through two SSMs. One SSM learns a chirp-wise feature by sequentially processing samples from all receiver channels within one chirp, and a second SSM learns a representation of a frame by sequentially processing chirp-wise features. The latent representations of a radar frame are decoded to perform segmentation and detection tasks. Comprehensive evaluations on the RADIal dataset show SSMRadNet has 10-33x fewer parameters and 60-88x less computation (GFLOPs) while being 3.7x faster than state-of-the-art transformer and convolution-based radar detectors at competitive performance for segmentation tasks.
Tsung-En Lin, Kuan-Yi Lee, Hung-Yi Lee
Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models' internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.
Authors' comments: Note: This preprint is a version of the paper submitted to ICASSP 2026. The author list here includes contributors who provided additional supervision and guidance. The official ICASSP submission may differ slightly in author composition
Shahriar Kabir Nahin, Wenxiao Xiao, Joshua Liu, Anshuman Chhabra, Hongfu Liu
Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: "what data benefits the learning model?" In this paper, we take a step further and investigate a more fundamental question: "what is the performance ceiling of the learning model?" Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model's performance improvement across multiple categories of interest.
Yonghan Shin, SeungKyu Kim, Won-Ki Jeong
Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.
Yachao Yuan, Zhen Yu, Jin Wang, Zhipeng Cheng, Jianhua Hu
Federated Learning (FL) has shown considerable promise in Computing Power Networks (CPNs) for privacy protection, efficient data utilization, and dynamic collaboration. Although it offers practical benefits, applying FL in CPNs continues to encounter a major obstacle, i.e., multi-task deployment. However, existing work mainly focuses on mitigating FL's computation and communication overhead of a single task while overlooking the computing resource wastage issue of heterogeneous devices across multiple tasks in FL under CPNs. To tackle this, we design FedAPTA, a federated multi-task learning framework in CPNs. FedAPTA alleviates computing resource wastage through the developed layer-wise model pruning technique, which reduces local model size while considering both data and device heterogeneity. To aggregate structurally heterogeneous local models of different tasks, we introduce a heterogeneous model recovery strategy and a task-aware model aggregation method that enables the aggregation through infilling local model architecture with the shared global model and clustering local models according to their specific tasks. We deploy FedAPTA on a realistic FL platform and benchmark it against nine SOTA FL methods. The experimental outcomes demonstrate that the proposed FedAPTA considerably outperforms the state-of-the-art FL methods by up to 4.23%. Our code is available at https://github.com/Zhenzovo/FedCPN.
Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang et al.
Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
Authors' comments: 18 pages, 12 figures
Jiyong Kim, Sunwoong Yang, Namwoo Kang
This study introduces a novel point-wise diffusion model that processes spatio-temporal points independently to efficiently predict complex physical systems with shape variations. This methodological contribution lies in applying forward and backward diffusion processes at individual spatio-temporal points, coupled with a point-wise diffusion transformer architecture for denoising. Unlike conventional image-based diffusion models that operate on structured data representations, this framework enables direct processing of any data formats including meshes and point clouds while preserving geometric fidelity. We validate our approach across three distinct physical domains with complex geometric configurations: 2D spatio-temporal systems including cylinder fluid flow and OLED drop impact test, and 3D large-scale system for road-car external aerodynamics. To justify the necessity of our point-wise approach for real-time prediction applications, we employ denoising diffusion implicit models (DDIM) for efficient deterministic sampling, requiring only 5-10 steps compared to traditional 1000-step and providing computational speedup of 100 to 200 times during inference without compromising accuracy. In addition, our proposed model achieves superior performance compared to image-based diffusion model: reducing training time by 94.4% and requiring 89.0% fewer parameters while achieving over 28% improvement in prediction accuracy. Comprehensive comparisons against data-flexible surrogate models including DeepONet and Meshgraphnet demonstrate consistent superiority of our approach across all three physical systems. To further refine the proposed model, we investigate two key aspects: 1) comparison of final physical states prediction or incremental change prediction, and 2) computational efficiency evaluation across varying subsampling ratios (10%-100%).
Abdoul O. Diakité, Claudia Moreau, Gleb Bezgin, Nikhil Bhagwat, Pedro Rosa-Neto, Jean-Baptiste Poline, Simon Girard, Amadou Barry et al.
Multimodal high-dimensional data are increasingly prevalent in biomedical
research, yet they are often compromised by block-wise missingness and
measurement errors, posing significant challenges for statistical inference and
prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression
method that simultaneously addresses these two pervasive issues. Building on
the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes
to account for heterogeneity in data structures and error magnitudes across
modalities. We establish the theoretical properties of AdapDISCOM, including
model selection consistency and convergence rates under sub-Gaussian and
heavy-tailed settings, and develop robust and computationally efficient
variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations
demonstrate that AdapDISCOM consistently outperforms existing methods such as
DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and
heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease
Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of
cognitive scores and reliable selection of established biomarkers, even with
substantial missingness and measurement errors. AdapDISCOM provides a flexible,
robust, and scalable framework for high-dimensional multimodal data analysis
under realistic data imperfections.
Authors' comments: 49 pages, 4 figures
Maxim Henry, Adrien Deliège, Anthony Cioppa, Marc Van Droogenbroeck
Convolutional Neural Networks (CNN) are widely used in many computer vision
tasks. Yet, their increasing size and complexity pose significant challenges
for efficient deployment on resource-constrained platforms. Hence, network
pruning has emerged as an effective way of reducing the size and computational
requirements of neural networks by removing redundant or unimportant
parameters. However, a fundamental challenge with pruning consists in optimally
removing redundancies without degrading performance. Most existing pruning
techniques overlook structural dependencies across feature maps within a layer,
resulting in suboptimal pruning decisions. In this work, we introduce LinDeps,
a novel post-pruning method, i.e., a pruning method that can be applied on top
of any pruning technique, which systematically identifies and removes redundant
filters via linear dependency analysis. Particularly, LinDeps applies pivoted
QR decomposition to feature maps to detect and prune linearly dependent
filters. Then, a novel signal recovery mechanism adjusts the next layer's
kernels to preserve compatibility and performance without requiring any
fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet
backbones demonstrate that LinDeps improves compression rates of existing
pruning techniques while preserving performances, leading to a new state of the
art in CNN pruning. We also benchmark LinDeps in low-resource setups where no
retraining can be performed, which shows significant pruning improvements and
inference speedups over a state-of-the-art method. LinDeps therefore
constitutes an essential add-on for any current or future pruning technique.
Authors' comments: 10 pages, 4 figures, 5 tables, 45 references
Gabriel Bo, Koa Chang, Justin Gu
We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel
reinforcement learning framework that teaches large language models to explore
diverse tool usage patterns beyond conventional high-temperature sampling.
Building on recent advances in step-wise reinforcement learning, we introduce a
dual-objective reward system that simultaneously optimizes for answer quality
and tool diversity, training a Llama-3.1 8B model through offline PPO on
synthetically generated trajectories from the MMLU-Pro dataset. Our approach
uniquely employs a rarity-first exploitation strategy where a GPT-4o judge
scores candidate actions across eight distinct tools plus chain-of-thought
reasoning, with the policy favoring less-frequently used but still viable tools
to encourage systematic exploration. Empirical results demonstrate that SPaRK
achieves competitive performance across 14 MMLU-Pro categories while exhibiting
significantly higher entropy in tool selection compared to both baseline and
supervised fine-tuning approaches, suggesting that algorithmic exploration
through explicit tool diversity can enhance reasoning capabilities without
sacrificing accuracy.
Authors' comments: 12 pages, 4 figures