Shahriar Kabir Nahin, Wenxiao Xiao, Joshua Liu, Anshuman Chhabra, Hongfu Liu
Data-centric learning seeks to improve model performance from the perspective of data quality, and has been drawing increasing attention in the machine learning community. Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions, enabling practitioners to identify detrimental samples and retrain models on a cleaner dataset for improved performance. However, most existing work focuses on the question: "what data benefits the learning model?" In this paper, we take a step further and investigate a more fundamental question: "what is the performance ceiling of the learning model?" Unlike prior studies that primarily measure improvement through overall accuracy, we emphasize category-wise accuracy and aim for Pareto improvements, ensuring that every class benefits, rather than allowing tradeoffs where some classes improve at the expense of others. To address this challenge, we propose category-wise influence functions and introduce an influence vector that quantifies the impact of each training sample across all categories. Leveraging these influence vectors, we develop a principled criterion to determine whether a model can still be improved, and further design a linear programming-based sample reweighting framework to achieve Pareto performance improvements. Through extensive experiments on synthetic datasets, vision, and text benchmarks, we demonstrate the effectiveness of our approach in estimating and achieving a model's performance improvement across multiple categories of interest.
Yonghan Shin, SeungKyu Kim, Won-Ki Jeong
Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.
Yachao Yuan, Zhen Yu, Jin Wang, Zhipeng Cheng, Jianhua Hu
Federated Learning (FL) has shown considerable promise in Computing Power Networks (CPNs) for privacy protection, efficient data utilization, and dynamic collaboration. Although it offers practical benefits, applying FL in CPNs continues to encounter a major obstacle, i.e., multi-task deployment. However, existing work mainly focuses on mitigating FL's computation and communication overhead of a single task while overlooking the computing resource wastage issue of heterogeneous devices across multiple tasks in FL under CPNs. To tackle this, we design FedAPTA, a federated multi-task learning framework in CPNs. FedAPTA alleviates computing resource wastage through the developed layer-wise model pruning technique, which reduces local model size while considering both data and device heterogeneity. To aggregate structurally heterogeneous local models of different tasks, we introduce a heterogeneous model recovery strategy and a task-aware model aggregation method that enables the aggregation through infilling local model architecture with the shared global model and clustering local models according to their specific tasks. We deploy FedAPTA on a realistic FL platform and benchmark it against nine SOTA FL methods. The experimental outcomes demonstrate that the proposed FedAPTA considerably outperforms the state-of-the-art FL methods by up to 4.23%. Our code is available at https://github.com/Zhenzovo/FedCPN.
Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang et al.
Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
Authors' comments: 18 pages, 12 figures
Jiyong Kim, Sunwoong Yang, Namwoo Kang
This study introduces a novel point-wise diffusion model that processes spatio-temporal points independently to efficiently predict complex physical systems with shape variations. This methodological contribution lies in applying forward and backward diffusion processes at individual spatio-temporal points, coupled with a point-wise diffusion transformer architecture for denoising. Unlike conventional image-based diffusion models that operate on structured data representations, this framework enables direct processing of any data formats including meshes and point clouds while preserving geometric fidelity. We validate our approach across three distinct physical domains with complex geometric configurations: 2D spatio-temporal systems including cylinder fluid flow and OLED drop impact test, and 3D large-scale system for road-car external aerodynamics. To justify the necessity of our point-wise approach for real-time prediction applications, we employ denoising diffusion implicit models (DDIM) for efficient deterministic sampling, requiring only 5-10 steps compared to traditional 1000-step and providing computational speedup of 100 to 200 times during inference without compromising accuracy. In addition, our proposed model achieves superior performance compared to image-based diffusion model: reducing training time by 94.4% and requiring 89.0% fewer parameters while achieving over 28% improvement in prediction accuracy. Comprehensive comparisons against data-flexible surrogate models including DeepONet and Meshgraphnet demonstrate consistent superiority of our approach across all three physical systems. To further refine the proposed model, we investigate two key aspects: 1) comparison of final physical states prediction or incremental change prediction, and 2) computational efficiency evaluation across varying subsampling ratios (10%-100%).
Abdoul O. Diakité, Claudia Moreau, Gleb Bezgin, Nikhil Bhagwat, Pedro Rosa-Neto, Jean-Baptiste Poline, Simon Girard, Amadou Barry et al.
Multimodal high-dimensional data are increasingly prevalent in biomedical
research, yet they are often compromised by block-wise missingness and
measurement errors, posing significant challenges for statistical inference and
prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression
method that simultaneously addresses these two pervasive issues. Building on
the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes
to account for heterogeneity in data structures and error magnitudes across
modalities. We establish the theoretical properties of AdapDISCOM, including
model selection consistency and convergence rates under sub-Gaussian and
heavy-tailed settings, and develop robust and computationally efficient
variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations
demonstrate that AdapDISCOM consistently outperforms existing methods such as
DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and
heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease
Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of
cognitive scores and reliable selection of established biomarkers, even with
substantial missingness and measurement errors. AdapDISCOM provides a flexible,
robust, and scalable framework for high-dimensional multimodal data analysis
under realistic data imperfections.
Authors' comments: 49 pages, 4 figures
Maxim Henry, Adrien Deliège, Anthony Cioppa, Marc Van Droogenbroeck
Convolutional Neural Networks (CNN) are widely used in many computer vision
tasks. Yet, their increasing size and complexity pose significant challenges
for efficient deployment on resource-constrained platforms. Hence, network
pruning has emerged as an effective way of reducing the size and computational
requirements of neural networks by removing redundant or unimportant
parameters. However, a fundamental challenge with pruning consists in optimally
removing redundancies without degrading performance. Most existing pruning
techniques overlook structural dependencies across feature maps within a layer,
resulting in suboptimal pruning decisions. In this work, we introduce LinDeps,
a novel post-pruning method, i.e., a pruning method that can be applied on top
of any pruning technique, which systematically identifies and removes redundant
filters via linear dependency analysis. Particularly, LinDeps applies pivoted
QR decomposition to feature maps to detect and prune linearly dependent
filters. Then, a novel signal recovery mechanism adjusts the next layer's
kernels to preserve compatibility and performance without requiring any
fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet
backbones demonstrate that LinDeps improves compression rates of existing
pruning techniques while preserving performances, leading to a new state of the
art in CNN pruning. We also benchmark LinDeps in low-resource setups where no
retraining can be performed, which shows significant pruning improvements and
inference speedups over a state-of-the-art method. LinDeps therefore
constitutes an essential add-on for any current or future pruning technique.
Authors' comments: 10 pages, 4 figures, 5 tables, 45 references
Gabriel Bo, Koa Chang, Justin Gu
We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel
reinforcement learning framework that teaches large language models to explore
diverse tool usage patterns beyond conventional high-temperature sampling.
Building on recent advances in step-wise reinforcement learning, we introduce a
dual-objective reward system that simultaneously optimizes for answer quality
and tool diversity, training a Llama-3.1 8B model through offline PPO on
synthetically generated trajectories from the MMLU-Pro dataset. Our approach
uniquely employs a rarity-first exploitation strategy where a GPT-4o judge
scores candidate actions across eight distinct tools plus chain-of-thought
reasoning, with the policy favoring less-frequently used but still viable tools
to encourage systematic exploration. Empirical results demonstrate that SPaRK
achieves competitive performance across 14 MMLU-Pro categories while exhibiting
significantly higher entropy in tool selection compared to both baseline and
supervised fine-tuning approaches, suggesting that algorithmic exploration
through explicit tool diversity can enhance reasoning capabilities without
sacrificing accuracy.
Authors' comments: 12 pages, 4 figures
Shehroz S. Khan, Ali Abedi, Charlene H. Chu
Interpreting large volumes of high-dimensional, unlabeled data in a manner
that is comprehensible to humans remains a significant challenge across various
domains. In unsupervised healthcare data analysis, interpreting clustered data
can offer meaningful insights into patients' health outcomes, which hold direct
implications for healthcare providers. This paper addresses the problem of
interpreting clustered sensor data collected from older adult patients
recovering from lower-limb fractures in the community. A total of 560 days of
multimodal sensor data, including acceleration, step count, ambient motion, GPS
location, heart rate, and sleep, alongside clinical scores, were remotely
collected from patients at home. Clustering was first carried out separately
for each data modality to assess the impact of feature sets extracted from each
modality on patients' recovery trajectories. Then, using context-aware
prompting, a large language model was employed to infer meaningful cluster
labels for the clusters derived from each modality. The quality of these
clusters and their corresponding labels was validated through rigorous
statistical testing and visualization against clinical scores collected
alongside the multimodal sensor data. The results demonstrated the statistical
significance of most modality-specific cluster labels generated by the large
language model with respect to clinical scores, confirming the efficacy of the
proposed method for interpreting sensor data in an unsupervised manner. This
unsupervised data analysis approach, relying solely on sensor data, enables
clinicians to identify at-risk patients and take timely measures to improve
health outcomes.
Authors' comments: 15 pages, 2 figures, 3 tables
Jung Hyun Lee, Seungjae Shin, Vinnam Kim, Jaeseong You, An Chen
As the rapid scaling of large language models (LLMs) poses significant
challenges for deployment on resource-constrained devices, there is growing
interest in extremely low-bit quantization, such as 2-bit. Although prior works
have shown that 2-bit large models are pareto-optimal over their 4-bit smaller
counterparts in both accuracy and latency, these advancements have been limited
to pre-trained LLMs and have not yet been extended to instruction-tuned models.
To bridge this gap, we propose Unified Progressive Quantization (UPQ)$-$a novel
progressive quantization framework (FP16$\rightarrow$INT4$\rightarrow$INT2)
that unifies block-wise post-training quantization (PTQ) with
distillation-based quantization-aware training (Distill-QAT) for INT2
instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned
models to INT4 using block-wise PTQ to significantly reduce the quantization
error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT
to enable INT2 instruction-tuned LLMs to generate responses consistent with
their original FP16 counterparts by minimizing the generalized Jensen-Shannon
divergence (JSD) between the two. To the best of our knowledge, we are the
first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs
to INT2 without relying on proprietary post-training data, while achieving
state-of-the-art performances on MMLU and IFEval$-$two of the most
representative benchmarks for evaluating instruction-tuned LLMs.
Authors' comments: Preprint
Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu
Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li et al.
Goal-oriented script planning, or the ability to devise coherent sequences of
actions toward specific goals, is commonly employed by humans to plan for
typical activities. In e-commerce, customers increasingly seek LLM-based
assistants to generate scripts and recommend products at each step, thereby
facilitating convenient and efficient shopping experiences. However, this
capability remains underexplored due to several challenges, including the
inability of LLMs to simultaneously conduct script planning and product
retrieval, difficulties in matching products caused by semantic discrepancies
between planned actions and search queries, and a lack of methods and benchmark
data for evaluation. In this paper, we step forward by formally defining the
task of E-commerce Script Planning (EcomScript) as three sequential subtasks.
We propose a novel framework that enables the scalable generation of
product-enriched scripts by associating products with each step based on the
semantic similarity between the actions and their purchase intentions. By
applying our framework to real-world e-commerce data, we construct the very
first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229
scripts sourced from 2.4 million products. Human annotations are then conducted
to provide gold labels for a sampled subset, forming an evaluation benchmark.
Extensive experiments reveal that current (L)LMs face significant challenges
with EcomScript tasks, even after fine-tuning, while injecting product purchase
intentions improves their performance.
Authors' comments: ACL2025
Ya Li, Bin Zhou, Bo Hu
In speaker verification, traditional models often emphasize modeling long-term contextual features to capture global speaker characteristics. However, this approach can neglect fine-grained voiceprint information, which contains highly discriminative features essential for robust speaker embeddings. This paper introduces a novel model architecture, termed MGFF-TDNN, based on multi-granularity feature fusion. The MGFF-TDNN leverages a two-dimensional depth-wise separable convolution module, enhanced with local feature modeling, as a front-end feature extractor to effectively capture time-frequency domain features. To achieve comprehensive multi-granularity feature fusion, we propose the M-TDNN structure, which integrates global contextual modeling with fine-grained feature extraction by combining time-delay neural networks and phoneme-level feature pooling. Experiments on the VoxCeleb dataset demonstrate that the MGFF-TDNN achieves outstanding performance in speaker verification while remaining efficient in terms of parameters and computational resources.
Jong Chul Lee, Joon Hyeop Lee, Hyunjin Jeong, Mina Pak, Sree Oh
We study star formation rate (SFR) indicators and dust attenuation of 74
nearby star-forming galaxies on kiloparsec scales, based on GALEX
far-ultraviolet (FUV) and WISE mid-infrared (MIR) images with CALIFA optical
integral field spectroscopic data. We obtain hybrid SFR indicators by combining
the observed FUV and MIR luminosities and calibrate them using the
dust-corrected H$\alpha$ luminosity as a reference SFR. The simple linear
combination appears to follow well the reference SFR, but the calibration
residual shows a significant dependence on the specific SFR (sSFR), which can
be removed by employing the combination coefficient or conversion offset that
varies with the sSFR. In the plane of gas versus stellar attenuation, the
median trend line's slope ($\approx$ stellar-to-gas attenuation ratio) changes
from 0.44 to 1.0 with increasing attenuation. The differential attenuation,
defined as the deviation of stellar attenuation from the median trend line, is
strongly correlated with the SFR surface density and sSFR, compatible with the
two-component dust model. The differential attenuation seems to be affected by
both local and global factors.
Authors' comments: 18 pages, 13 figures, To appear in ApJ
Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
Yuanhong A, Guoyu Zhang, Yongcheng Zeng, Bo Zhang
In this study, we establish a unified framework to deal with the high dimensional matrix completion problem under flexible nonignorable missing mechanisms. Although the matrix completion problem has attracted much attention over the years, there are very sparse works that consider the nonignorable missing mechanism. To address this problem, we derive a row- and column-wise matrix U-statistics type loss function, with the nuclear norm for regularization. A singular value proximal gradient algorithm is developed to solve the proposed optimization problem. We prove the non-asymptotic upper bound of the estimation error's Frobenius norm and show the performance of our method through numerical simulations and real data analysis.
Stephen Meisenbacher, Chaeeun Joy Lee, Florian Matthes
The task of $\textit{Differentially Private Text Rewriting}$ is a class of
text privatization techniques in which (sensitive) input textual documents are
$\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation
behind such methods is to hide both explicit and implicit identifiers that
could be contained in text, while still retaining the semantic meaning of the
original text, thus preserving utility. Recent years have seen an uptick in
research output in this field, offering a diverse array of word-, sentence-,
and document-level DP rewriting methods. Common to these methods is the
selection of a privacy budget (i.e., the $\varepsilon$ parameter), which
governs the degree to which a text is privatized. One major limitation of
previous works, stemming directly from the unique structure of language itself,
is the lack of consideration of $\textit{where}$ the privacy budget should be
allocated, as not all aspects of language, and therefore text, are equally
sensitive or personal. In this work, we are the first to address this
shortcoming, asking the question of how a given privacy budget can be
intelligently and sensibly distributed amongst a target document. We construct
and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a
privacy budget to constituent tokens in a text document. In a series of privacy
and utility experiments, we empirically demonstrate that given the same privacy
budget, intelligent distribution leads to higher privacy levels and more
positive trade-offs than a naive distribution of $\varepsilon$. Our work
highlights the intricacies of text privatization with DP, and furthermore, it
calls for further work on finding more efficient ways to maximize the
privatization benefits offered by DP in text rewriting.
Authors' comments: 14 pages, 1 figure, 6 tables. Accepted to CODASPY 2025
Shu Yang, Chengting Yu, Lei Liu, Hanzhi Ma, Aili Wang, Erping Li
Spiking Neural Networks (SNNs) have garnered considerable attention as a potential alternative to Artificial Neural Networks (ANNs). Recent studies have highlighted SNNs' potential on large-scale datasets. For SNN training, two main approaches exist: direct training and ANN-to-SNN (ANN2SNN) conversion. To fully leverage existing ANN models in guiding SNN learning, either direct ANN-to-SNN conversion or ANN-SNN distillation training can be employed. In this paper, we propose an ANN-SNN distillation framework from the ANN-to-SNN perspective, designed with a block-wise replacement strategy for ANN-guided learning. By generating intermediate hybrid models that progressively align SNN feature spaces to those of ANN through rate-based features, our framework naturally incorporates rate-based backpropagation as a training method. Our approach achieves results comparable to or better than state-of-the-art SNN distillation methods, showing both training and learning efficiency.
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, Dacheng Tao
Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.
Yuqi Wu, Guangya Wan, Jingjing Li, Shengming Zhao, Lingfeng Ma, Tianyi Ye, Ion Pop, Yanbo Zhang et al.
Translating state-of-the-art NLP into practice often stalls at the "last
mile" owing to insufficient contextualization of the target domain's knowledge,
processes, and evaluation. Psychiatric differential diagnosis exemplifies this
challenge: accurate assessments depend on nuanced clinical knowledge, a
delicate cognitive-affective interview process, and downstream outcomes that
extend far beyond benchmark accuracy. We present WiseMind, a systematic
interdisciplinary contextualization framework that delivers both instrumental
(diagnostic precision) and humanistic (empathy) gains. WiseMind comprises three
components:(i) structured knowledge-guided proactive reasoning, which embeds
DSM-5 criteria in a knowledge graph to steer questioning; (ii) a
theory-informed dual-agent architecture that coordinates a "reasonable-mind"
reasoning agent and an "emotional-mind" empathy agent, inspired by Dialectical
Behavior Therapy; and (iii) a multi-faceted evaluation strategy covering
simulated patients, user studies, clinician review, and ethical assessment.
Tested on depression, anxiety, and bipolar disorder, WiseMind attains up to
84.2% diagnostic accuracy, which is comparable to human experts, while
outperforming single-agent baselines in perceived empathy and trustworthiness.
These results show that deep contextualization-across knowledge, process, and
evaluation layers-can transform benchmark-driven NLP into clinically meaningful
impact.
Authors' comments: 27 pages, 13 figures