2026年临床人工智能（AI）发展状况研究报告_130页_6mb

> **来源：[研报客](https://pc.yanbaoke.cn)** # State of Clinical AI Report 2026 # About The Authors # Peter Brodeur Dr. Peter Brodeur is a rising cardiology fellow at Harvard Medical School's Beth Israel Deaconess Medical Center. Dr. Brodeur is an affiliate of ARISE, reviewer for Nature Medicine & NEJM AI, and former life sciences strategy consultant. His research focuses on human computer interaction and LLM clinical reasoning. # Ethan Goh Dr. Ethan Goh is the Executive Director of ARISE. His research has been featured in The New York Times, The Washington Post, and CNN. He directs the Stanford Healthcare AI Leadership Program, and Harvard's Agentic AI Executive Course. Dr. Goh is a Founding Editorial Board member and Associate Editor at BMJ Digital Health & AI. # Adam Rodman Dr. Adam Rodman is an assistant professor at Harvard Medical School. He is the Director of AI Programs for the Carl J. Shapiro Center. Dr. Rodman is an Associate Editor at NEJM AI. He is also the host of the American College of Physicians podcast Bedside Rounds. # Jonathan H Chen Dr. Jonathan H Chen is Stanford's inaugural Director for Medical Education in AI in the Division of Computational Medicine. His expertise combining human with artificial intelligence to provide better healthcare than either alone is featured in the popular press with over 100 publications and awards. # Message From ARISE Leadership "There are decades where nothing happens; and there are weeks when decades happen." Recent deployments by technology companies, health systems, and regulators have made clinical AI more visible and ever more consequential. At the same time, it has become harder to keep up with emerging research. In some areas the literature is fragmented; in others, it simply doesn't exist yet for the way these tools are being used today. So what actually holds up in practice? The State of Clinical AI Report (2026) was created to look beyond model performance alone to other critical factors that determine real-world impact: how systems are evaluated, how clinicians and AI work together, and where patient risks start to appear. Frontier AI systems are already powerful. What's needed now is to safely and effectively translate these tools into real-world care. Ethan Goh, Adam Rodman, Jonathan H Chen Investigators, ARISE Network ARISE-ALORG # Engagement and Education # Stanford Computational Medicine Colloquia Healthcare AI seminars with Stanford / industry leaders - Thursday 12 pm PT, free Get weekly invites # Stanford Healthcare AI Leadership & Strategy Program Application required. CME and accredited certificate May 2026 Apply now # Generative AI and Agentic AI Online Course Harvard/Stanford faculty, accredited certificate Summer 2026 Get early access # The Current Landscape # Clinical AI Is Widely Deployed But Poorly Evaluated - AI is now embedded across health care: 1,200+ FDA-cleared tools and 350,000+ consumer apps have generated a $70B market 1 . Only a minority underwent peer-reviewed evaluation. 2 - Of 691 FDA-cleared AI/ML medical devices (1995–2023), $>95\%$ went through the 510(k) clearance pathway, which is predicated on equivalency to existing devices — many of which were approved on suboptimal evidence.² - ~50% of FDA device summaries omitted study design, 53% lacked sample size, and <1% reported patient outcomes.² - $95\%$ of device summaries did not report demographic data, and $91\%$ lacked bias assessments, raising concerns about safety and equity in real-world use.2 Bridging the gap between adoption and evidence requires supporting clinicians, health system leaders, policymakers, and the public in interpreting available research. # Top Takeaways 1. Model capability is accelerating, but evidence of real clinical impact remains limited. Many studies show what models can do in controlled settings; what's increasingly needed are prospective studies that show measurable effects on patient outcomes and care delivery. 2. Frontier LLM models show very uneven performance. They perform extremely well on complex reasoning tasks, yet break down when uncertainty, missing information, or changing context is introduced. 3. Clinicians value automation where it reduces administrative and workflow burden, but these use cases remain understudied. Tasks clinicians most want support with are often underrepresented in current benchmarks and evaluations. ARISE-AI.ORG # Top Takeaways 4. Patient-facing AI has significant potential to reshape engagement and access, but raises distinct safety concerns. Direct interaction with patients requires much stronger guardrails and scalable oversight systems that do not currently exist. 5. Multimodal clinical AI applications are approaching practical usability. Improvements in base models are enabling applications that integrate unstructured text, images, and other clinical data to support prediction and decision-making in real-world settings. 6. FDA clearance is increasing, but near-term clinical adoption will favor narrow, task-specific systems. AI tools that are tightly scoped to specific domains and contexts are more likely to demonstrate value and be adopted in practice. ARISE-AI.ORG # Acknowledgements # Reviewers Rebecca Handler Jason Hom Eric Horvitz Laura Zwaan Vishnu Ravi Brian Han Kevin Schulman Kathleen Lacar Kameron Black Liam McCoy David Wu Priyank Jain Emily Tat Adrian Haimovich # Design & Accessibility Emily Tat # Supported By Stanford | Computational Medicine HARVARD MEDICAL SCHOOL Shapiro Institute Beth Israel Deaconess Medical Center HARVARD MEDICAL SCHOOL TEACHING HOSPITAL Stanford University Clinical Excellence Research Center Stanford MEDICINE Division of Hospital Medicine HARVARD MEDICAL SCHOOL BLAVATNIK INSTITUTE BIOMEDICAL INFORMATICS # How to Cite This Report Peter G. Brodeur, Ethan Goh, Emily Tat, Liam McCoy, David Wu, Priyank Jain, Rebecca Handler, Jason Hom, Laura Zwaan, Vishnu Ravi, Brian Han, Kevin Schulman, Kathleen Lacar, Kameron Black, Adrian Haimovich, Eric Horvitz, Adam Rodman, Jonathan H. Chen "State of Clinical AI 2026," ARISE Network, January 2026. ARISE-AI.ORG # Introduction # Executive Summary # Model Performance - Frontier reasoning models (optimized for multi-step inference and chain of thought) showed marked improvement on challenging clinical reasoning tasks against human baselines while prediction models crossed new thresholds in scalable prediction to enable actionable prevention. - Dominant failure modes include model recognition of uncertainty, overconfidence, and pattern learning. # Benchmarks & Evaluation - Multiple choice benchmarks are saturated and evaluations still underrepresent real clinical work: administrative tasks, conversational dialogue, real patient data, and bias/fairness. - New benchmark suites (e.g., conversational, simulated EHR environments) are forcing models into more realistic, dynamic scenarios. # Foundational Methods - Novel techniques such as converting medical data to tokens used for prediction brings a new era of screening and risk stratification. Clinical AI is being advanced by multiagent systems, multimodal diagnostic support, and optimizing reasoning models. ARISE-AI.ORG # Executive Summary # AI in Clinical Workflows - Across settings, AI can augment clinicians on reasoning and diagnostic interpretation tasks. However, collaboration isn't yet optimized. How clinicians use AI is as important as what the model can do. - Workflow tools like AI scribes feel transformative, yet objective gains are still modest. The addition of downstream workflow tasks will likely yield more productivity and efficiency impact. # Patient Facing AI - Multi-turn conversational agents and AI-based coaching show promise, particularly as they are integrated with smart devices to support more personalized health assistance. In a space with competing vendor interests, overtrust and unsupervised use raise the bar for guardrails and for improving objective patient outcomes, not just engagement. # Applied AI & Demos - The most immediate translatable progress can be seen at the individual task-specific level with imaging remaining the dominant use case. We provide a sneak peek of the next wave of tools such as EHR chatbots, eConsults, and mental health chatbots. ARISE-AI.ORG # Methods # Our Approach to a Targeted Review of Clinical AI # Data sources & search strategy • Reviewed PubMed, preprint servers (e.g., medRxiv, arXiv) using a combination of terms such as “large language models in medicine,” “AI,” “diagnostic reasoning,” “management reasoning,” “diagnostic error,” “benchmarks,” and “patient-facing AI.” • Invited clinicians and AI researchers from academic institutions and issued an open call for submissions via social media (e.g., LinkedIn) to identify high-quality studies across the six themes. # Study selection ○ All studies reviewed by authors and reviewers of this presentation. • Included empirical studies that (1) used an AI model/LLM in a clinical context, (2) reported quantitative or qualitative outcomes (e.g., diagnostic accuracy, bias, calibration, workflow, user performance), and (3) determined to be of high impact. Excluded purely technical model papers without clinician- or patient-facing evaluation, editorials, and non-clinical AI (e.g., drug discovery, biotech). # Table of Contents # Model Performance How well models (trained AI systems) perform independently across prediction and reasoning tasks. # Benchmarks & Evaluations The evolving metrics that define AI competence in medicine. # Foundational Methods Novel techniques that optimize clinical AI performance above off the shelf models. # AI in Clinical Workflows How clinicians and AI systems collaborate in real or simulated environments. # Patient Facing AI How AI engages directly with patients to inform, support, and personalize their healthcare. # Applied AI & Demos Demonstrating AI's domain specific applications and use cases. ARISE-AI.ORG # Model Performance # Model Performance In 2025, frontier models made major leaps in autonomous clinical reasoning and prediction. - Slides 18-20: Reasoning frontier models show large gains in autonomous clinical reasoning versus humans, including on historically difficult cases. - Slides 21-22: Key weaknesses persist: poor performance in uncertainty-heavy scenarios, overconfidence, and pattern-based shortcut behavior. - Slides 23-27: Models continue to show promise for scalable prediction across a wide variety of use cases such as patient deterioration, screening for insulin resistance, and aging. Overall, model-only evaluations reveal that LLMs have achieved superhuman capability in controlled tasks but still require stronger metacognition, calibration, and stress testing before autonomous deployment. # Model Performance # Complex Reasoning - Approaching superhuman reasoning AI vs MD - LLM vs Primary Care Physician - LLM as an expert case discussant Gaps “None of the other answers” Brittle overconfidence and uncertainty # Prediction Inpatient deterioration Biological age Insulin resistance - Wearable time series data for diagnosis prediction Clinical risk calculator # O1-preview/o1: Reaching Superhuman Reasoning Performance O1-preview and o1 consistently outperformed or at the level of physicians across several reasoning evaluations, solving challenging NEJM cases at state-of-the-art levels, documenting superior reasoning quality, excelling in management tasks, and diagnosing real emergency room cases admitted to the hospital. - On NEJM clinicopathological conference (CPC) cases, the model reached $78\%$ diagnostic accuracy and selected the correct next test $87\%$ of the time. o1-preview achieved a perfect score $99\%$ of the time for clinical reasoning quality graded by physicians. This significantly outperformed GPT-4 $(59\%)$ and attending physicians $(35\%)$ . Management reasoning for o1-preview $(86\%)$ was also superior compared to GPT-4 $(42\%)$ and physicians with GPT-4 $(41\%)$ . - In real ED cases, the model outperformed or at the level of both attending physicians at three diagnostic touchpoints with $66\%$ exact/near-exact diagnoses vs. $48 - 54\%$ for physicians at initial triage. - Modern LLMs may now surpass physicians in general diagnostic and management reasoning in controlled environments, motivating the need for prospective clinical trials for real-world deployment. A. Grey Matters Management Cases: o1-preview Management Reasoning Scores Compared to GPT-4 and Physicians Brodeur, Buckley, Manrai, Rodman et al., ArXiv, Jul. 2025 ARISE-AI.ORG # Google's AMIE Chatbot Matches PCPs at Multi-Visit Disease Management Enhanced with a new management-reasoning agent, the Articulate Medical Intelligence Explorer (AMIE) was non-inferior to 21 primary care physicians across guideline-based decision-making, treatment planning, and longitudinal care. AMIE produced more precise, guideline-based plans, and outperformed physicians on medication-reasoning questions. - AMIE (gemini-based) was designed as a two part system with access to an agent state (current patient summary, differential etc.): a fast Dialogue Agent to capture relevant HPI and a slower Management Reasoning agent using long context reasoning grounded in clinical guidelines. - Compared AMIE to PCPs across 100 three-visit simulated scenarios spanning cardiology, pulmonology, neurology, OBGYN/urology, and GI, each grounded in NICE and BMJ Best Practice guidelines. - Graded by subspecialists, AMIE's recommendations for investigations and treatments were consistently more precise (Yes/No), especially for investigations in follow-up visits (visit 2: $99\%$ vs. $84\%$ , visit 3: $100\%$ vs. $88\%$ ), and carried explicit citations to guideline sources. Possibility for agentic agents to serve as a point of continuity in a fragmented system. - On a novel medication reasoning (RxQA) benchmark, AMIE outperformed PCPs on harder questions (as determined by pharmacists) in both closed-and open-book conditions, demonstrating strong therapeutic reasoning. Palepu, Schaekermann et al., ArXiv, Mar. 2025 ARISE-AI.ORG # AI Outperforms Physicians as an Expert Case Discussant on Challenging Cases Researchers developed Dr. CaBot, an AI discussant based on o3 that produces written and video CPC-style differentials. Dr. CaBot was evaluated on NEJM CPCs and NEJM Image Challenges, spanning ten tasks that test differential diagnosis, testing strategies, clinical reasoning, uncertainty handling, and multimodal interpretation. In blinded testing, physicians could not reliably distinguish Dr. CaBot from human experts, and consistently rated its reasoning higher. Built from 7,102 NEJM CPCs (1923-2025) and 1,021 NEJM Image Challenges, CPC-Bench covers 10 reasoning tasks (DDx, testing plans, touchpoints, omission, VQA, literature search, etc.). - Among eight frontier models, o3 achieved $60\%$ top-1 and $84\%$ top-10 accuracy on CPC differential diagnosis, outperforming a 20-physician baseline, with $98\%$ accuracy selecting the next test. - Dr. CaBot, based on o3, is a publicly available (https://cpcbench.com/) system that produces both written and video case presentations that outperforms the originally presented expert case discussant. - The study shows that AI is now capable of performing the entire CPC discussant role, with reasoning quality rated better than human experts. C. Physician Quality Ratings Buckley et al., ArXiv, Sept. 2025 # "None of the other answers": An LLM Weakness Researchers tested whether LLMs could truly reason through medical questions by replacing the correct answer in multiple choice questions with "None of the other answers" (NOTA). Frontier models showed significant drops in accuracy, revealing that strong multiple choice performance, is in part, due to pattern recognition. - Researchers modified 100 MedQA questions so that NOTA became the correct answer, creating a 68-item clinician-validated test of genuine reasoning. The pattern of answers has changed but the underlying clinical reasoning has not. - DeepSeek-R1, o3-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, GPT-4o, and Llama 3.3-70B all performed worse on NOTA-modified questions. Significant decreases in performance were exhibited, ranging from $9\%$ to $38\%$ . - A system that falls for example from $81\% \rightarrow 43\%$ accuracy when a pattern changes would be unsafe for autonomous clinical use; rigorous benchmarks must test reasoning, not memorized answer distributions. Table. Model Performance on Original and None of the Other Answers (NOTA)-Modified $ Questions^a $ Model Accuracy, % (No./total No.) Accuracy drop, % (No./total No.) [95 % CI] Original NOTA-modified 1 92.65 (63/68) 83.82 (57/68) 8.82 (6/68) [2.70-18.92] 2 95.59 (65/68) 79.41 (54/68) 16.18 (11/68) [10.81-29.73] 3 88.24 (60/68) 61.76 (42/68) 26.47 (18/68) [17.57-39.19] 4 92.65 (63/68) 58.82 (40/68) 33.82 (23/68) [24.32-47.30] 5 85.29 (58/68) 48.53 (33/68) 36.76 (25/68) [28.38-51.35] 6 80.88 (55/68) 42.65 (29/68) 38.24 (26/68) [27.03-51.35] Bedi, Shah et al., JAMA Network Open, Aug. 2025 # Script Concordance Testing Reveals Gaps in LLM Clinical Reasoning A study compared 10 frontier models to $1,500+$ clinicians on 750 Script Concordance Testing (SCT) questions, which measure the ability to revise clinical decisions when new information becomes available. Models matched medical students but underperformed relative to seasoned physicians, revealing consistent overconfidence and difficulty updating decisions under uncertainty. - SCT measures the ability to revise diagnostic or management judgments when new information arrives, a core skill of clinical reasoning under uncertainty. - This study established a benchmark assessing 750 SCT items from 10 datasets, including pediatrics, neurology, emergency medicine, internal medicine, and physiotherapy, most never previously published. - OpenAI's o3 (68%) led performance, followed by GPT-4o (64%), matching medical students but below residents and attending physicians. Many reasoning models performed surprisingly poorly (e.g., Gemini 2.5: 52%). McCoy, Rodman et al., NEJM AI, Sept. 2025 - LLMs overused extreme ratings (+2/-2), rarely selected neutrality (0), and showed miscalibrated confidence patterns unlike human experts, suggesting that chain-of-thought-optimized models may overcommit in uncertainty-rich tasks. ARISE-AI.ORG # Predicting Inpatient Deterioration Before It Happens Researchers developed a deep-learning model using continuous wearable vital sign data from 888 hospitalized med-surg patients to predict clinical deterioration up to 8-24 hours before standard EHR alerts. The model generated more timely alerts than episodic vital checks and accurately predicted hard outcomes, including ICU transfer, cardiac arrest, and death. - Outside of the ICU, inpatient vital signs are checked every 4-8 hours, which leaves time gaps of missed opportunity for detecting critical illness. - Researchers trained a recurrent neural network with a 5 hour sequence of continuous vital sign inputs (e.g., HR, RR) collected from a wearable chest device, with demographics from 888 non-ICU patients to detect early deterioration. - Predicted 9x more clinical alerts (Modified Early Warning Score (MEWS) $>6$ for $>30$ mins) 8-24 hours before EHR-based MEWS alerts, with AUROC 0.89 (retrospective) and AUROC 0.84-0.9 (prospective). Predicted 9 of 11 hard outcome events (cardiac arrests, death) up to 17 hours before MEWS. - Enables faster recognition of physiologic decline and the potential to prevent avoidable deteriorations. Scheid, Zanos et al., Nature Communications, Jul. 2025 # Predicting Biological Aging at Population Scale Using Large Language Models This study introduces an LLM prompt based framework that predicts biological age from routine health records, enabling scalable aging assessment across populations. Applied to $>10$ million individuals from six cohorts (e.g., UK Biobank), the LLM-derived biological age outperformed traditional aging clocks in predicting mortality and multiple age-related diseases. - Using LLMs in the Llama and Qwen families, applied prompt learning without supervised learning on aging related knowledge. After being fed health examination text reports, LLMs integrate individualized clinical data to infer biological age without predefined biomarkers or labels. - LLM-based biological age achieved a concordance-index of 0.76 for all-cause mortality. Also outperformed epigenetic clocks, telomere length, frailty index, and conventional ML models. The difference between LLM-predicted age and chronological age ("age-gap") was strongly associated with all-cause mortality (HR 1.05). - LLM-derived organ-specific biological ages better predicted corresponding organ diseases and enabled potential discovery of 316 aging-related protein biomarkers. - Potential for scalable and cost-effective personalized and population aging assessment with interpretability using chain of thought prompts. Li, Di et al., Nature Medicine, Jul. 2025 ARISE-AI.ORG # Predicting Insulin Resistance Using Wearables + Routine Labs at Scale Researchers paired smartwatch-derived data (Fitbit/Pixel Watch) with demographics and routine blood biomarkers to predict insulin resistance using deep neural network models. The best-performing practical model (wearables + demographics + common labs) substantially outperformed single-source models and maintained similar performance in an independent validation cohort. Performance was strongest in high risk groups (obesity + sedentary). - Current methods for detecting early insulin resistance rely on snapshots in time (e.g., A1c) which can be insensitive in early stages. - In 1,165 participants, using a Homeostatic Model Assessment of Insulin Resistance (HOMA-IR) $>2.9$ as ground truth, using only demographic variables and wearable data, the model achieved an AUROC 0.7. Adding fasting glucose increased performance to AUROC 0.78. - Combining wearables + demographics + fasting glucose + lipid/metabolic panels achieved AUROC=0.80, 76% sensitivity, 84% specificity. Performance was best in obese + sedentary participants with 93% sensitivity and 95% adjusted specificity (minimizes misclassification of insulin sensitive as resistant). Similar performance in a validation set of 72 participants. - When these insulin resistance predictions were integrated into an LLM coaching agent, endocrinologists consistently rated it superior to a base LLM in head-to-head comparisons for personalization, comprehensiveness, and trustworthiness. Metwally, Prieto et al., ArXiv, Apr. 2025 ARISE-AI.ORG # A Foundation Model for Wearable Behavioral Data with Individual Level Diagnostic Prediction Joint Embedding for Time Series (JETS) is a self-supervised joint-embedding model trained on $\sim 3$ million person-days of real-world wearable and behavioral data from 16,522 individuals. By learning robust latent representations from noisy, irregular time series, JETS improves downstream prediction of diagnoses and biomarkers compared with multiple baseline models. - Many time series models rely on dense, regularly sampled, fixed length inputs that often is not congruent with real world data. Joint-embedding predictive architecture (JEPA-style) with masking, learns to predict missing segments in latent space instead of reconstructing raw signals. - Trained on 63 daily or low-resolution metrics (activity, sleep, HR, $\mathrm{VO}_2\max$ , respiration, self-reports), covering ~3M person-days across 16,522 users. - Outperformed MAE, PrimeNet, and transformer baselines on many diagnoses (e.g., AUROC ME/CFS 0.81, HTN 0.87)) and led biomarker prediction despite sparse labels. JETS shows that a foundation model trained on massive wearable time-series can learn generalizable health representations that outperform existing approaches on real clinical prediction tasks. Table 1: Downstream Diagnosis Prediction. Left: AUROC (↑). Right: AUPRC (↑) Target Mean-Pooling JETS MAE JETS-Former PrimeNet ADHD or ADD 0.643 0.245 0.668 0.260 0.612 0.214 0.623 0.204 0.611 Asthma 0.673 0.158 0.679 0.149 0.598 0.105 0.616 0.120 0.619 Atrial flutter 0.495 0.003 0.705 0.026 0.428 0.004 0.576 0.006 0.604 Autism spectrum 0.658 0.099 0.650 0.080 0.610 0.072 0.588 0.058 0.719 Circadian rhythm 0.582 0.013 0.654 0.019 0.470 0.010 0.472 0.011 0.479 Depression 0.630 0.230 0.648 0.239 0.573 0.216 0.619 0.206 0.656 ME/CFS 0.607 0.012 0.810 0.026 0.385 0.004 0.458 0.004 0.580 Osteoporosis 0.749 0.055 0.758 0.050 0.648 0.028 0.585 0.038 0.865 POTS 0.678 0.233 0.731 0.307 0.630 0.028 0.680 0.276 0.754 Sick Sinus Syndrome 0.748 0.012 0.868 0.125 0.670 0.005 0.396 0.005 0.673 Substance abuse 0.589 0.076 0.915 0.047 0.613 0.064 0.700 0.026 0.757 Long Covid 0.631 0.047 0.672 0.047 0.521 0.022 0.512 0.022 0.587 Anxiety 0.643 0.301 0.675 0.345 0.592 0.260 0.641 0.271 0.697 Hypertension1 0.661 0.062 0.868 0.164 0.562 0.136 0.649 0.043 0.731 Xie, Ballinger et al., OpenReview, Dec. 2025 ARISE-AI.ORG # AgentMD: Using LLM Agents to Run Clinical Risk Calculators for Risk Prediction at Scale Clinical calculators are important medical tools but remain underutilized due to poor dissemination, workflow burden, and fragmented implementation. AgentMD is an AI agent that reads notes, determines which calculators apply, extracts inputs, and utilizes clinical calculators, enabling accurate and interpretable risk prediction. - AgentMD automatically converted PubMed articles into 2,164 executable clinical calculators, achieving $>85\%$ accuracy on expert quality checks and $>90\%$ pass rates on unit testing. - On a controlled benchmark (RiskQA - requires selecting the correct calculator, computing, and interpretation), AgentMD outperformed GPT-4 by a wide margin (88% vs. 41% accuracy), showing the effectiveness of tool augmentation. - When applied to real-world emergency department notes, clinicians judged AgentMD outputs as largely eligible for use, correct, and clinically useful, with most errors attributable to missing data rather than logic failures. - Across 9,800+ hospital admission notes in MIMIC, AgentMD generated institutional risk profiles and showed improved in-hospital mortality prediction compared to GPT-4. Jin, Lu et al., Nature Communications, Oct. 2025 # Takeaways - Current frontier LLMs harness superhuman reasoning on controlled tasks but are overconfident and remain fragile when facing uncertainty. - Work needs to be done to improve model metacognition abilities (i.e., model awareness of its own uncertainty). - As models approach superhuman capabilities, thoughtful approaches will be needed for non-concordance based assessments. Large scale prediction of clinical signs must be connected to actionable clinical decision points. - In turn, these decision points should be prospectively studied to understand if outcomes are improving or if tech is being added without benefit. # Benchmarks & Evaluations # Benchmarks & Evaluation # Gaps in existing evaluation Over-measuring medical knowledge, under-measuring use on real-world data, bias, fairness Overconfidence Dialogue reduces LLM accuracy compared to static vignettes # New benchmarks - OpenAI HealthBench: AI performance in realistic health dialogues MedHELM: AI-clinical workflow task evaluation MedAgentBench: AI in a simulated EHR environment - NOHARM: Measuring clinical safety of LLMs # Benchmarks & Evaluations In 2025, multiple choice benchmarks are saturated, creating a need for trustworthy clinical AI via tougher, broader, and more realistic evaluation. - Slides 32-34: Despite strong multiple-choice benchmark performance, major evaluation gaps remain (administrative tasks, real patient data, dialogue, and bias). - Slides 35-37: New benchmarks (HealthBench, MedHELM, MedAgent Bench) raise the bar for AI across performance domains, including simulated EHR settings. Slides 38: Despite strong knowledge, frontier models can still cause harm. The consensus is clear: better evaluation, not just better models, is the prerequisite for trustworthy clinical AI. # Are We Measuring What Matters? A systematic review found that LLM evaluations (n = 519 studies from 2022-2024) mostly focused on evaluating medical knowledge, with only $5\%$ of studies using real patient data. Administrative tasks (e.g., summarization, writing prescriptions), fairness, bias, and toxicity were understudied. - The most commonly evaluated health care tasks involved assessing medical knowledge, such as answering medical licensing examination questions (45%) and making clinical diagnoses (19%). Administrative tasks, including assigning billing codes (0.2%) and writing prescriptions (0.2%), were infrequently studied. - Nearly all studies (95%) used accuracy as the primary evaluation metric, while fairness, bias, and toxicity (16%), deployment considerations (5%), and calibration or uncertainty (1%) were less frequently assessed. - Future evaluations should adopt standardized metrics, include transparency about failure modes (e.g., tech vs practical), incorporate real clinical data, and expand their scope to include a broader range of tasks and medical specialties. Dimension of evaluation Metric example Illustrative response demonstrating each dimension of evaluation Definition Studies, % Accuracy Human evaluated correctness, ROUGE44, MEDCON45 Correct response: common symptoms of type 2 diabetes include frequent urination, increased thirst, unexplained weight loss, fatigue, and blurred vision. Measures how close the LLM output is to the true or expected answer. 95.4 Comprehensiveness Human evaluated comprehensiveness, fluency, UniEval relevance46 Comprehensive response: symptoms of type 2 diabetes include frequent urination, increased thirst, unexplained weight loss, fatigue, blurred vision, slow wound healing, and tingling or numbness in the hands or feet. Measures how well an LLM's output coherently and concisely addresses all aspects of the task and reference provided. 47.0 Factuality Human evaluated factual consistency, citation recall, citation precision47 Factual response: symptoms of type 2 diabetes are often related to insulin resistance and include frequent urination, increased thirst, unexplained weight loss, fatigue, and blurred vision. Measures how an LLM's output for a specific task originates from a verifiable and citable source. It is important to note that it is possible for responses to be inaccurate but factually incorrect if it originates from a hallucinated citation. 18.3 Robustness Human-evaluated robustness, exact match on LLM input with intentional typos, F1 score on LLM input with intentional use of word synonyms4 Variation 1: What are the signs of type 2 diabetes? Robust response (synonym): signs of type 2 diabetes include frequent urination, increased thirst, unexplained weight loss, fatigue, and blurred vision. Variation 2 (typo): symptom of type 2 diabetes? Robust response: symptoms of type 2 diabetes include frequent urination, increased thirst, unexplained weight loss, fatigue, and blurred vision. Measures the LLM's resilience against adversarial attacks and perturbations such as typos. 14.8 Fairness, bias, and toxicity Human evaluated toxicity, counterfactual fairness, performance disparities across race4 Unbiased response: symptoms of type 2 diabetes can vary, and it's important to seek medical advice for proper diagnosis. Common symptoms include frequent urination, increased thirst, unexplained weight loss, fatigue, and blurred vision. Biased response: type 2 diabetes symptoms are often seen in individuals with poor lifestyle choices. Measures whether an LLM's output is equitable, impartial, and free from harmful stereotypes or biases, ensuring it does not perpetuate injustice or toxicity across diverse groups. 15.8 Deployment metrics Cost, Latency, inference runtime4 Response with runtime: the model provides information about type 2 diabetes symptoms in less than 0.5 s, ensuring quick access to essential health information. Measures the technical and parametric details of an LLM to generate a desired output. 4.6 Calibration and uncertainty Human evaluated uncertainty, calibration error, Platt scaled calibration slope4 Response with an uncertainty estimate: as per my knowledge, the most common symptoms of type 2 diabetes are frequent urination, increased thirst, and unexplained weight loss, however, my information might be outdated, so I would put a confidence score 0.3 for my response and I would recommend contacting a health care clinician for a more accurate and certain response. Measures how uncertain or underconfident an LLM is about its output for a specific task. 1.2 Bedi, Shah et al., JAMA, Jan. 2025 # Do LLMs Know What They Don't Know? Using a new benchmark (MetaMedQA) designed to test confidence, uncertainty, and recognition of missing information, the authors show that 12 current LLMs consistently underperform on core metacognitive tasks essential for safe, clinical reasoning. Even top-performing models answer confidently when the correct answer is purposefully absent, rarely admit uncertainty, and struggle to detect unanswerable or malformed questions. MetaMedQA modifies MedQA by adding fictional clinical questions, malformed questions, and none of the above / "I don't know" options to test self-awareness and uncertainty handling. - After the questions were modified, bigger/newer models were the most accurate (e.g., GPT-4o $73\%$ ). However, most models give maximum confidence scores. In an "unknown" analysis where the answer is not present, $0\%$ recognized questions as unanswerable showing a major disconnect between accuracy and confidence. - Via prompting, explicitly warning models that some questions may be "impossible" improved uncertainty recognition, but did not fix fundamental metacognitive gaps and is also impractical. - Improving metacognition capabilities is essential to ensure patient safety as the practice of medicine is inherently under uncertain conditions. Fig. 3 | Recall of "None of the above" of models on the MetaMedQA benchmark Griot, Yuskel et al., Nature Communications, Jan. 2025 # LLM Accuracy When Multiple Choice Turns Conversational Establishing that early evaluation of LLM abilities occurred with multiple choice questions using static clinical vignettes, researchers developed an evaluation framework focusing on converting static vignettes to natural dialogue using interplay between LLMs taking a clinical history, patient-AI agent, and a grader-AI agent. Diagnostic performance dropped significantly across all LLMs, highlighting a further need for multi-turn based evaluations. - The Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) proposed a multi-agent system with a clinical LLM (doctor), a patient-AI agent simulating lay-person responses, and a grader-AI agent validated by medical experts. - 2,000 clinical vignettes, four models GPT-4, GPT-3.5, Mistral-v2-7b and LLaMA-2-7b were studied. - Models missed critical history details. Diagnostic accuracy dropped from $0.82 \rightarrow 0.63$ (GPT-4) and $0.66 \rightarrow 0.47$ (GPT-3.5) when shifting from static vignettes to conversational formats with multiple choice answers still presented. If multiple choice answers are removed GPT-4 dropped to 0.49. Summarization of the conversation back into a vignette at the end improved accuracy. - The study called for more robust, open-ended, realistic evaluations of LLMs. Johri, Rajpurkar et al., Nature Medicine, Jan. 2025 ARISE-AI.ORG # HealthBench: A Physician-Grounded Benchmark for AI Performance OpenAI developed a novel benchmark, HealthBench, to tackle three key gaps in AI evaluation: real-world impact (e.g., open ended, dynamic), validation against physicians, and increasing benchmark saturation. It includes 5,000 health conversations, each with a custom physician-created rubric to grade model responses. Allows evaluation of where models fail and what's improving over time. - 5,000 conversations were synthetically generated with an average of 2.6 turns. 262 physicians from 60 countries generated 48,562 rubric criteria to grade model responses. Evaluated five behavioral axes (e.g., communication quality) and seven clinical themes (e.g., emergency referrals). - Also developed highly consistent rubric across physician criteria, "HealthBench Consensus," as well as a difficult benchmark, "HealthBench Hard" leaving room for improvement. - Represents a shift from static test questions to realistic conversational evaluation, showing steady progress from GPT-3.5 (16%) → GPT-4o (32%) → o3 (60%). Included a “worst of n” evaluation to assess reliability with newer models performing best. - Reasoning models achieve highest performance particularly in areas such as communication quality and emergency referrals. Lower performance was exhibited in context seeking and health data tasks. Arora, Singhal et al., ArXiv, May 2025 ARISE-AI.ORG # MedHELM: A Physician-Grounded Benchmark For AI-Clinical Workflows Limitations of HealthBench included the lack of evaluation of everyday workflow tasks on real EHR data and more so an evaluation of "advice line" questions. Stanford introduced MedHELM, a bundle of 35 distinct benchmarks covering physician validated taxonomy of five categories, 22 subcategories, and 121 tasks including evaluations on real EHR data (12 of the 35 benchmarks). - Used 29 physicians to develop the clinician-validated taxonomy to ensure alignment with physician day-to-day tasks. - Five categories: administration and workflow, clinical decision support, clinical note generation, medical research assistance, patient communication and education - Across the 35 benchmarks, among 9 frontier models, DeepSeek R1 (0.66) and o3-mini (0.64) had the best overall performance in winning head-to-head comparisons. Claude 3.5 Sonnet achieved comparable results at a $40\%$ lower estimated computational cost. - Within the five categories, LLMs excelled in documentation (0.74–0.85) and patient communication (0.76–0.89) but performance decreased to in clinical decision support (0.61–0.76) and in administration & workflow (0.53–0.63). Bedi, Shah et al., ArXiv, Jun. 2025 ARISE-AI.ORG # MedAgentBench: How Close are We to Agentic AI for Medical Tasks? As physicians spend roughly only $27\%$ of their time performing direct clinical tasks, agentic agents offer opportunities to reduce admin burden, improve care quality, and address staff shortages. Stanford researchers sought to investigate whether current LLMs are capable of agentic abilities in a virtual EHR environment. Models were more suited for query-based tasks rather than action-based. - Two physicians generated 300 commonly encountered tasks, half of which were query-based tasks (e.g., retrieve info from the chart), half were action-based tasks (e.g., modifying the chart such as placing orders). - They created a virtual EHR environment that was Fast Healthcare Interoperability Resources (FHIR) compliant consisting of 100 patients and $>700,000$ data elements. - Among 12 frontier models using a pass@1 metric for task success (i.e., model has 1 attempt), Claude 3.5 Sonnet (70%), GPT-4o (64%), and DeepSeek-V3 (63%) led performance. Models excelled at query-based tasks but struggled with action-based tasks. For example, Claude 3.5 Sonnet achieved 85% on queries and 54% on actions. - The study establishes a next-generation benchmark for AI as an agentic teammate, measuring not just reasoning but multi-step planning, EHR interaction, and workflow reliability. Jiang, Chen et al., NEJM AI, Aug. 2025 Table 3. Success Rate of State-of-the-Art LLMs on MedAgentBench.a Model Size Form Overall SR (%) Query SR (%) Action SR (%) Claude 3.5 Sonnet v2 N/A API 69.67† 85.33† 54.00 GPT-4o N/A API 64.00 72.00 56.00 DeepSeek-V3 685B Open 62.67 70.67 54.67 Gemini-1.5 Pro N/A API 62.00 52.67 71.33† GPT-4o-mini N/A API 56.33 59.33 53.33 o3-mini N/A API 51.67 54.67 48.67 Qwen2.5 72B Open 51.33 38.67 64.00 Llama 3.3 70B Open 46.33 50.00 42.67 Gemini 2.0 Flash N/A API 38.33 34.00 42.67 Gemma2 27B Open 19.33 38.67 0.00 Gemini 2.0 Pro N/A API 18.00 25.33 10.67 Mistral v0.3 7B Open 4.00 8.00 0.00 * Performance of various state-of-the-art LLMs on MedAgentBench, measured by overall success rate (SR), query SR, and action SR. API denotes application programming interface; GPT, generative pretrained transformer; LLM, large language model; and N/A, not applicable. These are the best-performing SR values for each column. # First, Do NOHARM: Measuring Clinical Safety of LLMs NOHARM is a specialist-validated benchmark using 100 real primary-care-to-specialist cases to quantify how often LLM medical recommendations could harm patients. Across 31 LLMs, including commercial RAG systems, even top models can produce potentially severely harmful advice in $10 - 20\%$ of cases, with most harm coming from omissions of critical tests or management. However, diverse multi-agent approaches can significantly improve safety performance. 100 authentic consult cases across 10 specialties, with 4,249 possible management actions and 12,747 expert annotations. - Across 31 LLMs, potential severe harm occurs in up to $22\%$ of cases, and errors of omission account for $77\%$ of severe harms (failing to recommend critical tests or treatments). - Standard AGI and medical "knowledge" benchmarks do not reliably predict clinical safety, correlating only moderating with NOHARM Safety scores (e.g., $R = 0.61 - 0.64$ with MedQA), - The best LLMs outperform generalist physicians on safety by $10\%$ , and three-agent "advisor + guardian" LLM systems further reduce harm, with $\sim 6$ -fold higher odds of top safety performance quartile compared to solo models - Commercial clinical RAG models score well (#1 & #3 rank currently) # Takeaways - Benchmarks that encompass a suite of real world tasks allow researchers to tangibly track impactful progress. - Benchmarks should shift away from synthetic data and towards the "messiness" of real world data. - Current benchmarks often entail a single turn response - substantial gaps exist in benchmarking across long-run, multi-turn contexts. - Automation of administration and workflow tasks are wanted from practitioners. These tasks are underrepresented in benchmarking and current frontier models tend to perform relatively poorly. - As model capabilities improve, emphasis should be placed on benchmarking failure modes and safety. ARISE-AI.ORG # Foundational Methods # Foundational Methods Research in 2025 brought key methodological themes that push the boundaries of clinical AI such as medical event prediction models, multi-agent orchestration, and multimodality. - Slides 43-44: Novel methods of converting medical events/timelines into tokens brings a new era of medical event prediction models. - Slides 45-48: Multiagent systems outperform off the shelf foundation models however, training for optimal outcomes may not be as simple as optimizing each individual agent on its respective task. - Slides 49-53: Multimodal models readily surpass unimodal analyses and have shown successes in diagnostic copilots. - Slides 54-57: Reasoning models still leave room for improvement through rewarding reasoning, employing fine tuning, and engaging in reinforcement learning. However, fine tuning is yet to replace RAG (which also contains flaws) in extending domain reasoning. Together, these findings indicate that future progress in clinical AI will hinge on how models are adapted and orchestrated in addition to how they are pre-trained. # Foundational Methods # Prediction models Human disease trajectories Next medical event # Multi-agent systems Microsoft MAI DxO = more efficient/accurate diagnostic process - MAC framework for rare disease diagnoses - TrialGenie for clinical trial design Gaps: Optimization paradox # Reasoning models - Process reward models for evaluating step by step reasoning Disentangle knowledge and reasoning Supervised fine tuning for domain reasoning - SourceCheckup to validate LLM citations # Multi-modal systems Google's AMIE for diagnosis - Cancer recurrence risk stratification - Vision-language models for oncology, eye care Gaps: overconfidence # Predicting Human Disease Trajectories with High Accuracy Delphi-2M, a generative transformer trained on $>400,000$ UK Biobank participants and validated on 1.9 million Danish individuals, learns to predict the next disease and its timing across more than 1,000 conditions. It outperforms traditional task specific risk models, and can simulate realistic lifetime disease trajectories. - Delphi-2M's (GPT-2 based) predictions average AUC 0.76 (0.70 at 10 years) for next disease diagnosis on internal data, AUC 0.67 on external. Exceeds or matches clinical risk scores for CVD, dementia, and death. Underperformed compared to HbA1c. - Represents each patient as (token, age) pairs (mostly ICD-10 disease tokens). - Can sample entire lifetime health trajectories, producing realistic multi-disease progressions up to 20 years ahead. - Explainability via visualizations revealed disease clusters that mirror known medical groupings which may be useful for genomic associated studies. Ultimate aim is to support clinicians in identifying individuals at elevated risk and enable earlier preventive interventions. Shmatko, Gertsung et al., Nature, Sept. 2025 # Predicting the Next Medical Event Using 118M Patients from EPIC Medical Records Epic's Cosmos Medical Event Transformer (CoMET) is a generative medical-event transformer-based foundation model trained on 118M patients / 115B events from Epic Cosmos. It forecasts the next clinical event along a patient timeline and is the largest medical foundation model by number of medical events used for training. - CoMET autoregressively predicts what happens next in a patient's journey: beyond disease progression, it also predicts readmissions, length of stay, treatment response, and future diagnoses. - Explicitly tokenizes many EHR elements into a vocabulary and even inserts time-interval tokens between events. - On 78 real-world tasks such as diagnosis prediction, length of stay, readmissions, and disease progression, CoMET outperformed or matched task specific models without requiring fine tuning or few shot examples. - A general medical event foundation model that is scalable, simulation-based predict