Sci-Tech

AI does not yet have the ability for independent clinical diagnosis and treatment

2026-04-14   

Despite the increasing application of artificial intelligence (AI) in the medical field, there are still significant shortcomings in its ability to think like a doctor. A recent study conducted by the MESH incubator team at Massachusetts General Hospital in the United States has found that generative AI still lacks the ability to independently undertake clinical diagnosis and treatment tasks in critical clinical reasoning processes. The relevant results were published in the latest issue of JAMA Network Open. The team selected 21 major language models, including ChatGPT, DeepSeek, Claude, Gemini, and Grok, to test in 29 published clinical cases, and simulated the real diagnosis and treatment process by gradually providing patient information (from basic symptoms to laboratory and imaging results). The results showed that when complete information was obtained, all models were able to provide accurate final diagnoses in over 90% of cases. However, in the critical early diagnostic stage, these models generally perform poorly. Research has found that in over 80% of cases, the model fails to provide a reasonable "differential diagnosis", which involves systematic analysis and screening of multiple possible diseases. This ability is considered the core of clinical reasoning and an important foundation for doctors' decision-making. To comprehensively evaluate the model's capabilities, the team proposed a new metric called PrIME-LLM, which comprehensively evaluates the model from multiple aspects such as proposing potential diagnoses, selecting examination methods, providing final diagnoses, to developing treatment plans. The results showed that the overall scores of each model ranged from 64% to 78%, with significant differences in performance. The team pointed out that the big language model is better at "providing answers" in situations where information is complete, but performs weaker in situations where information is insufficient and open reasoning is required. With the addition of laboratory data and imaging materials, the performance of the model has improved, and the overall performance of the new generation model is better than the old version, indicating that the relevant technology is continuously improving. The team stated that the current large language model is not yet suitable for direct clinical practice without supervision, and its value lies more in assisting doctors in decision-making rather than replacing it. (New Society)

Edit:Momo Responsible editor:Chen zhaozhao

Source:Science and Technology Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Recommended Reading Change it

Links