Health

Nature's journal releases new standards for medical AI evaluation, with MedGPT for future doctors ranking first globally

2026-01-08   

Recently, the digital medicine journal "npj Digital Medicine" (an academic journal under Nature) released a latest research result from a Chinese research team - the "Clinical Safety Effectiveness Dual Track Benchmark" (CSEDB), a global standard used to evaluate the clinical applicability of medical AI. This is the first time that a Chinese team has published a standard study in the field of "big language model+healthcare" in this journal. The establishment of CSEDB has pointed out the direction for iterative optimization of medical models and laid the foundation for medical AI to enter serious diagnosis and treatment scenarios. At the same time, in the systematic evaluation of multiple mainstream AI models worldwide based on this standard, MedGPT, created by the Chinese Future Doctor team, ranks first in all scores worldwide. The CSEDB evaluation criteria validated by global journals were jointly developed by the Future Doctor research team and 32 domestic clinical experts. These experts all come from the core specialties of 23 medical institutions, including Peking Union Medical College Hospital, Chinese Academy of Medical Sciences Cancer Hospital, Chinese People's Liberation Army General Hospital, and Fudan University Affiliated Huashan Hospital. This new standard breaks the previous model of evaluating medical AI capabilities based on accuracy in answering questions, and introduces for the first time a dual track evaluation system of "safety" and "effectiveness" globally, which fully conforms to real clinical decision-making scenarios. The evaluation dimensions cover 30 core indicators, of which 17 focus on safety, including key scenarios such as critical symptom identification, fatal diagnostic errors, and absolute contraindications to medication; 13 focuses on effectiveness, including core requirements such as prioritizing multiple comorbidities and ensuring consistency between diagnosis and treatment plans and guidelines. At the same time, CSEDB assigns weighted scores to each indicator based on clinical risk levels, ranging from 1 to 5 points. 5 points correspond to "potentially fatal consequences," such as high-risk situations such as dose and organ dysfunction mismatch; 1 point corresponds to "reversible injury", such as the accuracy of professional interpretation of cases and examination reports. In terms of testing methods, CSEDB has also broken the static mode of "standard question standard answer" in the past. Based on the above indicators, the entire evaluation system has constructed 2069 open-ended Q&A items, covering 26 clinical specialties and simulating complex clinical diagnosis and treatment scenarios in all aspects. The establishment of CSEDB means that a systematic evaluation standard that can truly reflect the clinical diagnosis and treatment capabilities of medical AI has been born for the first time in the AI era. All mainstream global models participated in the testing, including DeepSeeker R1, OpenAI o3, Gemini-2.5, Qwen3-235B, Claude-3.7, etc. In this systematic evaluation, the MedGPT AI medical cognitive system developed by Future Doctor ranked first globally in three core indicators: overall score (0.985), safety score (0.912), and effectiveness score (0.861). The performance of MedGPT stems from the original intention of future doctors: from the beginning of the project, the safety and effectiveness that clinical experts are concerned about are implanted into the underlying code, committed to making medical AI "think like a doctor" rather than just "speak like a doctor". The underlying technical architecture simulates the cognitive logic of the human brain, rather than relying on the natural emergence of "big model intelligence" nurtured by massive data. In 2023, MedGPT will be in clinical trials targeting real patients. Today, this capability is still iterating: over 10000 doctors interact with patients through the Future Doctor platform, and 20000 "real diagnosis and treatment feedbacks" are accumulated every week. Through the "feedback is iteration" flying wheel system, MedGPT's accuracy can be improved by 1.2% -1.5% per month, continuously promoting the advancement of medical AI clinical diagnosis and treatment capabilities to a higher level. (New Society)

Edit:He Chuanning Responsible editor:Su Suiyue

Source:Xinhua

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Recommended Reading Change it

Links