Dieses Ergebnis aus MEDLINE kann Gästen nicht angezeigt werden. Login für vollen Zugriff.

Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination.

Title:	Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination.
Authors:	Huy LD; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Anh LN; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Quang MH; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Phuc LH; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Trung ND; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Thang VD; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.; Nam LH; College of Health Sciences VinUniversity.; Department of General Orthopedics 108 Military Central Hospital.; Dung TT; College of Health Sciences VinUniversity.; Orthopaedics & Sports Medicine Center Vinmec International Hospital.
Source:	Orthopedic reviews [Orthop Rev (Pavia)] 2026 Apr 17; Vol. 18, pp. 160184. Date of Electronic Publication: 2026 Apr 17 (Print Publication: 2026).
Publication Type:	Journal Article
Language:	English
Journal Info:	Publisher: Open Medical Publishing Country of Publication: United States NLM ID: 101524779 Publication Model: eCollection Cited Medium: Internet ISSN: 2035-8164 (Electronic) Linking ISSN: 20358164 NLM ISO Abbreviation: Orthop Rev (Pavia) Subsets: PubMed not MEDLINE
Imprint Name(s):	Publication: 2021- : [Scottsdale, AZ] : Open Medical Publishing; Original Publication: Pavia, Italy : PagePress, 2009-
Abstract:	Background: Previous studies evaluating large language models (LLMs) on the Orthopaedic In-Training Examination (OITE) have primarily focused on earlier-generation models and single-pass accuracy. These investigations did not assess newer multimodal systems such as GPT-5 and Gemini 2.5 Pro, nor did they examine the reasoning quality underlying model responses or the consistency of outputs across repeated trials. As LLMs are increasingly used as educational tools, a more comprehensive evaluation framework is needed to assess not only correctness but also reliability and explanatory validity on specialty-specific, image-rich examinations.; Methods: We conducted a controlled, parallel evaluation of GPT-5 and Gemini 2.5 Pro using 412 OITE-style questions from the 2023-2024 examination cycle obtained via an institutional AAOS ResStudy subscription. Primary outcomes included overall and subspecialty-specific accuracy. Secondary analyses evaluated explanatory quality, error-pattern classification, response consistency across repeated trials, and performance stratified by imaging burden. Paired accuracy was compared using McNemar's exact test.; Results: Gemini 2.5 Pro demonstrated higher overall accuracy than GPT-5 on the 2023-2024 OITE question set (81.1% vs 76.0), with both models exceeding published PGY-5 resident benchmarks. Accuracy declined significantly with questions containing images (74.2% vs 71.6%). Subspecialty performance varied widely, with accuracy ranging from 42.9% to 94.1% for GPT-5 and from 57.1% to 95.8% for Gemini, and both models performing poorest in Hand and Wrist questions. Among incorrect responses, faulty reasoning accounted for 52.5% of GPT-5 errors, whereas stem misinterpretation was the predominant error for Gemini (43.6%). Incorrect or partially correct explanations accompanied 45.4% of GPT-5 and 41.7% of Gemini responses. Consistency testing showed high reproducibility (fully consistent responses: 88% for GPT-5 and 84% for Gemini), with all inconsistent outputs occurring in image-containing questions.; Conclusions: GPT-5 and Gemini 2.5 Pro demonstrate strong performance on recent OITE content, exceeding prior LLM benchmarks; however, persistent limitations in multimodal reasoning, explanatory reliability, and response consistency indicate that high accuracy alone does not ensure dependable clinical reasoning, underscoring the need for cautious educational use.
Competing Interests:	The authors declare no competing interests.
Contributed Indexing:	Keywords: Artificial intelligence; ChatGPT; Gemini; Large language models; OITE
Entry Date(s):	Date Created: 20260420 Date Completed: 20260421 Latest Revision: 20260421
Update Code:	20260422
PubMed Central ID:	PMC13091714
DOI:	10.52965/001c.160184
PMID:	42005526
Database:	MEDLINE

Journal Article