Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists.
| Title: | Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists. |
|---|---|
| Authors: | Keyif B; Department of Obstetrics and Gynecology, Faculty of Medicine, School of Medicine, Duzce University, 81620, Konuralp, Duzce, Turkey. betulkeyif@duzce.edu.tr.; Yurtçu E; Department of Obstetrics and Gynecology, Faculty of Medicine, School of Medicine, Duzce University, 81620, Konuralp, Duzce, Turkey.; Başbuğ A; Department of Obstetrics and Gynecology, Faculty of Medicine, School of Medicine, Duzce University, 81620, Konuralp, Duzce, Turkey.; Goynumer FG; Department of Obstetrics and Gynecology, Faculty of Medicine, School of Medicine, Duzce University, 81620, Konuralp, Duzce, Turkey. |
| Source: | Scientific reports [Sci Rep] 2026 Feb 16; Vol. 16 (1). Date of Electronic Publication: 2026 Feb 16. |
| Publication Type: | Journal Article |
| Language: | English |
| Journal Info: | Publisher: Nature Publishing Group Country of Publication: England NLM ID: 101563288 Publication Model: Electronic Cited Medium: Internet ISSN: 2045-2322 (Electronic) Linking ISSN: 20452322 NLM ISO Abbreviation: Sci Rep Subsets: MEDLINE |
| Imprint Name(s): | Original Publication: London : Nature Publishing Group, copyright 2011- |
| MeSH Terms: | Obstetrics*/methods ; Artificial Intelligence* ; Language*; Humans ; Pregnancy ; Female ; Reproducibility of Results ; Surveys and Questionnaires ; Adult |
| Abstract: | This study aimed to compare the performance of three large language models-ChatGPT-3.5, Gemini, and ChatGPT-4.0-in generating responses to ten frequently asked pregnancy-related questions, as evaluated by obstetrics and gynecology specialists. Seventy-five specialists independently rated 30 anonymized AI-generated responses using a 5-point Likert scale across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. All questions were standardized and presented verbatim to each model using identical zero-shot prompts. Data were analyzed using the Kruskal-Wallis test with Bonferroni-adjusted Mann-Whitney U post-hoc comparisons. Inter-rater consistency was assessed using Cronbach's alpha. Spearman correlation was used to examine associations between clinical experience and evaluation patterns. ChatGPT-4.0 demonstrated the highest overall performance, particularly in accuracy (median 4.35; mean ± SD: 4.30 ± 0.48) and patient-friendliness (4.40; 4.35 ± 0.47). Gemini performed comparably to ChatGPT-4.0 in comprehensibility (3.70; 3.68 ± 0.54), while ChatGPT-3.5 consistently received the lowest scores. Significant differences were observed among the three models for accuracy, reliability, and patient-friendliness (all p |
| Competing Interests: | Declarations. Competing interests: The authors declare no competing interests. Ethics approval and consent to participate: The study was conducted in accordance with the Declaration of Helsinki, and ethical approval was obtained from the Non-Interventional Research Ethics Committee of Düzce University (Approval number: 2025/92). Written informed consent was obtained from all participants prior to participation. |
| References: | Pascual, Z. N. & Langaker, M. D. Physiology, pregnancy. In Treasure Island (FL) Ineligible Companies. Disclosure: Michelle Langaker Declares No Relevant Financial Relationships with Ineligible Companies. (StatPearls Ed.) (StatPearls Publishing Copyright © 2025, StatPearls Publishing LLC, 2025).; Åhlin, P., Almström, P. & Wänström, C. Solutions for improved hospital-wide patient flows - A qualitative interview study of leading healthcare providers. BMC Health Serv. Res. 23 (1), 17 (2023). (PMID: 10.1186/s12913-022-09015-w366111789825009); Javaid, M., Haleem, A. & Singh, R. P. Health informatics to enhance the healthcare industry’s culture: An extensive analysis of its features, contributions, applications and limitations. Inf. Health. 1 (2), 123–148 (2024).; Abuelezz, I. et al. Contribution of artificial intelligence in pregnancy: A scoping review. Stud. Health Technol. Inform. 289, 333–336 (2022). (PMID: 35062160); Zhang, K. et al. Revolutionizing health care: The transformative impact of large language models in medicine. J. Med. Internet. Res. 27, e59069 (2025). (PMID: 10.2196/590693977366611751657); Alowais, S. A. et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med. Educ. 23 (1), 689 (2023). (PMID: 10.1186/s12909-023-04698-z3774019110517477); Kleib, M. et al. Current trends and future implications in the utilization of ChatGPT in nursing: A rapid review. Int. J. Nurs. Stud. Adv. 2024, 100252 (2024).; Salman, I. M., Ameer, O. Z., Khanfar, M. A. & Hsieh, Y-H. Artificial intelligence in healthcare education: Evaluating the accuracy of ChatGPT, copilot, and Google Gemini in cardiovascular pharmacology. Front. Med. 12, 1495378 (2025). (PMID: 10.3389/fmed.2025.1495378); Al Kuwaiti, A., Nazer, K., Al-Reedy, A., Al-Shehri, S. & Al-Muhanna, A. A review of the role of artificial intelligence in healthcare. J. Pers. Med. 13(6) (2023).; Longo, U. G. et al. Revolutionizing total hip arthroplasty: The role of artificial intelligence and machine learning. 12(1), e70195 (2025).; An, X., Zhou, J., Xu, Q., Zhao, Z. & Li, W. Artificial intelligence in obstructive sleep apnea: A bibliometric analysis. Digit. health. 11, 20552076251324446 (2025). (PMID: 10.1177/205520762513244464012388211930495); Saglam, S. et al. Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: A cross-sectional study. BMC Med. Inf. Decis. Mak. 25 (1), 163 (2025). (PMID: 10.1186/s12911-025-02996-8); Tuncer, C., Tekin, R. T., Uludağ, V., Kılıç, G. & Taşkesen, A. Multi-criteria evaluation of clinical decision-making performance in spinal neurosurgery and physical therapy scenarios: A comparative analysis of artificial intelligence models. Eur. Spine J. (2026).; Taşkıran, A. T., Balık, A. Y., Başaran, E., Baba, D. & Kayıkçı, M. A. Clinical reasoning with machines: evaluating the interpretive depth of AI in urological case assessments. BMC Urol. 26 (1), 35 (2026). (PMID: 10.1186/s12894-026-02048-x4150790212882585); Drukker, L., Noble, J. A. & Papageorghiou, A. T. Introduction to artificial intelligence in ultrasound imaging in obstetrics and gynecology. Ultrasound Obstet. gynecology: official J. Int. Soc. Ultrasound Obstet. Gynecol. 56 (4), 498–505 (2020). (PMID: 10.1002/uog.22122); Yurtcu, E., Ozvural, S. & Keyif, B. Analyzing the performance of ChatGPT in answering inquiries about cervical cancer. Int. J. Gynecol. Obstet. 168 (2), 502–507 (2025). (PMID: 10.1002/ijgo.15861); Sav, N. M. Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology. Pediatr. Nephrol. 2025, 1–7 (2025).; Taymour, N., Fouda, S. M., Abdelrahaman, H. H. & Hassan, M. G. Performance of the ChatGPT-3.5, ChatGPT-4, and Google Gemini large language models in responding to dental implantology inquiries. J. Prosthet. Dentist. (2025).; Wang, L. & Wan, Z. Applications and concerns of ChatGPT and other conversational large language models in health care: Systematic review. medRxiv 26, e22769 (2024).; Kelly, B. S. et al. Radiology artificial intelligence: A systematic review and evaluation of methods (RAISE). Eur. Radiol. 32 (11), 7998–8007 (2022). (PMID: 10.1007/s00330-022-08784-6354203059668941); Chen, H. et al. A machine learning model for diagnosing opportunistic infections in HIV patients: Broad applicability across infection types. J. Cell Mol. Med. 29(6), e70497 (2025).; Gjestvang, C. & Haakstad, L. A. H. Navigating pregnancy: Information sources and lifestyle behavior choices-A narrative review. J. Pregnancy2024, 4040825 (2024).; Fatima, A. & Shafique, M. A. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT’s (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore)103(32), e39250 (2024).; Abdul Sami, M., Abdul Samad, M., Parekh, K. & Suthar, P. P. Comparative accuracy of ChatGPT 4.0 and Google Gemini in answering pediatric radiology text-based questions. Cureus 16 (10), e70897 (2024). (PMID: 3949786811534303); Thompson, C. & Mebrahtu, T. The effects of computerised decision support systems on nursing and allied health professional performance and patient outcomes: A systematic review and user contextualisation. Health Soc. Care Deliv. Res. 12, 1–85 (2023).; Liu, S. & McCoy, A. B. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J. Am. Med. Inform. Assoc. 31(6), 1388–1396 (2024). |
| Contributed Indexing: | Keywords: Artificial intelligence; ChatGPT; Gemini; Large language models; Obstetrics; Patient education; Pregnancy |
| Entry Date(s): | Date Created: 20260216 Date Completed: 20260319 Latest Revision: 20260321 |
| Update Code: | 20260321 |
| PubMed Central ID: | PMC13000243 |
| DOI: | 10.1038/s41598-026-40609-0 |
| PMID: | 41699404 |
| Database: | MEDLINE |
Journal Article