How do ChatGPT and other generative artificial intelligence models perform on foot and ankle questions from the Brazilian Orthopedics and Traumatology Association’s TEOT and TARO exams? The implications of large language models for medical education

Daniel Soares Baumfeld; Roberto Zambelli de Almeida Pinto; Paula Costa Machado; Lucca Gontijo Giarola; Lacerda MCR

doi:10.30795/jfootankle.2026.v20.2051

How do ChatGPT and other generative artificial intelligence models perform on foot and ankle questions from the Brazilian Orthopedics and Traumatology Association’s TEOT and TARO exams? The implications of large language models for medical education

Authors

Daniel Soares Baumfeld Hospital Felício Rocho, Belo Horizonte, MG, Brazil
Roberto Zambelli de Almeida Pinto Hospital Mater Dei, Belo Horizonte, MG, Brazil
Paula Costa Machado Hospital Felício Rocho, Belo Horizonte, MG, Brazil
Lucca Gontijo Giarola Hospital das Clínicas da UFMG, Belo Horizonte, MG, Brazil https://orcid.org/0000-0001-8999-1194
Lacerda MCR Faculdade de Ciências Médicas, Belo Horizonte, MG, Brazil

DOI:

https://doi.org/10.30795/jfootankle.2026.v20.2051

Keywords:

Medical education; Orthopedics; Foot

Abstract

Introduction: Generative artificial intelligence (AI) is increasingly used for study and rapid consultation. We assessed how leading large language models (LLMs) perform on Brazilian Orthopedics and Traumatology Association (SBOT) Foot and Ankle exam questions. Methods: Cross-sectional benchmarking of 107 foot and ankle questions from TEOT and TARO exams. Items were classified into the following categories: adult trauma, pediatric trauma, anatomy/imaging, physical examination, congenital/pediatric disorders, and adult disorders. Four generative AI models were queried with standardized prompts; responses were scored against the official key. Outcome: overall accuracy. Results: ChatGPT (GPT-5 Thinking) had the highest accuracy (86.91%), followed by Gemini (79.43%). Accuracy differed by domain, with lower performance in pediatric trauma and congenital disorders. No model achieved perfect agreement with the key. Conclusions: Popular generative AI models performed well on SBOT foot and ankle exam questions, with ChatGPT (GPT-5 Thinking) scoring highest. LLMs may be helpful adjuncts in residency education when used with supervision and critical appraisal.

Downloads

Published

2026-04-23

How to Cite

Baumfeld, D. S., Pinto, R. Z. de A., Machado, P. C., Giarola, L. G., & Lacerda MCR. (2026). How do ChatGPT and other generative artificial intelligence models perform on foot and ankle questions from the Brazilian Orthopedics and Traumatology Association’s TEOT and TARO exams? The implications of large language models for medical education. Journal of the Foot & Ankle, 20(Suppl 1). https://doi.org/10.30795/jfootankle.2026.v20.2051