Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Artificial Intelligence as a Discriminator of Competence in Urological Training: Are we there?
The Journal of Urology ( IF 5.9 ) Pub Date : 2024-12-09 , DOI: 10.1097/ju.0000000000004357 Naji J Touma,Ruchit Patel,Thomas Skinner,Michael Leveridge
The Journal of Urology ( IF 5.9 ) Pub Date : 2024-12-09 , DOI: 10.1097/ju.0000000000004357 Naji J Touma,Ruchit Patel,Thomas Skinner,Michael Leveridge
INTRODUCTION
Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence (AI) is finding an increasing role in clinical care and medical education. The objective of this study is to evaluate the ability of the large language model ChatGPT to generate exam questions that are discriminating in the evaluation of graduating urology residents.
METHODS
Graduating urology residents representing all Canadian training programs gather yearly for a mock exam that simulates their upcoming board certifying exam. The exam consists of a written multiple-choice questions (MCQs) exam, and an oral OSCE. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and CUA guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain if they came from a valid source.
RESULTS
The mean score of the 35 exam takers on the ChatGPT MCQs was 60.7% versus 61.1% for the overall exam. 25% of ChatGPT MCQs showed a discriminating index > 0.3, the threshold for questions that properly discriminate between high and low exam performers. 25% of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the exam. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, were sometimes not rooted in the literature. 35% of the MCQs generated by ChatGPT provided wrong answers to stems.
CONCLUSIONS
Despite what appears to be similar performance on ChatGPT MCQs and the overall exam, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for AI hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training exams.
中文翻译:
人工智能作为泌尿外科培训能力的鉴别器:我们在那里吗?
引言 医学教育中的评估在评估受训者的进步和最终能力方面起着核心作用。生成式人工智能 (AI) 在临床护理和医学教育中发挥着越来越大的作用。本研究的目的是评估大型语言模型 ChatGPT 生成在评估即将毕业的泌尿外科住院医师时具有区别性的考试问题的能力。方法 代表所有加拿大培训计划的泌尿外科毕业生每年聚集在一起进行模拟考试,模拟他们即将到来的董事会认证考试。考试包括笔试多项选择题 (MCQ) 考试和 OSCE 口语考试。2023 年,ChatGPT 第 4 版用于生成 20 个 MCQ,这些 MCQ 被添加到编写组件中。ChatGPT 被要求使用 Campbell-Walsh Urology、AUA 和 CUA 指南作为资源。对 ChatGPT MCQ 进行了心理测量分析。3 名教师还研究了 MCQ 的表面有效性,并确定它们是否来自有效来源。结果: 35 名考生在 ChatGPT MCQ 上的平均分数为 60.7%,而整体考试的平均分数为 61.1%。25% 的 ChatGPT MCQ 显示出判别指数 > 0.3,这是正确区分高考和低考题的门槛。25% 的 ChatGPT MCQ 显示点双序列 > 0.2,这被认为与考试的整体表现高度相关。教师的评估发现,ChatGPT MCQ 通常在词干中提供不完整的信息,提供多个可能正确的答案,有时没有植根于文献。ChatGPT 生成的 MCQ 中有 35% 为词干提供了错误的答案。 结论 尽管 ChatGPT MCQ 和整体考试的表现似乎相似,但 ChatGPT MCQ 往往没有高度歧视。具有 AI 幻觉可能性的措辞不当的问题一直存在。在将 ChatGPT 问题用于泌尿外科培训考试的评估之前,应仔细审查 ChatGPT 问题的质量。
更新日期:2024-12-09
中文翻译:
人工智能作为泌尿外科培训能力的鉴别器:我们在那里吗?
引言 医学教育中的评估在评估受训者的进步和最终能力方面起着核心作用。生成式人工智能 (AI) 在临床护理和医学教育中发挥着越来越大的作用。本研究的目的是评估大型语言模型 ChatGPT 生成在评估即将毕业的泌尿外科住院医师时具有区别性的考试问题的能力。方法 代表所有加拿大培训计划的泌尿外科毕业生每年聚集在一起进行模拟考试,模拟他们即将到来的董事会认证考试。考试包括笔试多项选择题 (MCQ) 考试和 OSCE 口语考试。2023 年,ChatGPT 第 4 版用于生成 20 个 MCQ,这些 MCQ 被添加到编写组件中。ChatGPT 被要求使用 Campbell-Walsh Urology、AUA 和 CUA 指南作为资源。对 ChatGPT MCQ 进行了心理测量分析。3 名教师还研究了 MCQ 的表面有效性,并确定它们是否来自有效来源。结果: 35 名考生在 ChatGPT MCQ 上的平均分数为 60.7%,而整体考试的平均分数为 61.1%。25% 的 ChatGPT MCQ 显示出判别指数 > 0.3,这是正确区分高考和低考题的门槛。25% 的 ChatGPT MCQ 显示点双序列 > 0.2,这被认为与考试的整体表现高度相关。教师的评估发现,ChatGPT MCQ 通常在词干中提供不完整的信息,提供多个可能正确的答案,有时没有植根于文献。ChatGPT 生成的 MCQ 中有 35% 为词干提供了错误的答案。 结论 尽管 ChatGPT MCQ 和整体考试的表现似乎相似,但 ChatGPT MCQ 往往没有高度歧视。具有 AI 幻觉可能性的措辞不当的问题一直存在。在将 ChatGPT 问题用于泌尿外科培训考试的评估之前,应仔细审查 ChatGPT 问题的质量。