BJOG: An International Journal of Obstetrics & Gynaecology ( IF 4.7 ) Pub Date : 2024-08-14 , DOI: 10.1111/1471-0528.17929 Gabriel Levin 1 , Walter Gotlieb 1 , Pedro Ramirez 2 , Raanan Meyer 3 , Yoav Brezinov 4
The practical medical use of artificial intelligence is rapidly progressing. Specifically, the application of ChatGPT was explored in medical education and even medical clinical data evaluation.1, 2 Tumour board is an integral and pivotal part of patient treatment and management in gynaecologic oncology.3 It entails the processing of various pathological and clinical parameters, coupled with the familiarity with treatment guidelines in accordance with the various parameters. The participation of ChatGPT in breast cancer tumour board was previously studied, with contrasting results.4, 5 We aim to study the feasibility of ChatGPT (Versions 3.5 and 4) as a support tool for endometrial cancer (EC) and ovarian cancer (OC) according to the NCCN and ESGO guidelines.
Ten EC cases and ten OC cases were fabricated based on experience of authors pertaining to the most complex scenarios discussed in real practice. For EC the following data was formulated: age, histology, stage, grade, lymphovascular space invasion, tumour size and molecular classification—MMR, p53 and POLE mutation status. For OC, the following data was formulated: age, histology and stage.
We created a new account for ChatGPT 3.5 and purchased and created an account for ChatGPT 4. We used generic prompts for all the cases. The ChatGPT 3.5 and ChatGPT 4 prompt are described (Appendix S1).
For each tumour board case, we accessed the NCCN and ESGO guidelines separately and recorded their recommendation. All ChatGPT recommendations were judged as correct or incorrect by two independent reviewers (G.L. and Y.B.). Data analysis is described in detail in the Appendix S1.
We used SPSS 29 for the statistical analysis. As no patient information was used—no ethical board review was needed for this study.
There were ten cases of EC cancer, stages IA-IIIC with four different histology, and ten cases of OC stages IA-IC3 with five different histology. ChatGPT 3.5 was unable to give a concrete recommendation, and ChatGPT 4 gave a recommendation to all cases. No disagreements between reviewers were noted for all 40 evaluations.
The rate of correct recommendations was 70% (14/20) for NCCN guidelines and 60% (12/20) for ESGO guidelines (p = 0.512). (Table 1). There were 55% (11/20) of cases with correct recommendations for both guidelines, 20% (4/20) of cases in which a correct recommendation was given only according to one guideline (Figure S1), and 25% (5/20) of cases in which an incorrect recommendation was given. Of those with an incorrect recommendation, 80% (4/5) were EC, stages IA-II, of all histology, and one case of OC, stage IA. Of the four single guidelines correct recommendations, all were EC, with three incorrect recommendations according to ESGO guidelines, including the only two cases with a positive POLE mutation. OC had higher complete correct recommendation as compared to EC (90% vs. 20%, p = 0.005). ChatGPT 4 suggestions for adjuvant treatment are presented in Tables S1 and S2.
# | Cancer site | Age | Histology | Stage | Grade | LVSI | Size (cm) | MMR | p53 | POLE | Accurate ChatGPT response by guideline | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NCCN | ESGO | Both | Any | |||||||||||
1 | Endometrial | 52 | Endometroid | 1A | 3 | Negative | 3 | Proficient | Wildtype | Unknown | Yes | No | No | Yes |
2 | Endometrial | 62 | Endometroid | 1B | 1 | Extensive | 2 | Deficient | Mutated | Positive | Yes | No | No | Yes |
3 | Endometrial | 85 | Serous | 1A | 3 | Negative | 5 | Proficient | Mutated | Negative | No | No | No | No |
4 | Endometrial | 49 | Clear Cell | 1B | 3 | Negative | 3 | Proficient | Wildtype | Unknown | No | No | No | No |
5 | Endometrial | 61 | Endometroid | 2 | 2 | Extensive | 6 | Deficient | Wildtype | Unknown | No | No | No | No |
6 | Endometrial | 70 | Serous | 1B | 3 | Extensive | 4 | Deficient | Mutated | Negative | Yes | Yes | Yes | Yes |
7 | Endometrial | 74 | Carcinosarcoma | 1A | 3 | Extensive | 2 | Proficient | Mutated | Negative | No | No | No | No |
8 | Endometrial | 72 | Endometroid | 3C | 1 | Negative | 4 | Proficient | Wildtype | Negative | Yes | Yes | Yes | Yes |
9 | Endometrial | 76 | Endometroid | 3C | 1 | Negative | 6 | Proficient | Wildtype | Positive | Yes | No | No | Yes |
10 | Endometrial | 61 | Endometroid | 2 | 1 | Negative | 3 | Proficient | Wildtype | Unknown | No | Yes | No | Yes |
11 | Ovarian | 70 | Serous | 1A | HG | – | – | – | – | – | No | No | No | No |
12 | Ovarian | 55 | Mucinous | 1B | - | – | – | – | – | – | Yes | Yes | Yes | Yes |
13 | Ovarian | 62 | Endometroid | 1C1 | HG | – | – | – | – | – | Yes | Yes | Yes | Yes |
14 | Ovarian | 80 | Clear cell | 1C2 | HG | – | – | – | – | – | Yes | Yes | Yes | Yes |
15 | Ovarian | 42 | Serous | 1C3 | LG | – | – | – | – | – | Yes | Yes | Yes | Yes |
16 | Ovarian | 70 | Serous | 1C1 | HG | – | – | – | – | – | Yes | Yes | Yes | Yes |
17 | Ovarian | 55 | Mucinous | 1C2 | - | – | – | – | – | – | Yes | Yes | Yes | Yes |
18 | Ovarian | 62 | Endometroid | 1B | HG | – | – | – | – | – | Yes | Yes | Yes | Yes |
19 | Ovarian | 80 | Clear cell | 1A | HG | – | – | – | – | – | Yes | Yes | Yes | Yes |
20 | Ovarian | 42 | Serous | 1B | LG | – | – | – | – | – | Yes | Yes | Yes | Yes |
- Abbreviations: HG, high grade; LG, low grade; LVSI, lymphovascular space invasion.
In this feasibility study, we showed that ChatGPT 4 provided correct recommendations in two-thirds of the cases evaluated, however in 25% of cases, mostly endometrial cancer, there was an incorrect recommendation. Endometrial cancer had a lower complete rate of correct recommendations, likely due to the complexity of stage, histology and grade in early stages and in the integration of molecular characterisation of endometrial cancer. More research is required to assess the credibility and configure protocols for the potential use of this tool. However, in a setting of high-volume clinics, or in regions where resources are limiting in terms of expertise, such tools may aid physicians maintain evidenced-based care. Further studies should focus on ChatGPT familiarity with ongoing clinical trials to assess for possible patient eligibility.
Our limitations include the small number of cases studied and limiting our study to endometrial and ovarian cancer. Additionally, we have used the generic ChatGPT tool without any specific training for our data. Moreover, we have used only two AI platforms in this study, this may limit the generalisability of our results. Importantly, we did not compare the AI-generated recommendation to a multidisciplinary Tumor Board recommendation, which is the ‘gold standard’ in real practice. Finally, all data is correct to the time this manuscript was written. As ChatGPT is a large language model, he is constantly trains on prompts and his output may change and evolve over time. Future prospective real-life evaluation of gynaecologic oncology tumour board is encouraged to better delineate advantages and pitfalls of artificial intelligence tools and their impact on practice.
中文翻译:
妇科肿瘤多学科团队肿瘤委员会中的 ChatGPT:一项可行性研究
人工智能的实际医疗应用正在迅速发展。具体来说,ChatGPT 的应用被探索在医学教育甚至医学临床数据评估中。1、2肿瘤板是妇科肿瘤患者治疗和管理不可或缺的关键部分。3 它需要处理各种病理和临床参数,以及根据各种参数熟悉治疗指南。之前曾研究过 ChatGPT 参与乳腺癌肿瘤委员会,结果截然不同。4、5我们旨在根据 NCCN 和 ESGO 指南研究 ChatGPT(版本 3.5 和 4)作为子宫内膜癌 (EC) 和卵巢癌 (OC) 支持工具的可行性。
10 个 EC 案例和 10 个 OC 案例是根据作者与实际实践中讨论的最复杂场景相关的经验捏造的。对于 EC,制定了以下数据:年龄、组织学、分期、分级、淋巴血管间隙浸润、肿瘤大小和分子分类——MMR、p53 和 POLE 突变状态。对于 OC,制定了以下数据: 年龄、组织学和分期。
我们为 ChatGPT 3.5 创建了一个新帐户,并为 ChatGPT 4 购买并创建了一个帐户。我们对所有情况都使用了通用提示。描述了 ChatGPT 3.5 和 ChatGPT 4 提示符(附录 S1)。
对于每个肿瘤委员会病例,我们分别访问了 NCCN 和 ESGO 指南并记录了他们的建议。所有 ChatGPT 建议都由两名独立审查员(GL 和 YB)判断为正确或错误。附录 S1 中详细介绍了数据分析。
我们使用 SPSS 29 进行统计分析。由于没有使用患者信息,因此本研究不需要伦理委员会审查。
有 10 例 EC 癌,IA-IIIC 期有 4 种不同的组织学,OC 10 例 IA-IC3 期有 5 种不同的组织学。ChatGPT 3.5 无法给出具体建议,ChatGPT 4 对所有情况都给出了建议。在所有 40 项评价中,评价员之间均未发现分歧。
NCCN 指南的正确推荐率为 70% (14/20),ESGO 指南的正确推荐率为 60% (12/20) (p = 0.512)。(表 1)。有 55% (11/20) 的案例对两个指南都有正确的建议,20% (4/20) 的案例仅根据一个指南给出了正确的建议(图 S1),以及 25% (5/20) 的案例给出了不正确的建议。在推荐不正确的患者中,80% (4/5) 是 EC,IA-II 期,占所有组织学,1 例 OC,IA 期。在 4 个单一指南正确推荐中,均为 EC,根据 ESGO 指南,有 3 个推荐错误,包括仅有的 2 例 POLE 突变阳性。与 EC 相比,OC 具有更高的完全正确推荐 (90% vs. 20%,p = 0.005)。ChatGPT 4 辅助治疗建议见表 S1 和 S2。
# | 癌症部位 | 年龄 | 组织学 | 阶段 | 年级 | LVSI | 尺寸 (cm) | MMR | 第 53 页 | POLE | 按指南准确响应 ChatGPT |
|||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NCCN | ESGO | 双 | 任何 | |||||||||||
1 | 子宫内膜的 | 52 | 子宫内膜 | 1A | 3 | 阴性 | 3 | 熟练 | 野生型 | 未知 | 是的 | 不 | 不 | 是的 |
2 | 子宫内膜的 | 62 | 子宫内膜 | 1B | 1 | 广泛 | 2 | 缺乏 | 突变 | 阳性 | 是的 | 不 | 不 | 是的 |
3 | 子宫内膜的 | 85 | 浆膜 | 1A | 3 | 阴性 | 5 | 熟练 | 突变 | 阴性 | 不 | 不 | 不 | 不 |
4 | 子宫内膜的 | 49 | 清除单元格 | 1B | 3 | 阴性 | 3 | 熟练 | 野生型 | 未知 | 不 | 不 | 不 | 不 |
5 | 子宫内膜的 | 61 | 子宫内膜 | 2 | 2 | 广泛 | 6 | 缺乏 | 野生型 | 未知 | 不 | 不 | 不 | 不 |
6 | 子宫内膜的 | 70 | 浆膜 | 1B | 3 | 广泛 | 4 | 缺乏 | 突变 | 阴性 | 是的 | 是的 | 是的 | 是的 |
7 | 子宫内膜的 | 74 | 肉瘤 | 1A | 3 | 广泛 | 2 | 熟练 | 突变 | 阴性 | 不 | 不 | 不 | 不 |
8 | 子宫内膜的 | 72 | 子宫内膜 | 3C | 1 | 阴性 | 4 | 熟练 | 野生型 | 阴性 | 是的 | 是的 | 是的 | 是的 |
9 | 子宫内膜的 | 76 | 子宫内膜 | 3C | 1 | 阴性 | 6 | 熟练 | 野生型 | 阳性 | 是的 | 不 | 不 | 是的 |
10 | 子宫内膜的 | 61 | 子宫内膜 | 2 | 1 | 阴性 | 3 | 熟练 | 野生型 | 未知 | 不 | 是的 | 不 | 是的 |
11 | 卵巢 | 70 | 浆膜 | 1A | HG | – | – | – | – | – | 不 | 不 | 不 | 不 |
12 | 卵巢 | 55 | 粘液 | 1B | - | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
13 | 卵巢 | 62 | 子宫内膜 | 1C1 | HG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
14 | 卵巢 | 80 | 清除单元格 | 1C2 | HG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
15 | 卵巢 | 42 | 浆膜 | 1C3 | LG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
16 | 卵巢 | 70 | 浆膜 | 1C1 | HG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
17 | 卵巢 | 55 | 粘液 | 1C2 | - | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
18 | 卵巢 | 62 | 子宫内膜 | 1B | HG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
19 | 卵巢 | 80 | 清除单元格 | 1A | HG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
20 | 卵巢 | 42 | 浆膜 | 1B | LG | – | – | – | – | – | 是的 | 是的 | 是的 | 是的 |
缩写:HG,高级;LG,低等级;LVSI,淋巴血管间隙浸润。
在这项可行性研究中,我们表明 ChatGPT 4 在三分之二的评估病例中提供了正确的建议,但在 25% 的病例中,主要是子宫内膜癌,存在不正确的建议。子宫内膜癌的正确推荐完全率较低,这可能是由于早期分期、组织学和分级的复杂性以及子宫内膜癌分子特征的整合。需要更多的研究来评估可信度并为该工具的潜在使用配置协议。然而,在高容量诊所的环境中,或者在专业知识资源有限的地区,这些工具可以帮助医生维持循证护理。进一步的研究应侧重于 ChatGPT 对正在进行的临床试验的熟悉程度,以评估可能的患者资格。
我们的局限性包括研究的病例数量少以及将我们的研究局限于子宫内膜癌和卵巢癌。此外,我们使用了通用的 ChatGPT 工具,但没有对我们的数据进行任何特定的训练。此外,我们在这项研究中只使用了两个 AI 平台,这可能会限制我们结果的普遍性。重要的是,我们没有将 AI 生成的建议与多学科肿瘤委员会的建议进行比较,后者是实际实践中的“黄金标准”。最后,所有数据都与这份手稿的写作时间有关。由于 ChatGPT 是一个大型语言模型,他不断接受提示训练,他的输出可能会随着时间的推移而变化和发展。鼓励未来对妇科肿瘤肿瘤委员会进行前瞻性真实评估,以更好地描述人工智能工具的优势和陷阱及其对实践的影响。