当前位置:
X-MOL 学术
›
Clin. Orthop. Relat. Res.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.
Clinical Orthopaedics and Related Research ( IF 4.2 ) Pub Date : 2024-10-02 , DOI: 10.1097/corr.0000000000003270 Sarah E Lindsay,Cecelia J Madison,Duncan C Ramsey,Yee-Cheen Doung,Kenneth R Gundle
Clinical Orthopaedics and Related Research ( IF 4.2 ) Pub Date : 2024-10-02 , DOI: 10.1097/corr.0000000000003270 Sarah E Lindsay,Cecelia J Madison,Duncan C Ramsey,Yee-Cheen Doung,Kenneth R Gundle
BACKGROUND
Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.
QUESTIONS/PURPOSES
(1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?
METHODS
All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.
RESULTS
The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.
CONCLUSION
An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.
LEVEL OF EVIDENCE
Level II, diagnostic study.
中文翻译:
De Novo Natural Language Processing 算法从病理报告中准确识别粘液纤维肉瘤。
背景 ICD-10 中的可用代码并不能准确反映软组织肉瘤的诊断,这可能导致软组织肉瘤在数据库中的代表性不足。由于所有临床结果和病理报告的可用性,国家 VA 数据库为软组织肉瘤调查提供了独特的机会。在软组织肉瘤的情况下,自然语言处理 (NLP) 有可能应用于病理报告等临床文件,以独立于 ICD 代码识别软组织肉瘤,使肉瘤研究人员能够构建更全面的数据库,能够回答无数的研究问题。问题/目的 (1) 仅按软组织肉瘤 ICD 代码搜索会错过国家 VA 数据库中粘液纤维肉瘤患者的百分比是多少?(2) 从头 NLP 算法是否能够分析病理报告以准确识别粘液纤维肉瘤患者?方法 确定了 2003 年至 2022 年国家 VA 企业数据仓库中的所有病理报告(1070 万份)。使用单词搜索功能,发现来自 403 名退伍军人的报告包含术语“粘液纤维肉瘤”。对生成的病理报告进行人工审查,以开发一个金标准队列,其中仅包含那些经病理学家证实的粘液纤维肉瘤诊断的退伍军人。该队列的平均 ± SD 年龄为 70 ± 12 岁,96% (300 人中的 287 人) 为男性。提取诊断代码,并比较适当 ICD 编码的差异。使用粘液纤维肉瘤的混杂项、否定项和强调项对 NLP 算法进行迭代改进和测试。 通过与人工审查的金标准队列进行比较,计算 NLP 生成的队列的敏感性、特异性、阳性预测值 (PPV)、阴性预测值 (NPV) 和准确性。结果VA 数据库中 27%(300 名中的 81 名)的粘液纤维肉瘤患者的记录缺少肉瘤 ICD 代码。与 ICD 代码 (73% [300 中的 219] ] 或基本单词搜索 (74% [403 中的 300 ] 相比,从头 NLP 算法更准确地 (92% [300 中的 276])识别粘液纤维肉瘤患者 (p < 0.001)。生成了三个最终算法模型,准确率从 92% 到 100% 不等。结论 NLP 算法可以从病理报告中高精度地识别粘液纤维肉瘤患者,这比基于 ICD 的队列创建和简单的单词搜索有所改进。该算法在 GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) 上免费提供,可用于通过其他队列中的测试来促进外部验证和改进。证据级别 II 级,诊断性研究。
更新日期:2024-10-02
中文翻译:
De Novo Natural Language Processing 算法从病理报告中准确识别粘液纤维肉瘤。
背景 ICD-10 中的可用代码并不能准确反映软组织肉瘤的诊断,这可能导致软组织肉瘤在数据库中的代表性不足。由于所有临床结果和病理报告的可用性,国家 VA 数据库为软组织肉瘤调查提供了独特的机会。在软组织肉瘤的情况下,自然语言处理 (NLP) 有可能应用于病理报告等临床文件,以独立于 ICD 代码识别软组织肉瘤,使肉瘤研究人员能够构建更全面的数据库,能够回答无数的研究问题。问题/目的 (1) 仅按软组织肉瘤 ICD 代码搜索会错过国家 VA 数据库中粘液纤维肉瘤患者的百分比是多少?(2) 从头 NLP 算法是否能够分析病理报告以准确识别粘液纤维肉瘤患者?方法 确定了 2003 年至 2022 年国家 VA 企业数据仓库中的所有病理报告(1070 万份)。使用单词搜索功能,发现来自 403 名退伍军人的报告包含术语“粘液纤维肉瘤”。对生成的病理报告进行人工审查,以开发一个金标准队列,其中仅包含那些经病理学家证实的粘液纤维肉瘤诊断的退伍军人。该队列的平均 ± SD 年龄为 70 ± 12 岁,96% (300 人中的 287 人) 为男性。提取诊断代码,并比较适当 ICD 编码的差异。使用粘液纤维肉瘤的混杂项、否定项和强调项对 NLP 算法进行迭代改进和测试。 通过与人工审查的金标准队列进行比较,计算 NLP 生成的队列的敏感性、特异性、阳性预测值 (PPV)、阴性预测值 (NPV) 和准确性。结果VA 数据库中 27%(300 名中的 81 名)的粘液纤维肉瘤患者的记录缺少肉瘤 ICD 代码。与 ICD 代码 (73% [300 中的 219] ] 或基本单词搜索 (74% [403 中的 300 ] 相比,从头 NLP 算法更准确地 (92% [300 中的 276])识别粘液纤维肉瘤患者 (p < 0.001)。生成了三个最终算法模型,准确率从 92% 到 100% 不等。结论 NLP 算法可以从病理报告中高精度地识别粘液纤维肉瘤患者,这比基于 ICD 的队列创建和简单的单词搜索有所改进。该算法在 GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) 上免费提供,可用于通过其他队列中的测试来促进外部验证和改进。证据级别 II 级,诊断性研究。