当前位置: X-MOL 学术Nature › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automated real-world data integration improves cancer outcome prediction
Nature ( IF 50.5 ) Pub Date : 2024-11-06 , DOI: 10.1038/s41586-024-08167-5
Justin Jee, Christopher Fong, Karl Pichotta, Thinh Ngoc Tran, Anisha Luthra, Michele Waters, Chenlian Fu, Mirella Altoe, Si-Yang Liu, Steven B. Maron, Mehnaj Ahmed, Susie Kim, Mono Pirun, Walid K. Chatila, Ino de Bruijn, Arfath Pasha, Ritika Kundra, Benjamin Gross, Brooke Mastrogiacomo, Tyler J. Aprati, David Liu, JianJiong Gao, Marzia Capelletti, Kelly Pekala, Lisa Loudon, Maria Perry, Chaitanya Bandlamudi, Mark Donoghue, Baby Anusha Satravada, Axel Martin, Ronglai Shen, Yuan Chen, A. Rose Brannon, Jason Chang, Lior Braunstein, Anyi Li, Anton Safonov, Aaron Stonestrom, Pablo Sanchez-Vela, Clare Wilhelm, Mark Robson, Howard Scher, Marc Ladanyi, Jorge S. Reis-Filho, David B. Solit, David R. Jones, Daniel Gomez, Helena Yu, Debyani Chakravarty, Rona Yaeger, Wassim Abida, Wungki Park, Eileen M. O’Reilly, Julio Garcia-Aguilar, Nicholas Socci, Francisco Sanchez-Vega, Jian Carrot-Zhang, Peter D. Stetson, Ross Levine, Charles M. Rudin, Michael F. Berger, Sohrab P. Shah, Deborah Schrag, Pedram Razavi, Kenneth L. Kehl, Bob T. Li, Gregory J. Riely, Nikolaus Schultz

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.



中文翻译:


自动化真实世界数据集成改善癌症结果预测



健康记录的数字化和肿瘤 DNA 测序的日益普及为以前所未有的丰富性研究癌症结果的决定因素提供了机会。患者数据通常存储在非结构化文本和孤立的数据集中。在这里,我们将自然语言处理注释1,2 与来自纪念斯隆凯特琳癌症中心 24,950 名患者的结构化药物、患者报告的人口统计学、肿瘤登记和肿瘤基因组数据相结合,以生成临床基因组学、协调的肿瘤学真实世界数据集 (MSK-CHORD)。MSK-CHORD 包括非小细胞肺癌 (n = 7,809)、乳腺癌 (n = 5,368)、结直肠癌 (n = 5,543)、前列腺癌 (n = 3,211) 和胰腺癌 (n = 3,109) 的数据,并能够发现在较小数据集中不明显的临床基因组学关系。利用 MSK-CHORD 训练机器学习模型来预测总生存期,我们发现,通过交叉验证和外部多机构数据集测试,包括来自自然语言处理的特征(例如疾病部位)的模型优于基于基因组数据或单独阶段的模型。通过注释 705,241 份放射学报告,MSK-CHORD 还揭示了转移到特定器官部位的预测因子,包括 SETD2 突变与免疫疗法治疗肺腺癌降低转移潜力之间的关系,这在独立数据集中得到了证实。我们展示了从非结构化笔记中自动注释的可行性及其在预测患者结果方面的效用。所得数据作为真实世界肿瘤学研究的公共资源提供。

更新日期:2024-11-07
down
wechat
bug