Nature Communications ( IF 14.7 ) Pub Date : 2022-07-11 , DOI: 10.1038/s41467-022-31666-w Luan Nguyen 1 , Arne Van Hoeck 1 , Edwin Cuppen 1, 2
Cancers of unknown primary (CUP) origin account for ∼3% of all cancer diagnoses, whereby the tumor tissue of origin (TOO) cannot be determined. Using a uniformly processed dataset encompassing 6756 whole-genome sequenced primary and metastatic tumors, we develop Cancer of Unknown Primary Location Resolver (CUPLR), a random forest TOO classifier that employs 511 features based on simple and complex somatic driver and passenger mutations. CUPLR distinguishes 35 cancer (sub)types with ∼90% recall and ∼90% precision based on cross-validation and test set predictions. We find that structural variant derived features increase the performance and utility for classifying specific cancer types. With CUPLR, we could determine the TOO for 82/141 (58%) of CUP patients. Although CUPLR is based on machine learning, it provides a human interpretable graphical report with detailed feature explanations. The comprehensive output of CUPLR complements existing histopathological procedures and can enable improved diagnostics for CUP patients.
中文翻译:
使用全基因组突变特征对未知初级诊断的癌症进行基于机器学习的组织起源分类
原发性不明 (CUP) 起源的癌症约占所有癌症诊断的3 %,因此无法确定肿瘤组织起源 (TOO)。使用包含 6756 个全基因组测序的原发性和转移性肿瘤的统一处理数据集,我们开发了未知原发性位置癌症解析器 (CUPLR),这是一种随机森林 TOO 分类器,采用基于简单和复杂的体细胞驱动和乘客突变的 511 个特征。 CUPLR 基于交叉验证和测试集预测,以~ 90% 的召回率和~ 90% 的精确度区分 35 种癌症(亚)类型。我们发现结构变异衍生的特征提高了对特定癌症类型进行分类的性能和实用性。通过 CUPLR,我们可以确定 82/141 (58%) 的 CUP 患者的 TOO。尽管 CUPLR 基于机器学习,但它提供了人类可解释的图形报告以及详细的功能解释。 CUPLR 的综合输出补充了现有的组织病理学程序,并且可以改进对 CUP 患者的诊断。