Scientific Reports ( IF 3.8 ) Pub Date : 2023-06-28 , DOI: 10.1038/s41598-023-37698-6
Moonjong Kang 1 , Seonhwa Kim 1 , Da-Bin Lee 2 , Changbum Hong 1 , Kyu-Baek Hwang 2
|
Machine learning-based pathogenicity prediction helps interpret rare missense variants of BRCA1 and BRCA2, which are associated with hereditary cancers. Recent studies have shown that classifiers trained using variants of a specific gene or a set of genes related to a particular disease perform better than those trained using all variants, due to their higher specificity, despite the smaller training dataset size. In this study, we further investigated the advantages of “gene-specific” machine learning compared to “disease-specific” machine learning. We used 1068 rare (gnomAD minor allele frequency (MAF) < 0.005) missense variants of 28 genes associated with hereditary cancers for our investigation. Popular machine learning classifiers were employed: regularized logistic regression, extreme gradient boosting, random forests, support vector machines, and deep neural networks. As features, we used MAFs from multiple populations, functional prediction and conservation scores, and positions of variants. The disease-specific training dataset included the gene-specific training dataset and was > 7 × larger. However, we observed that gene-specific training variants were sufficient to produce the optimal pathogenicity predictor if a suitable machine learning classifier was employed. Therefore, we recommend gene-specific over disease-specific machine learning as an efficient and effective method for predicting the pathogenicity of rare BRCA1 and BRCA2 missense variants.
中文翻译:

用于罕见 BRCA1 和 BRCA2 错义变异致病性预测的基因特异性机器学习
基于机器学习的致病性预测有助于解释BRCA1和BRCA2的罕见错义变异,与遗传性癌症有关。最近的研究表明,尽管训练数据集较小,但使用特定基因的变体或与特定疾病相关的一组基因的变体训练的分类器比使用所有变体训练的分类器表现更好,因为它们具有更高的特异性。在这项研究中,我们进一步研究了“特定基因”机器学习相对于“特定疾病”机器学习的优势。我们使用与遗传性癌症相关的 28 个基因的 1068 个罕见(gnomAD 小等位基因频率 (MAF) < 0.005)错义变异进行研究。采用了流行的机器学习分类器:正则化逻辑回归、极端梯度提升、随机森林、支持向量机和深度神经网络。作为特征,我们使用来自多个群体的 MAF,功能预测和保护分数,以及变体的位置。特定疾病的训练数据集包括特定基因的训练数据集,并且大于 7 倍。然而,我们观察到,如果采用合适的机器学习分类器,基因特异性训练变体足以产生最佳的致病性预测因子。因此,我们建议将基因特异性而非疾病特异性机器学习作为预测罕见疾病致病性的有效方法。BRCA1和BRCA2错义变体。