用于预测肿瘤变异 NGS 分析的机器学习随机森林,Scientific Reports

当前位置： X-MOL 学术 › Sci. Rep. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

用于预测肿瘤变异 NGS 分析的机器学习随机森林
Scientific Reports ( IF 3.8 ) Pub Date : 2021-11-08 , DOI: 10.1038/s41598-021-01253-y
Eric Pellegrino ₁ , Coralie Jacques ₁ , Nathalie Beaufils ₁ , Isabelle Nanni ₁ , Antoine Carlioz ₁ , Philippe Metellus ₂ , L'Houcine Ouafik _{1,

3}

Affiliation

自 2017 年以来，我们在我院使用 IonTorrent NGS 平台来诊断和治疗癌症。每次运行分析变体都需要相当长的时间，我们仍在努力解决一些起初在指标上看起来正确的变体，但在进一步调查后发现它们是负面的。任何机器学习算法 (ML) 都可以帮助我们对 NGS 变体进行分类吗？这促使我们研究哪些 ML 可以适合我们的 NGS 数据，并开发一种可以常规实施以帮助生物学家的工具。目前，医学面临的最大挑战之一是处理大量数据。这在分子生物学中尤其如此，具有下一代测序 (NGS) 的优势，可用于分析和识别分子肿瘤及其治疗。除了生物信息学管道，人工智能 (AI) 在帮助分析突变变异方面可能很有价值。在临床试验中，从患者 DNA 样本中生成测序数据变得很容易。然而，分析大量基因组或转录组数据并提取与对特定疗法的临床反应相关的关键生物标志物需要科学专业知识、生物分子技能和一组生物信息学和生物统计学工具的强大组合，其中人工智能现在正在发挥作用。成功开发未来的常规诊断。然而，癌症基因组的复杂性和技术人工制品使得识别真正的变异具有挑战性。我们提出了一种机器学习方法，用于对致病性单核苷酸变异 (SNV)、单核苷酸多态性 (SNP)、多核苷酸变异 (MNV)、NGS 从不同类型的肿瘤样本中检测到插入和缺失，例如：结肠直肠癌、黑色素瘤、肺癌和神经胶质瘤。我们使用 k 折交叉验证方法将我们的 NGS 数据与不同的机器学习算法和神经网络（深度学习）进行了比较，以测量不同 ML 算法的性能并确定哪一个是确认 NGS 变体调用的有效模型癌症诊断。我们使用从本地数据库中提取的 70% 的数据样本训练机器学习（我们的数据结构有 7 个参数：染色体、位置、外显子、变异等位基因频率、次要等位基因频率、覆盖率和蛋白质描述），并使用剩余 30% 的数据。在 NGS 分析程序中选择并实施了提供最佳准确性的模型。人工智能是使用 R 脚本语言版本 3.6.0 开发的。我们在 102,011 个变体中的 70% 上训练了我们的模型。我们在随机森林机器学习（ntree = 500 和 mtry = 4）中发现了最佳错误率（0.22%），AUC 为 0.99。神经网络取得了一些不错的成绩。最终经过神经网络训练的模型达到了 98% 的准确率和 0.99 的 ROC-AUC 以及验证数据。我们测试了我们的 RF 模型以解释来自 NGS 数据库的 2000 多个变体：20 个变体被错误分类（错误率 < 1%）。错误是命名问题和误报。在我们的训练数据库中添加误报并定期实施我们的 RF 模型后，我们的错误率始终小于 0.5%。RF 模型显示了对肿瘤 NGS 解释的出色结果，并且可以轻松地在其他分子生物学实验室中实施。人工智能在分子生物医学分析中变得越来越重要，并且在处理医学数据方面非常有帮助。神经网络在变体分类方面表现出良好的能力，在未来，它们可能有助于预测更复杂的变体。

"点击查看英文标题和摘要"

Machine learning random forest for predicting oncosomatic variant NGS analysis

Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.

更新日期：2021-11-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>