当前位置: X-MOL 学术Nat. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genome-wide prediction of disease variant effects with a deep protein language model
Nature Genetics ( IF 31.7 ) Pub Date : 2023-08-10 , DOI: 10.1038/s41588-023-01465-0
Nadav Brandes 1 , Grant Goldman 2 , Charlotte H Wang 3 , Chun Jimmie Ye 1, 4, 5, 6, 7, 8, 9 , Vasilis Ntranos 4, 8, 9, 10
Affiliation  

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.



中文翻译:


使用深度蛋白质语言模型对疾病变异效应进行全基因组预测



预测编码变体的影响是一个重大挑战。虽然最近的深度学习模型提高了变异效应预测的准确性,但由于依赖于紧密的同源物或软件限制,它们无法分析所有编码变异。在这里,我们使用 ESM1b(一种 6.5 亿参数的蛋白质语言模型)开发了一个工作流程,以预测人类基因组中所有约 4.5 亿种可能的错义变异效应,并将所有预测发布在门户网站上。 ESM1b 在将约 150,000 个 ClinVar/HGMD 错义变异分类为致病性或良性以及预测 28 个深度突变扫描数据集的测量方面优于现有方法。我们进一步注释了约 200 万个变体仅对特定蛋白质亚型具有损害性,这证明了在预测变体效应时考虑所有亚型的重要性。我们的方法还推广到更复杂的编码变体,例如帧内插入和停止增益。总之,这些结果建立了蛋白质语言模型作为预测变异效应的有效、准确和通用的方法。

更新日期:2023-08-11
down
wechat
bug