当前位置: X-MOL 学术Nat. Biotechnol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
OpenFold provides insights into AlphaFold2’s learning behavior
Nature Biotechnology ( IF 33.1 ) Pub Date : 2024-06-17 , DOI: 10.1038/s41587-024-02290-4
Iris Marchal

Despite the immense utility of AlphaFold2 in predicting protein structure, the official implementation excludes code for its training procedure and the associated required data. This makes it difficult to study the model’s learning behavior and to create variants that can perform new tasks. Writing in Nature Methods, AlQuraishi and colleagues now report OpenFold, a trainable and open-source implementation of AlphaFold2 that provides insights into its learning mechanisms and capacity for generalization.

OpenFold was trained from scratch using OpenProteinSet — an open-source reproduction of AlphaFold2’s training dataset — and was shown to match AlphaFold2 in accuracy. To understand specific properties of the architecture (such as data efficiency), the authors trained OpenFold in a series of runs that used progressively fewer data, which showed that it can achieve high accuracy using datasets as small as 1,000 protein chains. OpenFold was then trained with out-of-distribution data to evaluate its capacity for generalization, which revealed that the model appears to learn from local patterns of multiple sequence alignments and/or sequence–structure correlations rather than from patterns at the global fold level. Analysis of intermediate structures further revealed that although the model ultimately predicts global structure almost as accurately as local structure, it starts with learning the latter.



中文翻译:


OpenFold 提供了对 AlphaFold2 学习行为的见解



尽管 AlphaFold2 在预测蛋白质结构方面具有巨大的实用性,但官方实施排除了其训练程序和相关所需数据的代码。这使得研究模型的学习行为和创建可以执行新任务的变体变得困难。 AlQuraishi 及其同事在《自然方法》中撰文,报告了 OpenFold,这是 AlphaFold2 的可训练和开源实现,可提供对其学习机制和泛化能力的见解。


OpenFold 是使用 OpenProteinSet(AlphaFold2 训练数据集的开源复制品)从头开始训练的,结果显示其准确性与 AlphaFold2 相当。为了了解该架构的特定属性(例如数据效率),作者在一系列使用逐渐减少的数据的运行中训练了 OpenFold,这表明它可以使用小至 1,000 条蛋白质链的数据集实现高精度。然后使用分布外数据对 OpenFold 进行训练,以评估其泛化能力,这表明该模型似乎是从多个序列比对和/或序列结构相关性的局部模式学习,而不是从全局折叠水平的模式学习。对中间结构的分析进一步表明,尽管该模型最终预测全局结构几乎与局部结构一样准确,但它是从学习后者开始的。

更新日期:2024-06-17
down
wechat
bug