Protein representations: Encoding biological information for machine learning in biocatalysis,Biotechnology Advances

当前位置： X-MOL 学术 › Biotechnol. Adv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Protein representations: Encoding biological information for machine learning in biocatalysis
Biotechnology Advances ( IF 12.1 ) Pub Date : 2024-10-02 , DOI: 10.1016/j.biotechadv.2024.108459
David Harding-Larsen, Jonathan Funk, Niklas Gesmar Madsen, Hani Gharabli, Carlos G. Acevedo-Rocha, Stanislav Mazurenko, Ditte Hededam Welner

Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations — primary sequence, 3D structure, and dynamics — to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.

中文翻译：

蛋白质表示：为生物催化中的机器学习编码生物信息

酶为传统化学提供了一种更环保、影响更小的解决方案，但它们通常需要额外的工程设计才能应用于工业环境，这是一项具有挑战性且费力的工作。为了解决这个问题，可以利用机器学习的力量来生成预测模型，从而实现改进酶特性的计算机研究和工程设计。然而，这种机器学习模型需要将复杂的生物信息转换为数字输入，也称为蛋白质表示。这些输入需要特别注意，以确保准确和精确的模型训练，因此，在这篇综述中，我们研究了将蛋白质信息编码为数字表示以用于机器学习的关键步骤。我们选择了编码三种不同生物蛋白质表示（一级序列、3D 结构和动力学）的最重要方法，以探索它们对就业和归纳偏差的要求。蛋白质和底物的组合表示也作为生物催化中的新兴工具被引入。我们提出了固定表示的划分，这是一组基于规则的编码策略，以及从大型神经网络的潜在空间中提取的学习表示。为了选择最合适的蛋白质代表，我们提出了两个需要考虑的主要因素。第一个是模型设置，它受训练数据集的大小和架构选择的影响。第二个因素是模型目标，例如对分析特性的考虑、野生型模型和突变预测因子之间的差异以及可解释性的要求。本综述旨在为在未来的生物催化机器学习模型中正确表示酶提供信息和指导。

更新日期：2024-10-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南