Stochastic machine learning via sigma profiles to build a digital chemical space,Proceedings of the National Academy of Sciences of the United States of America

当前位置： X-MOL 学术 › Proc. Natl. Acad. Sci. U.S.A. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Stochastic machine learning via sigma profiles to build a digital chemical space
Proceedings of the National Academy of Sciences of the United States of America ( IF 9.4 ) Pub Date : 2024-07-23 , DOI: 10.1073/pnas.2404676121
Dinis O Abranches ₁ , Edward J Maginn ₁ , Yamil J Colón ₁

Affiliation

This work establishes a different paradigm on digital molecular spaces and their efficient navigation by exploiting sigma profiles. To do so, the remarkable capability of Gaussian processes (GPs), a type of stochastic machine learning model, to correlate and predict physicochemical properties from sigma profiles is demonstrated, outperforming state-of-the-art neural networks previously published. The amount of chemical information encoded in sigma profiles eases the learning burden of machine learning models, permitting the training of GPs on small datasets which, due to their negligible computational cost and ease of implementation, are ideal models to be combined with optimization tools such as gradient search or Bayesian optimization (BO). Gradient search is used to efficiently navigate the sigma profile digital space, quickly converging to local extrema of target physicochemical properties. While this requires the availability of pretrained GP models on existing datasets, such limitations are eliminated with the implementation of BO, which can find global extrema with a limited number of iterations. A remarkable example of this is that of BO toward boiling temperature optimization. Holding no knowledge of chemistry except for the sigma profile and boiling temperature of carbon monoxide (the worst possible initial guess), BO finds the global maximum of the available boiling temperature dataset (over 1,000 molecules encompassing more than 40 families of organic and inorganic compounds) in just 15 iterations (i.e., 15 property measurements), cementing sigma profiles as a powerful digital chemical space for molecular optimization and discovery, particularly when little to no experimental data is initially available.

中文翻译：

通过 sigma 配置文件进行随机机器学习，构建数字化学空间

这项工作通过利用 sigma 配置文件，建立了数字分子空间及其有效导航的不同范例。为此，展示了高斯过程（GP）（一种随机机器学习模型）根据西格玛曲线关联和预测物理化学性质的卓越能力，其性能优于之前发布的最先进的神经网络。 sigma 配置文件中编码的化学信息量减轻了机器学习模型的学习负担，允许在小型数据集上训练 GP，由于其计算成本可以忽略不计并且易于实施，因此是与优化工具（例如梯度搜索或贝叶斯优化 (BO)。梯度搜索用于有效地导航西格玛轮廓数字空间，快速收敛到目标物理化学性质的局部极值。虽然这需要在现有数据集上提供预训练的 GP 模型，但通过实施 BO 可以消除这种限制，BO 可以通过有限的迭代次数找到全局极值。一个显着的例子是 BO 的沸腾温度优化。除了一氧化碳的西格玛分布和沸腾温度（最糟糕的初始猜测）之外，BO 不具备任何化学知识，但找到了可用沸腾温度数据集的全局最大值（超过 1,000 个分子，涵盖 40 多个有机和无机化合物家族）只需 15 次迭代（即 15 次属性测量），即可将 sigma 配置文件巩固为用于分子优化和发现的强大数字化学空间，特别是在最初几乎没有或没有实验数据可用的情况下。

更新日期：2024-07-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南