Influence of Data Curation and Confidence Levels on Compound Predictions Using Machine Learning Models.,Journal of Chemical Information and Modeling

当前位置： X-MOL 学术 › J. Chem. Inf. Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Influence of Data Curation and Confidence Levels on Compound Predictions Using Machine Learning Models.
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-12-10 , DOI: 10.1021/acs.jcim.4c01573
Elena Xerxa,Martin Vogt,Jürgen Bajorath

While data curation principles and practices are a major topic in data science, they are often not explicitly considered in machine learning (ML) applications in chemistry. We have been interested in evaluating the potential effects of data curation on the performance of molecular ML models. Therefore, a sequential curation scheme was developed for compounds and activity data, and different ML classification models were generated at increasing data confidence levels and evaluated. Sequential data curation was found to systematically increase classification performance in an incremental manner due to cumulative effects of individual data curation criteria. The analysis of chemical space distributions of compound subsets at different data confidence levels revealed that the separation of compounds with different class labels in chemical space generally increased during sequential activity data curation, which was mostly due to subsequent elimination of singletons rather than compounds from analogue series. These findings provided a rationale for increasing the classification performance of ML models as a consequence of increasingly stringent data curation. Taken together, the results reported herein suggest that further attention should be paid to varying data curation and confidence levels when deriving and assessing ML models for chemical applications.

中文翻译：

数据管理和置信度对使用机器学习模型的化合物预测的影响。

虽然数据管理原则和实践是数据科学中的一个主要主题，但在化学的机器学习（ML）应用中通常不会明确考虑它们。我们一直对评估数据管理对分子 ML 模型性能的潜在影响感兴趣。因此，为化合物和活性数据开发了顺序管理方案，并以不断提高的数据置信度生成不同的 ML 分类模型并进行评估。由于单个数据管理标准的累积效应，发现顺序数据管理以增量方式系统地提高分类性能。对不同数据置信度下化合物子集的化学空间分布的分析表明，在顺序活性数据管理过程中，具有不同类别标签的化合物在化学空间中的分离度通常会增加，这主要是由于随后从类似序列中消除了单例而不是化合物。由于数据管理越来越严格，这些发现为提高 ML 模型的分类性能提供了理由。综上所述，本文报告的结果表明，在推导和评估化学应用的 ML 模型时，应进一步关注不同的数据管理和置信度。

更新日期：2024-12-10

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南