Data Quality in the Fitting of Approximate Models: A Computational Chemistry Perspective.,Journal of Chemical Theory and Computation

当前位置： X-MOL 学术 › J. Chem. Theory Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Data Quality in the Fitting of Approximate Models: A Computational Chemistry Perspective.
Journal of Chemical Theory and Computation ( IF 5.7 ) Pub Date : 2024-11-18 , DOI: 10.1021/acs.jctc.4c01063
Bun Chan,William Dawson,Takahito Nakajima

Empirical parametrization underpins many scientific methodologies including certain quantum-chemistry protocols [e.g., density functional theory (DFT), machine-learning (ML) models]. In some cases, the fitting requires a large amount of data, necessitating the use of data obtained using low-cost, and thus low-quality, means. Here we examine the effect of using low-quality data on the resulting method in the context of DFT methods. We use multiple G2/97 data sets of different qualities to fit the DFT-type methods. Encouragingly, this fitting can tolerate a relatively large proportion of low-quality fitting data, which may be attributed to the physical foundations of the DFT models and the use of a modest number of parameters. Further examination using "ML-quality" data shows that adding a large amount of low-quality data to a small number of high-quality ones may not offer tangible benefits. On the other hand, when the high-quality data is limited in scope, diversification by a modest amount of low-quality data improves the performance. Quantitatively, for parametrizing DFT (and perhaps also quantum-chemistry ML models), caution should be taken when more than 50% of the fitting set contains questionable data, and that the average error of the full set is more than 20 kJ mol-1. One may also follow the recently proposed transferability principles to ensure diversity in the fitting set.

中文翻译：

近似模型拟合中的数据质量：计算化学视角。

实证参数化是许多科学方法的基础，包括某些量子化学协议 [例如，密度泛函理论（DFT）、机器学习（ML）模型]。在某些情况下，拟合需要大量数据，因此需要使用通过低成本、低质量手段获得的数据。在这里，我们研究了在 DFT 方法的上下文中使用低质量数据对结果方法的影响。我们使用多个不同质量的 G2/97 数据集来拟合 DFT 类型的方法。令人鼓舞的是，这种拟合可以容忍相对很大比例的低质量拟合数据，这可能归因于 DFT 模型的物理基础和适度数量的参数的使用。使用 “ML 质量” 数据的进一步检查表明，将大量低质量数据添加到少量高质量数据中可能不会提供切实的好处。另一方面，当高质量数据的范围受到限制时，通过适量的低质量数据进行多样化可以提高性能。在定量上，对于参数化 DFT（可能还有量子化学 ML 模型），当超过 50% 的拟合集包含有问题的数据，并且完整集的平均误差超过 20 kJ mol-1 时，应谨慎。人们还可以遵循最近提出的可转移性原则，以确保拟合集的多样性。

更新日期：2024-11-18

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南