Scientific Data ( IF 5.8 ) Pub Date : 2023-11-08 , DOI: 10.1038/s41597-023-02690-2 Surajit Nandi 1 , Tejs Vegge 1 , Arghya Bhowmik 1
Well curated extensive datasets have helped spur intense molecular machine learning (ML) method development activities over the last few years, encouraging nonchemists to be part of the effort as well. QM9 dataset is one of the benchmark databases for small molecules with molecular energies based on B3LYP functional. G4MP2 based energies of these molecules were published later. To enable a wide variety of ML tasks like transfer learning, delta learning, multitask learning, etc. with QM9 molecules, in this article, we introduce a new dataset with QM9 molecule energies estimated with 76 different DFT functionals and three different basis sets (228 energy numbers for each molecule). We additionally enumerated all possible A ↔ B monomolecular interconversions within the QM9 dataset and provided the reaction energies based on these 76 functionals, and basis sets. Lastly, we also provide the bond changes for all the 162 million reactions with the dataset to enable structure- and bond-based reaction energy prediction tools based on ML.
中文翻译:
MultiXC-QM9:来自多级量子化学方法的分子和反应能量的大型数据集
过去几年,精心策划的广泛数据集帮助刺激了激烈的分子机器学习 (ML) 方法开发活动,鼓励非化学家也参与其中。QM9数据集是基于B3LYP泛函的分子能量小分子基准数据库之一。这些分子基于 G4MP2 的能量随后发表。为了使用 QM9 分子实现各种 ML 任务,如迁移学习、增量学习、多任务学习等,在本文中,我们引入了一个新的数据集,其中包含使用 76 个不同的 DFT 泛函和三个不同的基集(228每个分子的能量数)。我们还列举了 QM9 数据集中所有可能的 A ↔ B 单分子相互转化,并提供了基于这 76 个泛函和基组的反应能量。最后,我们还通过数据集提供所有 1.62 亿个反应的键变化,以启用基于机器学习的基于结构和键的反应能量预测工具。