MD-HIT: Machine learning for material property prediction with dataset redundancy control,npj Computational Materials

当前位置： X-MOL 学术 › npj Comput. Mater. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MD-HIT: Machine learning for material property prediction with dataset redundancy control
npj Computational Materials ( IF 9.4 ) Pub Date : 2024-10-18 , DOI: 10.1038/s41524-024-01426-z
Qin Li, Nihang Fu, Sadman Sadeed Omee, Jianjun Hu

Materials datasets usually contain many redundant (highly similar) materials due to the tinkering approach historically used in material design. This redundancy skews the performance evaluation of machine learning (ML) models when using random splitting, leading to overestimated predictive performance and poor performance on out-of-distribution samples. This issue is well-known in bioinformatics for protein function prediction, where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a given threshold. In this paper, we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT, a redundancy reduction algorithm for material datasets. Applying MD-HIT to composition- and structure-based formation energy and band gap prediction problems, we demonstrate that with redundancy control, the prediction performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy, but better reflect models’ true prediction capability.

中文翻译：

MD-HIT：使用数据集冗余控制进行材料特性预测的机器学习

材质数据集通常包含许多冗余（高度相似）的材质，这是由于材质设计中历来使用的修补方法。在使用随机拆分时，这种冗余会扭曲机器学习（ML）模型的性能评估，从而导致高估预测性能和分布外样本的性能不佳。这个问题在蛋白质功能预测的生物信息学中是众所周知的，其中使用 CD-HIT 等工具通过确保样本之间的序列相似性大于给定阈值来减少冗余。在本文中，我们调查了材料科学中用于材料性能预测的高估 ML 性能，并提出了 MD-HIT，一种用于材料数据集的冗余减少算法。将 MD-HIT 应用于基于成分和结构的形成能和带隙预测问题，我们证明，在冗余控制的情况下，与高冗余模型相比，ML模型在测试集上的预测性能往往相对较低，但更能反映模型的真实预测能力。

更新日期：2024-10-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南