当前位置: X-MOL 学术J. Chem. Theory Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data-Quality-Navigated Machine Learning Strategy with Chemical Intuition to Improve Generalization.
Journal of Chemical Theory and Computation ( IF 5.7 ) Pub Date : 2024-11-26 , DOI: 10.1021/acs.jctc.4c00969
Songran Yang,Ming Sun,Chaojie Shi,Yiran Liu,Yanzhi Guo,Yijing Liu,Zhiyun Lu,Yan Huang,Xuemei Pu

Generalizing real-world data has been one of the most difficult challenges for application of machine learning (ML) in practice. Most ML works focused on improvements in algorithms and feature representations. However, the data quality, as the foundation of ML, has been largely overlooked, also leading to the absence of data evaluation and processing methods in ML fields. Motivated by the challenge and need, we selected an important but difficult reorganization energy (RE) prediction task as a test platform, which is an important parameter for the charge mobility of organic semiconductors (OSCs), to propose a data-quality-navigated strategy with chemical intuition. We developed a data diversity evaluation based on structure characteristics of OSC molecules, a reliability evaluation method based on prediction accuracy, a data filtering method based on the uncertainty of K-fold division, and a data split technique by clustering and stratified sampling based on four molecular descriptor-associated REs. Consequently, a representative RE data set (15,989 molecules) with high reliability and diversity can be obtained. For the feature representation, a complementary strategy is proposed by considering the chemical nature of REs and the structure characteristics of OCS molecules as well as the model algorithm. In addition, an ensemble framework consisting of two deep learning models is constructed to avoid the risk of local optimization of the single model. The robustness and generalization of our model are strongly validated against different OSC-like molecules with diverse structures and a wide range of REs and real OSC molecules, greatly outperforming eight adversarial controls. Collectively, our work not only provides a quick and reliable tool to screen efficient OSCs but also offers methodological guidelines for improving the generalization of ML.

中文翻译:


使用 Chemical Intuition 的数据质量导航机器学习策略来提高泛化。



泛化真实世界数据一直是机器学习 (ML) 在实践中应用最困难的挑战之一。大多数 ML 工作都侧重于算法和特征表示的改进。然而,作为 ML 的基础,数据质量在很大程度上被忽视了,这也导致了 ML 领域缺乏数据评估和处理方法。在挑战和需求的激励下,我们选择了一个重要但困难的重组能量 (RE) 预测任务作为测试平台,该任务是有机半导体 (OSC) 电荷迁移率的重要参数,以提出一种具有化学直觉的数据质量导航策略。我们开发了基于 OSC 分子结构特征的数据多样性评价、基于预测准确性的可靠性评价方法、基于 K 折叠划分不确定性的数据过滤方法,以及基于四个分子描述符相关 REs 的聚类和分层抽样数据分割技术。因此,可以获得具有高可靠性和多样性的代表性 RE 数据集(15,989 个分子)。对于特征表示,通过考虑 REs 的化学性质和 OCS 分子的结构特征以及模型算法,提出了一种互补策略。此外,构建了一个由两个深度学习模型组成的集成框架,以避免单个模型局部优化的风险。我们模型的稳健性和泛化性在具有不同结构、广泛 RE 和真实 OSC 分子的不同 OSC 样分子上得到了有力的验证,大大优于八种对抗性对照。 总的来说,我们的工作不仅提供了一种快速可靠的工具来筛选高效的 OSC,而且还为提高 ML 的泛化提供了方法指南。
更新日期:2024-11-26
down
wechat
bug