当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Be aware of overfitting by hyperparameter optimization!
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-12-09 , DOI: 10.1186/s13321-024-00934-w
Igor V. Tetko, Ruud van Deursen, Guillaume Godin

Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures. Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.

中文翻译:


注意超参数优化的过拟合!



超参数优化在机器学习中非常频繁地使用。但是,对大空间的参数进行优化可能会导致模型过度拟合。在最近的溶解度预测研究中,作者从不同的数据源收集了七个热力学和动力学溶解度数据集。他们使用了最先进的基于图形的方法,并比较了使用不同的数据清理协议和超参数优化为每个数据集开发的模型。在我们的研究中,我们表明超参数优化并不总是会产生更好的模型,这可能是由于使用相同的统计度量时过度拟合。可以使用预设的超参数计算出类似的结果,从而将计算工作量减少约 10000 倍。我们还扩展了之前的分析,添加了一种基于微笑自然语言处理的表征学习方法,称为 Transformer CNN。我们表明,在使用完全相同协议的所有分析集中,Transformer CNN 在 28 个成对比较中的 26 个中提供了比基于图的方法更好的结果,与其他方法相比,仅使用了一小部分时间。最后但并非最不重要的一点是,我们强调了使用完全相同的统计措施比较计算结果的重要性。科学贡献我们表明,具有预先优化的超参数的模型可能会出现过拟合,并且使用预设超参数会产生类似的性能,但速度要快四个数量级。与其他研究的方法相比,Transformer CNN 提供了明显更高的准确性。
更新日期:2024-12-10
down
wechat
bug