机器学习辅助数据过滤和 QSAR 模型预测大鼠和小鼠的化学急性毒性,Journal of Hazardous Materials

当前位置： X-MOL 学术 › J. Hazard. Mater. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

机器学习辅助数据过滤和 QSAR 模型预测大鼠和小鼠的化学急性毒性
Journal of Hazardous Materials ( IF 12.2 ) Pub Date : 2023-04-01 , DOI: 10.1016/j.jhazmat.2023.131344
Tao Bo ₁ , Yaohui Lin ₂ , Jinglong Han ₃ , Zhineng Hao ₄ , Jingfu Liu ₁

Affiliation

机器学习 (ML) 方法为基于大型毒性数据集构建用于预测化学品毒性的定量构效关系 (QSAR) 模型提供了新的机会，但由于化学品的数据集质量差，它们在模型鲁棒性方面受到限制某些结构。为了解决这个问题并提高模型的稳健性，我们建立了一个关于大鼠口服急性毒性的数千种化学品的大型数据集，然后使用 ML 过滤有利于回归模型 (CFRM) 的化学品。与不利于回归模型的化学品 (CNRM) 相比，CFRM 占原始数据集中化学品的 67%，并且在 2–4 log 10 中具有更高的结构相似性和更小的毒性_分布（毫克/千克）。已建立的 CFRM 回归模型的性能得到极大改善，均方根偏差 (RMSE) 在 0.45–0.48 log ₁₀ (mg/kg) 范围内。使用原始数据集中的所有化学物质为 CNRM 建立分类模型，受试者工作特征面积 (AUROC) 达到 0.75–0.76。所提出的策略成功应用于小鼠口服急性数据集，产生的 RMSE 和 AUROC 分别在 0.36–0.38 log ₁₀ (mg/kg) 和 0.79 范围内。

"点击查看英文标题和摘要"

Machine learning-assisted data filtering and QSAR models for prediction of chemical acute toxicity on rat and mouse

Machine learning (ML) methods provide a new opportunity to build quantitative structure-activity relationship (QSAR) models for predicting chemicals’ toxicity based on large toxicity data sets, but they are limited in insufficient model robustness due to poor data set quality for chemicals with certain structures. To address this issue and improve model robustness, we built a large data set on rat oral acute toxicity for thousands of chemicals, then used ML to filter chemicals favorable for regression models (CFRM). In comparison to chemicals not favorable for regression models (CNRM), CFRM accounted for 67% of chemicals in the original data set, and had a higher structural similarity and a smaller toxicity distribution in 2–4 log₁₀ (mg/kg). The performance of established regression models for CFRM was greatly improved, with root-mean-square deviations (RMSE) in the range of 0.45–0.48 log₁₀ (mg/kg). Classification models were built for CNRM using all chemicals in the original data set, and the area under receiver operating characteristic (AUROC) reached 0.75–0.76. The proposed strategy was successfully applied to a mouse oral acute data set, yielding RMSE and AUROC in the range of 0.36–0.38 log₁₀ (mg/kg) and 0.79, respectively.

更新日期：2023-04-06

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南