Prediction of Pseudomonas aeruginosa abundance in drinking water distribution systems using machine learning,Process Safety and Environmental Protection

当前位置： X-MOL 学术 › Process Saf. Environ. Prot. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Prediction of Pseudomonas aeruginosa abundance in drinking water distribution systems using machine learning
Process Safety and Environmental Protection ( IF 6.9 ) Pub Date : 2024-11-28 , DOI: 10.1016/j.psep.2024.11.099
Qiaomei Zhou, Yukang Li, Min Wang, Jingang Huang, Weishuai Li, Shanshan Qiu, Haibo Wang

The detection of Pseudomonas aeruginosa is a challenging but crucial task to ensure the bio-safety of drinking water. The current cultivation and molecular qPCR methods are costly, laborious and time-consuming, leading to inaccuracies and delayed monitoring. In this study, three machine learning (ML) models, including eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Regression (SVR), were developed, interpreted, and validated for their ability to predict P. aeruginosa abundance in both urban and rural drinking water distribution systems (DWDS). To ensure the reliability and robustness of ML models, data leakage management for data pre-processing, 5-fold cross-validation and grid search for hyperparameters tuning were utilized during the training phase. To control overfitting issues, feature selection using embedded method was implemented to exclude three low-contributing input variables of oxidation-reduction potential (ORP), total chlorine, and heterotrophic plate counts (HPC). The XGBoost model outperformed RF and SVR models in terms of accuracy and generalizability in predicting P. aeruginosa abundance, achieving training/testing R2 of 0.92/0.85 in urban system, and 0.94/0.87 in rural system, respectively. Feature importance analysis revealed that water temperature, dissolved oxygen (DO), residual chlorine, and NO3--N were key variables for the prediction. The validation experiments, by randomly sampling from both urban and rural DWDS, demonstrated acceptable relative errors of 10.77 % and 8.86 %, respectively. Overall, this study provides an applicable ML modeling framework for the accurate and fast prediction of P. aeruginosa abundance in DWDS, potentially reducing laborious experiments in future.

中文翻译：

使用机器学习预测饮用水分配系统中铜绿假单胞菌的丰度

铜绿假单胞菌的检测对于确保饮用水的生物安全是一项具有挑战性但至关重要的任务。目前的培养和分子 qPCR 方法成本高昂、费力且耗时，导致不准确和延迟监测。在这项研究中，开发、解释并验证了三种机器学习（ML）模型，包括极限梯度提升（XGBoost）、随机森林（RF）和支持向量回归（SVR），它们能够预测城市和农村饮用水分配系统（DWDS）中铜绿假单胞菌的丰度。为了确保 ML 模型的可靠性和稳健性，在训练阶段使用了数据预处理的数据泄漏管理、5 倍交叉验证和超参数调整的网格搜索。为了控制过拟合问题，使用嵌入式方法实施了特征选择，以排除氧化还原电位（ORP）、总氯和异养板计数（HPC）这三个低贡献的输入变量。XGBoost 模型在预测铜绿假单胞菌丰度的准确性和泛化性方面优于 RF 和 SVR 模型，在城市系统中实现了 0.92/0.85 的训练/测试 R2，在农村系统中实现了 0.94/0.87。特征重要性分析显示，水温、溶氧（DO）、余氯和 NO3--N 是预测的关键变量。通过从城市和农村 DWDS 中随机抽样的验证实验，分别证明了可接受的相对误差为 10.77 % 和 8.86 %。总体而言，本研究为准确快速预测 DWDS 中铜绿假单胞菌的丰度提供了一个适用的 ML 建模框架，有可能减少未来费力的实验。

更新日期：2024-11-28

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南