当前位置:
X-MOL 学术
›
Water Resour. Res.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Representative Sample Size for Estimating Saturated Hydraulic Conductivity via Machine Learning: A Proof-Of-Concept Study
Water Resources Research ( IF 4.6 ) Pub Date : 2024-08-04 , DOI: 10.1029/2023wr036783 Amin Ahmadisharaf 1 , Reza Nematirad 2 , Sadra Sabouri 3 , Yakov Pachepsky 4 , Behzad Ghanbarian 1, 5
Water Resources Research ( IF 4.6 ) Pub Date : 2024-08-04 , DOI: 10.1029/2023wr036783 Amin Ahmadisharaf 1 , Reza Nematirad 2 , Sadra Sabouri 3 , Yakov Pachepsky 4 , Behzad Ghanbarian 1, 5
Affiliation
Machine learning (ML) has been extensively applied in various disciplines. However, not much attention has been paid to data heterogeneity in databases and number of samples used to train ML models in hydrology. In this study, we addressed these issues and their impacts on the accuracy and reliability of ML models in the estimation of saturated hydraulic conductivity, Ks. We selected 17,990 soil samples from the USKSAT database and created random subsets N = 2,000, 4,000, 6,000, 8,000, 10,000, 12,000, 14,000, 16,000, and 17,990, 80% of which were used for training. The random subset selection was repeated 50 times. The extreme gradient boosting (XGBoost) algorithm was used to estimate Ks from other soil properties, such as bulk density, soil depth, texture, and organic content. For each subset, we conducted the learning curve analysis on the training and cross-validation data sets. Results showed that for all training sample sizes the number of samples was not enough for the training and cross-validation curves to reach a plateau. We also applied the concept of representative elementary volume by plotting the average coefficient of determination, R2, and root mean square log-transformed error, RMSLE, against the training sample size. For the testing data set, as the number of training sample size increased from 1,600 to 14,392 the average R2 value increased from 0.74 to 0.90, while the average RMSLE value decreased from 1.08 to 0.69. Either the learning curve or representative sample size analysis is required to investigate whether the number of samples is enough or not.
中文翻译:
通过机器学习估算饱和水力电导率的代表性样本量:概念验证研究
机器学习(ML)已广泛应用于各个学科。然而,人们对数据库中的数据异质性以及用于训练水文学机器学习模型的样本数量并没有给予太多关注。在本研究中,我们解决了这些问题及其对 ML 模型在饱和导水率K s估计中的准确性和可靠性的影响。我们从 USKSAT 数据库中选择了 17,990 个土壤样本,并创建了随机子集N = 2,000、4,000、6,000、8,000、10,000、12,000、14,000、16,000 和 17,990,其中 80% 用于训练。随机子集选择重复 50 次。极端梯度增强 (XGBoost) 算法用于根据其他土壤特性(例如容重、土壤深度、质地和有机含量)估计K s 。对于每个子集,我们对训练和交叉验证数据集进行了学习曲线分析。结果表明,对于所有训练样本大小,样本数量不足以使训练和交叉验证曲线达到稳定水平。我们还通过绘制平均决定系数R 2和均方根对数转换误差 RMSLE 与训练样本大小的关系来应用代表性基本体积的概念。对于测试数据集,随着训练样本数量从1,600增加到14,392,平均R 2值从0.74增加到0.90,而平均RMSLE值从1.08减少到0.69。要么需要学习曲线,要么需要代表性样本量分析来考察样本数量是否足够。
更新日期:2024-08-07
中文翻译:
通过机器学习估算饱和水力电导率的代表性样本量:概念验证研究
机器学习(ML)已广泛应用于各个学科。然而,人们对数据库中的数据异质性以及用于训练水文学机器学习模型的样本数量并没有给予太多关注。在本研究中,我们解决了这些问题及其对 ML 模型在饱和导水率K s估计中的准确性和可靠性的影响。我们从 USKSAT 数据库中选择了 17,990 个土壤样本,并创建了随机子集N = 2,000、4,000、6,000、8,000、10,000、12,000、14,000、16,000 和 17,990,其中 80% 用于训练。随机子集选择重复 50 次。极端梯度增强 (XGBoost) 算法用于根据其他土壤特性(例如容重、土壤深度、质地和有机含量)估计K s 。对于每个子集,我们对训练和交叉验证数据集进行了学习曲线分析。结果表明,对于所有训练样本大小,样本数量不足以使训练和交叉验证曲线达到稳定水平。我们还通过绘制平均决定系数R 2和均方根对数转换误差 RMSLE 与训练样本大小的关系来应用代表性基本体积的概念。对于测试数据集,随着训练样本数量从1,600增加到14,392,平均R 2值从0.74增加到0.90,而平均RMSLE值从1.08减少到0.69。要么需要学习曲线,要么需要代表性样本量分析来考察样本数量是否足够。