Confidence intervals for validation statistics with data truncation in genomic prediction,Genetics Selection Evolution

当前位置： X-MOL 学术 › Genet. Sel. Evol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Confidence intervals for validation statistics with data truncation in genomic prediction
Genetics Selection Evolution ( IF 3.6 ) Pub Date : 2024-03-08 , DOI: 10.1186/s12711-024-00883-w
Matias Bermann ₁ , Andres Legarra ₂ , Alejandra Alvarez Munera ₁ , Ignacy Misztal ₁ , Daniela Lourenco ₁

Affiliation

Validation by data truncation is a common practice in genetic evaluations because of the interest in predicting the genetic merit of a set of young selection candidates. Two of the most used validation methods in genetic evaluations use a single data partition: predictivity or predictive ability (correlation between pre-adjusted phenotypes and estimated breeding values (EBV) divided by the square root of the heritability) and the linear regression (LR) method (comparison of “early” and “late” EBV). Both methods compare predictions with the whole dataset and a partial dataset that is obtained by removing the information related to a set of validation individuals. EBV obtained with the partial dataset are compared against adjusted phenotypes for the predictivity or EBV obtained with the whole dataset in the LR method. Confidence intervals for predictivity and the LR method can be obtained by replicating the validation for different samples (or folds), or bootstrapping. Analytical confidence intervals would be beneficial to avoid running several validations and to test the quality of the bootstrap intervals. However, analytical confidence intervals are unavailable for predictivity and the LR method. We derived standard errors and Wald confidence intervals for the predictivity and statistics included in the LR method (bias, dispersion, ratio of accuracies, and reliability). The confidence intervals for the bias, dispersion, and reliability depend on the relationships and prediction error variances and covariances across the individuals in the validation set. We developed approximations for large datasets that only need the reliabilities of the individuals in the validation set. The confidence intervals for the ratio of accuracies and predictivity were obtained through the Fisher transformation. We show the adequacy of both the analytical and approximated analytical confidence intervals and compare them versus bootstrap confidence intervals using two simulated examples. The analytical confidence intervals were closer to the simulated ones for both examples. Bootstrap confidence intervals tend to be narrower than the simulated ones. The approximated analytical confidence intervals were similar to those obtained by bootstrapping. Estimating the sampling variation of predictivity and the statistics in the LR method without replication or bootstrap is possible for any dataset with the formulas presented in this study.

中文翻译：

基因组预测中数据截断验证统计数据的置信区间

通过数据截断进行验证是遗传评估中的常见做法，因为有兴趣预测一组年轻候选者的遗传优点。遗传评估中最常用的两种验证方法使用单个数据分区：预测性或预测能力（预先调整的表型和估计育种值 (EBV) 之间的相关性除以遗传力的平方根）和线性回归 (LR)方法（“早期”和“晚期”EBV 的比较）。两种方法都将预测与整个数据集和通过删除与一组验证个体相关的信息而获得的部分数据集进行比较。将使用部分数据集获得的 EBV 与调整后的表型进行比较，以了解预测性或使用 LR 方法中的整个数据集获得的 EBV。预测性和 LR 方法的置信区间可以通过复制不同样本（或折叠）的验证或引导来获得。分析置信区间有利于避免运行多次验证并测试引导区间的质量。然而，分析置信区间不适用于预测性和 LR 方法。我们得出了 LR 方法中包含的预测性和统计数据的标准误差和 Wald 置信区间（偏差、离散度、准确率和可靠性）。偏差、离散度和可靠性的置信区间取决于验证集中个体之间的关系以及预测误差方差和协方差。我们为大型数据集开发了近似值，只需要验证集中个体的可靠性。准确率和预测率之比的置信区间是通过 Fisher 变换获得的。我们展示了分析置信区间和近似分析置信区间的充分性，并使用两个模拟示例将它们与引导置信区间进行比较。这两个示例的分析置信区间更接近模拟置信区间。 Bootstrap 置信区间往往比模拟置信区间更窄。近似的分析置信区间与通过自举法获得的相似。对于使用本研究中提出的公式的任何数据集，无需复制或引导即可估计 LR 方法中预测性和统计数据的采样变化。

更新日期：2024-03-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南