Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation,Genetics Selection Evolution

当前位置： X-MOL 学术 › Genet. Sel. Evol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation
Genetics Selection Evolution ( IF 3.6 ) Pub Date : 2024-11-15 , DOI: 10.1186/s12711-024-00942-2
Tuan V. Nguyen, Sunduimijid Bolormaa, Coralie M. Reich, Amanda J. Chamberlain, Christy J. Vander Jagt, Hans D. Daetwyler, Iona M. MacLeod

Genotype imputation is a cost-effective method for obtaining sequence genotypes for downstream analyses such as genome-wide association studies (GWAS). However, low imputation accuracy can increase the risk of false positives, so it is important to pre-filter data or at least assess the potential limitations due to imputation accuracy. In this study, we benchmarked three different imputation programs (Beagle 5.2, Minimac4 and IMPUTE5) and compared the empirical accuracy of imputation with the software estimated accuracy of imputation (Rsqsoft). We also tested the accuracy of imputation in cattle for autosomal and X chromosomes, SNP and INDEL, when imputing from either low-density or high-density genotypes. The accuracy of imputing sequence variants from real high-density genotypes was higher than from low-density genotypes. In our software benchmark, all programs performed well with only minor differences in accuracy. While there was a close relationship between empirical imputation accuracy and the imputation Rsqsoft, this differed considerably for Minimac4 compared to Beagle 5.2 and IMPUTE5. We found that the Rsqsoft threshold for removing poorly imputed variants must be customised according to the software and this should be accounted for when merging data from multiple studies, such as in meta-GWAS studies. We also found that imposing an Rsqsoft filter has a positive impact on genomic regions with poor imputation accuracy due to large segmental duplications that are susceptible to error-prone alignment. Overall, our results showed that on average the imputation accuracy for INDEL was approximately 6% lower than SNP for all software programs. Importantly, the imputation accuracy for the non-PAR (non-Pseudo-Autosomal Region) of the X chromosome was comparable to autosomal imputation accuracy, while for the PAR it was substantially lower, particularly when starting from low-density genotypes. This study provides an empirically derived approach to apply customised software-specific Rsqsoft thresholds for downstream analyses of imputed variants, such as needed for a meta-GWAS. The very poor empirical imputation accuracy for variants on the PAR when starting from low density genotypes demonstrates that this region should be imputed starting from a higher density of real genotypes.

中文翻译：

经验与估计的插补准确性：优化序列插补的过滤阈值

基因型填补是获取序列基因型用于下游分析（如全基因组关联研究（GWAS））的一种经济高效的方法。但是，低插补准确性会增加假阳性的风险，因此预筛选数据或至少评估由于插补准确性而导致的潜在限制非常重要。在这项研究中，我们对三种不同的插补程序（Beagle 5.2、Minimac4 和 IMPUTE5）进行了基准测试，并将插补的经验准确性与软件估计的插补准确性（Rsqsoft）进行了比较。我们还测试了从低密度或高密度基因型进行插补时，牛常染色体和 X 染色体、SNP 和 INDEL 的插补准确性。从真实的高密度基因型中插补序列变异的准确性高于来自低密度基因型的序列变异。在我们的软件基准测试中，所有程序都表现良好，准确性只有微小的差异。虽然经验插补准确性与 Rsqsoft 插补之间存在密切关系，但与 Beagle 5.2 和 IMPUTE5 相比，Minimac4 的情况差异很大。我们发现，去除插补不良变异的 Rsqsoft 阈值必须根据软件进行定制，并且在合并来自多项研究的数据时应考虑到这一点，例如在 meta-GWAS 研究中。我们还发现，施加 Rsqsoft 过滤器对插补准确性差的基因组区域有积极影响，因为大片段重复容易出错。总体而言，我们的结果表明，对于所有软件程序，INDEL 的插补准确性平均比 SNP 低约 6%。重要的是，X 染色体非 PAR （非准常染色体区域）的插补准确性与常染色体插补准确性相当，而 PAR 的插补准确性要低得多，尤其是从低密度基因型开始时。本研究提供了一种经验衍生的方法，将定制的软件特定 Rsqsoft 阈值应用于插补变体的下游分析，例如 meta-GWAS 所需的。当从低密度基因型开始时，PAR 上变异的经验插补准确性非常差，这表明该区域应该从更高密度的真实基因型开始插补。

更新日期：2024-11-15

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南