Nature Reviews Genetics ( IF 39.1 ) Pub Date : 2024-06-14 , DOI: 10.1038/s41576-024-00738-6 William Hemstrom 1 , Jared A Grummer 2 , Gordon Luikart 2 , Mark R Christie 1, 3
Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering — removing sequencing bases, reads, genetic variants and/or individuals from a dataset — to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy–Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima’s D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).
中文翻译:
基因组学时代的下一代数据筛选
基因组数据在各个学科中无处不在,从农业到生物多样性、生态学、进化论和人类健康。然而,这些数据集通常包含噪声或错误,并且缺少可能影响后续计算分析和结论的准确性和可靠性的信息。基因组数据分析的一个关键步骤是过滤,即从数据集中删除测序碱基、读长、遗传变异和/或个体,以提高下游分析的数据质量。研究人员在筛选基因组数据时面临多种选择;他们必须选择要应用的筛选条件并选择适当的阈值。为了帮助引入下一代基因组数据过滤,我们审查并建议最佳实践,以改进通常用于基因组数据集的过滤器类型和阈值的实施、可重复性和报告标准。我们主要关注次要等位基因频率、每个个体或每个基因座的缺失数据、连锁不平衡和 Hardy-Weinberg 偏差的过滤器。使用模拟和实证数据集,我们说明了不同过滤阈值对常见群体遗传学统计的巨大影响,例如田岛的 D 值、群体分化 (FST)、核苷酸多样性 (π) 和有效群体大小 (Ne)。