当前位置: X-MOL 学术Genome Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references
Genome Biology ( IF 10.1 ) Pub Date : 2024-12-18 , DOI: 10.1186/s13059-024-03452-y
Prasad Sarashetti, Josipa Lipovac, Filip Tomas, Mile Šikić, Jianjun Liu

Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges. Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT’s Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15–20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs. Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

中文翻译:


评估高质量单倍型解析基因组的数据需求,以创建可靠的泛基因组参考



Pacific Biosciences (PacBio) 和 Oxford Nanopore Technologies (ONT) 的长读长技术通过提供 HiFi、双重和超长 ONT 等多种数据类型,改变了基因组学研究。尽管最近在使用长读长技术实现单倍型阶段的无缝基因组组装方面取得了长足的进步,但对遗传多样性表示的担忧仍然存在,这促使了泛基因组参考的发展。然而,泛基因组研究面临着与每个组装基因组的数据类型、体积和成本考虑相关的挑战,同时努力保持灵敏度。缺乏关于最佳数据选择的全面指导加剧了这些挑战。我们的研究评估了为群体水平泛基因组项目建立强大的从头基因组组装管道所需的推荐数据类型和数量,在实现具有增强连续性和完整性的高质量分阶段基因组的背景下,广泛研究了 ONT 的 Duplex 和 PacBio HiFi 数据集之间的性能。结果表明,实现染色体水平单倍型分辨组装需要 20 ×高质量长读长,例如 PacBio HiFi 或 ONT 双链体,结合每个单倍型 15-20 ×的超长 ONT 和 10 ×长距离数据,例如 Omni-C 或 Hi-C。来自两个平台的高质量长读长产生具有可比连续性的组装体,HiFi 在定相精度方面表现出色,而 Duplex 产生更多的 T2T 重叠群。我们的研究为群体水平泛基因组项目中稳健的从头基因组组装的最佳数据类型和体积提供了见解。 重新评估本研究中推荐的数据类型和数量,并使其与实际的经济限制保持一致,这对泛基因组研究界至关重要,有助于他们的努力并推动具有更广泛影响的基因组研究。
更新日期:2024-12-19
down
wechat
bug