当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Telomere-to-telomere assembly by preserving contained reads
Genome Research ( IF 6.2 ) Pub Date : 2024-11-01 , DOI: 10.1101/gr.279311.124
Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain

Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm.

中文翻译:


通过保留包含的读数来组装端粒到端粒



二倍体和多倍体基因组的自动端粒到端粒 (T2T) 从头组装仍然是一项艰巨的任务。字符串图是汇编算法中常用的汇编图表示形式。字符串图公式采用图形简化启发式方法,这大大减少了顶点和边的数量。其中一种启发式方法涉及删除较长 reads 中包含的 reads。在实践中,这种启发式方法偶尔会通过删除覆盖一个或多个基因组间隔的所有读数来在组装中引入间隙。造成这种差距的因素仍然知之甚少。在这项工作中,我们从数学上推导出了在种系和体细胞杂合变异位点附近观察到间隙的频率。我们的分析表明,(1) 由于读取长度分布的差异,Oxford Nanopore Technologies (ONT) 读长中由于包含读长缺失导致的组装间隙比 Pacific Biosciences 高保真 (PacBio HiFi) 读长更频繁一个数量级,并且 (2) 该频率随着测序深度的增加而降低。从这些观察中汲取线索,我们通过开发重复感知碎裂工具 (RAFT) 组装算法解决了字符串图公式的弱点。RAFT 通过对读取进行分段并生成更均匀的读取长度分布来解决包含读取的问题。该算法在碎片化期间在读取中保留跨区重复。我们实证证明,RAFT 使用模拟数据集显着减少了差距的数量。使用 HG002 人类基因组的真实 ONT 和 PacBio HiFi 数据集,与半息叉相比,我们实现了重叠群 NG50 和单倍型分辨的 T2T 重叠群数量的两倍。
更新日期:2024-11-01
down
wechat
bug