Nature Communications ( IF 14.7 ) Pub Date : 2024-03-19 , DOI: 10.1038/s41467-024-46614-z Yichen Henry Liu 1 , Can Luo 2 , Staunton G Golding 2 , Jacob B Ioffe 1 , Xin Maizie Zhou 1, 2, 3
Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Alignment-based methods, favored for their computational efficiency and lower coverage requirements, are prominent. Alternative approaches, relying solely on available reads for de novo genome assembly and employing assembly-based tools for SV detection via comparison to a reference genome, demand significantly more computational resources. However, the lack of comprehensive benchmarking constrains our comprehension and hampers further algorithm development. Here we systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers. Assembly-based tools excel in detecting large SVs, especially insertions, and exhibit robustness to evaluation parameter changes and coverage fluctuations. Conversely, alignment-based tools demonstrate superior genotyping accuracy at low sequencing coverage (5-10×) and excel in detecting complex SVs, like translocations, inversions, and duplications. Our evaluation provides performance insights, highlighting the absence of a universally superior tool. We furnish guidelines across 31 criteria combinations, aiding users in selecting the most suitable tools for diverse scenarios and offering directions for further method development.
中文翻译:
使用长读长测序数据进行结构变异检测的比对和基于组装的方法的权衡
长读长测序提供长连续 DNA 片段,促进二倍体基因组组装和结构变异 (SV) 检测。用于 SV 识别的高效和稳健算法对于提高数据可用性至关重要。基于比对的方法因其计算效率和较低的覆盖率要求而受到青睐,这一点非常突出。替代方法,仅依赖于从头基因组组装的可用读数,并通过与参考基因组进行比较来采用基于组装的工具进行 SV 检测,需要更多的计算资源。然而,缺乏全面的基准测试限制了我们的理解并阻碍了进一步的算法开发。在这里,我们系统地比较了 14 种基于读取比对的 SV 调用方法(包括 4 种基于深度学习的方法和 1 种混合方法)和 4 种基于汇编的 SV 调用方法,以及 4 种上游对准器和 7 种汇编器。基于装配的工具擅长检测大型 SV,尤其是插入,并且对评估参数变化和覆盖率波动表现出稳健性。相反,基于比对的工具在低测序覆盖率 (5-10×) 下表现出卓越的基因分型准确性,并且在检测复杂的 SV 方面表现出色,如易位、倒位和重复。我们的评估提供了性能见解,突出了缺乏普遍卓越的工具。我们提供了 31 种标准组合的指南,帮助用户为不同场景选择最合适的工具,并为进一步的方法开发提供方向。