Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Probably Correct: Rescuing Repeats with Short and Long Reads
Genes ( IF 2.8 ) Pub Date : 2020-12-31 , DOI: 10.3390/genes12010048 Monika Cechova 1
Genes ( IF 2.8 ) Pub Date : 2020-12-31 , DOI: 10.3390/genes12010048 Monika Cechova 1
Affiliation
Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome—estimated 50–69%—is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from “telomere to telomere”. Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.
中文翻译:
可能是正确的:通过短读和长读来拯救重复
自从人类基因组计划引入高通量测序以来,将短读段组装成足够质量的参考就成了一个重大问题,因为人类基因组的很大一部分(估计为 50-69%)是重复的。因此,相当大比例的测序读数是多重映射的,即在基因组中没有独特的位置。读取是否是多重映射的两个关键参数是读取长度和基因组复杂性。长读数现在能够跨越困难的异染色质区域,包括完整的着丝粒,并表征从“端粒到端粒”的染色体。此外,相同的读数或重复阵列可以根据其表观遗传标记(例如甲基化模式)来区分,从而有助于组装过程。尽管长读长仍然包含一定比例的测序错误,导致比对器和组装器在准确性和速度上迷失方向。在这里,我回顾了针对重复解析和多重映射读取问题提出和实施的解决方案,以及参考选择、重复掩码和性染色体正确表示的下游后果。我还考虑了关于长读的即将到来的挑战和解决方案,我们期望从单个个体内的重复定位问题转变为全基因组内的重复定位问题。
更新日期:2020-12-31
中文翻译:
可能是正确的:通过短读和长读来拯救重复
自从人类基因组计划引入高通量测序以来,将短读段组装成足够质量的参考就成了一个重大问题,因为人类基因组的很大一部分(估计为 50-69%)是重复的。因此,相当大比例的测序读数是多重映射的,即在基因组中没有独特的位置。读取是否是多重映射的两个关键参数是读取长度和基因组复杂性。长读数现在能够跨越困难的异染色质区域,包括完整的着丝粒,并表征从“端粒到端粒”的染色体。此外,相同的读数或重复阵列可以根据其表观遗传标记(例如甲基化模式)来区分,从而有助于组装过程。尽管长读长仍然包含一定比例的测序错误,导致比对器和组装器在准确性和速度上迷失方向。在这里,我回顾了针对重复解析和多重映射读取问题提出和实施的解决方案,以及参考选择、重复掩码和性染色体正确表示的下游后果。我还考虑了关于长读的即将到来的挑战和解决方案,我们期望从单个个体内的重复定位问题转变为全基因组内的重复定位问题。