Nature Machine Intelligence ( IF 18.8 ) Pub Date : 2024-10-07 , DOI: 10.1038/s42256-024-00908-5 Bohao Zou, Jingjing Wang, Yi Ding, Zhenmiao Zhang, Yufen Huang, Xiaodong Fang, Ka Chun Cheung, Simon See, Lu Zhang
Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
中文翻译:
用于从宏基因组组装的基因组中去除污染物的多模态深度语言模型
宏基因组组装基因组 (MAG) 为利用宏基因组测序数据探索微生物暗物质提供了宝贵的见解。然而,人们越来越担心 MAG 中的污染可能会严重影响下游分析的结果。目前的MAG去污工具主要依赖于标记基因,并没有充分利用基因组序列的上下文信息。为了克服这一限制,我们引入了用于 MAG 净化的 Deepurify。 Deepurify 使用具有对比学习的多模态深度语言模型来将微生物基因组序列与其分类谱系相匹配。它将 MAG 内的重叠群分配给 MAG 分离树,并应用树遍历算法将 MAG 划分为子 MAG,目标是最大化高质量和中等质量子 MAG 的数量。在这里,我们展示了 Deepurify 在模拟数据、CAMI 数据集和具有不同复杂性的现实数据集上的表现优于 MDMclearer 和 MAGpurify。 Deepurify 将土壤中的高质量 MAG 数量增加了 20.0%,在海洋中增加了 45.1%,在植物中增加了 45.5%,在淡水中增加了 33.8%,在人类粪便宏基因组测序数据集中增加了 28.5%。