当前位置: X-MOL 学术Nucleic Acids Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs
Nucleic Acids Research ( IF 16.6 ) Pub Date : 2024-09-17 , DOI: 10.1093/nar/gkae799
Lianrong Pu 1, 2 , Ron Shamir 1
Affiliation  

Microbial communities usually harbor a mix of bacteria, archaea, plasmids, viruses and microeukaryotes. Within these communities, viruses, plasmids, and microeukaryotes coexist in relatively low abundance, yet they engage in intricate interactions with bacteria. Moreover, viruses and plasmids, as mobile genetic elements, play important roles in horizontal gene transfer and the development of antibiotic resistance within microbial populations. However, due to the difficulty of identifying viruses, plasmids, and microeukaryotes in microbial communities, our understanding of these minor classes lags behind that of bacteria and archaea. Recently, several classifiers have been developed to separate one or more minor classes from bacteria and archaea in metagenome assemblies. However, these classifiers often overlook the issue of class imbalance, leading to low precision in identifying the minor classes. Here, we developed a classifier called 4CAC that is able to identify viruses, plasmids, microeukaryotes, and prokaryotes simultaneously from metagenome assemblies. 4CAC generates an initial four-way classification using several sequence length-adjusted XGBoost models and further improves the classification using the assembly graph. Evaluation on simulated and real metagenome datasets demonstrates that 4CAC substantially outperforms existing classifiers and combinations thereof on short reads. On long reads, it also shows an advantage unless the abundance of the minor classes is very low. 4CAC runs 1–2 orders of magnitude faster than the other classifiers. The 4CAC software is available at https://github.com/Shamir-Lab/4CAC.

中文翻译:


4CAC:使用机器学习和组装图的宏基因组重叠群的 4 类分类器



微生物群落通常含有细菌、古细菌、质粒、病毒和微真核生物的混合物。在这些群落中,病毒、质粒和微真核生物以相对较低的丰度共存,但它们与细菌进行错综复杂的相互作用。此外,病毒和质粒作为移动遗传元件,在微生物种群内的水平基因转移和抗生素耐药性的发展中起着重要作用。然而,由于难以识别微生物群落中的病毒、质粒和微真核生物,我们对这些小类的理解落后于细菌和古细菌。最近,已经开发了几种分类器,用于将一个或多个次要类别与宏基因组组装中的细菌和古细菌分开。但是,这些分类器经常忽略类不平衡的问题,导致识别次要类的精度较低。在这里,我们开发了一种名为 4CAC 的分类器,它能够从宏基因组组装中同时识别病毒、质粒、微真核生物和原核生物。4CAC 使用几个序列长度调整的 XGBoost 模型生成初始四向分类,并使用装配图进一步改进分类。对模拟和真实宏基因组数据集的评估表明,4CAC 在短读长方面大大优于现有的分类器及其组合。在长读 时,除非 minor 类的丰度非常低,否则它也显示出优势。4CAC 的运行速度比其他分类器快 1-2 个数量级。4CAC 软件可在 https://github.com/Shamir-Lab/4CAC 获取。
更新日期:2024-09-17
down
wechat
bug