单细胞 RNA 测序 (scRNA-seq) 允许以细胞分辨率进行全局转录组分析,从而识别潜在的细胞类型和相应的谱系。这种细胞类型识别和注释在很大程度上依赖于模型,这些模型通过在具有准确注释标签的大量单个细胞上训练自己来学习。目前,这项细胞类型注释任务是基于对每个具有统计学意义的细胞组的标记基因的检查来完成的。这既具有挑战性又耗时。在本文中,我们提出了一种基于非负矩阵分解 (NMF) 和递归 k-means 算法的半监督单元类型标注方法,称为 CASSL。半监督模型能够在有限数量的标记数据的帮助下学习大量未标记数据的标签。CASSL 的有效性已在八个公开可用的人类和小鼠 scRNA-seq 数据集上得到证明,这些数据集涉及不同的器官和协议。它已经能够以高精度正确注释大多数未标记的细胞。它还因其聚类解决方案的正确性、不同百分比的缺失标签的鲁棒性以及执行时间而进行了评估。与最先进的无监督和半监督细胞类型注释方法相比,CASSL 在大多数数据集的所有指标上始终优于其他方法。与最先进的监督方法相比,它也显示出具有竞争力的结果。CASSL 的有效性已在八个公开可用的人类和小鼠 scRNA-seq 数据集上得到证明,这些数据集涉及不同的器官和协议。它已经能够以高精度正确注释大多数未标记的细胞。它还因其聚类解决方案的正确性、不同百分比的缺失标签的鲁棒性以及执行时间而进行了评估。与最先进的无监督和半监督细胞类型注释方法相比,CASSL 在大多数数据集的所有指标上始终优于其他方法。与最先进的监督方法相比,它也显示出具有竞争力的结果。CASSL 的有效性已在八个公开可用的人类和小鼠 scRNA-seq 数据集上得到证明,这些数据集涉及不同的器官和协议。它已经能够以高精度正确注释大多数未标记的细胞。它还因其聚类解决方案的正确性、不同百分比的缺失标签的鲁棒性以及执行时间而进行了评估。与最先进的无监督和半监督细胞类型注释方法相比,CASSL 在大多数数据集的所有指标上始终优于其他方法。与最先进的监督方法相比,它也显示出具有竞争力的结果。它已经能够以高精度正确注释大多数未标记的细胞。它还因其聚类解决方案的正确性、不同百分比的缺失标签的鲁棒性以及执行时间而进行了评估。与最先进的无监督和半监督细胞类型注释方法相比,CASSL 在大多数数据集的所有指标上始终优于其他方法。与最先进的监督方法相比,它也显示出具有竞争力的结果。它已经能够以高精度正确注释大多数未标记的细胞。它还因其聚类解决方案的正确性、不同百分比的缺失标签的鲁棒性以及执行时间而进行了评估。与最先进的无监督和半监督细胞类型注释方法相比,CASSL 在大多数数据集的所有指标上始终优于其他方法。与最先进的监督方法相比,它也显示出具有竞争力的结果。CASSL 在大多数数据集的所有指标上始终优于其他指标。与最先进的监督方法相比,它也显示出具有竞争力的结果。CASSL 在大多数数据集的所有指标上始终优于其他指标。与最先进的监督方法相比,它也显示出具有竞争力的结果。
"点击查看英文标题和摘要"
CASSL: A cell-type annotation method for single cell transcriptomics data using semi-supervised learning
Single cell RNA sequencing (scRNA-seq) allows global transcriptomic profiling at a cellular resolution, thus, identifying underlying cell types and corresponding lineages. Such cell type identification and annotation rely heavily on models that learn by training themselves on a large amount of individual cells with accurate, annotated labels. Presently, this task of cell-type annotation is done based on inspection of marker genes from each of the statistically significant groups of cells. This is both challenging and time consuming. In this article, we have proposed a semi-supervised cell-type annotation method, called CASSL, based on Non-negative matrix factorization (NMF) coupled with recursive k-means algorithm. A semi-supervised model is capable of learning labels for a large amount of unlabelled data with the help of a limited amount of labelled data. The effectiveness of CASSL has been demonstrated on eight publicly available human and mice scRNA-seq datasets across varied organs and protocols. It has been able to correctly annotate majority of the unlabelled cells with high accuracy. It has also been evaluated for its correctness of clustering solution, robustness across varying percentage of missing labels, and time taken for execution. When compared with state-of-the-art unsupervised and semi-supervised cell-type annotation methods, CASSL has consistently outperformed others across all metrics for most of the datasets. It has also shown competitive results when compared against state-of-the-art supervised methods.