当前位置:
X-MOL 学术
›
Syst. Biol.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Toward a semi-supervised learning approach to phylogenetic estimation
Systematic Biology ( IF 6.1 ) Pub Date : 2024-06-25 , DOI: 10.1093/sysbio/syae029 Daniele Silvestro 1, 2 , Thibault Latrille 3 , Nicolas Salamin 3
Systematic Biology ( IF 6.1 ) Pub Date : 2024-06-25 , DOI: 10.1093/sysbio/syae029 Daniele Silvestro 1, 2 , Thibault Latrille 3 , Nicolas Salamin 3
Affiliation
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitu- tions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence, without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference, when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accu- rate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
中文翻译:
迈向系统发育估计的半监督学习方法
模型一直是推断分子进化和重建系统发育树的核心。它们的使用通常涉及开发一个机制框架,以反映我们对潜在生物过程的理解,例如核苷酸替换,以及通过最大似然或贝叶斯推理估计模型参数。然而,在复杂的进化情景下,推导和优化数据的可能性并不总是可能的,甚至对于大型数据集来说也并不总是可行的,这通常会导致拟合模型中不切实际的简化假设。为了克服这个问题,我们将基因组进化的随机模拟与一种新的监督深度学习模型相结合,以推断分子进化的关键参数。我们的模型旨在直接分析多个序列比对并估计每个位点的进化速率和分歧,而无需已知的系统发育树。当速率异质性遵循简单的 γ 分布时,我们预测的准确性与基于可能性的系统发育推断相匹配,但在更复杂的速率变化模式(例如密码子模型)下,它大大超过了它。我们的方法是高度可扩展的,可以有效地应用于基因组数据,正如我们在小丑鱼分支的 2600 万个核苷酸的数据集上所展示的那样。我们的模拟还表明,在贝叶斯框架内整合深度学习获得的每站点速率会导致系统发育推断的准确性显著提高,尤其是在估计的分支长度方面。 因此,我们提出系统发育分析的未来进展将受益于半监督学习方法,该方法结合了替代率的深度学习估计,这允许更灵活的速率变化模型,以及系统发育树的概率推断,这保证了可解释性和对统计支持的严格评估。
更新日期:2024-06-25
中文翻译:
迈向系统发育估计的半监督学习方法
模型一直是推断分子进化和重建系统发育树的核心。它们的使用通常涉及开发一个机制框架,以反映我们对潜在生物过程的理解,例如核苷酸替换,以及通过最大似然或贝叶斯推理估计模型参数。然而,在复杂的进化情景下,推导和优化数据的可能性并不总是可能的,甚至对于大型数据集来说也并不总是可行的,这通常会导致拟合模型中不切实际的简化假设。为了克服这个问题,我们将基因组进化的随机模拟与一种新的监督深度学习模型相结合,以推断分子进化的关键参数。我们的模型旨在直接分析多个序列比对并估计每个位点的进化速率和分歧,而无需已知的系统发育树。当速率异质性遵循简单的 γ 分布时,我们预测的准确性与基于可能性的系统发育推断相匹配,但在更复杂的速率变化模式(例如密码子模型)下,它大大超过了它。我们的方法是高度可扩展的,可以有效地应用于基因组数据,正如我们在小丑鱼分支的 2600 万个核苷酸的数据集上所展示的那样。我们的模拟还表明,在贝叶斯框架内整合深度学习获得的每站点速率会导致系统发育推断的准确性显著提高,尤其是在估计的分支长度方面。 因此,我们提出系统发育分析的未来进展将受益于半监督学习方法,该方法结合了替代率的深度学习估计,这允许更灵活的速率变化模型,以及系统发育树的概率推断,这保证了可解释性和对统计支持的严格评估。