当前位置: X-MOL 学术bioRxiv. Genom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genome modeling and design across all domains of life with Evo 2
bioRxiv - Genomics Pub Date : 2025-02-21 , DOI: 10.1101/2025.02.18.638918
Garyk Brixi , Matthew G. Durrant , Jerome Ku , Michael Poli , Greg Brockman , Daniel Chang , Gabriel A. Gonzalez , Samuel H. King , David B. Li , Aditi T. Merchant , Mohsen Naghipourfar , Eric Nguyen , Chiara Ricci-Tam , David W. Romero , Gwanggyu Sun , Ali Taghibakshi , Anton Vorontsov , Brandon Yang , Myra Deng , Liv Gorton , Nam Nguyen , Nicholas K. Wang , Etowah Adams , Stephen A. Baccus , Steven Dillmann , Stefano Ermon , Daniel Guo , Rajesh Ilango , Ken Janik , Amy X. Lu , Reshma Mehta , Mohammad R.K. Mofrad , Madelena Y. Ng , Jaspreet Pannu , Christopher Ré , Jonathan C. Schmok , John St. John , Jeremy Sullivan , Kevin Zhu , Greg Zynda , Daniel Balsam , Patrick Collison , Anthony B. Costa , Tina Hernandez-Boussard , Eric Ho , Ming-Yu Liu , Thomas McGrath , Kimberly Powell , Dave P. Burke , Hani Goodarzi , Patrick D. Hsu , Brian L. Hie

All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation--from noncoding pathogenic mutations to clinically significant BRCA1 variants--without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon-intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

中文翻译:


使用 Evo 2 进行所有生命领域的基因组建模和设计



所有的生命都用 DNA 编码信息。虽然基因组密码的测序、合成和编辑工具改变了生物学研究,但智能地构建新的生物系统还需要深入了解基因组编码的巨大复杂性。我们介绍了 Evo 2,这是一种生物基础模型,使用 9.3 万亿个 DNA 碱基对进行训练,该模型来自一个高度精选的基因组图谱,涵盖生命的所有领域。我们使用 7B 和 40B 参数训练 Evo 2,以获得前所未有的 100 万个代币上下文窗口和单核苷酸分辨率。Evo 2 仅从 DNA 序列中学习,以准确预测遗传变异的功能影响——从非编码致病突变到具有临床意义的 BRCA1 变异——无需特定任务的微调。应用机制可解释性分析,我们揭示了 Evo 2 自主学习了广泛的生物学特征,包括外显子-内含子边界、转录因子结合位点、蛋白质结构元件和原噬菌体基因组区域。除了预测能力之外,Evo 2 还可以在基因组规模上生成线粒体、原核和真核序列,比以前的方法具有更高的自然性和连贯性。通过推理时间搜索引导 Evo 2 可以可控地生成表观基因组结构,为此我们展示了生物学中的第一个推理时间缩放结果。我们将 Evo 2 完全开放,包括模型参数、训练代码、推理代码和 OpenGenome2 数据集,以加速生物复杂性的探索和设计。
更新日期:2025-02-22
down
wechat
bug