Nature Biotechnology ( IF 33.1 ) Pub Date : 2024-10-11 , DOI: 10.1038/s41587-024-02414-w Abdul Muntakim Rafi, Daria Nogina, Dmitry Penzar, Dohoon Lee, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Dohyeon Kim, Yeojin Shin, Il-Youp Kwak, Georgy Meshcheryakov, Andrey Lando, Arsenii Zinkevich, Byeong-Chan Kim, Juhyun Lee, Taein Kang, Eeshit Dhaval Vaishnav, Payman Yadollahpour, Sun Kim, Jake Albrecht, Aviv Regev, Wuming Gong, Ivan V. Kulakovskiy, Pablo Meyer, Carl G. de Boer
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
中文翻译:
社区努力优化基于序列的基因调控深度学习模型
需要系统地评估模型架构和训练策略如何影响基因组学模型性能。为了解决这一差距,我们举办了一场 DREAM 挑战赛,参赛者在包含数百万个随机启动子 DNA 序列和相应表达水平的数据集上训练模型,这些序列是在酵母中实验确定的。为了对模型进行稳健的评估,我们设计了一套全面的基准,涵盖各种序列类型。所有性能最好的模型都使用神经网络,但在架构和训练策略上有所不同。为了剖析架构和训练选择如何影响性能,我们开发了 Prix Fixe 框架,将模型划分为模块化构建块。我们测试了前三个模型的所有可能组合,进一步提高了它们的性能。DREAM Challenge 模型不仅在我们全面的酵母数据集上取得了最先进的结果,而且始终超越了果蝇和人类基因组数据集的现有基准,展示了金标准基因组学数据集可以推动的进步。