当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A systematic review of deep learning chemical language models in recent era
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-11-18 , DOI: 10.1186/s13321-024-00916-y
Hector Flores-Hernandez, Emmanuel Martinez-Ledesma

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

中文翻译:


近代深度学习化学语言模型的系统综述



发现具有特定特性的新化合物可以为依赖材料进行开发的领域提供优势,尽管这项任务在复杂性和资源方面成本很高。自数据时代开始以来,深度学习技术通过分析和学习分子数据的表示形式,彻底改变了分子设计过程,大大减少了所涉及的资源和时间。迄今为止,已经开发了各种深度学习方法,使用各种架构和策略,以探索广泛和不连续的化学空间,为生成具有特定特性的化合物提供好处。在这项研究中,我们提出了一篇系统综述,利用分子集 (MOSES) 或 Guacamol 中提出的指标,对通过深度学习技术生成分子的策略进行了统计描述和比较。该研究包括从 Scopus 和 Web of Science 的基于查询的检索中检索到的 48 篇文章和从引文检索中检索到的 25 篇文章,共检索到 72 篇文章,其中 62 篇对应于分子生成的化学语言模型方法,其他 10 篇检索到的文章对应于分子图表示。Transformers、递归神经网络 (RNN)、生成对抗网络 (GAN)、结构化空间状态序列 (S4) 模型和变分自动编码器 (VAE) 被认为是检索到的文章集中用于分子生成的主要深度学习架构。此外,迁移学习、强化学习和条件学习是偏向模型生成和探索特定化学空间区域最常用的技术。 最后,该分析侧重于分子表示、数据库、训练数据集大小、有效性-新颖性权衡以及无偏和有偏化学语言模型的性能等中心主题。选择这些主题以利用图形表示和统计测试进行统计分析。结果分析揭示了过去四年中化学语言模型领域的主要挑战、优势和机遇。
更新日期:2024-11-18
down
wechat
bug