当前位置: X-MOL 学术J. Chem. Inf. Model. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EDITORIAL: Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-08-12 , DOI: 10.1021/acs.jcim.4c01300
Ganna Gryn'ova 1 , Tristan Bereau 2 , Carolin Müller 3 , Pascal Friederich 4, 5 , Rebecca C Wade 6, 7, 8 , Ariane Nunes-Alves 9 , Thereza A Soares 10, 11 , Kenneth Merz 12
Affiliation  

According to various estimates, the number of possible chemical compounds is in the range of 1018–10200. A popular estimate of 1060 refers to molecules composed of C, H, N, O, and H atoms, containing no more than 4 rings and weighting less than 500 Da. (1) Several other “theoretical” subspaces populated by small organic (often, drug-like) molecules have been enumerated, e.g., a more conservative estimate of 3.4 × 109 for molecules with ≤100 carbon atoms (2) and the “Chemical Universe Database” GDB-17 with 166.4 billion molecules with up to 17 C, N, O, S, and halogen atoms. (3,4) Across the entire chemical space, ca. 219 million organic substances, alloys, coordination compounds, minerals, mixtures, polymers, and salts have been published and are recorded in the Chemical Abstracts Service (CAS) registry. (5) Although seemingly a huge number, 219 million is but a speck of dust in the practically infinite chemical universe. Such endless possibilities come with a burden of choice: which molecule or material is the best (most efficient, cheapest, most sustainable, ...) for a given practical use? Chemists have come up with diverse solutions to this problem over the centuries, but a profound shift away from building upon prior experience (for example, introducing new substituents into a known catalyst) and toward a less biased and more broad chemical space exploration has only come about comparatively recently. The immense growth in computing power and memory, the variety and availability of theoretical methods and their implementations, new chemical synthesis approaches and laboratory automation, and stellar advances in artificial intelligence are all factors contributing to this shift. Today, combinatorial structure generation and property computation using automated multiscale workflows enable high-throughput screening (HTS) of millions of compounds at the cost we used to pay for computing a single chemically accurate energy profile for a reaction between relatively small compounds only some 20 years ago. Moreover, generative and predictive machine learning (ML) models enable targeted inverse molecular design and allow estimation of chemical properties across even larger regions of the chemical universe. (6−11) The latest advances in AI-driven molecular science were central to two recent meetings in the historic city of Heidelberg in the south of Germany: the second SIMPLAIX Workshop on Machine Learning for Multiscale Molecular Modeling (https://simplaix-workshop2024.h-its.org/) and the Chemical Compound Space Conference 2024 (CCSC2024, https://ccsc2024.github.io/, Figure 1). A broad range of scientific themes, from developing new machine learning architectures for studying molecular properties to ML applications in biomolecular simulations and materials discovery, was covered. However, exploration of chemical space served as a leitmotif for many of the lectures, posters, and discussions. Despite the meteoric pace of progress in the field of chemical big data and machine learning, many experts agree that efficient, comprehensive, and unbiased exploration of this immense space remains elusive. The very fast pace of methodological developments in this area was named among the key obstacles to chemical space exploration, with parallels being drawn between the “alphabet soup” of density functionals (a term coined by Kieron Burke in 2007) and the multitude of new ML potentials, architectures, and representations published today. Only time will show whether the field will eventually converge on a handful of popular models. The “garbage in, garbage out” problem is equally central to chemical space exploration, as the reliability of predictions for new systems is a direct outcome of the quality of the training data. Consequently, there is a pressing need for automated approaches to benchmarking, comparing, and quantifying the uncertainties of ML models in chemistry. Finally, even if and when promising new molecules and materials are discovered in silico, their stability and synthetic accessibility, challenging to predict for truly novel systems, become key to successful experimental validation. Mining the literature and training ML models on experimental data offer a potential solution to this issue, yet both these approaches are hindered by barriers to the availability and accessibility of such data and the lack of FAIR data standards when reporting and structuring it. Figure 1. Illustration of chemical compound space created by the attendees during the CCSC2024 conference. Photo credit: John M. Lindner. In the light of these challenges, the Journal of Chemical Information and Modeling (JCIM) invites authors to submit contributions to a Virtual Special Issue (VSI) on the topic of “Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning”. This VSI recognizes the vertiginous developments in the field during the last three years since the JCIM special issue on Reaction Informatics and Chemical Space. (12) All manuscript types published by JCIM, including articles, perspectives, viewpoints, reviews, letters, and application notes, are welcome. For more information on manuscript types and how to submit, please visit the journal’s Web site. Submissions will be received through January 31, 2025. All articles submitted under this VSI will be peer-reviewed to ensure they fit the scope of the Virtual Special Issue and that they meet the high scientific publishing standards of the Journal of Chemical Information and Modeling (more information can be found in previous editorials (13,14)). If accepted, publications will go online as soon as possible and be published in the next available issue. Publications on this topic will be gathered into a Virtual Special Issue and widely promoted thereafter. G.G. gratefully acknowledges funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 101042290 PATTERNCHEM). T.B. acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2181/1-390900948 (the Heidelberg STRUCTURES Excellence Cluster). G.G, T.B., P.F., and R.C.W. gratefully acknowledge the financial support of the Klaus Tschira Stiftung gGmbH for SIMPLAIX. T.A.S. acknowledges funding from FAPESP (Grant 2022/07231-7), CNPq (Productivity fellowship), and the RCN (Grant 26269). A.N.A. acknowledges funding from DFG under Germany’s Excellence Strategy EXC 2008/1-390540038 - UniSysCat. This article references 14 other publications. This article has not yet been cited by other publications.

中文翻译:


社论:通过多尺度高通量筛选和机器学习进行化合物空间探索



根据各种估计,可能的化合物数量在 10 18 –10 200之间。流行的估计 10 60是指由 C、H、N、O 和 H 原子组成的分子,包含不超过 4 个环,重量小于 500 Da。 (1) 已经列举了由小有机(通常是类药物)分子组成的其他几个“理论”子空间,例如,对于 ≤100 个碳原子的分子,更保守的估计为 3.4 × 10 9 (2) 和“化学”子空间。宇宙数据库”GDB-17,包含 1664 亿个分子,其中包含多达 17 个 C、N、O、S 和卤素原子。 (3,4) 在整个化学空间中,大约。 2.19 亿种有机物质、合金、配位化合物、矿物、混合物、聚合物和盐已发布并记录在化学文摘服务 (CAS) 注册表中。 (5) 虽然2.19亿看似一个巨大的数字,但在几乎无限的化学宇宙中,它不过是一粒尘埃。这种无限的可能性伴随着选择的负担:对于给定的实际用途,哪种分子或材料是最好的(最有效、最便宜、最可持续……)?几个世纪以来,化学家针对这个问题提出了多种解决方案,但从基于先前经验(例如,在已知催化剂中引入新的取代基)和转向更少偏见和更广泛的化学空间探索的深刻转变才刚刚到来。大约是最近的事。计算能力和内存的巨大增长、理论方法及其实现的多样性和可用性、新的化学合成方法和实验室自动化以及人工智能的显着进步都是促成这一转变的因素。 如今,使用自动化多尺度工作流程的组合结构生成和属性计算可以对数百万种化合物进行高通量筛选 (HTS),而我们过去仅花费大约 20 年的时间为相对较小的化合物之间的反应计算单一化学精确的能量谱。前。此外,生成和预测机器学习 (ML) 模型可以实现有针对性的逆向分子设计,并允许估计化学宇宙中更大区域的化学性质。 (6−11) AI 驱动的分子科学的最新进展是最近在德国南部历史名城海德堡举行的两次会议的核心:第二届 SIMPLAIX 多尺度分子建模机器学习研讨会 (https://simplaix- Workshop2024.h-its.org/)和 2024 年化合物空间会议(CCSC2024,https://ccsc2024.github.io/,图 1)。涵盖了广泛的科学主题,从开发用于研究分子特性的新机器学习架构到生物分子模拟和材料发现中的机器学习应用。然而,对化学空间的探索成为许多讲座、海报和讨论的主题。尽管化学大数据和机器学习领域进展迅速,但许多专家一致认为,对这一巨大空间的高效、全面和公正的探索仍然难以实现。该领域方法论的快速发展被认为是化学空间探索的主要障碍之一,密度泛函的“字母汤”(Kieron Burke 在 2007 年创造的术语)与众多新的机器学习之间存在相似之处。今天发布的潜力、架构和表示。 只有时间才能证明该领域最终是否会集中在少数流行模型上。 “垃圾进,垃圾出”问题对于化学空间探索同样重要,因为新系统预测的可靠性是训练数据质量的直接结果。因此,迫切需要自动化方法来对化学中 ML 模型的不确定性进行基准测试、比较和量化。最后,即使当在计算机中发现有前途的新分子和材料时,它们的稳定性和合成可及性(难以预测真正新颖的系统)也成为成功实验验证的关键。挖掘文献和基于实验数据训练机器学习模型为这个问题提供了一个潜在的解决方案,但这两种方法都受到此类数据的可用性和可访问性障碍以及报告和构建数据时缺乏公平数据标准的阻碍。图 1. CCSC2024 会议期间与会者创建的化合物空间插图。照片来源:约翰·M·林德纳。鉴于这些挑战, 《化学信息与建模杂志》 (JCIM)邀请作者向主题为“通过多尺度高通量筛选和机器学习进行化学化合物空间探索”的虚拟特刊(VSI)提交稿件。该 VSI 认可了自 JCIM 反应信息学和化学空间特刊以来过去三年该领域取得的令人眼花缭乱的发展。 (12) 欢迎 JCIM 发表的所有类型的稿件,包括文章、观点、观点、评论、信件和应用笔记。 有关稿件类型和如何提交的更多信息,请访问该期刊的网站。提交内容的接收截止日期为 2025 年 1 月 31 日。在此 VSI 下提交的所有文章都将接受同行评审,以确保它们符合虚拟特刊的范围,并符合《化学信息与建模杂志》的高科学出版标准(更多信息可以在之前的社论中找到(13,14))。如果被接受,出版物将尽快上线并在下一期出版。有关该主题的出版物将收集成虚拟特刊,并随后广泛推广。 GG 衷心感谢欧洲研究理事会根据欧盟 Horizo​​n 2020 研究和创新计划提供的资助(资助协议编号 101042290 PATTERNCHEM)。 TB 感谢 Deutsche Forschungsgemeinschaft(DFG,德国研究基金会)根据德国卓越战略 EXC 2181/1-390900948(海德堡结构卓越集群)提供的支持。 GG、TB、PF 和 RCW 衷心感谢 Klaus Tschira Stiftung gGmbH 对 SIMPLAIX 的财务支持。 TAS 感谢 FAPESP(拨款 2022/07231-7)、CNPq(生产力奖学金)和 RCN(拨款 26269)的资助。 ANA 承认 DFG 根据德国卓越战略 EXC 2008/1-390540038 - UniSysCat 提供的资金。本文引用了其他 14 篇出版物。这篇文章尚未被其他出版物引用。
更新日期:2024-08-14
down
wechat
bug