当前位置:
X-MOL 学术
›
J. Cheminfom.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Human-in-the-loop active learning for goal-oriented molecule generation
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-12-09 , DOI: 10.1186/s13321-024-00924-y Yasmine Nahal, Janosch Menke, Julien Martinelli, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-12-09 , DOI: 10.1186/s13321-024-00924-y Yasmine Nahal, Janosch Menke, Julien Martinelli, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski
Machine learning (ML) systems have enabled the modelling of quantitative structure–property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules. We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.
中文翻译:
用于目标导向分子生成的人机回环主动学习
机器学习 (ML) 系统能够使用现有实验数据对定量结构-性质关系 (QSPR) 和结构-活性关系 (QSAR) 进行建模,以预测新分子的目标性质。这些特性预测器通过指导生成式人工智能 (AI) 代理探索所需的化学空间,在加速药物发现方面具有巨大潜力。但是,由于训练数据的范围有限,它们通常难以概括。当通过生成代理进行优化时,这种限制可能导致产生具有人为高预测概率的分子,以满足目标特性,随后无法通过实验验证。为了应对这一挑战,我们提出了一种自适应方法,该方法集成了主动学习 (AL) 和迭代反馈来改进属性预测器,从而通过生成式 AI 代理改善其优化结果。我们的方法利用预期预测信息增益 (EPIG) 标准来选择其他分子以供 oracle 评估。此过程旨在最大程度地减少预测不确定性,从而对随后生成的分子进行更准确的模型评估。认识到由于时间和后勤限制,立即进行湿实验室或基于物理的实验是不切实际的,我们建议利用人类专家的成本效益和领域知识来有效地增强属性预测器,弥合有限训练数据中的差距。通过模拟和真实人机协同实验进行的实证评估表明,我们的方法改进了属性预测器,以更好地与 Oracle 评估保持一致。 此外,我们观察到预测特性的准确性提高,并且在排名靠前的生成分子中提高了药物相似性。我们提出了一个适应性强的框架,该框架整合了 AL 和人类专业知识,以改进面向目标的分子生成的属性预测器。这种方法对人类反馈中的噪声具有鲁棒性,并确保使用人工改进的预测器在化学空间中导航,利用人类洞察力来识别不仅满足预测属性特征而且在 Oracle 模型中得分高的分子。此外,它还优先考虑实用特性,例如药物相似性、合成可及性以及探索不同化学空间和利用与现有训练数据的相似性之间的良好平衡。
更新日期:2024-12-10
中文翻译:
用于目标导向分子生成的人机回环主动学习
机器学习 (ML) 系统能够使用现有实验数据对定量结构-性质关系 (QSPR) 和结构-活性关系 (QSAR) 进行建模,以预测新分子的目标性质。这些特性预测器通过指导生成式人工智能 (AI) 代理探索所需的化学空间,在加速药物发现方面具有巨大潜力。但是,由于训练数据的范围有限,它们通常难以概括。当通过生成代理进行优化时,这种限制可能导致产生具有人为高预测概率的分子,以满足目标特性,随后无法通过实验验证。为了应对这一挑战,我们提出了一种自适应方法,该方法集成了主动学习 (AL) 和迭代反馈来改进属性预测器,从而通过生成式 AI 代理改善其优化结果。我们的方法利用预期预测信息增益 (EPIG) 标准来选择其他分子以供 oracle 评估。此过程旨在最大程度地减少预测不确定性,从而对随后生成的分子进行更准确的模型评估。认识到由于时间和后勤限制,立即进行湿实验室或基于物理的实验是不切实际的,我们建议利用人类专家的成本效益和领域知识来有效地增强属性预测器,弥合有限训练数据中的差距。通过模拟和真实人机协同实验进行的实证评估表明,我们的方法改进了属性预测器,以更好地与 Oracle 评估保持一致。 此外,我们观察到预测特性的准确性提高,并且在排名靠前的生成分子中提高了药物相似性。我们提出了一个适应性强的框架,该框架整合了 AL 和人类专业知识,以改进面向目标的分子生成的属性预测器。这种方法对人类反馈中的噪声具有鲁棒性,并确保使用人工改进的预测器在化学空间中导航,利用人类洞察力来识别不仅满足预测属性特征而且在 Oracle 模型中得分高的分子。此外,它还优先考虑实用特性,例如药物相似性、合成可及性以及探索不同化学空间和利用与现有训练数据的相似性之间的良好平衡。