使用自然语言处理 (NLP) 启发的分子嵌入方法来预测汉森溶解度参数,Digital Discovery

当前位置： X-MOL 学术 › Digital Discovery › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

使用自然语言处理 (NLP) 启发的分子嵌入方法来预测汉森溶解度参数
Digital Discovery ( IF 6.2 ) Pub Date : 2023-11-29 , DOI: 10.1039/d3dd00119a
Jiayun Pang ₁ , Alexander W. R. Pine ₁ , Abdulai Sulemana ₁

Affiliation

汉森溶解度参数 (HSP) 具有三个分量：δ _d、δ _p和δ _h，用于解释分子的色散力、极性力和氢键，旨在更好地了解分子结构如何影响混溶性/溶解度。HSP 在药物研究的整个过程中广泛使用，但尚未像水溶性那样得到充分的计算研究。在当前的研究中，我们仅使用分子的 SMILES 来预测 HSP，并利用受自然语言处理 (NLP) 启发的分子嵌入方法。两个预训练的深度学习模型——Mol2Vec 和 ChemBERTa 已用于导出嵌入。含有实验确定的 HSP 的约 1200 个有机分子的数据集被用作标记数据集。经过微调，ChemBERTa 模型“学习”了相关分子特征，并将注意力转移到产生相关 HSP 的官能团上。经过微调的 ChemBERTa 模型优于 Mol2Vec 模型和基线 Morgan 指纹方法，尽管程度并不显着。有趣的是，嵌入模型可以比δ _h和δ _p更好地预测δ _d，并且总体而言，预测的 HSP 的准确性低于经过良好基准测试的 ESOL 水溶性。我们的研究表明，预训练模型利用的迁移学习的程度与标记的分子特性有关。它还强调了δ _p和δ _h在定义方式上可能存在较大的固有误差，因此使用机器学习模型对其准确预测引入了固有的限制。我们的工作揭示了一些有趣的发现，这些发现将有助于探索基于 BERT 的分子特性预测模型的潜力。它还可以指导汉森溶解度框架的可能完善，这将对整个制药行业和研究产生广泛影响。

"点击查看英文标题和摘要"

Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters

Hansen solubility parameters (HSPs) have three components, δ_d, δ_p and δ_h, accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure affects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon finetuning, the ChemBERTa model “learned” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The finetuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan fingerprint method albeit not to a significant extent. Interestingly, the embedding models can predict δ_d significantly better than δ_h and δ_p and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how δ_p and δ_h may have large intrinsic errors in the way they are defined and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting findings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible refinement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.

更新日期：2023-12-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文