A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-06-19 , DOI: 10.1186/s13321-024-00848-7
Xiaofan Zheng ₁ , Yoichi Tomiura ₁

Affiliation

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.

中文翻译：

基于 BERT 的预训练模型，用于从 SMILES 序列中提取分子结构信息

在各种分子性质及其组合中，通过理论或实验获得所需的分子性质是一个成本高昂的过程。使用机器学习来分析分子结构特征并预测分子特性是加速分子特性预测的潜在有效替代方案。在本研究中，我们从机器学习的角度通过分子结构分析分子特性。我们使用 SMILES 序列作为人工神经网络的输入来提取分子结构特征并预测分子特性。 SMILES 序列包含代表分子结构的符号。针对SMILES序列与实际分子结构数据不同的问题，我们基于广泛应用于自然语言处理的BERT模型，提出了一种SMILES序列的预训练模型，使得模型能够学习提取分子结构数据。 SMILES 序列中包含的信息。在实验中，我们首先用 100,000 个 SMILES 序列对所提出的模型进行预训练，然后使用预训练模型在 22 个数据集上预测分子特性以及分子的气味特征（98 种气味描述符）。实验结果表明，我们提出的预训练模型有效地提高了分子属性预测的性能。2-编码器预训练是通过关注 SMILES 中符号对上下文环境的依赖性低于自然语言句子中的符号以及相应的一种复合到多个 SMILES 序列。与擅长自然语言的 BERT 相比，使用 2-encoder 预训练的模型在分子属性预测任务中表现出更高的鲁棒性。

更新日期：2024-06-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南