Positional embeddings and zero-shot learning using BERT for molecular-property prediction,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Positional embeddings and zero-shot learning using BERT for molecular-property prediction
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2025-02-05 , DOI: 10.1186/s13321-025-00959-9
Medard Edmund Mswahili ₁ , JunHa Hwang ₁ , Jagath C Rajapakse ₂ , Kyuri Jo ₁ , Young-Seob Jeong ₁

Affiliation

Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model’s proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets. Scientific contribution This study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model’s performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.

中文翻译：

使用 BERT 进行位置嵌入和零样本学习进行分子性质预测

最近，化学信息学的进步，如化学结构的表示学习、用于性质预测的深度学习（DL）、数据驱动的发现和化学数据处理的优化，导致对处理化学简化分子输入行系统（SMILES）数据的需求增加，尤其是在文本分析任务中。这些进步推动了对 transformer 模型中位置编码和位置嵌入（PE）等组件进行优化的需求，以更好地捕获嵌入在分子表示中的序列和上下文信息。SMILES 数据代表了原子或元素之间的复杂关系，使它们对于化学信息学领域的各种学习任务至关重要。本研究解决了在 SMILES 字符串中编码原子之间复杂关系的关键挑战，以探索基于 transformer 的框架内的各种 PE，以提高分子性质预测的准确性和泛化性。基于 transformer 的模型，例如来自 transformer 的双向编码器表示（BERT）模型，在自然语言处理任务中的成功激发了化学信息学领域日益增长的兴趣。然而，这些模型在预训练和微调期间的性能会受到 PE 等位置信息的显著影响，这有助于理解序列内错综复杂的关系。在 transformer 架构中集成位置信息已成为一种很有前途的方法。这种编码机制为建模位于给定序列中不同位置的元素之间的依赖关系提供了必要的监督。在这项研究中，我们首先使用各种 PE 进行预训练实验，以探索将位置信息纳入 BERT 模型以使用 SMILES 字符串进行化学文本分析的不同方法。接下来，对于每个 PE，我们在分子性质预测的下游任务上微调性能最佳的 BERT （掩码语言建模）模型。在这里，我们使用两种分子表示，SMILES 和 DeepSMILES，来全面评估 PE 在零样本学习分析中的潜力和局限性，展示了该模型在新提出和现有数据集的背景下预测看不见的分子表示特性的熟练程度。科学贡献本研究使用 BERT 模型进行分子特性预测，探讨了 PE 尚未开发的潜力。该研究涉及使用 SMILES 和 DeepSMILES 表示在与 COVID-19、生物测定数据以及其他分子和生物学特性相关的各种数据集上对 BERT 模型进行预训练和微调。该研究详细介绍了预训练架构、微调数据集以及具有不同 PE 的 BERT 模型的性能。它还探讨了 zero-shot 学习分析以及模型在各种分类和回归任务上的性能。在这项研究中，除了现有和常用的数据集外，在微调过程中还引入了来自不同领域的新提出的数据集。该研究强调了 BERT 模型在预测化学性质方面的稳健性及其在化学信息学和生物信息学中的潜在应用。

更新日期：2025-02-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南