当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-05-22 , DOI: 10.1186/s13321-024-00852-x
Hengwei Chen 1 , Jürgen Bajorath 1
Affiliation  

Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications. The approach introduced herein combines protein language model and chemical language model components, representing an advanced architecture, and is the first methodology for predicting compounds with desired potency from conditioned protein sequence data.

中文翻译:


使用多模式生化语言模型从目标蛋白序列生成具有所需效力的化合物



改编自自然语言处理的深度学习模型为通过序列分子数据表示的机器翻译来预测活性化合物提供了新的机会。例如,化学语言模型通常是为复合字符串转换而导出的。此外,考虑到翻译不同类型文本表示的语言模型的主要多功能性,可以探索非常规的设计任务。在这项工作中,我们研究了从目标序列嵌入中具有所需效力的活性化合物的生成设计,这代表了一项相当令人兴奋的预测任务。因此,设计了双组件条件语言模型来从多模态数据中学习。它包含用于生成目标序列嵌入的蛋白质语言模型组件和用于预测具有所需效力的新活性化合物的条件转换器。为此,指定的“生化”语言模型经过训练,学习组合蛋白质序列和化合物效力值嵌入到相应化合物的映射,对模型推导过程中未遇到的各个活性类别进行微调,并在化合物测试集上进行评估结构上与训练集不同。生化语言模型正确地再现了对所有活动类别具有不同效力的已知化合物,为该方法提供了概念验证。此外,与无条件模型相比,条件模型一致地再现了更多数量的已知化合物以及更有效的化合物,揭示了效力调节的显着效果。生化语言模型还生成了结构多样的候选化合物,与微调和测试化合物不同。 总体而言,基于效价值条件的目标序列嵌入的生成化合物设计产生了有希望的结果,使该方法对进一步探索和实际应用具有吸引力。本文介绍的方法结合了蛋白质语言模型和化学语言模型组件,代表了先进的架构,并且是第一个根据条件蛋白质序列数据预测具有所需效力的化合物的方法。
更新日期:2024-05-22
down
wechat
bug