Glypred: Lysine Glycation Site Prediction via CCU–LightGBM–BiLSTM Framework with Multi-Head Attention Mechanism,Journal of Chemical Information and Modeling

当前位置： X-MOL 学术 › J. Chem. Inf. Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Glypred: Lysine Glycation Site Prediction via CCU–LightGBM–BiLSTM Framework with Multi-Head Attention Mechanism
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-08-09 , DOI: 10.1021/acs.jcim.4c01034
Yun Zuo ₁ , Bangyi Zhang ₁ , Yinkang Dong ₁ , Wenying He ₂ , Yue Bi ₃ , Xiangrong Liu ₄ , Xiangxiang Zeng ₅ , Zhaohong Deng ₁

Affiliation

Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer’s, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model’s accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.

中文翻译：

Glypred：通过具有多头注意力机制的 CCU–LightGBM–BiLSTM 框架进行赖氨酸糖基化位点预测

糖化是一种翻译后修饰，优先发生在赖氨酸和精氨酸残基上，损害蛋白质功能并改变特性。这一过程与阿尔茨海默病、糖尿病和动脉粥样硬化等疾病有关。传统的湿实验室实验非常耗时，而机器学习显着简化了蛋白质糖化位点的预测。尽管结果令人鼓舞，但挑战仍然存在，包括数据不平衡、特征冗余和分类器性能次优。本研究引入了 Glypred，这是一种赖氨酸糖基化位点预测模型，结合了 ClusterCentroids 欠采样 (CCU)、LightGBM 和双向长短期记忆网络 (BiLSTM) 方法，并在 BiLSTM 中集成了额外的多头注意力机制。为了实现这一目标，该研究采取了几个关键步骤：选择不同的特征类型来捕获全面的蛋白质信息，采用基于集群的欠采样策略来平衡数据集，使用LightGBM进行特征选择以增强模型性能，以及实现双向LSTM网络以达到准确分类的目的。这些方法共同确保 Glypred 能够以高精度和稳健性有效识别糖化位点。对于特征编码，选择了五种不同的特征类型——AAC、KMER、DR、PWAA 和 EBGW——来捕获广泛的蛋白质序列和生物信息。这些编码特征经过整合和验证，以确保全面的蛋白质信息采集。为了解决正负样本高度不平衡的问题，评估了各种欠采样算法，包括随机欠采样、NearMiss、编辑最近邻规则和 CCU。最终选择 CCU 来删除冗余的非糖化训练数据，建立平衡的数据集，从而提高模型的准确性和鲁棒性。对于特征选择，采用 LightGBM 集成学习算法通过识别最重要的特征来降低特征维度。这种方法加速了模型训练，增强了泛化能力，并保证了模型良好的可移植性。最后，使用双向长短期记忆网络作为分类器，其网络结构旨在从前向和后向捕获糖化修饰位点特征。为了防止过度拟合，引入了适当的正则化参数和dropout率，实现了高效的分类。实验结果表明Glypred取得了最佳性能。该模型为生物信息学提供了新的见解，并鼓励类似策略在其他领域的应用。还利用PyQt5库开发了赖氨酸糖基化位点预测软件工具，为研究人员提供了减少工作量、提高效率的辅助筛选工具。软件和数据集可在 GitHub 上获取：https://github.com/ZBYnb/Glypred。

更新日期：2024-08-09

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南