当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Performance comparison of TCR-pMHC prediction tools reveals a strong data dependency
bioRxiv - Bioinformatics Pub Date : 2022-11-24 , DOI: 10.1101/2022.11.24.517666
Lihua Deng , Cedric Ly , Sina Abdollahi , Yu Zhao , Immo Prinz , Stefan Bonn

The interaction of T-cell receptors with peptide-major histocompatibility complex molecules plays a crucial role in adaptive immune responses. Currently there are various models aiming at predicting TCR-pMHC binding, while a standard dataset and procedure to compare the performance of these approaches is still missing. In this work we provide a general method for data collection, preprocessing, splitting and generation of negative examples, as well as comprehensive datasets to compare TCR-pMHC prediction models. We collected, harmonized, and merged all the major publicly available TCR-pMHC binding data and compared the performance of five state-of-the-art deep learning models (TITAN, NetTCR, ERGO, DLpTCR and ImRex) using this data. Our performance evaluation focuses on two scenarios: 1) different splitting methods for generating training and testing data to assess model generalization and 2) different data versions that vary in size and peptide imbalance to assess model robustness. Our results indicate that the five contemporary models do not generalize to peptides that have not been in the training set. We can also show that model performance is strongly dependent on the data balance and size, which indicates a relatively low model robustness. These results suggest that TCR-pMHC binding prediction remains highly challenging and requires further high quality data and novel algorithmic approaches.

中文翻译:

TCR-pMHC 预测工具的性能比较揭示了强烈的数据依赖性

T 细胞受体与肽-主要组织相容性复合物分子的相互作用在适应性免疫反应中起着至关重要的作用。目前有多种模型旨在预测 TCR-pMHC 结合,但仍然缺少用于比较这些方法性能的标准数据集和程序。在这项工作中,我们提供了一种用于数据收集、预处理、拆分和生成负面示例的通用方法,以及用于比较 TCR-pMHC 预测模型的综合数据集。我们收集、协调和合并了所有主要的公开可用的 TCR-pMHC 结合数据,并使用这些数据比较了五种最先进的深度学习模型(TITAN、NetTCR、ERGO、DLpTCR 和 ImRex)的性能。我们的性能评估侧重于两种场景:1) 用于生成训练和测试数据以评估模型泛化的不同拆分方法和 2) 大小和肽不平衡不同的不同数据版本以评估模型稳健性。我们的结果表明,五个当代模型不能推广到训练集中没有的肽。我们还可以表明,模型性能在很大程度上取决于数据平衡和大小,这表明模型鲁棒性相对较低。这些结果表明 TCR-pMHC 结合预测仍然极具挑战性,需要进一步的高质量数据和新颖的算法方法。我们的结果表明,五个当代模型不能推广到训练集中没有的肽。我们还可以表明,模型性能在很大程度上取决于数据平衡和大小,这表明模型鲁棒性相对较低。这些结果表明 TCR-pMHC 结合预测仍然极具挑战性,需要进一步的高质量数据和新颖的算法方法。我们的结果表明,五个当代模型不能推广到训练集中没有的肽。我们还可以表明,模型性能在很大程度上取决于数据平衡和大小,这表明模型鲁棒性相对较低。这些结果表明 TCR-pMHC 结合预测仍然极具挑战性,需要进一步的高质量数据和新颖的算法方法。
更新日期:2022-11-25
down
wechat
bug