Finding BERT errors by clustering activation vectors,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Finding BERT errors by clustering activation vectors
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-11-16 , DOI: 10.1016/j.future.2024.107601
William B. Andreopoulos , Dominic Lopez , Carlos Rojas , Vedashree P. Bhandare

The non-linear nature of deep neural networks makes it difficult to interpret the reason behind their output, thus reducing verifiability of the system where these models are applied. Understanding the patterns between activation vectors and predictions could give insight as to erroneous classifications and how to identify them. This paper explains a systematic approach to identifying the clusters with the most misclassifications or false label annotations. For this research, we extracted the activation vectors from a deep learning model, DNABERT, and visualized them using t-SNE to decode the reason behind the results that are produced. We applied K-means in a hierarchical fashion on the activation vectors for a set of training instances. We analyzed cluster mean activation vectors to find any patterns in the errors across K-means clusters. The cluster analysis revealed that the predictions were uniform, or nearly 100 percent the same, in clusters of similar activation vectors. It was found that two clusters containing most of their objects belonging to the same true class tend to be closer together than clusters of opposite classes. The means of objects of the same true label are closer if two clusters have the same predicted labels rather than opposite predicted labels, showing that the activation vectors reflect both predicted and true classes. We did a similar analysis for all 26 organisms in the dataset, showing the Euclidean distance can be used for identifying clusters with many errors. We propose a heuristic to find the clusters with a high number of misclassifications or incorrect label annotations using the vector analysis between clusters. This can aid in identifying misclassifications of DNA sequences or problems with sequence tagging.

中文翻译：

通过对激活向量进行聚类来查找 BERT 错误

深度神经网络的非线性特性使得很难解释其输出背后的原因，从而降低了应用这些模型的系统的可验证性。了解激活向量和预测之间的模式可以深入了解错误的分类以及如何识别它们。本文介绍了一种系统的方法来识别错误分类或错误标签注释最多的集群。在这项研究中，我们从深度学习模型 DNABERT 中提取了激活向量，并使用 t-SNE 将它们可视化，以解码产生结果背后的原因。我们以分层方式将 K-means 应用于一组训练实例的激活向量。我们分析了聚类均值激活向量，以查找 K-means 聚类中误差的任何模式。聚类分析显示，在相似激活向量的聚类中，预测是一致的，或者几乎 100% 相同。结果发现，包含属于同一真类的大部分对象的两个集群往往比相反类的集群靠得更近。如果两个聚类具有相同的预测标签，而不是相反的预测标签，则具有相同真实标签的对象的均值更接近，这表明激活向量同时反映了预测类别和真实类别。我们对数据集中的所有 26 种生物进行了类似的分析，表明欧几里得距离可用于识别具有许多错误的集群。我们提出了一种启发式方法，使用聚类之间的向量分析来查找具有大量错误分类或错误标签注释的聚类。这有助于识别 DNA 序列的错误分类或序列标记问题。

更新日期：2024-11-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文