TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 2024-04-25 , DOI: 10.1109/tse.2024.3393419
Zixiang Xian ₁ , Rubing Huang ₁ , Dave Towey ₂ , Chunrong Fang ₃ , Zhenyu Chen ₃

Affiliation

Artificial intelligence (AI) has revolutionized software engineering (SE) by enhancing software development efficiency. The advent of pre-trained models (PTMs) leveraging transfer learning has significantly advanced AI for SE. However, existing PTMs that operate on individual code tokens suffer from several limitations: They are costly to train and fine-tune; and they rely heavily on labeled data for fine-tuning on task-specific datasets. In this paper, we present TransformCode , a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language. We also propose a novel data-augmentation technique called abstract syntax tree (AST) transformation , which applies syntactic and semantic transformations to the original code snippets, to generate more diverse and robust samples for contrastive learning. Our framework has several advantages over existing methods: (1) It is flexible and adaptable, because it can easily be extended to other downstream tasks that require code representation (such as code-clone detection and classification); (2) it is efficient and scalable, because it does not require a large model or a large amount of training data, and it can support any programming language; (3) it is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives; and (4) it can also adjust the number of encoder parameters based on computing resources. We evaluate our framework on several code-related tasks, and demonstrate its effectiveness and superiority over the state-of-the-art methods such as SourcererCC, Code2vec, and InferCode.

中文翻译：

TransformCode：通过子树转换进行代码嵌入的对比学习框架

人工智能 (AI) 通过提高软件开发效率彻底改变了软件工程 (SE)。利用迁移学习的预训练模型 (PTM) 的出现极大地推进了 SE 的人工智能。然而，现有的在单个代码令牌上运行的 PTM 存在一些局限性：训练和微调成本高昂；他们严重依赖标记数据来对特定任务的数据集进行微调。在本文中，我们提出了 TransformCode，这是一种以对比学习方式学习代码嵌入的新颖框架。我们的框架与编码器和语言无关，这意味着它可以利用任何编码器模型并处理任何编程语言。我们还提出了一种称为抽象语法树（AST）转换的新型数据增强技术，它将句法和语义转换应用于原始代码片段，以生成更多样化和更强大的样本以进行对比学习。我们的框架比现有方法有几个优点：（1）它灵活且适应性强，因为它可以轻松扩展到需要代码表示的其他下游任务（例如代码克隆检测和分类）；（2）高效且可扩展，因为它不需要大型模型或大量训练数据，并且可以支持任何编程语言；（3）它不仅限于无监督学习，还可以通过结合特定任务的标签或目标来应用于一些监督学习任务； (4)还可以根据计算资源调整编码器参数的数量。我们在几个与代码相关的任务上评估我们的框架，并证明其相对于 SourcererCC、Code2vec 和 InferCode 等最先进方法的有效性和优越性。

更新日期：2024-04-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>