当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 2024-04-25 , DOI: 10.1109/tse.2024.3393419
Zixiang Xian 1 , Rubing Huang 1 , Dave Towey 2 , Chunrong Fang 3 , Zhenyu Chen 3
Affiliation  

Artificial intelligence (AI) has revolutionized software engineering (SE) by enhancing software development efficiency. The advent of pre-trained models (PTMs) leveraging transfer learning has significantly advanced AI for SE. However, existing PTMs that operate on individual code tokens suffer from several limitations: They are costly to train and fine-tune; and they rely heavily on labeled data for fine-tuning on task-specific datasets. In this paper, we present TransformCode , a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language. We also propose a novel data-augmentation technique called abstract syntax tree (AST) transformation , which applies syntactic and semantic transformations to the original code snippets, to generate more diverse and robust samples for contrastive learning. Our framework has several advantages over existing methods: (1) It is flexible and adaptable, because it can easily be extended to other downstream tasks that require code representation (such as code-clone detection and classification); (2) it is efficient and scalable, because it does not require a large model or a large amount of training data, and it can support any programming language; (3) it is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives; and (4) it can also adjust the number of encoder parameters based on computing resources. We evaluate our framework on several code-related tasks, and demonstrate its effectiveness and superiority over the state-of-the-art methods such as SourcererCC, Code2vec, and InferCode.

中文翻译:


TransformCode:通过子树转换进行代码嵌入的对比学习框架



人工智能 (AI) 通过提高软件开发效率彻底改变了软件工程 (SE)。利用迁移学习的预训练模型 (PTM) 的出现极大地推进了 SE 的人工智能。然而,现有的在单个代码令牌上运行的 PTM 存在一些局限性:训练和微调成本高昂;他们严重依赖标记数据来对特定任务的数据集进行微调。在本文中,我们提出了 TransformCode,这是一种以对比学习方式学习代码嵌入的新颖框架。我们的框架与编码器和语言无关,这意味着它可以利用任何编码器模型并处理任何编程语言。我们还提出了一种称为抽象语法树(AST)转换的新型数据增强技术,它将句法和语义转换应用于原始代码片段,以生成更多样化和更强大的样本以进行对比学习。我们的框架比现有方法有几个优点:(1)它灵活且适应性强,因为它可以轻松扩展到需要代码表示的其他下游任务(例如代码克隆检测和分类); (2)高效且可扩展,因为它不需要大型模型或大量训练数据,并且可以支持任何编程语言; (3)它不仅限于无监督学习,还可以通过结合特定任务的标签或目标来应用于一些监督学习任务; (4)还可以根据计算资源调整编码器参数的数量。我们在几个与代码相关的任务上评估我们的框架,并证明其相对于 SourcererCC、Code2vec 和 InferCode 等最先进方法的有效性和优越性。
更新日期:2024-04-25
down
wechat
bug