VarGAN: Adversarial Learning of Variable Semantic Representations,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VarGAN: Adversarial Learning of Variable Semantic Representations
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 4-25-2024 , DOI: 10.1109/tse.2024.3391730
Yalan Lin ₁ , Chengcheng Wan ₂ , Shuwen Bai ₃ , Xiaodong Gu ₁

Affiliation

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

中文翻译：

VarGAN：可变语义表示的对抗性学习

变量名称在代码表示学习中至关重要。然而，由于命名约定不同，变量通常会收到任意名称，从而导致长尾、词汇外 (OOV) 和其他众所周知的问题。虽然字节对编码（BPE）标记器已经解决了低频标记的表层识别问题，但它没有注意到代码表示模型对低频标识符的训练不足，导致稀有标识符和常见标识符分布不平衡。因此，代码表示模型很难有效地捕获低频变量名称的语义。在本文中，我们提出了 VarGAN，一种用于变量名称表示的新方法。 VarGAN通过对抗性训练加强低频变量的训练。具体来说，我们将代码表示模型视为负责从源代码生成向量的生成器。此外，我们采用鉴别器来检测生成器的代码输入是否包含低频变量。这种对抗性设置规范了稀有变量的分布，使它们与向量空间中相应的高频变量重叠。实验结果表明，VarGAN 使 CodeBERT 能够生成低频和高频标识符表现出更均匀分布的代码向量。在 IdBench 基准测试中，与 VarCLR 相比，相似性和相关性得分提高了 8%。 VarGAN 还在下游任务中得到验证，在捕获令牌和代码级语义方面表现出增强的能力。

更新日期：2024-08-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>