当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VarGAN: Adversarial Learning of Variable Semantic Representations
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 4-25-2024 , DOI: 10.1109/tse.2024.3391730
Yalan Lin 1 , Chengcheng Wan 2 , Shuwen Bai 3 , Xiaodong Gu 1
Affiliation  

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

中文翻译:


VarGAN:可变语义表示的对抗性学习



变量名称在代码表示学习中至关重要。然而,由于命名约定不同,变量通常会收到任意名称,从而导致长尾、词汇外 (OOV) 和其他众所周知的问题。虽然字节对编码(BPE)标记器已经解决了低频标记的表层识别问题,但它没有注意到代码表示模型对低频标识符的训练不足,导致稀有标识符和常见标识符分布不平衡。因此,代码表示模型很难有效地捕获低频变量名称的语义。在本文中,我们提出了 VarGAN,一种用于变量名称表示的新方法。 VarGAN通过对抗性训练加强低频变量的训练。具体来说,我们将代码表示模型视为负责从源代码生成向量的生成器。此外,我们采用鉴别器来检测生成器的代码输入是否包含低频变量。这种对抗性设置规范了稀有变量的分布,使它们与向量空间中相应的高频变量重叠。实验结果表明,VarGAN 使 CodeBERT 能够生成低频和高频标识符表现出更均匀分布的代码向量。在 IdBench 基准测试中,与 VarCLR 相比,相似性和相关性得分提高了 8%。 VarGAN 还在下游任务中得到验证,在捕获令牌和代码级语义方面表现出增强的能力。
更新日期:2024-08-19
down
wechat
bug