No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 2024-04-23 , DOI: 10.1109/tse.2024.3392499
Zhijie Liu ₁ , Yutian Tang ₂ , Xiapu Luo ₃ , Yuming Zhou ₄ , Liang Feng Zhang ₁

Affiliation

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using ChatGPT , a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by ChatGPT , focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate ChatGPT 's ability to engage in multi-round fixing process (i.e., ChatGPT 's dialog ability, chatting between users and ChatGPT for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of ChatGPT in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) ChatGPT is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with

$48.14\%$

advantage in Accepted rate on judgment platform, but ChatGPT 's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by ChatGPT has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than

$89\%$

of vulnerabilities successfully addressed; and (4) code generation may be affected by ChatGPT 's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the ChatGPT -based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.

中文翻译：

不再需要动一根手指吗？通过 ChatGPT 评估代码生成的质量

大型语言模型 (LLMs) 在各种自然语言处理 (NLP) 任务中展示了令人印象深刻的功能，例如机器翻译、问答、摘要等。此外，LLMs 在支持软件工程任务方面也非常有价值，特别是在代码生成领域。自动代码生成是根据给定的规范或需求自动生成源代码或可执行代码的过程，提高开发人员的工作效率。在本研究中，我们使用最新的最先进产品 LLM ChatGPT 对代码生成的质量进行了系统的实证评估。我们利用五种语言（即 C、C++、Java、Python 和 JavaScript）的 728 个算法问题和 18 个 CWE 以及 54 个代码场景来执行代码生成任务。我们的评估包括对 ChatGPT 生成的代码片段的全面分析，重点关注三个关键方面：正确性、复杂性和安全性。我们还专门研究了 ChatGPT 参与多轮修复过程的能力（即 ChatGPT 的对话能力、用户和 ChatGPT 之间的聊天以修复生成的错误代码）以促进代码生成。通过深入研究生成的代码并检查实验结果，这项工作为 ChatGPT 在三个关键方面处理代码生成任务的性能提供了宝贵的见解。实验结果表明，(1) ChatGPT 对于 2021 年之前不同语言的问题生成功能正确的代码比 2021 年之后使用 48 美元的问题更好。判断平台的Accepted率有14\%$优势，但ChatGPT通过多轮修复过程直接修复错误代码以实现正确功能的能力相对较弱； (2)不同语言的代码片段的圈复杂度和认知复杂度水平的分布不同。此外，ChatGPT 的多轮修复过程通常会保留或增加代码片段的复杂程度； (3)在C、C++、Java语言的算法场景以及C、Python3语言的CWE场景中，ChatGPT生成的代码存在相关漏洞。然而，针对易受攻击的代码片段的多轮修复过程显示出可喜的结果，成功解决了超过 $89\%$ 的漏洞； (4)代码生成可能会受到ChatGPT的非确定性因素的影响，导致代码片段在功能正确性、复杂性和安全性方面存在差异。总的来说，我们的研究结果揭示了基于 ChatGPT 的代码生成中出现的潜在问题和限制，并为改进 AI 和基于 LLM 的代码生成技术奠定了基础。

更新日期：2024-04-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>