当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 2024-07-22 , DOI: 10.1109/tse.2024.3428972
Sarah Fakhoury 1 , Aaditya Naik 2 , Georgios Sakkas 3 , Saikat Chakraborty 1 , Shuvendu K. Lahiri 1
Affiliation  

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.

中文翻译:


基于LLM的测试驱动的交互式代码生成:用户研究和实证评估



大型语言模型 ( LLMs ) 通过根据非正式的自然语言 (NL) 意图生成自然代码,在自动化编码的重要方面显示出巨大的潜力。然而,鉴于 NL 是非正式的,它不容易检查生成的代码是否正确满足用户意图。在本文中,我们提出了一种新颖的交互式工作流程 TiCoder,用于通过测试引导意图澄清(即部分形式化),以支持生成更准确的代码建议。通过对 15 名程序员进行的混合方法用户研究,我们对工作流程提高代码生成准确性的有效性进行了实证评估。我们发现使用所提出的工作流程的参与者更有可能正确评估人工智能生成的代码,并且报告任务引起的认知负荷显着减少。此外,我们使用理想化的用户反馈代理,在两个 Python 数据集上使用四个不同的最先进的LLMs大规模测试工作流程的潜力。我们观察到,除了自动生成随附的单元测试之外,两个数据集和所有LLMs在 5 次用户交互中的 pass@1 代码生成准确度平均绝对提高了 45.97%。
更新日期:2024-07-22
down
wechat
bug