Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 2024-09-04 , DOI: 10.1109/tse.2024.3450837
Sungmin Kang ₁ , Juyeon Yoon ₁ , Nargiz Askarbekkyzy ₁ , Shin Yoo ₁

Affiliation

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique Libro could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using Libro improves as LLM size increases, providing information as to which LLMs can be used with the Libro pipeline.

中文翻译：

评估各种大型语言模型以进行自动和一般 bug 重现

错误重现是一项关键的开发人员活动，也很难实现自动化，因为错误报告通常使用自然语言，因此很难一致地转换为测试用例。因此，现有技术主要集中在崩溃错误上，这些错误更容易自动检测和验证。在这项工作中，我们通过使用大型语言模型（LLMs，这些模型已被证明擅长自然语言处理和代码生成。通过提示 LLMs 生成错误重现测试，并通过后处理管道自动识别有前途的生成测试，我们提出的技术 Libro 可以成功重现广泛使用的 Defects4J 基准测试中大约三分之一的所有错误。此外，我们对 15 个 LLMs，包括 11 个开源 LLMs表明，开源 LLMs 也显示出巨大的潜力，StarCoder LLM 在大型 Defects4J 基准测试中实现了闭源 OpenAI LLM code-davinci-002 的 70% 的再现性能，并且在可能不属于任何LLM 的训练数据。此外，我们对不同大小的 LLMs表明，随着 LLM，从而提供有关哪些 LLMs 可以与 Libro 管道一起使用的信息。

更新日期：2024-09-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南