当前位置:
X-MOL 学术
›
ACM Comput. Surv.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Systematic Review of Generative Modelling Tools and Utility Metrics for Fully Synthetic Tabular Data
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2024-11-14 , DOI: 10.1145/3704437 Anton Danholt Lautrup, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2024-11-14 , DOI: 10.1145/3704437 Anton Danholt Lautrup, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp
Sharing data with third parties is essential for advancing science, but it is becoming more and more difficult with the rise of data protection regulations, ethical restrictions, and growing fear of misuse. Fully synthetic data, which transcends anonymisation, may be the key to unlocking valuable untapped insights stored away in secured data vaults. This review examines current synthetic data generation methods and their utility measurement. We found that more traditional generative models such as Classification and Regression Tree models alongside Bayesian Networks remain highly relevant and are still capable of surpassing deep learning alternatives like Generative Adversarial Networks. However, our findings also display the same lack of agreement on metrics for evaluation, uncovered in earlier reviews, posing a persistent obstacle to advancing the field. We propose a tool for evaluating the utility of synthetic data and illustrate how it can be applied to three synthetic data generation models. By streamlining evaluation and promoting agreement on metrics, researchers can explore novel methods and generate compelling results that will convince data curators and lawmakers to embrace synthetic data. Our review emphasises the potential of synthetic data and highlights the need for greater collaboration and standardisation to unlock its full potential.
中文翻译:
全合成表格数据的生成建模工具和效用指标的系统评价
与第三方共享数据对于推动科学发展至关重要,但随着数据保护法规、道德限制的兴起和对滥用的担忧日益增加,这变得越来越困难。超越匿名性的完全合成数据可能是解锁存储在安全数据库中的有价值的未开发见解的关键。本综述研究了当前的合成数据生成方法及其效用测量。我们发现,更传统的生成模型(如分类和回归树模型)与贝叶斯网络(Bayesian Networks)仍然高度相关,并且仍然能够超越生成对抗网络(Generative Adversarial Networks)等深度学习替代方案。然而,我们的研究结果也表明,在早期综述中发现的评估指标同样缺乏一致性,这对推进该领域构成了持续的障碍。我们提出了一种评估合成数据效用的工具,并说明了如何将其应用于三种合成数据生成模型。通过简化评估和促进指标共识,研究人员可以探索新方法并产生令人信服的结果,从而说服数据管理者和立法者接受合成数据。我们的审查强调了合成数据的潜力,并强调了加强合作和标准化以释放其全部潜力的必要性。
更新日期:2024-11-14
中文翻译:
全合成表格数据的生成建模工具和效用指标的系统评价
与第三方共享数据对于推动科学发展至关重要,但随着数据保护法规、道德限制的兴起和对滥用的担忧日益增加,这变得越来越困难。超越匿名性的完全合成数据可能是解锁存储在安全数据库中的有价值的未开发见解的关键。本综述研究了当前的合成数据生成方法及其效用测量。我们发现,更传统的生成模型(如分类和回归树模型)与贝叶斯网络(Bayesian Networks)仍然高度相关,并且仍然能够超越生成对抗网络(Generative Adversarial Networks)等深度学习替代方案。然而,我们的研究结果也表明,在早期综述中发现的评估指标同样缺乏一致性,这对推进该领域构成了持续的障碍。我们提出了一种评估合成数据效用的工具,并说明了如何将其应用于三种合成数据生成模型。通过简化评估和促进指标共识,研究人员可以探索新方法并产生令人信服的结果,从而说服数据管理者和立法者接受合成数据。我们的审查强调了合成数据的潜力,并强调了加强合作和标准化以释放其全部潜力的必要性。