当前位置: X-MOL 学术Inf. Syst. Front. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Ingestion Validation Through Stable Conditional Metrics with Ranking and Filtering
Information Systems Frontiers ( IF 6.9 ) Pub Date : 2024-07-05 , DOI: 10.1007/s10796-024-10504-y
Niels Bylois , Frank Neven , Stijn Vansummeren

We introduce an advanced method for validating data quality, which is crucial for ensuring reliable analytics insights. Traditional data quality validation relies on data unit tests, which use global metrics to determine if data quality falls within expected ranges. Unfortunately, these existing approaches suffer from two limitations. Firstly, they offer only coarse-grained assessments, missing fine-grained errors. Secondly, they fail to pinpoint the specific data causing test failures. To address these issues, we propose a novel approach using conditional metrics, enabling more detailed analysis than global metrics. Our method involves two stages: unit test discovery and monitoring/error identification. In the discovery phase, we derive conditional metric-based unit tests from historical data, focusing on stability to select appropriate metrics. The monitoring phase involves using these tests for new data batches, with conditional metrics helping us identify potential errors. We validate the effectiveness of this approach using two datasets and seven synthetic error scenarios, showing significant improvements over global metrics and promising results in fine-grained error detection for data ingestion validation.



中文翻译:


通过具有排名和过滤功能的稳定条件指标进行数据摄取验证



我们引入了一种验证数据质量的先进方法,这对于确保可靠的分析见解至关重要。传统的数据质量验证依赖于数据单元测试,它使用全局指标来确定数据质量是否落在预期范围内。不幸的是,这些现有方法存在两个局限性。首先,他们只提供粗粒度的评估,缺少细粒度的错误。其次,他们未能查明导致测试失败的具体数据。为了解决这些问题,我们提出了一种使用条件指标的新方法,可以实现比全局指标更详细的分析。我们的方法涉及两个阶段:单元测试发现和监控/错误识别。在发现阶段,我们从历史数据中导出基于条件度量的单元测试,重点关注稳定性以选择适当的度量。监控阶段涉及对新数据批次使用这些测试,并通过条件指标帮助我们识别潜在的错误。我们使用两个数据集和七个合成错误场景验证了这种方法的有效性,显示出相对于全局指标的显着改进,并在数据摄取验证的细粒度错误检测方面取得了有希望的结果。

更新日期:2024-07-05
down
wechat
bug