当前位置: X-MOL 学术Data Technol. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data repair of density-based data cleaning approach using conditional functional dependencies
Data Technologies and Applications ( IF 1.7 ) Pub Date : 2021-11-19 , DOI: 10.1108/dta-05-2021-0108
Samir Al-Janabi 1 , Ryszard Janicki 1
Affiliation  

Purpose

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.

Design/methodology/approach

A set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.

Findings

This new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.

Originality/value

Conditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.



中文翻译:

使用条件函数依赖的基于密度的数据清洗方法的数据修复

目的

数据质量是数据管理的主要挑战。对于组织而言,数据的清洁度是一个影响许多业务活动的重大问题。数据错误的发生有多种原因,例如违反业务规则。但是,由于数据量巨大,单靠人工清理是行不通的。需要通过自动检测来修复和清理脏数据的方法,这是需要解决的数据质量问题。这项工作的目的是使用条件函数依赖来扩展基于密度的数据清洗方法,以实现更好的数据修复。

设计/方法/方法

引入一组条件函数依赖作为基于密度的数据清理算法的输入。该算法使用该集合修复不一致的数据。

发现

这种新方法通过对真实世界和合成数据集的实验进行了评估。使用F测量确定修复质量。结果表明,当引入条件函数依赖时,基于密度的数据清理方法的质量和可扩展性得到了提高。

原创性/价值

条件函数依赖捕获数据值之间的语义错误。这项工作表明,基于密度的数据清洗方法可以通过使用条件函数依赖来修复不一致的数据。

更新日期:2021-11-19
down
wechat
bug