Data repair of density-based data cleaning approach using conditional functional dependencies,Data Technologies and Applications

当前位置： X-MOL 学术 › Data Technol. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Data repair of density-based data cleaning approach using conditional functional dependencies
Data Technologies and Applications ( IF 1.7 ) Pub Date : 2021-11-19 , DOI: 10.1108/dta-05-2021-0108
Samir Al-Janabi ₁ , Ryszard Janicki ₁

Affiliation

Purpose

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.

Design/methodology/approach

A set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.

Findings

This new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.

Originality/value

Conditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.

中文翻译：

使用条件函数依赖的基于密度的数据清洗方法的数据修复

目的

数据质量是数据管理的主要挑战。对于组织而言，数据的清洁度是一个影响许多业务活动的重大问题。数据错误的发生有多种原因，例如违反业务规则。但是，由于数据量巨大，单靠人工清理是行不通的。需要通过自动检测来修复和清理脏数据的方法，这是需要解决的数据质量问题。这项工作的目的是使用条件函数依赖来扩展基于密度的数据清洗方法，以实现更好的数据修复。