An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge,BMC Medical Informatics and Decision Making

当前位置： X-MOL 学术 › BMC Med. Inform. Decis. Mak. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
BMC Medical Informatics and Decision Making ( IF 3.3 ) Pub Date : 2021-09-17 , DOI: 10.1186/s12911-021-01630-7
Xi Shi ₁ , Charlotte Prins ₂ , Gijs Van Pottelbergh ₃ , Pavlos Mamouris ₃ , Bert Vaes ₃ , Bart De Moor ₁

Affiliation

The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.

中文翻译：

一种结合临床知识的电子病历自动数据清洗方法

电子健康记录 (EHR) 数据在临床研究中的使用越来越多，但数据资源的丰富性提出了数据清洗的挑战。如果数据清洗可以自动完成，可以节省时间。此外，其他领域数据的自动化数据清理工具往往对所有变量进行统一处理，这意味着它们不能很好地服务于临床数据，因为需要考虑特定于变量的信息。本文提出了一种考虑临床知识的EHR数据自动数据清理方法。我们使用了 1994-2015 年从比利时佛兰德斯初级保健机构收集的 EHR 数据。我们构建了一个临床知识数据库来存储数据清理所需的所有特定于变量的信息。我们应用模糊搜索来自动检测和替换拼写错误的单位，并按照变量特定的转换公式进行单位转换。然后根据临床知识校正数值并检测异常值。总共清理了52个临床变量，比较了清理前后的缺失值百分比（完整性）和值在正常范围内的百分比（正确性）。在数据清理之前，所有变量都是 100% 完成的。42 个变量的缺失值百分比下降不到 1%，9 个变量下降了 1-10%。只有 1 个变量的完整性大幅下降 (13.36%)。清洗后所有变量的值均在正常范围内的50%以上，其中43个变量的百分比高于70%。我们提出了一种临床变量的通用方法，该方法实现了高度自动化并能够处理大规模数据。这种方法大大提高了清理数据的效率，消除了非技术人员的技术壁垒。

更新日期：2021-09-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>