From raw to refined: Data preprocessing for construction machine learning (ML), deep learning (DL), and reinforcement learning (RL) models,Automation in Construction

当前位置： X-MOL 学术 › Autom. Constr. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

From raw to refined: Data preprocessing for construction machine learning (ML), deep learning (DL), and reinforcement learning (RL) models
Automation in Construction ( IF 9.6 ) Pub Date : 2024-10-24 , DOI: 10.1016/j.autcon.2024.105844
SeyedeZahra Golazad, Abbas Mohammadi, Abbas Rashidi, Mohammad Ilbeigi

As the use of predictive models in construction rapidly increases, the need for preprocessing raw construction data has become more critical. This systematic review investigates data preprocessing techniques for machine learning (ML), deep learning (DL), and reinforcement learning (RL) models in the construction domain. Through a comprehensive analysis of 457 studies, the prevalence of six data types (i.e., tabular, image, video frame, time series, text, and point cloud) and their respective preprocessing methods are examined. Key findings reveal data transformation, cleaning, reduction, augmentation, and scaling as fundamental preprocessing categories, with applications varying across data types. The paper highlights knowledge gaps, including limited synthetic data adoption, lack of standardized annotation practices, absence of comprehensive preprocessing frameworks, and need for automated labeling. Furthermore, critical considerations regarding data privacy, security, sharing, and management practices are discussed. The review underscores the pivotal role of robust data preprocessing in enabling reliable predictive models.

中文翻译：

从原始到精细：用于建筑机器学习（ML）、深度学习（DL）和强化学习（RL）模型的数据预处理

随着预测模型在建筑中的使用迅速增加，对原始建筑数据进行预处理的需求变得更加迫切。本系统综述研究了建筑领域机器学习（ML）、深度学习（DL）和强化学习（RL）模型的数据预处理技术。通过对 457 项研究的全面分析，研究了六种数据类型（即表格、图像、视频帧、时间序列、文本和点云）的普遍性及其各自的预处理方法。主要发现揭示了数据转换、清理、缩减、增强和扩展是基本的预处理类别，其应用因数据类型而异。该论文强调了知识差距，包括合成数据采用有限、缺乏标准化注释实践、缺乏全面的预处理框架以及对自动标记的需求。此外，还讨论了有关数据隐私、安全性、共享和管理实践的关键考虑因素。该综述强调了稳健数据预处理在实现可靠预测模型方面的关键作用。

更新日期：2024-10-24

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南