Recognizing temporary construction site objects using CLIP-based few-shot learning and multi-modal prototypes,Automation in Construction

当前位置： X-MOL 学术 › Autom. Constr. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Recognizing temporary construction site objects using CLIP-based few-shot learning and multi-modal prototypes
Automation in Construction ( IF 9.6 ) Pub Date : 2024-06-21 , DOI: 10.1016/j.autcon.2024.105542
Yuanchang Liang , Prahlad Vadakkepat , David Kim Huat Chua , Shuyi Wang , Zhigang Li , Shuxiang Zhang

Visual understanding of temporary on-site objects is essential for robots and project management in construction. Implementation of deep learning algorithms is challenging on construction sites due to high data annotation cost, demanding computational power, and lack of large-scale training datasets. Recognizing on-site temporary objects demands the algorithms to learn in a data-efficient way. To fill this gap, a Contrastive Language–Image Pre-training (CLIP)-based few-shot learning algorithm to recognize temporary objects with limited image samples is proposed. The study builds ImageNet-based Similarity Cache with inter-class similarity distribution. The proposed algorithm is evaluated on a newly created TOCS dataset and on the public SODA dataset. Compared with CLIP zero-shot algorithm, the classification accuracy improves from 23.17% to 73.09% with 16-shot learning on SODA, and from 58.33% to 83.33% with 1-shot learning on TOCS. The study indicates that few-shot learning with vision language models (VLM) is promising to improve visual intelligence on construction sites.

中文翻译：

使用基于 CLIP 的少样本学习和多模态原型识别临时施工现场对象

对现场临时物体的视觉理解对于施工中的机器人和项目管理至关重要。由于数据标注成本高、计算能力要求高且缺乏大规模训练数据集，深度学习算法在建筑工地的实施面临挑战。识别现场临时对象需要算法以数据有效的方式学习。为了填补这一空白，提出了一种基于对比语言图像预训练（CLIP）的少样本学习算法，用于识别有限图像样本的临时对象。该研究构建了具有类间相似度分布的基于 ImageNet 的相似度缓存。所提出的算法在新创建的 TOCS 数据集和公共 SODA 数据集上进行评估。与CLIP零样本算法相比，SODA上16-shot学习的分类准确率从23.17%提高到73.09%，TOCS上1-shot学习的分类准确率从58.33%提高到83.33%。该研究表明，利用视觉语言模型 (VLM) 进行的小样本学习有望提高建筑工地的视觉智能。

更新日期：2024-06-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>