International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-12-11 , DOI: 10.1007/s11263-024-02313-2 Yanan Zhang, Jiaxin Chen, Di Huang
LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).
中文翻译:
CMAE-3D: 用于自监督 3D 目标检测的对比掩蔽 AutoEncoder
基于 LiDAR 的 3D 对象检测是自动驾驶的一项关键任务,因为它可以在 3D 真实世界中实现精确的对象识别和定位功能。然而,现有方法严重依赖耗时费力的大规模标记 LiDAR 数据,对性能提升和实际应用都构成了瓶颈。在本文中,我们提出了用于自监督 3D 对象检测的对比掩蔽自动编码器,称为 CMAE-3D,这是一种很有前途的解决方案,可以有效减轻 3D 感知中的标签依赖性。具体来说,我们将对比学习 (CL) 和掩码自动编码器 (MAE) 集成到一个统一的框架中,以充分利用全局语义表示和局部空间感知的互补特性。此外,从 MAE 的角度来看,我们开发了几何语义混合掩蔽 (GSHM),以选择性地掩盖前景背景不平衡和密度分布不均匀的点云中的代表性区域,并设计了多尺度潜在特征重建 (MLFR) 来捕获高级语义特征,同时减轻低级细节的冗余重建。从 CL 的角度来看,我们提出了分层关系对比学习 (HRCL) 来挖掘丰富的语义相似性信息,同时缓解了体素级别和框架级别的负样本不匹配问题。广泛的实验表明,当应用于三个流行数据集(KITTI、Waymo 和 nuScenes)上的多个主流 3D 对象检测器(SECOND、CenterPoint 和 PV-RCNN)时,我们的预训练方法的有效性。