Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-12-19 , DOI: 10.1007/s40747-024-01695-7 Yunfan Zhang, Rong Zou, Yiqun Zhang, Yue Zhang, Yiu-ming Cheung, Kangshun Li
Heterogeneous attribute data (also called mixed data), characterized by attributes with numerical and categorical values, occur frequently across various scenarios. Since the annotation cost is high, clustering has emerged as a favorable technique for analyzing unlabeled mixed data. To address the complex real-world clustering task, this paper proposes a new clustering method called Adaptive Micro Partition and Hierarchical Merging (AMPHM) based on neighborhood rough set theory and a novel hierarchical merging mechanism. Specifically, we present a distance metric unified on numerical and categorical attributes to leverage neighborhood rough sets in partitioning data objects into fine-grained compact clusters. Then, we gradually merge the current most similar clusters to avoid incorporating dissimilar objects into a similar cluster. It turns out that the proposed approach breaks through the clustering performance bottleneck brought by the pre-set number of sought clusters k and cluster distribution bias, and is thus capable of clustering datasets comprising various combinations of numerical and categorical attributes. Extensive experimental evaluations comparing the proposed AMPHM with state-of-the-art counterparts on various datasets demonstrate its superiority.
中文翻译:
自适应微分区和分层合并,实现准确的混合数据聚类
异构属性数据(也称为混合数据)以具有数值和分类值的属性为特征,在各种场景中频繁出现。由于注释成本高,聚类已成为分析未标记混合数据的有利技术。为了解决复杂的实际聚类任务,本文提出了一种基于邻域粗略集理论和新颖的分层合并机制的称为自适应微分区和分层合并 (AMPHM) 的新聚类方法。具体来说,我们提出了一个统一于数值和分类属性的距离度量,以利用邻域粗糙集将数据对象划分为细粒度的紧凑集群。然后,我们逐渐合并当前最相似的集群,以避免将不同的对象合并到相似的集群中。事实证明,所提出的方法突破了由预设的搜索聚类 k 数量和聚类分布偏差带来的聚类性能瓶颈,因此能够对包含数值和分类属性的各种组合的数据集进行聚类。将所提出的 AMPHM 与各种数据集上最先进的对应物进行比较的广泛实验评估证明了其优越性。