Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation

Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances be...

Full description

Bibliographic Details
Main Authors:	Min Wei, Tommy W. S. Chow, Rosa H. M. Chan
Format:	Article
Language:	English
Published:	MDPI AG 2015-03-01
Series:	Entropy
Subjects:	feature transformation k-means clustering heterogeneous data numerical features non-numerical features
Online Access:	http://www.mdpi.com/1099-4300/17/3/1535

Description
Summary:	Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.
ISSN:	1099-4300

Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation

Similar Items