Learning-Based Dissimilarity for Clustering Categorical Data

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity...

Full description

Bibliographic Details
Main Authors:	Edgar Jacob Rivera Rios, Miguel Angel Medina-Pérez, Manuel S. Lazo-Cortés, Raúl Monroy
Format:	Article
Language:	English
Published:	MDPI AG 2021-04-01
Series:	Applied Sciences
Subjects:	dissimilarity categorical data clustering
Online Access:	https://www.mdpi.com/2076-3417/11/8/3509

id	doaj-403359bbf997416aaf43a79a0c76eda3
record_format	Article
spelling	doaj-403359bbf997416aaf43a79a0c76eda32021-04-14T23:02:29ZengMDPI AGApplied Sciences2076-34172021-04-01113509350910.3390/app11083509Learning-Based Dissimilarity for Clustering Categorical DataEdgar Jacob Rivera Rios0Miguel Angel Medina-Pérez1Manuel S. Lazo-Cortés2Raúl Monroy3School of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoSchool of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoTecNM/Instituto Tecnológico de Tlalnepantla, Tlalnepantla de Baz 54070, MexicoSchool of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoComparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call <i>Learning-Based Dissimilarity</i>, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.https://www.mdpi.com/2076-3417/11/8/3509dissimilaritycategorical dataclustering
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Edgar Jacob Rivera Rios Miguel Angel Medina-Pérez Manuel S. Lazo-Cortés Raúl Monroy
spellingShingle	Edgar Jacob Rivera Rios Miguel Angel Medina-Pérez Manuel S. Lazo-Cortés Raúl Monroy Learning-Based Dissimilarity for Clustering Categorical Data Applied Sciences dissimilarity categorical data clustering
author_facet	Edgar Jacob Rivera Rios Miguel Angel Medina-Pérez Manuel S. Lazo-Cortés Raúl Monroy
author_sort	Edgar Jacob Rivera Rios
title	Learning-Based Dissimilarity for Clustering Categorical Data
title_short	Learning-Based Dissimilarity for Clustering Categorical Data
title_full	Learning-Based Dissimilarity for Clustering Categorical Data
title_fullStr	Learning-Based Dissimilarity for Clustering Categorical Data
title_full_unstemmed	Learning-Based Dissimilarity for Clustering Categorical Data
title_sort	learning-based dissimilarity for clustering categorical data
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2021-04-01
description	Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call <i>Learning-Based Dissimilarity</i>, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.
topic	dissimilarity categorical data clustering
url	https://www.mdpi.com/2076-3417/11/8/3509
work_keys_str_mv	AT edgarjacobriverarios learningbaseddissimilarityforclusteringcategoricaldata AT miguelangelmedinaperez learningbaseddissimilarityforclusteringcategoricaldata AT manuelslazocortes learningbaseddissimilarityforclusteringcategoricaldata AT raulmonroy learningbaseddissimilarityforclusteringcategoricaldata
_version_	1721526896904110080

Learning-Based Dissimilarity for Clustering Categorical Data

Similar Items