Learning-Based Dissimilarity for Clustering Categorical Data

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity...

Full description

Bibliographic Details
Main Authors: Edgar Jacob Rivera Rios, Miguel Angel Medina-Pérez, Manuel S. Lazo-Cortés, Raúl Monroy
Format: Article
Language:English
Published: MDPI AG 2021-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/8/3509
id doaj-403359bbf997416aaf43a79a0c76eda3
record_format Article
spelling doaj-403359bbf997416aaf43a79a0c76eda32021-04-14T23:02:29ZengMDPI AGApplied Sciences2076-34172021-04-01113509350910.3390/app11083509Learning-Based Dissimilarity for Clustering Categorical DataEdgar Jacob Rivera Rios0Miguel Angel Medina-Pérez1Manuel S. Lazo-Cortés2Raúl Monroy3School of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoSchool of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoTecNM/Instituto Tecnológico de Tlalnepantla, Tlalnepantla de Baz 54070, MexicoSchool of Engineering and Science, Tecnologico de Monterrey, Estado de Mexico 52926, MexicoComparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call <i>Learning-Based Dissimilarity</i>, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.https://www.mdpi.com/2076-3417/11/8/3509dissimilaritycategorical dataclustering
collection DOAJ
language English
format Article
sources DOAJ
author Edgar Jacob Rivera Rios
Miguel Angel Medina-Pérez
Manuel S. Lazo-Cortés
Raúl Monroy
spellingShingle Edgar Jacob Rivera Rios
Miguel Angel Medina-Pérez
Manuel S. Lazo-Cortés
Raúl Monroy
Learning-Based Dissimilarity for Clustering Categorical Data
Applied Sciences
dissimilarity
categorical data
clustering
author_facet Edgar Jacob Rivera Rios
Miguel Angel Medina-Pérez
Manuel S. Lazo-Cortés
Raúl Monroy
author_sort Edgar Jacob Rivera Rios
title Learning-Based Dissimilarity for Clustering Categorical Data
title_short Learning-Based Dissimilarity for Clustering Categorical Data
title_full Learning-Based Dissimilarity for Clustering Categorical Data
title_fullStr Learning-Based Dissimilarity for Clustering Categorical Data
title_full_unstemmed Learning-Based Dissimilarity for Clustering Categorical Data
title_sort learning-based dissimilarity for clustering categorical data
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-04-01
description Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call <i>Learning-Based Dissimilarity</i>, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.
topic dissimilarity
categorical data
clustering
url https://www.mdpi.com/2076-3417/11/8/3509
work_keys_str_mv AT edgarjacobriverarios learningbaseddissimilarityforclusteringcategoricaldata
AT miguelangelmedinaperez learningbaseddissimilarityforclusteringcategoricaldata
AT manuelslazocortes learningbaseddissimilarityforclusteringcategoricaldata
AT raulmonroy learningbaseddissimilarityforclusteringcategoricaldata
_version_ 1721526896904110080