Graph Clustering for Categorical Data

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 107 === Clustering is a popular task in many fields, especially machine learning and data mining. Many of the existing clustering algorithms or methods are designed for numerical data that have numerical attributes. Due to the popularity of big data, many collected d...

Full description

Bibliographic Details
Main Authors: Chen, Wei-Shiang, 陳威翔
Other Authors: Lin, Ja-Chen
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/69g376
Description
Summary:碩士 === 國立交通大學 === 資訊科學與工程研究所 === 107 === Clustering is a popular task in many fields, especially machine learning and data mining. Many of the existing clustering algorithms or methods are designed for numerical data that have numerical attributes. Due to the popularity of big data, many collected data are originally of categorical or nominal attributes. Transforming categorical data into numerical data with specific techniques may be a solution, but somehow loses the essence of the original data. In this study, we use graph clustering for categorical data to solve this problem. By using a context-based similarity measurement to estimate similarity between data objects, our first method transforms categorical dataset into a similarity matrix for a graph. Afterwards, we feed our graph transition matrix into a neural network model to obtain a graph embedding matrix. Finally, a simple clustering algorithm is utilized to cluster the embedding matrix. Our second method extends the idea of graph transition matrix used in our first method. With additional input for our neural network model, we change the structure of the model and obtain better representations for both the nodes and clustering results. Four categorical datasets including Congress vote, Heart, Mushroom, and HIV are tested in our experiments. The results show that our both methods can cluster the categorical data better than other categorical clustering methods.