On the Clustering Techniques of Categorical Databases

博士 === 國立臺灣大學 === 電機工程學研究所 === 96 === In recent years, Knowledge Discovery in Databases (KDD) has attracted a large amount of attention because of the need for mining the useful information and knowledge from a great deal of data. Data clustering is one of the frequently used data mining techniques...

Full description

Bibliographic Details
Main Authors: Hung-Leng Chen, 陳泓稜
Other Authors: Ming-Syan Chen
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/45374529258975985453
Description
Summary:博士 === 國立臺灣大學 === 電機工程學研究所 === 96 === In recent years, Knowledge Discovery in Databases (KDD) has attracted a large amount of attention because of the need for mining the useful information and knowledge from a great deal of data. Data clustering is one of the frequently used data mining techniques in KDD for exploratory data analysis. Given a set of data objects, the problem of clustering is to partition data objects into groups in such a way that objects in the same group are similar to each other while objects in different groups are dissimilar from each other according to the predefined similarity measurement. In this dissertation, we focus on the problem of performing clustering on categorical databases. Categorical attributes prevalently exist in real data. For example, buying records, web logs, and web documents, are all categorical data. Previous works on clustering categorical data focused on doing clustering on the entire data set, and did not fully explore the issues of the execution efficiency, the incremental updates, and the drifting-concepts. Therefore, the problem of clustering categorical databases remains as a challenging issue. On the problem of execution efficiency, sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sampling applied, those points which are not sampled will not have their labels after the normal process. Therefore, a framework named Maximal Resemblance Data Labeling is proposed to allocate each unlabeled data point into the corresponding appropriate cluster based on the novel categorical clustering representative, namely, N-Nodeset Importance Representative, which represents clusters by the importance of the combinations of attribute values. In addition to the difficulty of processing the tremendous data volume, another challenge in the design of modern clustering algorithms is that various dynamic updates of deletions and insertions are applied to the huge database. It is impractical to re-apply data clustering on the entire database whenever there are new data points added into the database. Therefore, clustering algorithms should operate incrementally. We devise another framework which performs incremental clustering on categorical data. In this work, the practical clustering representative Node Importance Representative which is the simplest version of N-Nodeset Importance Representative is utilized for capturing the characteristics of clusters. Moreover, the concept of period of interest utilized in the progressive mining is adopted to continuously trace the latest clustering result. Finally, we focus on the problem of drifting-concepts. The concepts which we try to learn from the data typically drift with time, and the underlying clusters may also change considerably with time. Performing clustering on the entire time-evolving data not only decreases the quality of clusters but also fails to meet the expectations of users which usually require recent clustering results. Therefore, we propose a framework which detects the drifting-concepts at different windows, and generating the clustering result based on the current concept, and also, shows the relationship between clustering results by the visualization. The framework is composed of two algorithms: Drifting Concept Detection algorithm detecting the changes of cluster distributions between the current window and the last clustering result, and Cluster Relationship Analysis algorithm analyzing the relationship between clustering results at different time. Based on this framework, we can obtain the clustering results with better quality and capture the time-evolving trend in the data set by analyzing the evolving clustering results.