Incremental Hierarchical Clustering Algorithms Based on Statistical Models

博士 === 國立臺灣大學 === 資訊工程學研究所 === 91 === This thesis studies incremental hierarchical clustering for large databases. Data clustering concerns how to group similar objects together, while separating dissimilar objects. An incremental clustering algorithm examines incoming objects one by one...

Full description

Bibliographic Details
Main Authors: Chien-Yu Chen, 陳倩瑜
Other Authors: Yen-Jen Oyang
Format: Others
Language:en_US
Published: 2003
Online Access:http://ndltd.ncl.edu.tw/handle/54407472577750540919
Description
Summary:博士 === 國立臺灣大學 === 資訊工程學研究所 === 91 === This thesis studies incremental hierarchical clustering for large databases. Data clustering concerns how to group similar objects together, while separating dissimilar objects. An incremental clustering algorithm examines incoming objects one by one and determines how each incoming object should be clustered with the existing objects without reprocessing all the existing objects. Many incremental clustering algorithms employ representatives of clusters to record the distribution of the existing objects. However, this practice may lead to some degree of information loss if insufficient or improper representatives are selected. Furthermore, many existing incremental clustering algorithms suffer order dependence. That is, clustering results may vary dramatically, if the objects are inputted in different orders. This thesis proposes a statistical model based test scheme for incremental clustering algorithm. The main distinction of the proposed algorithm from previous model-based approaches is that abstractions referred to as spherical cluster and homogeneous cluster are verified with some statistical tests. The objective of employing those statistical tests is to reduce the information loss when using representatives of clusters to record the distribution of the existing objects. Experimental results reveal that, with the abstractions as well as the split and merge operations, the proposed incremental clustering algorithm does not suffer order dependence. Furthermore, the parameters associated with the algorithm can be set according to some statistical senses and does not need to be changed to accommodate datasets with different distributions. Experimental results also reveal that the proposed incremental clustering algorithm delivers higher clustering quality than the existing incremental hierarchical clustering algorithms in many cases, and is more robust in the presence of outliers when compared with the conventional hierarchical clustering algorithms.