A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters

碩士 === 國立中央大學 === 資訊工程研究所 === 100 === Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determ...

Full description

Bibliographic Details
Main Authors: Shi-Shan Chen, 陳詩姍
Other Authors: Wei-Jen Wang
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/36517890026743833619
id ndltd-TW-100NCU05392053
record_format oai_dc
spelling ndltd-TW-100NCU053920532015-10-13T21:22:38Z http://ndltd.ncl.edu.tw/handle/36517890026743833619 A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters 基於圖形的平行化最小生成樹分群演算法 Shi-Shan Chen 陳詩姍 碩士 國立中央大學 資訊工程研究所 100 Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree). The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently. From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results. Wei-Jen Wang 王尉任 2012 學位論文 ; thesis 73 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程研究所 === 100 === Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree). The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently. From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results.
author2 Wei-Jen Wang
author_facet Wei-Jen Wang
Shi-Shan Chen
陳詩姍
author Shi-Shan Chen
陳詩姍
spellingShingle Shi-Shan Chen
陳詩姍
A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
author_sort Shi-Shan Chen
title A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_short A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_full A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_fullStr A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_full_unstemmed A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_sort parallel minimal-spanning-tree-based clustering algorithm with self-detectability of the best number of clusters
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/36517890026743833619
work_keys_str_mv AT shishanchen aparallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters
AT chénshīshān aparallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters
AT shishanchen jīyútúxíngdepíngxínghuàzuìxiǎoshēngchéngshùfēnqúnyǎnsuànfǎ
AT chénshīshān jīyútúxíngdepíngxínghuàzuìxiǎoshēngchéngshùfēnqúnyǎnsuànfǎ
AT shishanchen parallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters
AT chénshīshān parallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters
_version_ 1718061574198919168