Distributed Clustering for Homogeneous and Heterogeneous Databases

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and pri...

Full description

Bibliographic Details
Main Authors: Huang, Chih-Yuan, 黃致遠
Other Authors: Lin, Ja-Chen
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/84u65s
id ndltd-TW-105NCTU5394110
record_format oai_dc
spelling ndltd-TW-105NCTU53941102019-05-16T00:08:09Z http://ndltd.ncl.edu.tw/handle/84u65s Distributed Clustering for Homogeneous and Heterogeneous Databases 同質性與異質性資料庫之分散式分群 Huang, Chih-Yuan 黃致遠 碩士 國立交通大學 資訊科學與工程研究所 105 Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data. Lin, Ja-Chen 林志青 2017 學位論文 ; thesis 50 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data.
author2 Lin, Ja-Chen
author_facet Lin, Ja-Chen
Huang, Chih-Yuan
黃致遠
author Huang, Chih-Yuan
黃致遠
spellingShingle Huang, Chih-Yuan
黃致遠
Distributed Clustering for Homogeneous and Heterogeneous Databases
author_sort Huang, Chih-Yuan
title Distributed Clustering for Homogeneous and Heterogeneous Databases
title_short Distributed Clustering for Homogeneous and Heterogeneous Databases
title_full Distributed Clustering for Homogeneous and Heterogeneous Databases
title_fullStr Distributed Clustering for Homogeneous and Heterogeneous Databases
title_full_unstemmed Distributed Clustering for Homogeneous and Heterogeneous Databases
title_sort distributed clustering for homogeneous and heterogeneous databases
publishDate 2017
url http://ndltd.ncl.edu.tw/handle/84u65s
work_keys_str_mv AT huangchihyuan distributedclusteringforhomogeneousandheterogeneousdatabases
AT huángzhìyuǎn distributedclusteringforhomogeneousandheterogeneousdatabases
AT huangchihyuan tóngzhìxìngyǔyìzhìxìngzīliàokùzhīfēnsànshìfēnqún
AT huángzhìyuǎn tóngzhìxìngyǔyìzhìxìngzīliàokùzhīfēnsànshìfēnqún
_version_ 1719160636446867456