Distributed Clustering for Homogeneous and Heterogeneous Databases

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and pri...

Full description

Bibliographic Details
Main Authors: Huang, Chih-Yuan, 黃致遠
Other Authors: Lin, Ja-Chen
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/84u65s
Description
Summary:碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data.