Multi-Domain Clustering on Real-Valued Datasets

Bibliographic Details
Main Author:	Hu, Zhen
Language:	English
Published:	University of Cincinnati / OhioLINK 2011
Subjects:	Computer Science Clustering Subspace Clustering
Online Access:	http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725

id	ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1311692725
record_format	oai_dc
spelling	ndltd-OhioLink-oai-etd.ohiolink.edu-ucin13116927252021-08-03T06:14:49Z Multi-Domain Clustering on Real-Valued Datasets Hu, Zhen Computer Science Clustering Subspace Clustering Clustering is an important research problem for knowledge discovery from databases. It focuses on finding hidden structures embedded in datasets. It is non-trivial to arrive at a clustering in a dataset such that each pair of data points within the same cluster is similar to each other, and each pair in different clusters is distinct from each other. This is due to the multiplicity of meanings of similarity between data points and also from criteria determining the number, shape, and boundaries of clusters. Despite a large body of published research, new clustering problems keep arising requiring novel solutions. Such a situation is evolving in the field of biomedical research which is generating a large number of interrelated and interdependent datasets, and also in many other domains of science and business. We have developed three novel methodologies for clustering to meet these newly emerging needs. The first problem we have solved relates to the grouping of data points with “similar” density in the data space into distinct clusters, using full dimensional clustering. Based on the pair-wise similarity matrix among data points, we define a new type of relationship among them - that of the point pairs being Mutual K-Nearest Neighbors (MKNN) of each other, and design clustering algorithms based on this new notion to capture the data density. Compared with traditional Euclidean distance based clustering algorithms for datasets having different densities, our MKNN-based clustering algorithm allows users to form density-based clusters with significantly lower sensitivity to parameters . We have analytically and empirically demonstrated, using both synthetic and real-world datasets, the increased capability, precision, efficiency, and robustness of our algorithm. The second clustering algorithm which we have developed incorporates prior domain knowledge, provided as pair-wise similarity matrix in one dataset, into the clustering performed for data in another dataset. The data objects in “prior knowledge” data source and the second data source are the same. By adopting a semi-supervised clustering procedure, our algorithm, called Semi-supervised Gaussian Infinite Mixture Model (SGIMM), balances information from two data sources and generates clusters enforcing precise pair-wise relationships. SGIMM accommodates many types of prior knowledge and from the empirical studies done with both the synthetic data and the real-world data; SGIMM generates high quality clusters regardless of the quality of prior knowledge. The third type of problem we have solved relates to the discovery of subspace clusters. Numerous real world applications focus on selecting subsets of data points and feature subspaces having desirable characteristics specified in terms of properties such as low variance, high distinction, low residue value, etc. We use lattice structured search spaces to identify low variance subspace clusters from one dataset (bicluster), two datasets (3-Cluster), and high discrepancy subspace clusters from a single dataset (polarized bicluster). The results on both synthetic datasets and genomic datasets have been shown for all these types of clustering tasks and they show performance better than what is shown by most of the existing algorithms. 2011-09-23 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection	NDLTD
language	English
sources	NDLTD
topic	Computer Science Clustering Subspace Clustering
spellingShingle	Computer Science Clustering Subspace Clustering Hu, Zhen Multi-Domain Clustering on Real-Valued Datasets
author	Hu, Zhen
author_facet	Hu, Zhen
author_sort	Hu, Zhen
title	Multi-Domain Clustering on Real-Valued Datasets
title_short	Multi-Domain Clustering on Real-Valued Datasets
title_full	Multi-Domain Clustering on Real-Valued Datasets
title_fullStr	Multi-Domain Clustering on Real-Valued Datasets
title_full_unstemmed	Multi-Domain Clustering on Real-Valued Datasets
title_sort	multi-domain clustering on real-valued datasets
publisher	University of Cincinnati / OhioLINK
publishDate	2011
url	http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725
work_keys_str_mv	AT huzhen multidomainclusteringonrealvalueddatasets
_version_	1719433426570838016

Multi-Domain Clustering on Real-Valued Datasets

Similar Items