Multi-Domain Clustering on Real-Valued Datasets

Bibliographic Details
Main Author: Hu, Zhen
Language:English
Published: University of Cincinnati / OhioLINK 2011
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725
id ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1311692725
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-ucin13116927252021-08-03T06:14:49Z Multi-Domain Clustering on Real-Valued Datasets Hu, Zhen Computer Science Clustering Subspace Clustering Clustering is an important research problem for knowledge discovery from databases. It focuses on finding hidden structures embedded in datasets. It is non-trivial to arrive at a clustering in a dataset such that each pair of data points within the same cluster is similar to each other, and each pair in different clusters is distinct from each other. This is due to the multiplicity of meanings of similarity between data points and also from criteria determining the number, shape, and boundaries of clusters. Despite a large body of published research, new clustering problems keep arising requiring novel solutions. Such a situation is evolving in the field of biomedical research which is generating a large number of interrelated and interdependent datasets, and also in many other domains of science and business. We have developed three novel methodologies for clustering to meet these newly emerging needs. The first problem we have solved relates to the grouping of data points with “similar” density in the data space into distinct clusters, using full dimensional clustering. Based on the pair-wise similarity matrix among data points, we define a new type of relationship among them - that of the point pairs being Mutual K-Nearest Neighbors (MKNN) of each other, and design clustering algorithms based on this new notion to capture the data density. Compared with traditional Euclidean distance based clustering algorithms for datasets having different densities, our MKNN-based clustering algorithm allows users to form density-based clusters with significantly lower sensitivity to parameters . We have analytically and empirically demonstrated, using both synthetic and real-world datasets, the increased capability, precision, efficiency, and robustness of our algorithm. The second clustering algorithm which we have developed incorporates prior domain knowledge, provided as pair-wise similarity matrix in one dataset, into the clustering performed for data in another dataset. The data objects in “prior knowledge” data source and the second data source are the same. By adopting a semi-supervised clustering procedure, our algorithm, called Semi-supervised Gaussian Infinite Mixture Model (SGIMM), balances information from two data sources and generates clusters enforcing precise pair-wise relationships. SGIMM accommodates many types of prior knowledge and from the empirical studies done with both the synthetic data and the real-world data; SGIMM generates high quality clusters regardless of the quality of prior knowledge. The third type of problem we have solved relates to the discovery of subspace clusters. Numerous real world applications focus on selecting subsets of data points and feature subspaces having desirable characteristics specified in terms of properties such as low variance, high distinction, low residue value, etc. We use lattice structured search spaces to identify low variance subspace clusters from one dataset (bicluster), two datasets (3-Cluster), and high discrepancy subspace clusters from a single dataset (polarized bicluster). The results on both synthetic datasets and genomic datasets have been shown for all these types of clustering tasks and they show performance better than what is shown by most of the existing algorithms. 2011-09-23 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Computer Science
Clustering
Subspace Clustering
spellingShingle Computer Science
Clustering
Subspace Clustering
Hu, Zhen
Multi-Domain Clustering on Real-Valued Datasets
author Hu, Zhen
author_facet Hu, Zhen
author_sort Hu, Zhen
title Multi-Domain Clustering on Real-Valued Datasets
title_short Multi-Domain Clustering on Real-Valued Datasets
title_full Multi-Domain Clustering on Real-Valued Datasets
title_fullStr Multi-Domain Clustering on Real-Valued Datasets
title_full_unstemmed Multi-Domain Clustering on Real-Valued Datasets
title_sort multi-domain clustering on real-valued datasets
publisher University of Cincinnati / OhioLINK
publishDate 2011
url http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725
work_keys_str_mv AT huzhen multidomainclusteringonrealvalueddatasets
_version_ 1719433426570838016