Clustering Consistently
Main Author: | |
---|---|
Language: | English |
Published: |
The Ohio State University / OhioLINK
2017
|
Subjects: | |
Online Access: | http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249 |
id |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu1512070374903249 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu15120703749032492021-08-03T07:04:56Z Clustering Consistently Eldridge, Justin, Eldridge Computer Science Statistics Artificial Intelligence machine learning unsupervised learning statistical learning clustering graphon mergeon density cluster tree hierarchical clustering Clustering is the task of organizing data into natural groups, or clusters. A central goal in developing a theory of clustering is the derivation of correctness guarantees which ensure that clustering methods produce the right results. In this dissertation, we analyze the setting in which the data are sampled from some underlying probability distribution. In this case, an algorithm is "correct" (or consistent) if, given larger and larger data sets, its output converges in some sense to the ideal cluster structure of the distribution.In the first part, we study the setting in which data are drawn from a probability density supported on a subset of a Euclidean space. The natural cluster structure of the density is captured by the so-called high density cluster tree, which is due to Hartigan (1981). Hartigan introduced a notion of convergence to the density cluster tree, and recent work by Chaudhuri and Dasgupta (2010) and Kpotufe and von Luxburg (2011) has contructed algorithms which are consistent in this sense.We will show that Hartigan's notion of consistency is in fact not strong enough to ensure that an algorithm recovers the density cluster tree as we would intuitively expect. We identify the precise deficiency which allows this, and introduce a new, stronger notion of convergence which we call consistency in merge distortion. Consistency in merge distortion implies Hartigan's consistency, and we prove that the algorithm of Chaudhuri and Dasgupta (2010) satisfies our new notion.In the sequel, we consider the clustering of graphs sampled from a very general, non-parametric random graph model called a graphon. Unlike in the density setting, clustering in the graphon model is not well-studied. We therefore rigorously analyze the cluster structure of a graphon and formally define the graphon cluster tree. We adapt our notion of consistency in merge distortion to the graphon setting and identify efficient, consistent algorithms. 2017 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249 http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249 unrestricted This thesis or dissertation is protected by copyright: some rights reserved. It is licensed for use under a Creative Commons license. Specific terms and permissions are available from this document's record in the OhioLINK ETD Center. |
collection |
NDLTD |
language |
English |
sources |
NDLTD |
topic |
Computer Science Statistics Artificial Intelligence machine learning unsupervised learning statistical learning clustering graphon mergeon density cluster tree hierarchical clustering |
spellingShingle |
Computer Science Statistics Artificial Intelligence machine learning unsupervised learning statistical learning clustering graphon mergeon density cluster tree hierarchical clustering Eldridge, Justin, Eldridge Clustering Consistently |
author |
Eldridge, Justin, Eldridge |
author_facet |
Eldridge, Justin, Eldridge |
author_sort |
Eldridge, Justin, Eldridge |
title |
Clustering Consistently |
title_short |
Clustering Consistently |
title_full |
Clustering Consistently |
title_fullStr |
Clustering Consistently |
title_full_unstemmed |
Clustering Consistently |
title_sort |
clustering consistently |
publisher |
The Ohio State University / OhioLINK |
publishDate |
2017 |
url |
http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249 |
work_keys_str_mv |
AT eldridgejustineldridge clusteringconsistently |
_version_ |
1719453162186735616 |