Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities

Finding meaningful clustering patterns in data can be very challenging when the clusters are of arbitrary shapes, different sizes, or densities, and especially when the data set contains high percentage (e.g., 80%) of noise. Unfortunately, most existing clustering techniques cannot properly handle t...

Full description

Bibliographic Details
Main Authors: Tian-Tian Zhang, Bo Yuan
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8359265/
id doaj-dd19b433396c4d80a45c6062582216a4
record_format Article
spelling doaj-dd19b433396c4d80a45c6062582216a42021-03-29T21:10:27ZengIEEEIEEE Access2169-35362018-01-016258612587310.1109/ACCESS.2018.28363898359265Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying DensitiesTian-Tian Zhang0https://orcid.org/0000-0002-0204-4758Bo Yuan1Intelligent Computing Lab, Division of Informatics, Graduate School at Shenzhen, Tsinghua University, Shenzhen, ChinaIntelligent Computing Lab, Division of Informatics, Graduate School at Shenzhen, Tsinghua University, Shenzhen, ChinaFinding meaningful clustering patterns in data can be very challenging when the clusters are of arbitrary shapes, different sizes, or densities, and especially when the data set contains high percentage (e.g., 80%) of noise. Unfortunately, most existing clustering techniques cannot properly handle this tough situation and often result in dramatically deteriorating performance. In this paper, a purposefully designed clustering algorithm called Density-Based Multiscale Analysis for Clustering (DBMAC)-II is proposed, which is an improved version of the latest strong-noise clustering algorithm DBMAC. DBMAC is proposed under the assumption that all clusters are homogeneous and cannot work well when the data set contains clusters of varying densities. DBMAC-II overcomes the limitation of DBMAC by executing the multiscale analysis iteratively and can conduct strong noise-robust clustering without any strict assumption on the shapes and densities of clusters. In DBMAC-II, each data point or object is mapped into a feature space using its r-neighborhood statistics with different r (radius) values, which is similar to DBMAC. In general, the higher the value of r-neighborhood statistics, the more likely the object is considered as a “clustered”object. Instead of trying to find a single optimal r value, a set of radius values appropriate for separating “clustered”objects and “noisy”objects is identified, using a formal statistical method for multimodality test, referred to as multiscale analysis. For clusters with varying densities, multiscale analysis is applied to extract the clusters with the highest density from the current data set iteratively. Moreover, a statistical uniformity test for measuring clustering tendency is used as the self-adaptive stopping criterion of the iteration. Comprehensive experimental studies on a series of challenging benchmark data sets demonstrate that DBMAC-II is not only superior to classical density-based clustering approaches, including DBSCAN, OPTICS, and HDBSCAN, but also can consistently outperform the latest strong-noise robust clustering techniques, such as Skinny-dip.https://ieeexplore.ieee.org/document/8359265/Multiscale analysisdensity-based clusteringheterogeneous clustersstrong noise
collection DOAJ
language English
format Article
sources DOAJ
author Tian-Tian Zhang
Bo Yuan
spellingShingle Tian-Tian Zhang
Bo Yuan
Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
IEEE Access
Multiscale analysis
density-based clustering
heterogeneous clusters
strong noise
author_facet Tian-Tian Zhang
Bo Yuan
author_sort Tian-Tian Zhang
title Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
title_short Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
title_full Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
title_fullStr Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
title_full_unstemmed Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities
title_sort density-based multiscale analysis for clustering in strong noise settings with varying densities
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2018-01-01
description Finding meaningful clustering patterns in data can be very challenging when the clusters are of arbitrary shapes, different sizes, or densities, and especially when the data set contains high percentage (e.g., 80%) of noise. Unfortunately, most existing clustering techniques cannot properly handle this tough situation and often result in dramatically deteriorating performance. In this paper, a purposefully designed clustering algorithm called Density-Based Multiscale Analysis for Clustering (DBMAC)-II is proposed, which is an improved version of the latest strong-noise clustering algorithm DBMAC. DBMAC is proposed under the assumption that all clusters are homogeneous and cannot work well when the data set contains clusters of varying densities. DBMAC-II overcomes the limitation of DBMAC by executing the multiscale analysis iteratively and can conduct strong noise-robust clustering without any strict assumption on the shapes and densities of clusters. In DBMAC-II, each data point or object is mapped into a feature space using its r-neighborhood statistics with different r (radius) values, which is similar to DBMAC. In general, the higher the value of r-neighborhood statistics, the more likely the object is considered as a “clustered”object. Instead of trying to find a single optimal r value, a set of radius values appropriate for separating “clustered”objects and “noisy”objects is identified, using a formal statistical method for multimodality test, referred to as multiscale analysis. For clusters with varying densities, multiscale analysis is applied to extract the clusters with the highest density from the current data set iteratively. Moreover, a statistical uniformity test for measuring clustering tendency is used as the self-adaptive stopping criterion of the iteration. Comprehensive experimental studies on a series of challenging benchmark data sets demonstrate that DBMAC-II is not only superior to classical density-based clustering approaches, including DBSCAN, OPTICS, and HDBSCAN, but also can consistently outperform the latest strong-noise robust clustering techniques, such as Skinny-dip.
topic Multiscale analysis
density-based clustering
heterogeneous clusters
strong noise
url https://ieeexplore.ieee.org/document/8359265/
work_keys_str_mv AT tiantianzhang densitybasedmultiscaleanalysisforclusteringinstrongnoisesettingswithvaryingdensities
AT boyuan densitybasedmultiscaleanalysisforclusteringinstrongnoisesettingswithvaryingdensities
_version_ 1724193460557185024