Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer

Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagr...

Full description

Bibliographic Details
Main Author: Binti Zainul Abidin, Fatin Nurzahirah
Other Authors: Westhead, David Robert ; Boyes, Joan
Published: University of Leeds 2017
Subjects:
570
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.729461
id ndltd-bl.uk-oai-ethos.bl.uk-729461
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7294612019-03-05T15:48:14ZFlexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancerBinti Zainul Abidin, Fatin NurzahirahWesthead, David Robert ; Boyes, Joan2017Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagram or needing a pre-defined number of clusters respectively. The increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data, for example, presence or absence of mutations, binding, motifs, and/or epigenetic marks and continuous data, for example, gene expression, protein abundance and/or metabolite levels. In this work, we presented a generic method based on a probabilistic model for clustering this mixture of data types, and illustrate its application to genetic regulation and the clustering of cancer samples. It uses penalized maximum likelihood (ML) estimation of mixture model parameters using information criteria (model selection objective function) and meta-heuristic searches for optimum clusters. Compatibility of several information criteria with our model-based joint clustering was tested, including the well-known Akaike Information Criterion (AIC) and its empirically determined derivatives (AICλ), Bayesian Information Criterion (BIC) and its derivative (CAIC), and Hannan-Quinn Criterion (HQC). We have experimentally shown with simulated data that AIC and AIC (λ=2.5) worked well with our method. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival.570University of Leedshttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.729461http://etheses.whiterose.ac.uk/18883/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 570
spellingShingle 570
Binti Zainul Abidin, Fatin Nurzahirah
Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
description Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagram or needing a pre-defined number of clusters respectively. The increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data, for example, presence or absence of mutations, binding, motifs, and/or epigenetic marks and continuous data, for example, gene expression, protein abundance and/or metabolite levels. In this work, we presented a generic method based on a probabilistic model for clustering this mixture of data types, and illustrate its application to genetic regulation and the clustering of cancer samples. It uses penalized maximum likelihood (ML) estimation of mixture model parameters using information criteria (model selection objective function) and meta-heuristic searches for optimum clusters. Compatibility of several information criteria with our model-based joint clustering was tested, including the well-known Akaike Information Criterion (AIC) and its empirically determined derivatives (AICλ), Bayesian Information Criterion (BIC) and its derivative (CAIC), and Hannan-Quinn Criterion (HQC). We have experimentally shown with simulated data that AIC and AIC (λ=2.5) worked well with our method. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival.
author2 Westhead, David Robert ; Boyes, Joan
author_facet Westhead, David Robert ; Boyes, Joan
Binti Zainul Abidin, Fatin Nurzahirah
author Binti Zainul Abidin, Fatin Nurzahirah
author_sort Binti Zainul Abidin, Fatin Nurzahirah
title Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
title_short Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
title_full Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
title_fullStr Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
title_full_unstemmed Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
title_sort flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer
publisher University of Leeds
publishDate 2017
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.729461
work_keys_str_mv AT bintizainulabidinfatinnurzahirah flexiblemodelbasedjointprobabilisticclusteringofbinaryandcontinuousinputsanditsapplicationtogeneticregulationandcancer
_version_ 1718996929738702848