The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data

Abstract Background In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our stu...

Full description

Bibliographic Details
Main Authors: Laurence de Torrenté, Samuel Zimmerman, Masako Suzuki, Maximilian Christopeit, John M. Greally, Jessica C. Mar
Format: Article
Language:English
Published: BMC 2020-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-020-03892-w
id doaj-2ae0bac7450f46ae973414c89a644fd8
record_format Article
spelling doaj-2ae0bac7450f46ae973414c89a644fd82021-01-03T12:21:21ZengBMCBMC Bioinformatics1471-21052020-12-0121S2111810.1186/s12859-020-03892-wThe shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic dataLaurence de Torrenté0Samuel Zimmerman1Masako Suzuki2Maximilian Christopeit3John M. Greally4Jessica C. Mar5Department of Systems and Computational Biology, Albert Einstein College of MedicineDepartment of Systems and Computational Biology, Albert Einstein College of MedicineCenter for Epigenomics and Department of Genetics, Albert Einstein College of MedicineInternal Medicine II, Hematology, Oncology, Clinical Immunology and Rheumatology, University Hospital TuebingenCenter for Epigenomics and Department of Genetics, Albert Einstein College of MedicineDepartment of Systems and Computational Biology, Albert Einstein College of MedicineAbstract Background In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). Results Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. Conclusions Our results highlight the value of studying a gene’s distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.https://doi.org/10.1186/s12859-020-03892-wGene expressionMulti-modalityNon-normal distributionSurvival analysisCancer genomics
collection DOAJ
language English
format Article
sources DOAJ
author Laurence de Torrenté
Samuel Zimmerman
Masako Suzuki
Maximilian Christopeit
John M. Greally
Jessica C. Mar
spellingShingle Laurence de Torrenté
Samuel Zimmerman
Masako Suzuki
Maximilian Christopeit
John M. Greally
Jessica C. Mar
The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
BMC Bioinformatics
Gene expression
Multi-modality
Non-normal distribution
Survival analysis
Cancer genomics
author_facet Laurence de Torrenté
Samuel Zimmerman
Masako Suzuki
Maximilian Christopeit
John M. Greally
Jessica C. Mar
author_sort Laurence de Torrenté
title The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
title_short The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
title_full The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
title_fullStr The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
title_full_unstemmed The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
title_sort shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2020-12-01
description Abstract Background In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). Results Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. Conclusions Our results highlight the value of studying a gene’s distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.
topic Gene expression
Multi-modality
Non-normal distribution
Survival analysis
Cancer genomics
url https://doi.org/10.1186/s12859-020-03892-w
work_keys_str_mv AT laurencedetorrente theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT samuelzimmerman theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT masakosuzuki theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT maximilianchristopeit theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT johnmgreally theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT jessicacmar theshapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT laurencedetorrente shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT samuelzimmerman shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT masakosuzuki shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT maximilianchristopeit shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT johnmgreally shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
AT jessicacmar shapeofgeneexpressiondistributionsmatterhowincorporatingdistributionshapeimprovestheinterpretationofcancertranscriptomicdata
_version_ 1724350313752690688