Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.

Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the...

Full description

Bibliographic Details
Main Authors: Dmitry Svetlichnyy, Hana Imrichova, Mark Fiers, Zeynep Kalender Atak, Stein Aerts
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2015-11-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1004590
id doaj-11086cce95c6426f858f1a455260a575
record_format Article
spelling doaj-11086cce95c6426f858f1a455260a5752021-04-21T14:59:12ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582015-11-011111e100459010.1371/journal.pcbi.1004590Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.Dmitry SvetlichnyyHana ImrichovaMark FiersZeynep Kalender AtakStein AertsCancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.https://doi.org/10.1371/journal.pcbi.1004590
collection DOAJ
language English
format Article
sources DOAJ
author Dmitry Svetlichnyy
Hana Imrichova
Mark Fiers
Zeynep Kalender Atak
Stein Aerts
spellingShingle Dmitry Svetlichnyy
Hana Imrichova
Mark Fiers
Zeynep Kalender Atak
Stein Aerts
Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
PLoS Computational Biology
author_facet Dmitry Svetlichnyy
Hana Imrichova
Mark Fiers
Zeynep Kalender Atak
Stein Aerts
author_sort Dmitry Svetlichnyy
title Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
title_short Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
title_full Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
title_fullStr Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
title_full_unstemmed Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
title_sort identification of high-impact cis-regulatory mutations using transcription factor specific random forest models.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2015-11-01
description Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.
url https://doi.org/10.1371/journal.pcbi.1004590
work_keys_str_mv AT dmitrysvetlichnyy identificationofhighimpactcisregulatorymutationsusingtranscriptionfactorspecificrandomforestmodels
AT hanaimrichova identificationofhighimpactcisregulatorymutationsusingtranscriptionfactorspecificrandomforestmodels
AT markfiers identificationofhighimpactcisregulatorymutationsusingtranscriptionfactorspecificrandomforestmodels
AT zeynepkalenderatak identificationofhighimpactcisregulatorymutationsusingtranscriptionfactorspecificrandomforestmodels
AT steinaerts identificationofhighimpactcisregulatorymutationsusingtranscriptionfactorspecificrandomforestmodels
_version_ 1714667989613150208