Discovering motifs in ranked lists of DNA sequences.

Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitatio...

Full description

Bibliographic Details
Main Authors:	Eran Eden, Doron Lipson, Sivan Yogev, Zohar Yakhini
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2007-03-01
Series:	PLoS Computational Biology
Online Access:	https://doi.org/10.1371/journal.pcbi.0030039

id	doaj-9a36ff1193a74f45993d98fdceff708b
record_format	Article
spelling	doaj-9a36ff1193a74f45993d98fdceff708b2021-04-21T15:08:58ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582007-03-0133e3910.1371/journal.pcbi.0030039Discovering motifs in ranked lists of DNA sequences.Eran EdenDoron LipsonSivan YogevZohar YakhiniComputational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP-chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP-chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP-chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.https://doi.org/10.1371/journal.pcbi.0030039
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Eran Eden Doron Lipson Sivan Yogev Zohar Yakhini
spellingShingle	Eran Eden Doron Lipson Sivan Yogev Zohar Yakhini Discovering motifs in ranked lists of DNA sequences. PLoS Computational Biology
author_facet	Eran Eden Doron Lipson Sivan Yogev Zohar Yakhini
author_sort	Eran Eden
title	Discovering motifs in ranked lists of DNA sequences.
title_short	Discovering motifs in ranked lists of DNA sequences.
title_full	Discovering motifs in ranked lists of DNA sequences.
title_fullStr	Discovering motifs in ranked lists of DNA sequences.
title_full_unstemmed	Discovering motifs in ranked lists of DNA sequences.
title_sort	discovering motifs in ranked lists of dna sequences.
publisher	Public Library of Science (PLoS)
series	PLoS Computational Biology
issn	1553-734X 1553-7358
publishDate	2007-03-01
description	Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP-chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP-chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP-chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.
url	https://doi.org/10.1371/journal.pcbi.0030039
work_keys_str_mv	AT eraneden discoveringmotifsinrankedlistsofdnasequences AT doronlipson discoveringmotifsinrankedlistsofdnasequences AT sivanyogev discoveringmotifsinrankedlistsofdnasequences AT zoharyakhini discoveringmotifsinrankedlistsofdnasequences
_version_	1714667936077053952

Discovering motifs in ranked lists of DNA sequences.

Similar Items