Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

Abstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the...

Full description

Bibliographic Details
Main Authors: Arnaud Liehrmann, Guillem Rigaill, Toby Dylan Hocking
Format: Article
Language:English
Published: BMC 2021-06-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04221-5
id doaj-db1bccdf7a7840928ccebf2e210059ca
record_format Article
spelling doaj-db1bccdf7a7840928ccebf2e210059ca2021-06-20T11:50:56ZengBMCBMC Bioinformatics1471-21052021-06-0122111810.1186/s12859-021-04221-5Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation modelsArnaud Liehrmann0Guillem Rigaill1Toby Dylan Hocking2Institut des Sciences des Plantes de Paris-Saclay (IPS2), Université Paris-Saclay, Université Evry, CNRS, INRAEInstitut des Sciences des Plantes de Paris-Saclay (IPS2), Université Paris-Saclay, Université Evry, CNRS, INRAESchool of Informatics, Computing, and Cyber Systems (SICCS), Northern Arizona UniversityAbstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS ( https://github.com/aLiehrmann/CROCS ), detect the peaks more accurately than algorithms which rely on natural assumptions. Conclusion The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.https://doi.org/10.1186/s12859-021-04221-5ChIP-seqHistone modificationsOver-dispersionPeak callingMultiple changepoint detectionLikelihood inference
collection DOAJ
language English
format Article
sources DOAJ
author Arnaud Liehrmann
Guillem Rigaill
Toby Dylan Hocking
spellingShingle Arnaud Liehrmann
Guillem Rigaill
Toby Dylan Hocking
Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
BMC Bioinformatics
ChIP-seq
Histone modifications
Over-dispersion
Peak calling
Multiple changepoint detection
Likelihood inference
author_facet Arnaud Liehrmann
Guillem Rigaill
Toby Dylan Hocking
author_sort Arnaud Liehrmann
title Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_short Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_full Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_fullStr Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_full_unstemmed Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_sort increased peak detection accuracy in over-dispersed chip-seq data with supervised segmentation models
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-06-01
description Abstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS ( https://github.com/aLiehrmann/CROCS ), detect the peaks more accurately than algorithms which rely on natural assumptions. Conclusion The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.
topic ChIP-seq
Histone modifications
Over-dispersion
Peak calling
Multiple changepoint detection
Likelihood inference
url https://doi.org/10.1186/s12859-021-04221-5
work_keys_str_mv AT arnaudliehrmann increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels
AT guillemrigaill increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels
AT tobydylanhocking increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels
_version_ 1721369671803863040