Sequence count data are poorly fit by the negative binomial distribution.

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a...

Full description

Bibliographic Details
Main Authors: Stijn Hawinkel, J C W Rayner, Luc Bijnens, Olivier Thas
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0224909
id doaj-4c2341a075444a0d937155230ec257ff
record_format Article
spelling doaj-4c2341a075444a0d937155230ec257ff2021-03-03T21:41:30ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01154e022490910.1371/journal.pone.0224909Sequence count data are poorly fit by the negative binomial distribution.Stijn HawinkelJ C W RaynerLuc BijnensOlivier ThasSequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.https://doi.org/10.1371/journal.pone.0224909
collection DOAJ
language English
format Article
sources DOAJ
author Stijn Hawinkel
J C W Rayner
Luc Bijnens
Olivier Thas
spellingShingle Stijn Hawinkel
J C W Rayner
Luc Bijnens
Olivier Thas
Sequence count data are poorly fit by the negative binomial distribution.
PLoS ONE
author_facet Stijn Hawinkel
J C W Rayner
Luc Bijnens
Olivier Thas
author_sort Stijn Hawinkel
title Sequence count data are poorly fit by the negative binomial distribution.
title_short Sequence count data are poorly fit by the negative binomial distribution.
title_full Sequence count data are poorly fit by the negative binomial distribution.
title_fullStr Sequence count data are poorly fit by the negative binomial distribution.
title_full_unstemmed Sequence count data are poorly fit by the negative binomial distribution.
title_sort sequence count data are poorly fit by the negative binomial distribution.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.
url https://doi.org/10.1371/journal.pone.0224909
work_keys_str_mv AT stijnhawinkel sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT jcwrayner sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT lucbijnens sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT olivierthas sequencecountdataarepoorlyfitbythenegativebinomialdistribution
_version_ 1714815624979415040