Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification

Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with i...

Full description

Bibliographic Details
Main Author: Baribault, Carl
Format: Others
Published: ScholarWorks@UNO 2009
Subjects:
HMM
Online Access:http://scholarworks.uno.edu/td/1098
http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td
id ndltd-uno.edu-oai-scholarworks.uno.edu-td-2079
record_format oai_dc
spelling ndltd-uno.edu-oai-scholarworks.uno.edu-td-20792016-10-21T17:05:12Z Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification Baribault, Carl Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data. 2009-12-20T08:00:00Z text application/pdf http://scholarworks.uno.edu/td/1098 http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td University of New Orleans Theses and Dissertations ScholarWorks@UNO hidden Markov model HMM GHMM gene finding gene prediction
collection NDLTD
format Others
sources NDLTD
topic hidden Markov model
HMM
GHMM
gene finding
gene prediction
spellingShingle hidden Markov model
HMM
GHMM
gene finding
gene prediction
Baribault, Carl
Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
description Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data.
author Baribault, Carl
author_facet Baribault, Carl
author_sort Baribault, Carl
title Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_short Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_full Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_fullStr Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_full_unstemmed Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_sort meta state generalized hidden markov model for eukaryotic gene structure identification
publisher ScholarWorks@UNO
publishDate 2009
url http://scholarworks.uno.edu/td/1098
http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td
work_keys_str_mv AT baribaultcarl metastategeneralizedhiddenmarkovmodelforeukaryoticgenestructureidentification
_version_ 1718388092334768128