Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification

Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with i...

Full description

Bibliographic Details
Main Author:	Baribault, Carl
Format:	Others
Published:	ScholarWorks@UNO 2009
Subjects:	hidden Markov model HMM GHMM gene finding gene prediction
Online Access:	http://scholarworks.uno.edu/td/1098 http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td

id	ndltd-uno.edu-oai-scholarworks.uno.edu-td-2079
record_format	oai_dc
spelling	ndltd-uno.edu-oai-scholarworks.uno.edu-td-20792016-10-21T17:05:12Z Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification Baribault, Carl Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data. 2009-12-20T08:00:00Z text application/pdf http://scholarworks.uno.edu/td/1098 http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td University of New Orleans Theses and Dissertations ScholarWorks@UNO hidden Markov model HMM GHMM gene finding gene prediction
collection	NDLTD
format	Others
sources	NDLTD
topic	hidden Markov model HMM GHMM gene finding gene prediction
spellingShingle	hidden Markov model HMM GHMM gene finding gene prediction Baribault, Carl Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
description	Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data.
author	Baribault, Carl
author_facet	Baribault, Carl
author_sort	Baribault, Carl
title	Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_short	Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_full	Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_fullStr	Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_full_unstemmed	Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification
title_sort	meta state generalized hidden markov model for eukaryotic gene structure identification
publisher	ScholarWorks@UNO
publishDate	2009
url	http://scholarworks.uno.edu/td/1098 http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2079&context=td
work_keys_str_mv	AT baribaultcarl metastategeneralizedhiddenmarkovmodelforeukaryoticgenestructureidentification
_version_	1718388092334768128

Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification

Similar Items