Gene prediction using a configurable system for the integration of data by dynamic programming

A new approach to the computational identification of protein-coding gene structures in genomic DNA sequence is described. It overcomes rigidities inherent in most existing gene prediction methods, for example those based on Hidden Markov Models (HMMs), by supporting a flexible computational model o...

Full description

Bibliographic Details
Main Author: Howe, K.
Published: University of Cambridge 2003
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.604280
Description
Summary:A new approach to the computational identification of protein-coding gene structures in genomic DNA sequence is described. It overcomes rigidities inherent in most existing gene prediction methods, for example those based on Hidden Markov Models (HMMs), by supporting a flexible computational model of how sequence signal signals fit together into complete gene structures. The primary result of the work is a gene prediction tool for the assembly of evidence for individual gene components (features) into predictions of complete gene structures. The system is completely configurable in that both the features themselves, and the model of gene structure against which candidate assemblies are validated and scored, are external to the system and supplied by the user. The gene prediction process is therefore tied neither to any specific techniques for the recognition of gene prediction signals, nor any specific underlying model of gene structure. The methodology is implemented in a piece of software called “GAZE” which uses a dynamic programming algorithm to obtain the highest scoring gene structure consistent with the user-supplied features and gene-structure model, and also posterior probabilities that each feature is part of a gene. The algorithm employs a novel pruning strategy, ensuring that it has a runtime effectively linear in the length of the sequence without compromising accuracy. The effectiveness of the strategy is explored by applying it to the prediction of gene structures in sequences of the nematode worm <i>C. elegans. </i> GAZE allows the integration of gene prediction data from multiple, arbitrary sources. It is important for the accuracy of the system that the various pieces of evidence are weighted appropriately with respect to each other. A novel strategy for the automatic determination of optimal values for these weights is described.