Missing genes in the annotation of prokaryotic genomes

Abstract Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small ge...

Full description

Bibliographic Details
Main Authors:	Feng Wu-chun, Archuleta Jeremy, Warren Andrew S, Setubal João
Format:	Article
Language:	English
Published:	BMC 2010-03-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/11/131

id	doaj-3e50c3f519e94a98bc168ed7ae44ab0e
record_format	Article
spelling	doaj-3e50c3f519e94a98bc168ed7ae44ab0e2020-11-24T22:36:36ZengBMCBMC Bioinformatics1471-21052010-03-0111113110.1186/1471-2105-11-131Missing genes in the annotation of prokaryotic genomesFeng Wu-chunArchuleta JeremyWarren Andrew SSetubal João<p>Abstract</p> <p>Background</p> <p>Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.</p> <p>Results</p> <p>We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.</p> <p>Conclusions</p> <p>Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.</p> http://www.biomedcentral.com/1471-2105/11/131
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Feng Wu-chun Archuleta Jeremy Warren Andrew S Setubal João
spellingShingle	Feng Wu-chun Archuleta Jeremy Warren Andrew S Setubal João Missing genes in the annotation of prokaryotic genomes BMC Bioinformatics
author_facet	Feng Wu-chun Archuleta Jeremy Warren Andrew S Setubal João
author_sort	Feng Wu-chun
title	Missing genes in the annotation of prokaryotic genomes
title_short	Missing genes in the annotation of prokaryotic genomes
title_full	Missing genes in the annotation of prokaryotic genomes
title_fullStr	Missing genes in the annotation of prokaryotic genomes
title_full_unstemmed	Missing genes in the annotation of prokaryotic genomes
title_sort	missing genes in the annotation of prokaryotic genomes
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2010-03-01
description	<p>Abstract</p> <p>Background</p> <p>Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.</p> <p>Results</p> <p>We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.</p> <p>Conclusions</p> <p>Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.</p>
url	http://www.biomedcentral.com/1471-2105/11/131
work_keys_str_mv	AT fengwuchun missinggenesintheannotationofprokaryoticgenomes AT archuletajeremy missinggenesintheannotationofprokaryoticgenomes AT warrenandrews missinggenesintheannotationofprokaryoticgenomes AT setubaljoao missinggenesintheannotationofprokaryoticgenomes
_version_	1725719445484077056

Missing genes in the annotation of prokaryotic genomes

Similar Items