Classifying Coding DNA with Nucleotide Statistics

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate...

Full description

Bibliographic Details
Main Authors:	Nicolas Carels, Diego Frías
Format:	Article
Language:	English
Published:	SAGE Publishing 2009-10-01
Series:	Bioinformatics and Biology Insights
Online Access:	http://la-press.com/classifying-coding-dna-with-nucleotide-statistics-a1718

id	doaj-3178be3458be450dafdad47de8c7cdd4
record_format	Article
spelling	doaj-3178be3458be450dafdad47de8c7cdd42020-11-25T03:32:20ZengSAGE PublishingBioinformatics and Biology Insights1177-93222009-10-0120093141154Classifying Coding DNA with Nucleotide StatisticsNicolas CarelsDiego FríasIn this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences. http://la-press.com/classifying-coding-dna-with-nucleotide-statistics-a1718
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Nicolas Carels Diego Frías
spellingShingle	Nicolas Carels Diego Frías Classifying Coding DNA with Nucleotide Statistics Bioinformatics and Biology Insights
author_facet	Nicolas Carels Diego Frías
author_sort	Nicolas Carels
title	Classifying Coding DNA with Nucleotide Statistics
title_short	Classifying Coding DNA with Nucleotide Statistics
title_full	Classifying Coding DNA with Nucleotide Statistics
title_fullStr	Classifying Coding DNA with Nucleotide Statistics
title_full_unstemmed	Classifying Coding DNA with Nucleotide Statistics
title_sort	classifying coding dna with nucleotide statistics
publisher	SAGE Publishing
series	Bioinformatics and Biology Insights
issn	1177-9322
publishDate	2009-10-01
description	In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
url	http://la-press.com/classifying-coding-dna-with-nucleotide-statistics-a1718
work_keys_str_mv	AT nicolascarels classifyingcodingdnawithnucleotidestatistics AT diegofriacuteas classifyingcodingdnawithnucleotidestatistics
_version_	1724569026693169152

Classifying Coding DNA with Nucleotide Statistics

Similar Items