Using random forests for assistance in the curation of G-protein coupled receptor databases

Abstract Background Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-cou...

Full description

Bibliographic Details
Main Authors:	Aleksei Shkurin, Alfredo Vellido
Format:	Article
Language:	English
Published:	BMC 2017-08-01
Series:	BioMedical Engineering OnLine
Subjects:	G-Protein coupled receptors Machine learning Random forests Database curation
Online Access:	http://link.springer.com/article/10.1186/s12938-017-0357-4

id	doaj-b3e28282e46a4ed0a5f33533baa11a5f
record_format	Article
spelling	doaj-b3e28282e46a4ed0a5f33533baa11a5f2020-11-24T21:18:33ZengBMCBioMedical Engineering OnLine1475-925X2017-08-0116S112110.1186/s12938-017-0357-4Using random forests for assistance in the curation of G-protein coupled receptor databasesAleksei Shkurin0Alfredo Vellido1Department of Computer Science, Universitat Politècnica de CatalunyaDepartment of Computer Science, Universitat Politècnica de CatalunyaAbstract Background Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. Methods We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Results Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. Conclusion The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.http://link.springer.com/article/10.1186/s12938-017-0357-4G-Protein coupled receptorsMachine learningRandom forestsDatabase curation
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Aleksei Shkurin Alfredo Vellido
spellingShingle	Aleksei Shkurin Alfredo Vellido Using random forests for assistance in the curation of G-protein coupled receptor databases BioMedical Engineering OnLine G-Protein coupled receptors Machine learning Random forests Database curation
author_facet	Aleksei Shkurin Alfredo Vellido
author_sort	Aleksei Shkurin
title	Using random forests for assistance in the curation of G-protein coupled receptor databases
title_short	Using random forests for assistance in the curation of G-protein coupled receptor databases
title_full	Using random forests for assistance in the curation of G-protein coupled receptor databases
title_fullStr	Using random forests for assistance in the curation of G-protein coupled receptor databases
title_full_unstemmed	Using random forests for assistance in the curation of G-protein coupled receptor databases
title_sort	using random forests for assistance in the curation of g-protein coupled receptor databases
publisher	BMC
series	BioMedical Engineering OnLine
issn	1475-925X
publishDate	2017-08-01
description	Abstract Background Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. Methods We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Results Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. Conclusion The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.
topic	G-Protein coupled receptors Machine learning Random forests Database curation
url	http://link.springer.com/article/10.1186/s12938-017-0357-4
work_keys_str_mv	AT alekseishkurin usingrandomforestsforassistanceinthecurationofgproteincoupledreceptordatabases AT alfredovellido usingrandomforestsforassistanceinthecurationofgproteincoupledreceptordatabases
_version_	1726008501224865792

Using random forests for assistance in the curation of G-protein coupled receptor databases

Similar Items