Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization

The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy an...

Full description

Bibliographic Details
Main Authors: Warin Wattanapornprom, Chinae Thammarongtham, Apiradee Hongsthong, Supatcha Lertampaiporn
Format: Article
Language:English
Published: MDPI AG 2021-03-01
Series:Life
Subjects:
Online Access:https://www.mdpi.com/2075-1729/11/4/293
id doaj-03b58a2d5b7942be957f93c614202135
record_format Article
spelling doaj-03b58a2d5b7942be957f93c6142021352021-03-30T23:02:47ZengMDPI AGLife2075-17292021-03-011129329310.3390/life11040293Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular LocalizationWarin Wattanapornprom0Chinae Thammarongtham1Apiradee Hongsthong2Supatcha Lertampaiporn3Applied Computer Science Program, Department of Mathematics, Faculty of Science, King Mongkut’s University of Technology Thonburi, Bangkok 10140, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandThe accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.https://www.mdpi.com/2075-1729/11/4/293ensemble machine learningplant proteinfeature extractionfeature selectiongo termconsensus voting
collection DOAJ
language English
format Article
sources DOAJ
author Warin Wattanapornprom
Chinae Thammarongtham
Apiradee Hongsthong
Supatcha Lertampaiporn
spellingShingle Warin Wattanapornprom
Chinae Thammarongtham
Apiradee Hongsthong
Supatcha Lertampaiporn
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
Life
ensemble machine learning
plant protein
feature extraction
feature selection
go term
consensus voting
author_facet Warin Wattanapornprom
Chinae Thammarongtham
Apiradee Hongsthong
Supatcha Lertampaiporn
author_sort Warin Wattanapornprom
title Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_short Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_full Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_fullStr Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_full_unstemmed Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_sort ensemble of multiple classifiers for multilabel classification of plant protein subcellular localization
publisher MDPI AG
series Life
issn 2075-1729
publishDate 2021-03-01
description The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.
topic ensemble machine learning
plant protein
feature extraction
feature selection
go term
consensus voting
url https://www.mdpi.com/2075-1729/11/4/293
work_keys_str_mv AT warinwattanapornprom ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT chinaethammarongtham ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT apiradeehongsthong ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT supatchalertampaiporn ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
_version_ 1724178905531678720