Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy an...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-03-01
|
Series: | Life |
Subjects: | |
Online Access: | https://www.mdpi.com/2075-1729/11/4/293 |
id |
doaj-03b58a2d5b7942be957f93c614202135 |
---|---|
record_format |
Article |
spelling |
doaj-03b58a2d5b7942be957f93c6142021352021-03-30T23:02:47ZengMDPI AGLife2075-17292021-03-011129329310.3390/life11040293Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular LocalizationWarin Wattanapornprom0Chinae Thammarongtham1Apiradee Hongsthong2Supatcha Lertampaiporn3Applied Computer Science Program, Department of Mathematics, Faculty of Science, King Mongkut’s University of Technology Thonburi, Bangkok 10140, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandBiochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Tha Kham, Bang Khun Thian, Bangkok 10150, ThailandThe accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.https://www.mdpi.com/2075-1729/11/4/293ensemble machine learningplant proteinfeature extractionfeature selectiongo termconsensus voting |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Warin Wattanapornprom Chinae Thammarongtham Apiradee Hongsthong Supatcha Lertampaiporn |
spellingShingle |
Warin Wattanapornprom Chinae Thammarongtham Apiradee Hongsthong Supatcha Lertampaiporn Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization Life ensemble machine learning plant protein feature extraction feature selection go term consensus voting |
author_facet |
Warin Wattanapornprom Chinae Thammarongtham Apiradee Hongsthong Supatcha Lertampaiporn |
author_sort |
Warin Wattanapornprom |
title |
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization |
title_short |
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization |
title_full |
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization |
title_fullStr |
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization |
title_full_unstemmed |
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization |
title_sort |
ensemble of multiple classifiers for multilabel classification of plant protein subcellular localization |
publisher |
MDPI AG |
series |
Life |
issn |
2075-1729 |
publishDate |
2021-03-01 |
description |
The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset. |
topic |
ensemble machine learning plant protein feature extraction feature selection go term consensus voting |
url |
https://www.mdpi.com/2075-1729/11/4/293 |
work_keys_str_mv |
AT warinwattanapornprom ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization AT chinaethammarongtham ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization AT apiradeehongsthong ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization AT supatchalertampaiporn ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization |
_version_ |
1724178905531678720 |