Identification of Phage Viral Proteins With Hybrid Sequence Features

The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accuratel...

Full description

Bibliographic Details
Main Authors: Xiaoqing Ru, Lihong Li, Chunyu Wang
Format: Article
Language:English
Published: Frontiers Media S.A. 2019-03-01
Series:Frontiers in Microbiology
Subjects:
Online Access:https://www.frontiersin.org/article/10.3389/fmicb.2019.00507/full
id doaj-4786b700c0334e189eccc38cf1ab0d1c
record_format Article
spelling doaj-4786b700c0334e189eccc38cf1ab0d1c2020-11-24T21:32:42ZengFrontiers Media S.A.Frontiers in Microbiology1664-302X2019-03-011010.3389/fmicb.2019.00507445150Identification of Phage Viral Proteins With Hybrid Sequence FeaturesXiaoqing Ru0Lihong Li1Chunyu Wang2School of Information and Electrical Engineering, Hebei University of Engineering, Handan, ChinaSchool of Information and Electrical Engineering, Hebei University of Engineering, Handan, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaThe uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.https://www.frontiersin.org/article/10.3389/fmicb.2019.00507/fullphage virion proteinsmachine learningfeature extractionfeature selectionhybrid sequence features
collection DOAJ
language English
format Article
sources DOAJ
author Xiaoqing Ru
Lihong Li
Chunyu Wang
spellingShingle Xiaoqing Ru
Lihong Li
Chunyu Wang
Identification of Phage Viral Proteins With Hybrid Sequence Features
Frontiers in Microbiology
phage virion proteins
machine learning
feature extraction
feature selection
hybrid sequence features
author_facet Xiaoqing Ru
Lihong Li
Chunyu Wang
author_sort Xiaoqing Ru
title Identification of Phage Viral Proteins With Hybrid Sequence Features
title_short Identification of Phage Viral Proteins With Hybrid Sequence Features
title_full Identification of Phage Viral Proteins With Hybrid Sequence Features
title_fullStr Identification of Phage Viral Proteins With Hybrid Sequence Features
title_full_unstemmed Identification of Phage Viral Proteins With Hybrid Sequence Features
title_sort identification of phage viral proteins with hybrid sequence features
publisher Frontiers Media S.A.
series Frontiers in Microbiology
issn 1664-302X
publishDate 2019-03-01
description The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
topic phage virion proteins
machine learning
feature extraction
feature selection
hybrid sequence features
url https://www.frontiersin.org/article/10.3389/fmicb.2019.00507/full
work_keys_str_mv AT xiaoqingru identificationofphageviralproteinswithhybridsequencefeatures
AT lihongli identificationofphageviralproteinswithhybridsequencefeatures
AT chunyuwang identificationofphageviralproteinswithhybridsequencefeatures
_version_ 1725956478855020544