Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks

Abstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study...

Full description

Bibliographic Details
Main Authors:	Yingxi Yang, Hui Wang, Wen Li, Xiaobo Wang, Shizhao Wei, Yulong Liu, Yan Xu
Format:	Article
Language:	English
Published:	BMC 2021-03-01
Series:	BMC Bioinformatics
Subjects:	Post-translational modification Deep learning Generative adversarial networks Random forest
Online Access:	https://doi.org/10.1186/s12859-021-04101-y

id	doaj-29c0569ff6fb4a81b696a9ad78d166bf
record_format	Article
spelling	doaj-29c0569ff6fb4a81b696a9ad78d166bf2021-04-04T11:45:28ZengBMCBMC Bioinformatics1471-21052021-03-0122111710.1186/s12859-021-04101-yPrediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networksYingxi Yang0Hui Wang1Wen Li2Xiaobo Wang3Shizhao Wei4Yulong Liu5Yan Xu6Department of Information and Computer Science, University of Science and Technology BeijingInstitute of Computing Technology, Chinese Academy of SciencesDepartment of Information and Computer Science, University of Science and Technology BeijingDepartment of Information and Computer Science, University of Science and Technology BeijingNo. 15 Research Institute, China Electronics Technology Group CorporationNo. 15 Research Institute, China Electronics Technology Group CorporationDepartment of Information and Computer Science, University of Science and Technology BeijingAbstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.https://doi.org/10.1186/s12859-021-04101-yPost-translational modificationDeep learningGenerative adversarial networksRandom forest
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yingxi Yang Hui Wang Wen Li Xiaobo Wang Shizhao Wei Yulong Liu Yan Xu
spellingShingle	Yingxi Yang Hui Wang Wen Li Xiaobo Wang Shizhao Wei Yulong Liu Yan Xu Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks BMC Bioinformatics Post-translational modification Deep learning Generative adversarial networks Random forest
author_facet	Yingxi Yang Hui Wang Wen Li Xiaobo Wang Shizhao Wei Yulong Liu Yan Xu
author_sort	Yingxi Yang
title	Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_short	Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_full	Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_fullStr	Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_full_unstemmed	Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_sort	prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2021-03-01
description	Abstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
topic	Post-translational modification Deep learning Generative adversarial networks Random forest
url	https://doi.org/10.1186/s12859-021-04101-y
work_keys_str_mv	AT yingxiyang predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT huiwang predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT wenli predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT xiaobowang predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT shizhaowei predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT yulongliu predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks AT yanxu predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
_version_	1721542354489311232

Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks

Similar Items