An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features

The availability of whole-genome sequences and associated multi-omics data sets, combined with advances in gene knockout and knockdown methods, has enabled large-scale annotation and exploration of gene and protein functions in eukaryotes. Knowing which genes are essential for the survival of eukary...

Full description

Bibliographic Details
Main Authors: Tulio L. Campos, Pasi K. Korhonen, Robin B. Gasser, Neil D. Young
Format: Article
Language:English
Published: Elsevier 2019-01-01
Series:Computational and Structural Biotechnology Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037019301357
id doaj-1a8f80f3cb1b4669a66892b7b0fe5b73
record_format Article
spelling doaj-1a8f80f3cb1b4669a66892b7b0fe5b732020-11-25T01:41:11ZengElsevierComputational and Structural Biotechnology Journal2001-03702019-01-0117785796An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived FeaturesTulio L. Campos0Pasi K. Korhonen1Robin B. Gasser2Neil D. Young3Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Bioinformatics Core Facility, Instituto Aggeu Magalhães, Fundação Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, BrazilDepartment of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, AustraliaDepartment of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Corresponding authors.Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Corresponding authors.The availability of whole-genome sequences and associated multi-omics data sets, combined with advances in gene knockout and knockdown methods, has enabled large-scale annotation and exploration of gene and protein functions in eukaryotes. Knowing which genes are essential for the survival of eukaryotic organisms is paramount for an understanding of the basic mechanisms of life, and could assist in identifying intervention targets in eukaryotic pathogens and cancer. Here, we studied essential gene orthologs among selected species of eukaryotes, and then employed a systematic machine-learning approach, using protein sequence-derived features and selection procedures, to investigate essential gene predictions within and among species. We showed that the numbers of essential gene orthologs comprise small fractions when compared with the total number of orthologs among the eukaryotic species studied. In addition, we demonstrated that machine-learning models trained with subsets of essentiality-related data performed better than random guessing of gene essentiality for a particular species. Consistent with our gene ortholog analysis, the predictions of essential genes among multiple (including distantly-related) species is possible, yet challenging, suggesting that most essential genes are unique to a species. The present work provides a foundation for the expansion of genome-wide essentiality investigations in eukaryotes using machine learning approaches. Keywords: Machine-learning, Essential genes, Essentiality prediction, Eukaryoteshttp://www.sciencedirect.com/science/article/pii/S2001037019301357
collection DOAJ
language English
format Article
sources DOAJ
author Tulio L. Campos
Pasi K. Korhonen
Robin B. Gasser
Neil D. Young
spellingShingle Tulio L. Campos
Pasi K. Korhonen
Robin B. Gasser
Neil D. Young
An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
Computational and Structural Biotechnology Journal
author_facet Tulio L. Campos
Pasi K. Korhonen
Robin B. Gasser
Neil D. Young
author_sort Tulio L. Campos
title An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
title_short An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
title_full An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
title_fullStr An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
title_full_unstemmed An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
title_sort evaluation of machine learning approaches for the prediction of essential genes in eukaryotes using protein sequence-derived features
publisher Elsevier
series Computational and Structural Biotechnology Journal
issn 2001-0370
publishDate 2019-01-01
description The availability of whole-genome sequences and associated multi-omics data sets, combined with advances in gene knockout and knockdown methods, has enabled large-scale annotation and exploration of gene and protein functions in eukaryotes. Knowing which genes are essential for the survival of eukaryotic organisms is paramount for an understanding of the basic mechanisms of life, and could assist in identifying intervention targets in eukaryotic pathogens and cancer. Here, we studied essential gene orthologs among selected species of eukaryotes, and then employed a systematic machine-learning approach, using protein sequence-derived features and selection procedures, to investigate essential gene predictions within and among species. We showed that the numbers of essential gene orthologs comprise small fractions when compared with the total number of orthologs among the eukaryotic species studied. In addition, we demonstrated that machine-learning models trained with subsets of essentiality-related data performed better than random guessing of gene essentiality for a particular species. Consistent with our gene ortholog analysis, the predictions of essential genes among multiple (including distantly-related) species is possible, yet challenging, suggesting that most essential genes are unique to a species. The present work provides a foundation for the expansion of genome-wide essentiality investigations in eukaryotes using machine learning approaches. Keywords: Machine-learning, Essential genes, Essentiality prediction, Eukaryotes
url http://www.sciencedirect.com/science/article/pii/S2001037019301357
work_keys_str_mv AT tuliolcampos anevaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT pasikkorhonen anevaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT robinbgasser anevaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT neildyoung anevaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT tuliolcampos evaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT pasikkorhonen evaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT robinbgasser evaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
AT neildyoung evaluationofmachinelearningapproachesforthepredictionofessentialgenesineukaryotesusingproteinsequencederivedfeatures
_version_ 1725042075225817088