geneRFinder: gene finding in distinct metagenomic data complexities

Background: Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions...

Full description

Bibliographic Details
Main Authors: Alves, R. (Author), Góes, F. (Author), Padovani, K. (Author), Silva, R. (Author)
Format: Article
Language:English
Published: BioMed Central Ltd 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03810nam a2200613Ia 4500
001 10.1186-s12859-021-03997-w
008 220427s2021 CNT 000 0 und d
020 |a 14712105 (ISSN) 
245 1 0 |a geneRFinder: gene finding in distinct metagenomic data complexities 
260 0 |b BioMed Central Ltd  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1186/s12859-021-03997-w 
520 3 |a Background: Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results: We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions: We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark. © 2021, The Author(s). 
650 0 4 |a algorithm 
650 0 4 |a Algorithms 
650 0 4 |a article 
650 0 4 |a Bacteria 
650 0 4 |a benchmarking 
650 0 4 |a Benchmarking 
650 0 4 |a Computational burden 
650 0 4 |a Confidence interval 
650 0 4 |a Critical assessment 
650 0 4 |a Data handling 
650 0 4 |a Decision trees 
650 0 4 |a false discovery rate 
650 0 4 |a False discovery rate 
650 0 4 |a Forecasting 
650 0 4 |a Gene prediction 
650 0 4 |a Genes 
650 0 4 |a high throughput sequencing 
650 0 4 |a high throughput sequencing 
650 0 4 |a High-Throughput Nucleotide Sequencing 
650 0 4 |a HTTP 
650 0 4 |a human 
650 0 4 |a Large dataset 
650 0 4 |a Machine learning 
650 0 4 |a metagenome 
650 0 4 |a metagenome 
650 0 4 |a Metagenome 
650 0 4 |a metagenomics 
650 0 4 |a Metagenomics 
650 0 4 |a Metagenomics 
650 0 4 |a molecular genetics 
650 0 4 |a Molecular Sequence Annotation 
650 0 4 |a Natural environments 
650 0 4 |a Next-generation sequencing 
650 0 4 |a Percentage points 
650 0 4 |a prediction 
650 0 4 |a random forest 
650 0 4 |a Random forest modeling 
700 1 |a Alves, R.  |e author 
700 1 |a Góes, F.  |e author 
700 1 |a Padovani, K.  |e author 
700 1 |a Silva, R.  |e author 
773 |t BMC Bioinformatics