Fecal source identification using random forest

Abstract Background Clostridiales and Bacteroidales are uniquely adapted to the gut environment and have co-evolved with their hosts resulting in convergent microbiome patterns within mammalian species. As a result, members of Clostridiales and Bacteroidales are particularly suitable for identifying...

Full description

Bibliographic Details
Main Authors: Adélaïde Roguet, A. Murat Eren, Ryan J Newton, Sandra L McLellan
Format: Article
Language:English
Published: BMC 2018-10-01
Series:Microbiome
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40168-018-0568-3
id doaj-44faede5b7b04bd4820e725d6548aa9f
record_format Article
spelling doaj-44faede5b7b04bd4820e725d6548aa9f2020-11-25T01:18:14ZengBMCMicrobiome2049-26182018-10-016111510.1186/s40168-018-0568-3Fecal source identification using random forestAdélaïde Roguet0A. Murat Eren1Ryan J Newton2Sandra L McLellan3School of Freshwater Sciences, University of Wisconsin-MilwaukeeDepartment of Medicine, University of ChicagoSchool of Freshwater Sciences, University of Wisconsin-MilwaukeeSchool of Freshwater Sciences, University of Wisconsin-MilwaukeeAbstract Background Clostridiales and Bacteroidales are uniquely adapted to the gut environment and have co-evolved with their hosts resulting in convergent microbiome patterns within mammalian species. As a result, members of Clostridiales and Bacteroidales are particularly suitable for identifying sources of fecal contamination in environmental samples. However, a comprehensive evaluation of their predictive power and development of computational approaches is lacking. Given the global public health concern for waterborne disease, accurate identification of fecal pollution sources is essential for effective risk assessment and management. Here, we use random forest algorithm and 16S rRNA gene amplicon sequences assigned to Clostridiales and Bacteroidales to identify common fecal pollution sources. We benchmarked the accuracy, consistency, and sensitivity of our classification approach using fecal, environmental, and artificial in silico generated samples. Results Clostridiales and Bacteroidales classifiers were composed mainly of sequences that displayed differential distributions (host-preferred) among sewage, cow, deer, pig, cat, and dog sources. Each classifier correctly identified human and individual animal sources in approximately 90% of the fecal and environmental samples tested. Misclassifications resulted mostly from false-positive identification of cat and dog fecal signatures in host animals not used to build the classifiers, suggesting characterization of additional animals would improve accuracy. Random forest predictions were highly reproducible, reflecting the consistency of the bacterial signatures within each of the animal and sewage sources. Using in silico generated samples, we could detect fecal bacterial signatures when the source dataset accounted for as little as ~ 0.5% of the assemblage, with ~ 0.04% of the sequences matching the classifiers. Finally, we developed a proxy to estimate proportions among sources, which allowed us to determine which sources contribute the most to observed fecal pollution. Conclusion Random forest classification with 16S rRNA gene amplicons offers a rapid, sensitive, and accurate solution for identifying host microbial signatures to detect human and animal fecal contamination in environmental samples.http://link.springer.com/article/10.1186/s40168-018-0568-3Microbial source tracking16S rRNA geneHigh-throughput sequencingClostridialesBacteroidalesRandom forest classification
collection DOAJ
language English
format Article
sources DOAJ
author Adélaïde Roguet
A. Murat Eren
Ryan J Newton
Sandra L McLellan
spellingShingle Adélaïde Roguet
A. Murat Eren
Ryan J Newton
Sandra L McLellan
Fecal source identification using random forest
Microbiome
Microbial source tracking
16S rRNA gene
High-throughput sequencing
Clostridiales
Bacteroidales
Random forest classification
author_facet Adélaïde Roguet
A. Murat Eren
Ryan J Newton
Sandra L McLellan
author_sort Adélaïde Roguet
title Fecal source identification using random forest
title_short Fecal source identification using random forest
title_full Fecal source identification using random forest
title_fullStr Fecal source identification using random forest
title_full_unstemmed Fecal source identification using random forest
title_sort fecal source identification using random forest
publisher BMC
series Microbiome
issn 2049-2618
publishDate 2018-10-01
description Abstract Background Clostridiales and Bacteroidales are uniquely adapted to the gut environment and have co-evolved with their hosts resulting in convergent microbiome patterns within mammalian species. As a result, members of Clostridiales and Bacteroidales are particularly suitable for identifying sources of fecal contamination in environmental samples. However, a comprehensive evaluation of their predictive power and development of computational approaches is lacking. Given the global public health concern for waterborne disease, accurate identification of fecal pollution sources is essential for effective risk assessment and management. Here, we use random forest algorithm and 16S rRNA gene amplicon sequences assigned to Clostridiales and Bacteroidales to identify common fecal pollution sources. We benchmarked the accuracy, consistency, and sensitivity of our classification approach using fecal, environmental, and artificial in silico generated samples. Results Clostridiales and Bacteroidales classifiers were composed mainly of sequences that displayed differential distributions (host-preferred) among sewage, cow, deer, pig, cat, and dog sources. Each classifier correctly identified human and individual animal sources in approximately 90% of the fecal and environmental samples tested. Misclassifications resulted mostly from false-positive identification of cat and dog fecal signatures in host animals not used to build the classifiers, suggesting characterization of additional animals would improve accuracy. Random forest predictions were highly reproducible, reflecting the consistency of the bacterial signatures within each of the animal and sewage sources. Using in silico generated samples, we could detect fecal bacterial signatures when the source dataset accounted for as little as ~ 0.5% of the assemblage, with ~ 0.04% of the sequences matching the classifiers. Finally, we developed a proxy to estimate proportions among sources, which allowed us to determine which sources contribute the most to observed fecal pollution. Conclusion Random forest classification with 16S rRNA gene amplicons offers a rapid, sensitive, and accurate solution for identifying host microbial signatures to detect human and animal fecal contamination in environmental samples.
topic Microbial source tracking
16S rRNA gene
High-throughput sequencing
Clostridiales
Bacteroidales
Random forest classification
url http://link.springer.com/article/10.1186/s40168-018-0568-3
work_keys_str_mv AT adelaideroguet fecalsourceidentificationusingrandomforest
AT amurateren fecalsourceidentificationusingrandomforest
AT ryanjnewton fecalsourceidentificationusingrandomforest
AT sandralmclellan fecalsourceidentificationusingrandomforest
_version_ 1725142970540228608