Identifying optimal bioinformatics protocols for aerosol microbial community data

Microbes are fundamental to Earth’s ecosystems, thus understanding ecosystem connectivity through microbial dispersal is key to predicting future ecosystem changes in a warming world. However, aerial microbial dispersal remains poorly understood. Few studies have been performed on bioaerosols (micro...

Full description

Bibliographic Details
Main Authors: Katie Miaow, Donnabella Lacap-Bugler, Hannah L. Buckley
Format: Article
Language:English
Published: PeerJ Inc. 2021-09-01
Series:PeerJ
Subjects:
NGS
Online Access:https://peerj.com/articles/12065.pdf
id doaj-aba162a0fe0b4bbca3d02d230e7e6d15
record_format Article
spelling doaj-aba162a0fe0b4bbca3d02d230e7e6d152021-10-02T15:05:05ZengPeerJ Inc.PeerJ2167-83592021-09-019e1206510.7717/peerj.12065Identifying optimal bioinformatics protocols for aerosol microbial community dataKatie MiaowDonnabella Lacap-BuglerHannah L. BuckleyMicrobes are fundamental to Earth’s ecosystems, thus understanding ecosystem connectivity through microbial dispersal is key to predicting future ecosystem changes in a warming world. However, aerial microbial dispersal remains poorly understood. Few studies have been performed on bioaerosols (microorganisms and biological fragments suspended in the atmosphere), despite them harboring pathogens and allergens. Most environmental microbes grow poorly in culture, therefore molecular approaches are required to characterize aerial diversity. Bioinformatic tools are needed for processing the next generation sequencing (NGS) data generated from these molecular approaches; however, there are numerous options and choices in the process. These choices can markedly affect key aspects of the data output including relative abundances, diversity, and taxonomy. Bioaerosol samples have relatively little DNA, and often contain novel and proportionally high levels of contaminant organisms, that are difficult to identify. Therefore, bioinformatics choices are of crucial importance. A bioaerosol dataset for bacteria and fungi based on the 16S rRNA gene (16S) and internal transcribed spacer (ITS) DNA sequencing from parks in the metropolitan area of Auckland, Aotearoa New Zealand was used to develop a process for determining the bioinformatics pipeline that would maximize the data amount and quality generated. Two popular tools (Dada2 and USEARCH) were compared for amplicon sequence variant (ASV) inference and generation of an ASV table. A scorecard was created and used to assess multiple outputs and make systematic choices about the most suitable option. The read number and ASVs were assessed, alpha diversity was calculated (Hill numbers), beta diversity (Bray–Curtis distances), differential abundance by site and consistency of ASVs were considered. USEARCH was selected, due to higher consistency in ASVs identified and greater read counts. Taxonomic assignment is highly dependent on the taxonomic database used. Two popular taxonomy databases were compared in terms of number and confidence of assignments, and a combined approach developed that uses information in both databases to maximize the number and confidence of taxonomic assignments. This approach increased the assignment rate by 12–15%, depending on amplicon and the overall assignment was 77% for bacteria and 47% for fungi. Assessment of decontamination using “decontam” and “microDecon” was performed, based on review of ASVs identified as contaminants by each and consideration of the probability of them being legitimate members of the bioaerosol community. For this example, “microDecon’s” subtraction approach for removing background contamination was selected. This study demonstrates a systematic approach to determining the optimal bioinformatics pipeline using a multi-criteria scorecard for microbial bioaerosol data. Example code in the R environment for this data processing pipeline is provided.https://peerj.com/articles/12065.pdfBioaerosolBioinformaticsMicrobial ecologyNGSBacteriaFungi
collection DOAJ
language English
format Article
sources DOAJ
author Katie Miaow
Donnabella Lacap-Bugler
Hannah L. Buckley
spellingShingle Katie Miaow
Donnabella Lacap-Bugler
Hannah L. Buckley
Identifying optimal bioinformatics protocols for aerosol microbial community data
PeerJ
Bioaerosol
Bioinformatics
Microbial ecology
NGS
Bacteria
Fungi
author_facet Katie Miaow
Donnabella Lacap-Bugler
Hannah L. Buckley
author_sort Katie Miaow
title Identifying optimal bioinformatics protocols for aerosol microbial community data
title_short Identifying optimal bioinformatics protocols for aerosol microbial community data
title_full Identifying optimal bioinformatics protocols for aerosol microbial community data
title_fullStr Identifying optimal bioinformatics protocols for aerosol microbial community data
title_full_unstemmed Identifying optimal bioinformatics protocols for aerosol microbial community data
title_sort identifying optimal bioinformatics protocols for aerosol microbial community data
publisher PeerJ Inc.
series PeerJ
issn 2167-8359
publishDate 2021-09-01
description Microbes are fundamental to Earth’s ecosystems, thus understanding ecosystem connectivity through microbial dispersal is key to predicting future ecosystem changes in a warming world. However, aerial microbial dispersal remains poorly understood. Few studies have been performed on bioaerosols (microorganisms and biological fragments suspended in the atmosphere), despite them harboring pathogens and allergens. Most environmental microbes grow poorly in culture, therefore molecular approaches are required to characterize aerial diversity. Bioinformatic tools are needed for processing the next generation sequencing (NGS) data generated from these molecular approaches; however, there are numerous options and choices in the process. These choices can markedly affect key aspects of the data output including relative abundances, diversity, and taxonomy. Bioaerosol samples have relatively little DNA, and often contain novel and proportionally high levels of contaminant organisms, that are difficult to identify. Therefore, bioinformatics choices are of crucial importance. A bioaerosol dataset for bacteria and fungi based on the 16S rRNA gene (16S) and internal transcribed spacer (ITS) DNA sequencing from parks in the metropolitan area of Auckland, Aotearoa New Zealand was used to develop a process for determining the bioinformatics pipeline that would maximize the data amount and quality generated. Two popular tools (Dada2 and USEARCH) were compared for amplicon sequence variant (ASV) inference and generation of an ASV table. A scorecard was created and used to assess multiple outputs and make systematic choices about the most suitable option. The read number and ASVs were assessed, alpha diversity was calculated (Hill numbers), beta diversity (Bray–Curtis distances), differential abundance by site and consistency of ASVs were considered. USEARCH was selected, due to higher consistency in ASVs identified and greater read counts. Taxonomic assignment is highly dependent on the taxonomic database used. Two popular taxonomy databases were compared in terms of number and confidence of assignments, and a combined approach developed that uses information in both databases to maximize the number and confidence of taxonomic assignments. This approach increased the assignment rate by 12–15%, depending on amplicon and the overall assignment was 77% for bacteria and 47% for fungi. Assessment of decontamination using “decontam” and “microDecon” was performed, based on review of ASVs identified as contaminants by each and consideration of the probability of them being legitimate members of the bioaerosol community. For this example, “microDecon’s” subtraction approach for removing background contamination was selected. This study demonstrates a systematic approach to determining the optimal bioinformatics pipeline using a multi-criteria scorecard for microbial bioaerosol data. Example code in the R environment for this data processing pipeline is provided.
topic Bioaerosol
Bioinformatics
Microbial ecology
NGS
Bacteria
Fungi
url https://peerj.com/articles/12065.pdf
work_keys_str_mv AT katiemiaow identifyingoptimalbioinformaticsprotocolsforaerosolmicrobialcommunitydata
AT donnabellalacapbugler identifyingoptimalbioinformaticsprotocolsforaerosolmicrobialcommunitydata
AT hannahlbuckley identifyingoptimalbioinformaticsprotocolsforaerosolmicrobialcommunitydata
_version_ 1716854364840132608