A big data approach to metagenomics for all-food-sequencing

Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires...

Full description

Bibliographic Details
Main Authors: Robin Kobus, José M. Abuín, André Müller, Sören Lukas Hellmann, Juan C. Pichel, Tomás F. Pena, Andreas Hildebrandt, Thomas Hankeln, Bertil Schmidt
Format: Article
Language:English
Published: BMC 2020-03-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-020-3429-6
id doaj-52b8dda9f3684cd6a38d2bbc7c876b0f
record_format Article
spelling doaj-52b8dda9f3684cd6a38d2bbc7c876b0f2020-11-25T02:25:01ZengBMCBMC Bioinformatics1471-21052020-03-0121111510.1186/s12859-020-3429-6A big data approach to metagenomics for all-food-sequencingRobin Kobus0José M. Abuín1André Müller2Sören Lukas Hellmann3Juan C. Pichel4Tomás F. Pena5Andreas Hildebrandt6Thomas Hankeln7Bertil Schmidt8Department of Computer Science, Johannes Gutenberg UniversityIPCA, Polytechnic Institute of Cávado and AveDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityCiTIUS, Universidade de Santiago de CompostelaCiTIUS, Universidade de Santiago de CompostelaDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityDepartment of Computer Science, Johannes Gutenberg UniversityAbstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).http://link.springer.com/article/10.1186/s12859-020-3429-6Next-generation sequencingMetagenomicsSpecies identificationEukaryotic genomesLocality sensitive hashingBig data
collection DOAJ
language English
format Article
sources DOAJ
author Robin Kobus
José M. Abuín
André Müller
Sören Lukas Hellmann
Juan C. Pichel
Tomás F. Pena
Andreas Hildebrandt
Thomas Hankeln
Bertil Schmidt
spellingShingle Robin Kobus
José M. Abuín
André Müller
Sören Lukas Hellmann
Juan C. Pichel
Tomás F. Pena
Andreas Hildebrandt
Thomas Hankeln
Bertil Schmidt
A big data approach to metagenomics for all-food-sequencing
BMC Bioinformatics
Next-generation sequencing
Metagenomics
Species identification
Eukaryotic genomes
Locality sensitive hashing
Big data
author_facet Robin Kobus
José M. Abuín
André Müller
Sören Lukas Hellmann
Juan C. Pichel
Tomás F. Pena
Andreas Hildebrandt
Thomas Hankeln
Bertil Schmidt
author_sort Robin Kobus
title A big data approach to metagenomics for all-food-sequencing
title_short A big data approach to metagenomics for all-food-sequencing
title_full A big data approach to metagenomics for all-food-sequencing
title_fullStr A big data approach to metagenomics for all-food-sequencing
title_full_unstemmed A big data approach to metagenomics for all-food-sequencing
title_sort big data approach to metagenomics for all-food-sequencing
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2020-03-01
description Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).
topic Next-generation sequencing
Metagenomics
Species identification
Eukaryotic genomes
Locality sensitive hashing
Big data
url http://link.springer.com/article/10.1186/s12859-020-3429-6
work_keys_str_mv AT robinkobus abigdataapproachtometagenomicsforallfoodsequencing
AT josemabuin abigdataapproachtometagenomicsforallfoodsequencing
AT andremuller abigdataapproachtometagenomicsforallfoodsequencing
AT sorenlukashellmann abigdataapproachtometagenomicsforallfoodsequencing
AT juancpichel abigdataapproachtometagenomicsforallfoodsequencing
AT tomasfpena abigdataapproachtometagenomicsforallfoodsequencing
AT andreashildebrandt abigdataapproachtometagenomicsforallfoodsequencing
AT thomashankeln abigdataapproachtometagenomicsforallfoodsequencing
AT bertilschmidt abigdataapproachtometagenomicsforallfoodsequencing
AT robinkobus bigdataapproachtometagenomicsforallfoodsequencing
AT josemabuin bigdataapproachtometagenomicsforallfoodsequencing
AT andremuller bigdataapproachtometagenomicsforallfoodsequencing
AT sorenlukashellmann bigdataapproachtometagenomicsforallfoodsequencing
AT juancpichel bigdataapproachtometagenomicsforallfoodsequencing
AT tomasfpena bigdataapproachtometagenomicsforallfoodsequencing
AT andreashildebrandt bigdataapproachtometagenomicsforallfoodsequencing
AT thomashankeln bigdataapproachtometagenomicsforallfoodsequencing
AT bertilschmidt bigdataapproachtometagenomicsforallfoodsequencing
_version_ 1724853203178094592