A big data approach to metagenomics for all-food-sequencing

Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires...

Full description

Bibliographic Details
Main Authors:	Robin Kobus, José M. Abuín, André Müller, Sören Lukas Hellmann, Juan C. Pichel, Tomás F. Pena, Andreas Hildebrandt, Thomas Hankeln, Bertil Schmidt
Format:	Article
Language:	English
Published:	BMC 2020-03-01
Series:	BMC Bioinformatics
Subjects:	Next-generation sequencing Metagenomics Species identification Eukaryotic genomes Locality sensitive hashing Big data
Online Access:	http://link.springer.com/article/10.1186/s12859-020-3429-6

id	doaj-52b8dda9f3684cd6a38d2bbc7c876b0f
record_format	Article
spelling	doaj-52b8dda9f3684cd6a38d2bbc7c876b0f2020-11-25T02:25:01ZengBMCBMC Bioinformatics1471-21052020-03-0121111510.1186/s12859-020-3429-6A big data approach to metagenomics for all-food-sequencingRobin Kobus0José M. Abuín1André Müller2Sören Lukas Hellmann3Juan C. Pichel4Tomás F. Pena5Andreas Hildebrandt6Thomas Hankeln7Bertil Schmidt8Department of Computer Science, Johannes Gutenberg UniversityIPCA, Polytechnic Institute of Cávado and AveDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityCiTIUS, Universidade de Santiago de CompostelaCiTIUS, Universidade de Santiago de CompostelaDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityDepartment of Computer Science, Johannes Gutenberg UniversityAbstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).http://link.springer.com/article/10.1186/s12859-020-3429-6Next-generation sequencingMetagenomicsSpecies identificationEukaryotic genomesLocality sensitive hashingBig data
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt
spellingShingle	Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt A big data approach to metagenomics for all-food-sequencing BMC Bioinformatics Next-generation sequencing Metagenomics Species identification Eukaryotic genomes Locality sensitive hashing Big data
author_facet	Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt
author_sort	Robin Kobus
title	A big data approach to metagenomics for all-food-sequencing
title_short	A big data approach to metagenomics for all-food-sequencing
title_full	A big data approach to metagenomics for all-food-sequencing
title_fullStr	A big data approach to metagenomics for all-food-sequencing
title_full_unstemmed	A big data approach to metagenomics for all-food-sequencing
title_sort	big data approach to metagenomics for all-food-sequencing
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2020-03-01
description	Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).
topic	Next-generation sequencing Metagenomics Species identification Eukaryotic genomes Locality sensitive hashing Big data
url	http://link.springer.com/article/10.1186/s12859-020-3429-6
work_keys_str_mv	AT robinkobus abigdataapproachtometagenomicsforallfoodsequencing AT josemabuin abigdataapproachtometagenomicsforallfoodsequencing AT andremuller abigdataapproachtometagenomicsforallfoodsequencing AT sorenlukashellmann abigdataapproachtometagenomicsforallfoodsequencing AT juancpichel abigdataapproachtometagenomicsforallfoodsequencing AT tomasfpena abigdataapproachtometagenomicsforallfoodsequencing AT andreashildebrandt abigdataapproachtometagenomicsforallfoodsequencing AT thomashankeln abigdataapproachtometagenomicsforallfoodsequencing AT bertilschmidt abigdataapproachtometagenomicsforallfoodsequencing AT robinkobus bigdataapproachtometagenomicsforallfoodsequencing AT josemabuin bigdataapproachtometagenomicsforallfoodsequencing AT andremuller bigdataapproachtometagenomicsforallfoodsequencing AT sorenlukashellmann bigdataapproachtometagenomicsforallfoodsequencing AT juancpichel bigdataapproachtometagenomicsforallfoodsequencing AT tomasfpena bigdataapproachtometagenomicsforallfoodsequencing AT andreashildebrandt bigdataapproachtometagenomicsforallfoodsequencing AT thomashankeln bigdataapproachtometagenomicsforallfoodsequencing AT bertilschmidt bigdataapproachtometagenomicsforallfoodsequencing
_version_	1724853203178094592

A big data approach to metagenomics for all-food-sequencing

Similar Items