PyBDA: a command line tool for automated analysis of big biological data sets

Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called P...

Full description

Bibliographic Details
Main Authors: Simon Dirmeier, Mario Emmenlauer, Christoph Dehio, Niko Beerenwinkel
Format: Article
Language:English
Published: BMC 2019-11-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-019-3087-8
id doaj-fef69666e5764a66988d747702df7910
record_format Article
spelling doaj-fef69666e5764a66988d747702df79102020-11-25T04:10:01ZengBMCBMC Bioinformatics1471-21052019-11-012011610.1186/s12859-019-3087-8PyBDA: a command line tool for automated analysis of big biological data setsSimon Dirmeier0Mario Emmenlauer1Christoph Dehio2Niko Beerenwinkel3Department of Biosystems Science and Engineering, ETH ZurichBiozentrum, University of BaselBiozentrum, University of BaselDepartment of Biosystems Science and Engineering, ETH ZurichAbstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.http://link.springer.com/article/10.1186/s12859-019-3087-8Big dataData analysisCommand linePipelineComputing clusterGrid engine
collection DOAJ
language English
format Article
sources DOAJ
author Simon Dirmeier
Mario Emmenlauer
Christoph Dehio
Niko Beerenwinkel
spellingShingle Simon Dirmeier
Mario Emmenlauer
Christoph Dehio
Niko Beerenwinkel
PyBDA: a command line tool for automated analysis of big biological data sets
BMC Bioinformatics
Big data
Data analysis
Command line
Pipeline
Computing cluster
Grid engine
author_facet Simon Dirmeier
Mario Emmenlauer
Christoph Dehio
Niko Beerenwinkel
author_sort Simon Dirmeier
title PyBDA: a command line tool for automated analysis of big biological data sets
title_short PyBDA: a command line tool for automated analysis of big biological data sets
title_full PyBDA: a command line tool for automated analysis of big biological data sets
title_fullStr PyBDA: a command line tool for automated analysis of big biological data sets
title_full_unstemmed PyBDA: a command line tool for automated analysis of big biological data sets
title_sort pybda: a command line tool for automated analysis of big biological data sets
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2019-11-01
description Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.
topic Big data
Data analysis
Command line
Pipeline
Computing cluster
Grid engine
url http://link.springer.com/article/10.1186/s12859-019-3087-8
work_keys_str_mv AT simondirmeier pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT marioemmenlauer pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT christophdehio pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
AT nikobeerenwinkel pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets
_version_ 1724420912622600192