PyBDA: a command line tool for automated analysis of big biological data sets
Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called P...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-11-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12859-019-3087-8 |
id |
doaj-fef69666e5764a66988d747702df7910 |
---|---|
record_format |
Article |
spelling |
doaj-fef69666e5764a66988d747702df79102020-11-25T04:10:01ZengBMCBMC Bioinformatics1471-21052019-11-012011610.1186/s12859-019-3087-8PyBDA: a command line tool for automated analysis of big biological data setsSimon Dirmeier0Mario Emmenlauer1Christoph Dehio2Niko Beerenwinkel3Department of Biosystems Science and Engineering, ETH ZurichBiozentrum, University of BaselBiozentrum, University of BaselDepartment of Biosystems Science and Engineering, ETH ZurichAbstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.http://link.springer.com/article/10.1186/s12859-019-3087-8Big dataData analysisCommand linePipelineComputing clusterGrid engine |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Simon Dirmeier Mario Emmenlauer Christoph Dehio Niko Beerenwinkel |
spellingShingle |
Simon Dirmeier Mario Emmenlauer Christoph Dehio Niko Beerenwinkel PyBDA: a command line tool for automated analysis of big biological data sets BMC Bioinformatics Big data Data analysis Command line Pipeline Computing cluster Grid engine |
author_facet |
Simon Dirmeier Mario Emmenlauer Christoph Dehio Niko Beerenwinkel |
author_sort |
Simon Dirmeier |
title |
PyBDA: a command line tool for automated analysis of big biological data sets |
title_short |
PyBDA: a command line tool for automated analysis of big biological data sets |
title_full |
PyBDA: a command line tool for automated analysis of big biological data sets |
title_fullStr |
PyBDA: a command line tool for automated analysis of big biological data sets |
title_full_unstemmed |
PyBDA: a command line tool for automated analysis of big biological data sets |
title_sort |
pybda: a command line tool for automated analysis of big biological data sets |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2019-11-01 |
description |
Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io. |
topic |
Big data Data analysis Command line Pipeline Computing cluster Grid engine |
url |
http://link.springer.com/article/10.1186/s12859-019-3087-8 |
work_keys_str_mv |
AT simondirmeier pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets AT marioemmenlauer pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets AT christophdehio pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets AT nikobeerenwinkel pybdaacommandlinetoolforautomatedanalysisofbigbiologicaldatasets |
_version_ |
1724420912622600192 |