doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Abstract Background Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict....

Full description

Bibliographic Details
Main Authors:	Daniel Svensson, Rickard Sjögren, David Sundell, Andreas Sjödin, Johan Trygg
Format:	Article
Language:	English
Published:	BMC 2019-10-01
Series:	BMC Bioinformatics
Subjects:	Design of Experiments Optimization Sequencing Nanopore MinION Assembly
Online Access:	http://link.springer.com/article/10.1186/s12859-019-3091-z

id	doaj-cf8fcb2f6f8d4bdea6c88d2f2d16fcad
record_format	Article
spelling	doaj-cf8fcb2f6f8d4bdea6c88d2f2d16fcad2020-11-25T03:07:59ZengBMCBMC Bioinformatics1471-21052019-10-0120111310.1186/s12859-019-3091-zdoepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflowsDaniel Svensson0Rickard Sjögren1David Sundell2Andreas Sjödin3Johan Trygg4Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå UniversityDepartment of Chemistry, Computational Life Science Cluster (CLiC), Umeå UniversityDivision of CBRN Security and Defence, FOI - Swedish Defence Research AgencyDivision of CBRN Security and Defence, FOI - Swedish Defence Research AgencyDepartment of Chemistry, Computational Life Science Cluster (CLiC), Umeå UniversityAbstract Background Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. Results We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. Conclusions Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.http://link.springer.com/article/10.1186/s12859-019-3091-zDesign of ExperimentsOptimizationSequencingNanoporeMinIONAssembly
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Daniel Svensson Rickard Sjögren David Sundell Andreas Sjödin Johan Trygg
spellingShingle	Daniel Svensson Rickard Sjögren David Sundell Andreas Sjödin Johan Trygg doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows BMC Bioinformatics Design of Experiments Optimization Sequencing Nanopore MinION Assembly
author_facet	Daniel Svensson Rickard Sjögren David Sundell Andreas Sjödin Johan Trygg
author_sort	Daniel Svensson
title	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_short	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_full	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_fullStr	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_full_unstemmed	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_sort	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2019-10-01
description	Abstract Background Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. Results We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. Conclusions Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.
topic	Design of Experiments Optimization Sequencing Nanopore MinION Assembly
url	http://link.springer.com/article/10.1186/s12859-019-3091-z
work_keys_str_mv	AT danielsvensson doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT rickardsjogren doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT davidsundell doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT andreassjodin doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT johantrygg doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows
_version_	1724667821166690304

doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Similar Items