Designing scientific workflow following a structure and provenance-aware strategy

Les expériences bioinformatiques sont généralement effectuées à l'aide de workflows scientifiques dans lesquels les tâches sont enchaînées les unes aux autres pour former des structures de graphes très complexes et imbriquées. Les systèmes de workflows scientifiques ont ensuite été développés p...

Full description

Bibliographic Details
Main Author:	Chen, Jiuqiang
Other Authors:	Paris 11
Language:	en
Published:	2013
Subjects:	Workflow scientifique Provenance Provenance-équivalence Graphes séries-parallèles Taverna Anti-modèles Scientific workflows Graph rewriting Series-parallel graphs Anti-patterns
Online Access:	http://www.theses.fr/2013PA112221/document

id	ndltd-theses.fr-2013PA112221
record_format	oai_dc
collection	NDLTD
language	en
sources	NDLTD
topic	Workflow scientifique Provenance Provenance-équivalence Graphes séries-parallèles Taverna Anti-modèles Scientific workflows Provenance Graph rewriting Series-parallel graphs Taverna Anti-patterns
spellingShingle	Workflow scientifique Provenance Provenance-équivalence Graphes séries-parallèles Taverna Anti-modèles Scientific workflows Provenance Graph rewriting Series-parallel graphs Taverna Anti-patterns Chen, Jiuqiang Designing scientific workflow following a structure and provenance-aware strategy
description	Les expériences bioinformatiques sont généralement effectuées à l'aide de workflows scientifiques dans lesquels les tâches sont enchaînées les unes aux autres pour former des structures de graphes très complexes et imbriquées. Les systèmes de workflows scientifiques ont ensuite été développés pour guider les utilisateurs dans la conception et l'exécution de workflows. Un avantage de ces systèmes par rapport aux approches traditionnelles est leur capacité à mémoriser automatiquement la provenance (ou lignage) des produits de données intermédiaires et finaux générés au cours de l'exécution du workflow. La provenance d'un produit de données contient des informations sur la façon dont le produit est dérivé, et est cruciale pour permettre aux scientifiques de comprendre, reproduire, et vérifier les résultats scientifiques facilement. Pour plusieurs raisons, la complexité du workflow et des structures d'exécution du workflow est en augmentation au fil du temps, ce qui a un impact évident sur la réutilisation des workflows scientifiques.L'objectif global de cette thèse est d'améliorer la réutilisation des workflows en fournissant des stratégies visant à réduire la complexité des structures de workflow tout en préservant la provenance. Deux stratégies sont introduites. Tout d'abord, nous proposons une approche de réécriture de la structure du graphe de n'importe quel workflow scientifique (classiquement représentée comme un graphe acyclique orienté (DAG)) dans une structure plus simple, à savoir une structure série-parallèle (SP) tout en préservant la provenance. Les SP-graphes sont simples et bien structurés, ce qui permet de mieux distinguer les principales étapes du workflow. En outre, d'un point de vue plus formel, on peut utiliser des algorithmes polynomiaux pour effectuer des opérations complexes fondées sur les graphiques (par exemple, la comparaison de workflows, ce qui est directement lié au problème d’homomorphisme de sous-graphes) lorsque les workflows ont des SP-structures alors que ces opérations sont reliées à des problèmes NP-hard pour des graphes qui sont des DAG sans aucune restriction sur leur structure. Nous avons introduit la notion de préservation de la provenance, conçu l’algorithme de réécriture SPFlow et réalisé l’outil associé.Deuxièmement, nous proposons une méthodologie avec une technique capable de réduire la redondance présente dans les workflow (en supprimant les occurrences inutiles de tâches). Plus précisément, nous détectons des « anti-modèles », un terme largement utilisé dans le domaine de la conception de programme, pour indiquer l'utilisation de formes idiomatiques qui mènent à une conception trop compliquée, et qui doit donc être évitée. Nous avons ainsi conçu l'algorithme DistillFlow qui est capable de transformer un workflow donné en un workflow sémantiquement équivalent «distillé», c’est-à-dire, qui est libre ou partiellement libre des anti-modèles et possède une structure plus concise et plus simple. Les deux principales approches de cette thèse (à savoir, SPFlow et DistillFlow) sont basées sur un modèle de provenance que nous avons introduit pour représenter la structure de la provenance des exécutions du workflowl. La notion de «provenance-équivalence» qui détermine si deux workflows ont la même signification est également au centre de notre travail. Nos solutions ont été testées systématiquement sur de grandes collections de workflows réels, en particulier avec le système Taverna. Nos outils sont disponibles à l'adresse: https://www.lri.fr/~chenj/. === Bioinformatics experiments are usually performed using scientific workflows in which tasks are chained together forming very intricate and nested graph structures. Scientific workflow systems have then been developed to guide users in the design and execution of workflows. An advantage of these systems over traditional approaches is their ability to automatically record the provenance (or lineage) of intermediate and final data products generated during workflow execution. The provenance of a data product contains information about how the product was derived, and it is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. For several reasons, the complexity of workflow and workflow execution structures is increasing over time, which has a clear impact on scientific workflows reuse.The global aim of this thesis is to enhance workflow reuse by providing strategies to reduce the complexity of workflow structures while preserving provenance. Two strategies are introduced.First, we propose an approach to rewrite the graph structure of any scientific workflow (classically represented as a directed acyclic graph (DAG)) into a simpler structure, namely, a series-parallel (SP) structure while preserving provenance. SP-graphs are simple and layered, making the main phases of workflow easier to distinguish. Additionally, from a more formal point of view, polynomial-time algorithms for performing complex graph-based operations (e.g., comparing workflows, which is directly related to the problem of subgraph homomorphism) can be designed when workflows have SP-structures while such operations are related to an NP-hard problem for DAG structures without any restriction on their structures. The SPFlow rewriting and provenance-preserving algorithm and its associated tool are thus introduced.Second, we provide a methodology together with a technique able to reduce the redundancy present in workflows (by removing unnecessary occurrences of tasks). More precisely, we detect "anti-patterns", a term broadly used in program design to indicate the use of idiomatic forms that lead to over-complicated design, and which should therefore be avoided. We thus provide the DistillFlow algorithm able to transform a workflow into a distilled semantically-equivalent workflow, which is free or partly free of anti-patterns and has a more concise and simpler structure.The two main approaches of this thesis (namely, SPFlow and DistillFlow) are based on a provenance model that we have introduced to represent the provenance structure of the workflow executions. The notion of provenance-equivalence which determines whether two workflows have the same meaning is also at the center of our work. Our solutions have been systematically tested on large collections of real workflows, especially from the Taverna system. Our approaches are available for use at https://www.lri.fr/~chenj/.
author2	Paris 11
author_facet	Paris 11 Chen, Jiuqiang
author	Chen, Jiuqiang
author_sort	Chen, Jiuqiang
title	Designing scientific workflow following a structure and provenance-aware strategy
title_short	Designing scientific workflow following a structure and provenance-aware strategy
title_full	Designing scientific workflow following a structure and provenance-aware strategy
title_fullStr	Designing scientific workflow following a structure and provenance-aware strategy
title_full_unstemmed	Designing scientific workflow following a structure and provenance-aware strategy
title_sort	designing scientific workflow following a structure and provenance-aware strategy
publishDate	2013
url	http://www.theses.fr/2013PA112221/document
work_keys_str_mv	AT chenjiuqiang designingscientificworkflowfollowingastructureandprovenanceawarestrategy AT chenjiuqiang conceptiondeworkflowsscientifiquesfondeesurlastructureetlaprovenance
_version_	1718477946438549504
spelling	ndltd-theses.fr-2013PA1122212017-06-28T04:36:12Z Designing scientific workflow following a structure and provenance-aware strategy Conception de workflows scientifiques fondée sur la structure et la provenance Workflow scientifique Provenance Provenance-équivalence Graphes séries-parallèles Taverna Anti-modèles Scientific workflows Provenance Graph rewriting Series-parallel graphs Taverna Anti-patterns Les expériences bioinformatiques sont généralement effectuées à l'aide de workflows scientifiques dans lesquels les tâches sont enchaînées les unes aux autres pour former des structures de graphes très complexes et imbriquées. Les systèmes de workflows scientifiques ont ensuite été développés pour guider les utilisateurs dans la conception et l'exécution de workflows. Un avantage de ces systèmes par rapport aux approches traditionnelles est leur capacité à mémoriser automatiquement la provenance (ou lignage) des produits de données intermédiaires et finaux générés au cours de l'exécution du workflow. La provenance d'un produit de données contient des informations sur la façon dont le produit est dérivé, et est cruciale pour permettre aux scientifiques de comprendre, reproduire, et vérifier les résultats scientifiques facilement. Pour plusieurs raisons, la complexité du workflow et des structures d'exécution du workflow est en augmentation au fil du temps, ce qui a un impact évident sur la réutilisation des workflows scientifiques.L'objectif global de cette thèse est d'améliorer la réutilisation des workflows en fournissant des stratégies visant à réduire la complexité des structures de workflow tout en préservant la provenance. Deux stratégies sont introduites. Tout d'abord, nous proposons une approche de réécriture de la structure du graphe de n'importe quel workflow scientifique (classiquement représentée comme un graphe acyclique orienté (DAG)) dans une structure plus simple, à savoir une structure série-parallèle (SP) tout en préservant la provenance. Les SP-graphes sont simples et bien structurés, ce qui permet de mieux distinguer les principales étapes du workflow. En outre, d'un point de vue plus formel, on peut utiliser des algorithmes polynomiaux pour effectuer des opérations complexes fondées sur les graphiques (par exemple, la comparaison de workflows, ce qui est directement lié au problème d’homomorphisme de sous-graphes) lorsque les workflows ont des SP-structures alors que ces opérations sont reliées à des problèmes NP-hard pour des graphes qui sont des DAG sans aucune restriction sur leur structure. Nous avons introduit la notion de préservation de la provenance, conçu l’algorithme de réécriture SPFlow et réalisé l’outil associé.Deuxièmement, nous proposons une méthodologie avec une technique capable de réduire la redondance présente dans les workflow (en supprimant les occurrences inutiles de tâches). Plus précisément, nous détectons des « anti-modèles », un terme largement utilisé dans le domaine de la conception de programme, pour indiquer l'utilisation de formes idiomatiques qui mènent à une conception trop compliquée, et qui doit donc être évitée. Nous avons ainsi conçu l'algorithme DistillFlow qui est capable de transformer un workflow donné en un workflow sémantiquement équivalent «distillé», c’est-à-dire, qui est libre ou partiellement libre des anti-modèles et possède une structure plus concise et plus simple. Les deux principales approches de cette thèse (à savoir, SPFlow et DistillFlow) sont basées sur un modèle de provenance que nous avons introduit pour représenter la structure de la provenance des exécutions du workflowl. La notion de «provenance-équivalence» qui détermine si deux workflows ont la même signification est également au centre de notre travail. Nos solutions ont été testées systématiquement sur de grandes collections de workflows réels, en particulier avec le système Taverna. Nos outils sont disponibles à l'adresse: https://www.lri.fr/~chenj/. Bioinformatics experiments are usually performed using scientific workflows in which tasks are chained together forming very intricate and nested graph structures. Scientific workflow systems have then been developed to guide users in the design and execution of workflows. An advantage of these systems over traditional approaches is their ability to automatically record the provenance (or lineage) of intermediate and final data products generated during workflow execution. The provenance of a data product contains information about how the product was derived, and it is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. For several reasons, the complexity of workflow and workflow execution structures is increasing over time, which has a clear impact on scientific workflows reuse.The global aim of this thesis is to enhance workflow reuse by providing strategies to reduce the complexity of workflow structures while preserving provenance. Two strategies are introduced.First, we propose an approach to rewrite the graph structure of any scientific workflow (classically represented as a directed acyclic graph (DAG)) into a simpler structure, namely, a series-parallel (SP) structure while preserving provenance. SP-graphs are simple and layered, making the main phases of workflow easier to distinguish. Additionally, from a more formal point of view, polynomial-time algorithms for performing complex graph-based operations (e.g., comparing workflows, which is directly related to the problem of subgraph homomorphism) can be designed when workflows have SP-structures while such operations are related to an NP-hard problem for DAG structures without any restriction on their structures. The SPFlow rewriting and provenance-preserving algorithm and its associated tool are thus introduced.Second, we provide a methodology together with a technique able to reduce the redundancy present in workflows (by removing unnecessary occurrences of tasks). More precisely, we detect "anti-patterns", a term broadly used in program design to indicate the use of idiomatic forms that lead to over-complicated design, and which should therefore be avoided. We thus provide the DistillFlow algorithm able to transform a workflow into a distilled semantically-equivalent workflow, which is free or partly free of anti-patterns and has a more concise and simpler structure.The two main approaches of this thesis (namely, SPFlow and DistillFlow) are based on a provenance model that we have introduced to represent the provenance structure of the workflow executions. The notion of provenance-equivalence which determines whether two workflows have the same meaning is also at the center of our work. Our solutions have been systematically tested on large collections of real workflows, especially from the Taverna system. Our approaches are available for use at https://www.lri.fr/~chenj/. Electronic Thesis or Dissertation Text Image en http://www.theses.fr/2013PA112221/document Chen, Jiuqiang 2013-10-11 Paris 11 Froidevaux, Christine

Designing scientific workflow following a structure and provenance-aware strategy

Similar Items