Combining replication and checkpointing redundancies for reducing resiliency overhead
AbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in tw...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Electronics and Telecommunications Research Institute (ETRI)
2020-04-01
|
Series: | ETRI Journal |
Subjects: | |
Online Access: | https://doi.org/10.4218/etrij.2018-0684 |
id |
doaj-202f08b2ee2d4aefbd74dd57b1b3e7f2 |
---|---|
record_format |
Article |
spelling |
doaj-202f08b2ee2d4aefbd74dd57b1b3e7f22020-11-25T03:18:06ZengElectronics and Telecommunications Research Institute (ETRI)ETRI Journal1225-64632020-04-0142338839810.4218/etrij.2018-068410.4218/etrij.2018-0684Combining replication and checkpointing redundancies for reducing resiliency overheadHassan MotallebiAbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.https://doi.org/10.4218/etrij.2018-0684concurrency graphextended upward rankfault‐tolerant schedulinghybrid redundancyresiliency overhead |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Hassan Motallebi |
spellingShingle |
Hassan Motallebi Combining replication and checkpointing redundancies for reducing resiliency overhead ETRI Journal concurrency graph extended upward rank fault‐tolerant scheduling hybrid redundancy resiliency overhead |
author_facet |
Hassan Motallebi |
author_sort |
Hassan Motallebi |
title |
Combining replication and checkpointing redundancies for reducing resiliency overhead |
title_short |
Combining replication and checkpointing redundancies for reducing resiliency overhead |
title_full |
Combining replication and checkpointing redundancies for reducing resiliency overhead |
title_fullStr |
Combining replication and checkpointing redundancies for reducing resiliency overhead |
title_full_unstemmed |
Combining replication and checkpointing redundancies for reducing resiliency overhead |
title_sort |
combining replication and checkpointing redundancies for reducing resiliency overhead |
publisher |
Electronics and Telecommunications Research Institute (ETRI) |
series |
ETRI Journal |
issn |
1225-6463 |
publishDate |
2020-04-01 |
description |
AbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead. |
topic |
concurrency graph extended upward rank fault‐tolerant scheduling hybrid redundancy resiliency overhead |
url |
https://doi.org/10.4218/etrij.2018-0684 |
work_keys_str_mv |
AT hassanmotallebi combiningreplicationandcheckpointingredundanciesforreducingresiliencyoverhead |
_version_ |
1724628844153929728 |