Combining replication and checkpointing redundancies for reducing resiliency overhead

AbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in tw...

Full description

Bibliographic Details
Main Author: Hassan Motallebi
Format: Article
Language:English
Published: Electronics and Telecommunications Research Institute (ETRI) 2020-04-01
Series:ETRI Journal
Subjects:
Online Access:https://doi.org/10.4218/etrij.2018-0684
id doaj-202f08b2ee2d4aefbd74dd57b1b3e7f2
record_format Article
spelling doaj-202f08b2ee2d4aefbd74dd57b1b3e7f22020-11-25T03:18:06ZengElectronics and Telecommunications Research Institute (ETRI)ETRI Journal1225-64632020-04-0142338839810.4218/etrij.2018-068410.4218/etrij.2018-0684Combining replication and checkpointing redundancies for reducing resiliency overheadHassan MotallebiAbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.https://doi.org/10.4218/etrij.2018-0684concurrency graphextended upward rankfault‐tolerant schedulinghybrid redundancyresiliency overhead
collection DOAJ
language English
format Article
sources DOAJ
author Hassan Motallebi
spellingShingle Hassan Motallebi
Combining replication and checkpointing redundancies for reducing resiliency overhead
ETRI Journal
concurrency graph
extended upward rank
fault‐tolerant scheduling
hybrid redundancy
resiliency overhead
author_facet Hassan Motallebi
author_sort Hassan Motallebi
title Combining replication and checkpointing redundancies for reducing resiliency overhead
title_short Combining replication and checkpointing redundancies for reducing resiliency overhead
title_full Combining replication and checkpointing redundancies for reducing resiliency overhead
title_fullStr Combining replication and checkpointing redundancies for reducing resiliency overhead
title_full_unstemmed Combining replication and checkpointing redundancies for reducing resiliency overhead
title_sort combining replication and checkpointing redundancies for reducing resiliency overhead
publisher Electronics and Telecommunications Research Institute (ETRI)
series ETRI Journal
issn 1225-6463
publishDate 2020-04-01
description AbstractWe herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.
topic concurrency graph
extended upward rank
fault‐tolerant scheduling
hybrid redundancy
resiliency overhead
url https://doi.org/10.4218/etrij.2018-0684
work_keys_str_mv AT hassanmotallebi combiningreplicationandcheckpointingredundanciesforreducingresiliencyoverhead
_version_ 1724628844153929728