McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads...

Full description

Bibliographic Details
Main Authors: Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann
Format: Article
Language:English
Published: Hindawi Limited 2013-01-01
Series:Scientific Programming
Online Access:http://dx.doi.org/10.3233/SPR-130371