McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads...

Full description

Bibliographic Details
Main Authors:	Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann
Format:	Article
Language:	English
Published:	Hindawi Limited 2013-01-01
Series:	Scientific Programming
Online Access:	http://dx.doi.org/10.3233/SPR-130371

Internet

http://dx.doi.org/10.3233/SPR-130371

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Internet

Similar Items