A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of de...

Full description

Bibliographic Details
Main Authors: Guozhen Zhang, Yi Liu, Hailong Yang, Depei Qian
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8540813/
Description
Summary:In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of debugging large-scale parallel programs. Our technique detects program failures by observing the abnormal message passing behaviors with distributed monitors and leverages event-driven mechanism to trigger global status checking among different node groups concurrently. Besides, both coarse-grained execution snapshots and fine-grained failure events can be provided for further failure diagnosis and bug analysis. We implement this technique as a user-space library named failure cause resolver (FCR). Experimental results on the Tianhe-2 supercomputer demonstrate that the latency of FCR for failure detection is acceptable with negligible overhead. In addition, FCR does not require administrative privilege and can be easily integrated into existing large-scale parallel programs.
ISSN:2169-3536