Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning

Bibliographic Details
Main Author: DiTomaso, Dominic F.
Language:English
Published: Ohio University / OhioLINK 2015
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1439478822
id ndltd-OhioLink-oai-etd.ohiolink.edu-ohiou1439478822
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-ohiou14394788222021-08-03T06:33:01Z Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning DiTomaso, Dominic F. Computer Engineering Electrical Engineering Network-on-Chip Fault-tolerance Machine Learning Proactive Fault-tolerant technique Chip Multiprocessor Error Prediction Chip multiprocessors (CMPs) have emerged as the standard computer design to overcome the high power limitations and high performance demands of modern processing. Tens to thousands of cores at low frequencies (1-2 GHz) operate together to outperform single core processors. In order for the cores to efficiently communicate, a communication infrastructure called the network-on-chip (NoC) is required. The NoC uses modular router and link components to route data across the chip. However, as transistor technology scales down, more and more cores are being integrated into the NoC which leads to power and performance concerns due to high buffering power and under-utilized links. Moreover, the smaller transistors along with effects such as wear-out and device aging leads to serious reliability concerns in the NoC. Commonly used reactive fault-tolerant techniques, which are employed after the error has affected the system, can be most effective against hard, or permanent, errors. Proactive fault-tolerant techniques, on the other hand, can be used to prevent or avoid errors before they occur which can be most effective against soft, or transient, errors. In this dissertation, two separate but related fault-tolerant architectures are presented: 1) QORE - A reactive power-efficient/high performance fault-tolerant architecture for hard errors and 2) A proactive prediction/mitigation fault-tolerant architecture for soft errors. Both architectures provide fault-tolerance and both benefit from machine learning (ML) techniques but in different ways.QORE uses Multi-Function Channel (MFC) buffers and their associated control (link and fault controllers) to provide fault-tolerance by allowing the NoC to dynamically adapt to faults at the link level and reverse propagation direction to avoid faulty links. Additionally, MFC buffers reduce router power and improve performance by eliminating in-router buffering. A ML technique is used in the link controllers to predict the direction of traffic flow in order to more efficiently reverse links. Simulation results using real benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3X and throughput by 2.3X when compared to state-of-the art fault tolerant NoCs designs such as Ariadne and Vicis. Moreover, results from the Synopsys Design Compiler show that network power in QORE is reduced by 21% with minimal control overhead.In the prediction/mitigation design, several effects such as process-voltage-temperature variations and device wear-out are combined to create data sets which can be used in a prediction model. ML techniques are used on the data sets to train a decision tree which can be used to predict faults efficiently in the network. Based on the prediction model, the predicted faults are dynamically mitigated through error correction codes (ECC) and relaxed timing transmission. Results indicate that, on an average, timing errors can be accurately predicted 32.4% better than other labeling techniques resulting in a 23.3% reduction in retransmitted packets, a net speedup of 3.47X, and an energy savings of 41.9% over other designs for real traffic patterns. 2015 English text Ohio University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1439478822 http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1439478822 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Computer Engineering
Electrical Engineering
Network-on-Chip
Fault-tolerance
Machine Learning
Proactive Fault-tolerant technique
Chip Multiprocessor
Error Prediction
spellingShingle Computer Engineering
Electrical Engineering
Network-on-Chip
Fault-tolerance
Machine Learning
Proactive Fault-tolerant technique
Chip Multiprocessor
Error Prediction
DiTomaso, Dominic F.
Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
author DiTomaso, Dominic F.
author_facet DiTomaso, Dominic F.
author_sort DiTomaso, Dominic F.
title Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
title_short Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
title_full Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
title_fullStr Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
title_full_unstemmed Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning
title_sort reactive and proactive fault-tolerant network-on-chip architectures using machine learning
publisher Ohio University / OhioLINK
publishDate 2015
url http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1439478822
work_keys_str_mv AT ditomasodominicf reactiveandproactivefaulttolerantnetworkonchiparchitecturesusingmachinelearning
_version_ 1719438676026458112