Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware developmen...

Full description

Bibliographic Details
Main Author: Zhang, Zheng
Format: Others
Language:English
Published: W&M ScholarWorks 2012
Subjects:
Online Access:https://scholarworks.wm.edu/etd/1539623602
https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd
id ndltd-wm.edu-oai-scholarworks.wm.edu-etd-3393
record_format oai_dc
spelling ndltd-wm.edu-oai-scholarworks.wm.edu-etd-33932019-05-16T03:23:14Z Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU Zhang, Zheng Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today's general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. 2012-01-01T08:00:00Z text application/pdf https://scholarworks.wm.edu/etd/1539623602 https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd © The Author Dissertations, Theses, and Masters Projects English W&M ScholarWorks Computer Sciences
collection NDLTD
language English
format Others
sources NDLTD
topic Computer Sciences
spellingShingle Computer Sciences
Zhang, Zheng
Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
description Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today's general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations.
author Zhang, Zheng
author_facet Zhang, Zheng
author_sort Zhang, Zheng
title Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_short Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_full Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_fullStr Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_full_unstemmed Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_sort locality enhancement and dynamic optimizations on multi-core and gpu
publisher W&M ScholarWorks
publishDate 2012
url https://scholarworks.wm.edu/etd/1539623602
https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd
work_keys_str_mv AT zhangzheng localityenhancementanddynamicoptimizationsonmulticoreandgpu
_version_ 1719185826354561024