Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware developmen...

Full description

Bibliographic Details
Main Author:	Zhang, Zheng
Format:	Others
Language:	English
Published:	W&M ScholarWorks 2012
Subjects:	Computer Sciences
Online Access:	https://scholarworks.wm.edu/etd/1539623602 https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd

id	ndltd-wm.edu-oai-scholarworks.wm.edu-etd-3393
record_format	oai_dc
spelling	ndltd-wm.edu-oai-scholarworks.wm.edu-etd-33932019-05-16T03:23:14Z Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU Zhang, Zheng Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today's general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. 2012-01-01T08:00:00Z text application/pdf https://scholarworks.wm.edu/etd/1539623602 https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd © The Author Dissertations, Theses, and Masters Projects English W&M ScholarWorks Computer Sciences
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer Sciences
spellingShingle	Computer Sciences Zhang, Zheng Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
description	Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today's general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations.
author	Zhang, Zheng
author_facet	Zhang, Zheng
author_sort	Zhang, Zheng
title	Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_short	Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_full	Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_fullStr	Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_full_unstemmed	Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
title_sort	locality enhancement and dynamic optimizations on multi-core and gpu
publisher	W&M ScholarWorks
publishDate	2012
url	https://scholarworks.wm.edu/etd/1539623602 https://scholarworks.wm.edu/cgi/viewcontent.cgi?article=3393&context=etd
work_keys_str_mv	AT zhangzheng localityenhancementanddynamicoptimizationsonmulticoreandgpu
_version_	1719185826354561024

Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Similar Items