On the Programmability and Performance of OpenCL Designs for FPGA

Field programmable gate arrays (FPGAs) have been emerging as a promising bedrock to provide opportunities for several types of accelerators that spans across various domains such as finance, web-search, and data center networking, among others. Research interests facilitating the development of acce...

Full description

Bibliographic Details
Main Author:	Verma, Anshuman
Other Authors:	Electrical and Computer Engineering
Format:	Others
Published:	Virginia Tech 2019
Subjects:	FPGA OpenCL HDL HLS AOCL Verilog Accelerator GEM Stencil
Online Access:	http://hdl.handle.net/10919/92699

id	ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-92699
record_format	oai_dc
spelling	ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-926992020-09-29T05:40:49Z On the Programmability and Performance of OpenCL Designs for FPGA Verma, Anshuman Electrical and Computer Engineering Feng, Wu-Chun Zhou, Huiyang Athanas, Peter M. FPGA OpenCL HDL HLS AOCL Verilog Accelerator GEM Stencil Field programmable gate arrays (FPGAs) have been emerging as a promising bedrock to provide opportunities for several types of accelerators that spans across various domains such as finance, web-search, and data center networking, among others. Research interests facilitating the development of accelerators on FPGAs are increasing significantly, in particular, because of their effectiveness with a variety of applications, flexibility, and high performance per watt. However, several key challenges remain that hinder their large-scale deployment. Overcoming these challenges would enable them to match the pervasiveness of graphics processor units (GPUs), their principal competitors in this arena. One of the primary reasons responsible for the slow adaptation by programmers has been the programming model, which uses a low-level hardware description language (HDL). Using HDLs require a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis (HLS) tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language named OpenCL. These applications are then compiled and synthesized to create a bitstream that configures the FPGA. This thesis characterizes the efficacy of HLS compiler optimizations that can be employed to improve the performance of these applications. The synthesized hardware from OpenCL kernels is fundamentally different from traditional hardware such as CPUs and GPUs, which exploit instruction level parallelism (ILP) thread level parallelism (TLP), or data level parallelism (DLP) for performance gains. FPGAs typically use deep pipelining (i.e., ILP) for performance. A stall in this pipeline may severely undermine the performance of applications. Thus, it is imperative to identify and remove any such bottlenecks. To this end, this thesis presents and discusses a software-centric framework to debug and profile the synthesized designs generated using HLS tools. This thesis proposes basic code patterns, including a timestamp and a scalable framework, which can be plugged easily into OpenCL kernels, to collect and process run-time information dynamically. This scalable framework has a small overhead for area utilization and frequency but provides fine-grained information about the bottlenecks and latencies in design. Additionally, although HLS tools have improved programmability, this may come at the cost of performance or area utilization. This thesis addresses this design trade-off via a comparative study of a hand-coded design in HDL and an architecturally similar, tool-generated design using an OpenCL compiler in the application area of 3D-stencil (i.e., structured grid) computation. Experiments in this thesis show that the performance of an OpenCL approach can achieve 95% of the peak attainable performance of a microkernel for multiple problem sizes. In comparison to the OpenCL approach, an HDL approach results in approximately 50% less memory usage and only 2% better performance on average. MS 2019-08-04T06:00:33Z 2019-08-04T06:00:33Z 2018-02-09 Thesis vt_gsexam:12265 http://hdl.handle.net/10919/92699 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech
collection	NDLTD
format	Others
sources	NDLTD
topic	FPGA OpenCL HDL HLS AOCL Verilog Accelerator GEM Stencil
spellingShingle	FPGA OpenCL HDL HLS AOCL Verilog Accelerator GEM Stencil Verma, Anshuman On the Programmability and Performance of OpenCL Designs for FPGA
description	Field programmable gate arrays (FPGAs) have been emerging as a promising bedrock to provide opportunities for several types of accelerators that spans across various domains such as finance, web-search, and data center networking, among others. Research interests facilitating the development of accelerators on FPGAs are increasing significantly, in particular, because of their effectiveness with a variety of applications, flexibility, and high performance per watt. However, several key challenges remain that hinder their large-scale deployment. Overcoming these challenges would enable them to match the pervasiveness of graphics processor units (GPUs), their principal competitors in this arena. One of the primary reasons responsible for the slow adaptation by programmers has been the programming model, which uses a low-level hardware description language (HDL). Using HDLs require a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis (HLS) tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language named OpenCL. These applications are then compiled and synthesized to create a bitstream that configures the FPGA. This thesis characterizes the efficacy of HLS compiler optimizations that can be employed to improve the performance of these applications. The synthesized hardware from OpenCL kernels is fundamentally different from traditional hardware such as CPUs and GPUs, which exploit instruction level parallelism (ILP) thread level parallelism (TLP), or data level parallelism (DLP) for performance gains. FPGAs typically use deep pipelining (i.e., ILP) for performance. A stall in this pipeline may severely undermine the performance of applications. Thus, it is imperative to identify and remove any such bottlenecks. To this end, this thesis presents and discusses a software-centric framework to debug and profile the synthesized designs generated using HLS tools. This thesis proposes basic code patterns, including a timestamp and a scalable framework, which can be plugged easily into OpenCL kernels, to collect and process run-time information dynamically. This scalable framework has a small overhead for area utilization and frequency but provides fine-grained information about the bottlenecks and latencies in design. Additionally, although HLS tools have improved programmability, this may come at the cost of performance or area utilization. This thesis addresses this design trade-off via a comparative study of a hand-coded design in HDL and an architecturally similar, tool-generated design using an OpenCL compiler in the application area of 3D-stencil (i.e., structured grid) computation. Experiments in this thesis show that the performance of an OpenCL approach can achieve 95% of the peak attainable performance of a microkernel for multiple problem sizes. In comparison to the OpenCL approach, an HDL approach results in approximately 50% less memory usage and only 2% better performance on average. === MS
author2	Electrical and Computer Engineering
author_facet	Electrical and Computer Engineering Verma, Anshuman
author	Verma, Anshuman
author_sort	Verma, Anshuman
title	On the Programmability and Performance of OpenCL Designs for FPGA
title_short	On the Programmability and Performance of OpenCL Designs for FPGA
title_full	On the Programmability and Performance of OpenCL Designs for FPGA
title_fullStr	On the Programmability and Performance of OpenCL Designs for FPGA
title_full_unstemmed	On the Programmability and Performance of OpenCL Designs for FPGA
title_sort	on the programmability and performance of opencl designs for fpga
publisher	Virginia Tech
publishDate	2019
url	http://hdl.handle.net/10919/92699
work_keys_str_mv	AT vermaanshuman ontheprogrammabilityandperformanceofopencldesignsforfpga
_version_	1719345120106512384

On the Programmability and Performance of OpenCL Designs for FPGA

Similar Items