Memory Partitioning and Optimization of On-Chip Accelerators with High-Level Synthesis

碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Current researches in the design space exploration for accelerators mainly rely on either RTL-based flow or High-Level Synthesis flow. However, both of them are very time-consuming. Pre-RTL tools, such as Aladdin, can directly analyze designs in high-level langu...

Full description

Bibliographic Details
Main Authors: Peng, Te-Hsin., 彭德欣
Other Authors: Huang, Chih-Tsun
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/24zvgm
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Current researches in the design space exploration for accelerators mainly rely on either RTL-based flow or High-Level Synthesis flow. However, both of them are very time-consuming. Pre-RTL tools, such as Aladdin, can directly analyze designs in high-level languages and take less time to explore the timing, area, and power estimation of different micro-architectures. Our previous work proposes a design assisted flow, which combines the HLS flow with the assistance of Aladdin to explore the design space. Vivado HLS, which targets at the FPGA design flow, is used. If users want to adopt the ASIC design flow, the result may be inaccurate. Therefore, in this thesis, we extend the exploration flow to adopt the ASIC HLS tool such as Stratus HLS, resulting in a more accurate design space exploration. In addition, the conventional partitioning approaches, such as the block, cyclic, and block-cyclic techniques, can not evenly distribute the data elements into the memory banks. It causes the memory conflicts and thus becomes the bottleneck for the performance. Our previous work proposes the novel remapping algorithm to solve the problem. However, the original remapping scheme will introduce irregular data padding or unnecessary data swapping, leading to the extra area or latency overhead. In this thesis, we improve the remapping algorithm by proposing a more general and efficient approach to find out the regularity. We compare the optimized remapping algorithm and the conventional cyclic approach in six benchmark applications with different access patterns. And we apply the different combinations of the loop unrolling and memory partition to explore the design space. Then we classify the six benchmarks based on their access patterns and analyze the performance and area. The results of experiments show that our optimized remapping approach can effectively improve the performance with a smaller area overhead as compared with the cyclic approach.