Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization

碩士 === 國立彰化師範大學 === 資訊工程學系 === 100 === Our study is focused on improving an important category of Dynamic programming (DP) problems called Nonserial polyadic dynamic programming (NPDP) on a graphics processing unit (GPU).Because NPDP in different stages of the computation has different degree of par...

Full description

Bibliographic Details
Main Authors: Ting-Hong Lin, 林庭宏
Other Authors: Chao-Chin Wu
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/72744617351629968168
Description
Summary:碩士 === 國立彰化師範大學 === 資訊工程學系 === 100 === Our study is focused on improving an important category of Dynamic programming (DP) problems called Nonserial polyadic dynamic programming (NPDP) on a graphics processing unit (GPU).Because NPDP in different stages of the computation has different degree of parallelism, so it is hard for us to use GPU computation ability fully. In previous studies, we proposed an algorithm that can adaptively adjust the thread-level parallelism to solve this problem and improve the effectiveness of the NPDP such problems. In this research, we focused on the memory used of GPU optimization. We used the Tiling technique to divide subproblems and data. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, this makes it impossible to reuse the tiled data in shared memory after the kernel is re-invoked. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. Experimental results demonstrate that our method can achieve a speedup of 3.2 over the previously published GPU algorithm.