Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization

碩士 === 國立彰化師範大學 === 資訊工程學系 === 100 === Our study is focused on improving an important category of Dynamic programming (DP) problems called Nonserial polyadic dynamic programming (NPDP) on a graphics processing unit (GPU).Because NPDP in different stages of the computation has different degree of par...

Full description

Bibliographic Details
Main Authors: Ting-Hong Lin, 林庭宏
Other Authors: Chao-Chin Wu
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/72744617351629968168
id ndltd-TW-100NCUE5392021
record_format oai_dc
spelling ndltd-TW-100NCUE53920212015-10-13T21:28:01Z http://ndltd.ncl.edu.tw/handle/72744617351629968168 Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization 利用區塊間同步機制提升動態規則問題在圖形加速器中之執行效能 Ting-Hong Lin 林庭宏 碩士 國立彰化師範大學 資訊工程學系 100 Our study is focused on improving an important category of Dynamic programming (DP) problems called Nonserial polyadic dynamic programming (NPDP) on a graphics processing unit (GPU).Because NPDP in different stages of the computation has different degree of parallelism, so it is hard for us to use GPU computation ability fully. In previous studies, we proposed an algorithm that can adaptively adjust the thread-level parallelism to solve this problem and improve the effectiveness of the NPDP such problems. In this research, we focused on the memory used of GPU optimization. We used the Tiling technique to divide subproblems and data. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, this makes it impossible to reuse the tiled data in shared memory after the kernel is re-invoked. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. Experimental results demonstrate that our method can achieve a speedup of 3.2 over the previously published GPU algorithm. Chao-Chin Wu 伍朝欽 2012 學位論文 ; thesis 40 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立彰化師範大學 === 資訊工程學系 === 100 === Our study is focused on improving an important category of Dynamic programming (DP) problems called Nonserial polyadic dynamic programming (NPDP) on a graphics processing unit (GPU).Because NPDP in different stages of the computation has different degree of parallelism, so it is hard for us to use GPU computation ability fully. In previous studies, we proposed an algorithm that can adaptively adjust the thread-level parallelism to solve this problem and improve the effectiveness of the NPDP such problems. In this research, we focused on the memory used of GPU optimization. We used the Tiling technique to divide subproblems and data. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, this makes it impossible to reuse the tiled data in shared memory after the kernel is re-invoked. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. Experimental results demonstrate that our method can achieve a speedup of 3.2 over the previously published GPU algorithm.
author2 Chao-Chin Wu
author_facet Chao-Chin Wu
Ting-Hong Lin
林庭宏
author Ting-Hong Lin
林庭宏
spellingShingle Ting-Hong Lin
林庭宏
Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
author_sort Ting-Hong Lin
title Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
title_short Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
title_full Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
title_fullStr Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
title_full_unstemmed Optimizing Dynamic Programming on Graphics Processing Units via Data Reuse and Data Prefetch InterData Prefetch with Inter-Block Synchronization
title_sort optimizing dynamic programming on graphics processing units via data reuse and data prefetch interdata prefetch with inter-block synchronization
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/72744617351629968168
work_keys_str_mv AT tinghonglin optimizingdynamicprogrammingongraphicsprocessingunitsviadatareuseanddataprefetchinterdataprefetchwithinterblocksynchronization
AT líntínghóng optimizingdynamicprogrammingongraphicsprocessingunitsviadatareuseanddataprefetchinterdataprefetchwithinterblocksynchronization
AT tinghonglin lìyòngqūkuàijiāntóngbùjīzhìtíshēngdòngtàiguīzéwèntízàitúxíngjiāsùqìzhōngzhīzhíxíngxiàonéng
AT líntínghóng lìyòngqūkuàijiāntóngbùjīzhìtíshēngdòngtàiguīzéwèntízàitúxíngjiāsùqìzhōngzhīzhíxíngxiàonéng
_version_ 1718064950957572096