Summary: | 碩士 === 國立交通大學 === 電子研究所 === 105 === Deep convolutional neural networks (CNNs) have achieved state-of-the-art accuracy on the recognition, detection, and other computer vision fields. However, its hardware design faces challenges of high computational complexity and data bandwidth as well as huge divergence in different CNN network layers. In which, the throughput of the convolutional layer would be bounded by available hardware resource and throughput of the fully connected layer would be bounded by available data bandwidth. Thus, a highly flexible design is desired to meet these needs.
This thesis will present our end-to-end CNN accelerator that maximizes hardware utilization to 100% with run-time multiple kernel size configurations and minimizes data bandwidth with the output first strategy to improve data reuse of the convolutional layers by up to 300X~600X compared to the non-reused case. The whole CNN implementation of the target network is generated optimally for both hardware and data efficiency under design resource constraints, and this implementation is run-time reconfigured by the layer optimized parameters to achieve real-time and end-to-end CNN acceleration. An implementation example for Alexnet consumes 1.783M gate count for 216 MACs and 142.64 KB internal buffer with TSMC 40nm process, and achieves 99.7 fps and 61.6 fps under 454 MHz clock frequency for the convolutional layers and all layers of the AlexNet respectively.
|