Sparse Ternary Convolutional Neural Network Model and its Hardware Design

碩士 === 國立交通大學 === 電子研究所 === 106 === Convolutional neural networks(CNNs) blow up in the last few years. The performance is impressive especially in the computer vision field. However, the computation complexity of state-of-art models is very high. As the result, powerful GPU is ne...

Full description

Bibliographic Details
Main Authors: Chiu,Kuan-Lin, 邱冠霖
Other Authors: 張添烜
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/vkwg7p
Description
Summary:碩士 === 國立交通大學 === 電子研究所 === 106 === Convolutional neural networks(CNNs) blow up in the last few years. The performance is impressive especially in the computer vision field. However, the computation complexity of state-of-art models is very high. As the result, powerful GPU is needed to compute CNN models. Several works try to reduce the computation by quantizing weights and activations. But quantizing models directly may have negative effect on accuracy. Thus, this thesis propose a systematic method named progressive quantization to simplified models when training. We could simplify weights from floating point to ternary values and quantize activation from floating point to fixed point values when training models. Besides, we also simplify batch normalization at proper time. Training models through our method, the accuracy drops in ResNet-56 and DenseNet-40 are 1.61% and 3.9% respectively in our experiment. On the other side, this thesis also propose a compatible hardware. We import only non-zero values to our accelerator by sparse matrix loading and group-sort and merge method. In addition, we make good use of the ternary weights to replace multipliers to multiplexers and shift operators. As for data reuse, this thesis propose input view convolution to reduce the dependency between convolution inputs and propose PE cooperation to calculation in high level parallel with few inputs in different output feature maps. At last, an implementation synthetized with TSMC 40nm process consumes 3.28M gate counts. ResNet-56 with CIFAR10 and ResNet-34 with ImageNet could arrive 1684FPS and 80FPS respectively with the implementation under 500MHz clock frequency.