Summary: | 碩士 === 國立交通大學 === 電控工程研究所 === 105 === Automatically describing the content of images connects computer vision and natural language processing. This thesis combines object detection with image caption to obtain better feature representations. The feature fusion proposes a simple weighting determination technique relying on only bounding box attributes to sum up local and global features. In addition, object coordinates predict the relations for building a bridge of human interactions.
The model is based on a novel combination of Convolutional Neural Networks and Recurrent Neural Networks over regions in interests and sentences respectively, inserting objective embedding in the middle network layer for reducing internal covariate shift and inferring compressed features. In this paper, the system is evaluated on MS COCO dataset, which comprises 123,287 images and 616,435 descriptions. Also, experiments show BLEU4, METEOR, ROUGEL, and CIDEr score improvements while sustain 26 FPS real-time performance.
|