none

碩士 === 國立中央大學 === 資訊工程學系 === 107 === In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware d...

Full description

Bibliographic Details
Main Authors: Yi-Cheng Chen, 陳奕誠
Other Authors: Yong-Bin Zheng
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/887z76
id ndltd-TW-107NCU05392054
record_format oai_dc
spelling ndltd-TW-107NCU053920542019-10-22T05:28:10Z http://ndltd.ncl.edu.tw/handle/887z76 none 評估與改進Tesseract運用於彩色網頁的光學字元辨識 Yi-Cheng Chen 陳奕誠 碩士 國立中央大學 資訊工程學系 107 In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware devices and computing power have made deep learning become a hot field in recent years due to its ability to automatically extract features and effectiveness to find good features to enhance the recognition ability of optical character recognition. According to IBM estimates, about $2.5 trillion a year has been spent on storing non-digital files by converting them into digital files by manual typing. If it is possible to improve the recognition rate of OCR to certain acceptable standard, then it can save time and reduce costs. Besides, there aren’t any tools with a recognition rate of 100% today because there are many different sources of identification images, such as scanned files, camera photo noise, complex typography, text and background colors, large and small icons, different languages and fonts that will greatly affect the recognition results. The purpose of this paper is to find a way to effectively improve the OCR software recognition rate. We used screenshots of webpages that have better corrected images and don’t have noise. The computer font is True Type Font, so the screenshots may be different even if the same page is on different screens. The result of testing indicates Google Vision, a cloud service, has better recognition rate than other software. However, many factories that demand OCR don’t connect to the Internet, so we choose Tesseract 4.0 which is an open source. The findings of this paper show that with its low recognition rate, the pre-processing of Tesseract 4.0 has better improved its recognition rate than its training. The poor result of its training is mainly caused by complex typography and different text sizes. Yong-Bin Zheng 鄭永斌 2019 學位論文 ; thesis 55 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程學系 === 107 === In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware devices and computing power have made deep learning become a hot field in recent years due to its ability to automatically extract features and effectiveness to find good features to enhance the recognition ability of optical character recognition. According to IBM estimates, about $2.5 trillion a year has been spent on storing non-digital files by converting them into digital files by manual typing. If it is possible to improve the recognition rate of OCR to certain acceptable standard, then it can save time and reduce costs. Besides, there aren’t any tools with a recognition rate of 100% today because there are many different sources of identification images, such as scanned files, camera photo noise, complex typography, text and background colors, large and small icons, different languages and fonts that will greatly affect the recognition results. The purpose of this paper is to find a way to effectively improve the OCR software recognition rate. We used screenshots of webpages that have better corrected images and don’t have noise. The computer font is True Type Font, so the screenshots may be different even if the same page is on different screens. The result of testing indicates Google Vision, a cloud service, has better recognition rate than other software. However, many factories that demand OCR don’t connect to the Internet, so we choose Tesseract 4.0 which is an open source. The findings of this paper show that with its low recognition rate, the pre-processing of Tesseract 4.0 has better improved its recognition rate than its training. The poor result of its training is mainly caused by complex typography and different text sizes.
author2 Yong-Bin Zheng
author_facet Yong-Bin Zheng
Yi-Cheng Chen
陳奕誠
author Yi-Cheng Chen
陳奕誠
spellingShingle Yi-Cheng Chen
陳奕誠
none
author_sort Yi-Cheng Chen
title none
title_short none
title_full none
title_fullStr none
title_full_unstemmed none
title_sort none
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/887z76
work_keys_str_mv AT yichengchen none
AT chényìchéng none
AT yichengchen pínggūyǔgǎijìntesseractyùnyòngyúcǎisèwǎngyèdeguāngxuézìyuánbiànshí
AT chényìchéng pínggūyǔgǎijìntesseractyùnyòngyúcǎisèwǎngyèdeguāngxuézìyuánbiànshí
_version_ 1719273877985558528