SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper

Most existing studies and public datasets for handwritten Chinese text recognition are based on the regular documents with clean and blank background, lacking research reports for handwritten text recognition on challenging areas such as educational documents and financial bills. In this paper, we f...

Full description

Bibliographic Details
Main Authors:	Yuanzhi Zhu, Zecheng Xie, Lianwen Jin, Xiaoxue Chen, Yaoxiong Huang, Ming Zhang
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Offline handwritten Chinese text recognition (HCTR) educational documents sequence transcription
Online Access:	https://ieeexplore.ieee.org/document/8565866/

id	doaj-86d6fb5e716645c5bf493b63cc784131
record_format	Article
spelling	doaj-86d6fb5e716645c5bf493b63cc7841312021-03-29T22:06:21ZengIEEEIEEE Access2169-35362019-01-01737038210.1109/ACCESS.2018.28853988565866SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination PaperYuanzhi Zhu0Zecheng Xie1Lianwen Jin2https://orcid.org/0000-0002-5456-0957Xiaoxue Chen3Yaoxiong Huang4Ming Zhang5College of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaCollege of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaCollege of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaCollege of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaCollege of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaAbcPen Inc., Hangzhou, ChinaMost existing studies and public datasets for handwritten Chinese text recognition are based on the regular documents with clean and blank background, lacking research reports for handwritten text recognition on challenging areas such as educational documents and financial bills. In this paper, we focus on examination paper text recognition and construct a challenging dataset named examination paper text (SCUT-EPT) dataset, which contains 50 000 text line images (40 000 for training and 10 000 for testing) selected from the examination papers of 2 986 volunteers. The proposed SCUT-EPT dataset presents numerous novel challenges, including character erasure, text line supplement, character/phrase switching, noised background, nonuniform word size, and unbalanced text length. In our experiments, the current advanced text recognition methods, such as convolutional recurrent neural network (CRNN) exhibits poor performance on the proposed SCUT-EPT dataset, proving the challenge and significance of the dataset. Nevertheless, through visualizing and error analysis, we observe that humans can avoid vast majority of the error predictions, which reveal the limitations and drawbacks of the current methods for handwritten Chinese text recognition (HCTR). Finally, three popular sequence transcription methods, connectionist temporal classification (CTC), attention mechanism, and cascaded attention-CTC are investigated for HCTR problem. It is interesting to observe that although the attention mechanism has been proved to be very effective in English scene text recognition, its performance is far inferior to the CTC method in the case of HCTR with large-scale character set.https://ieeexplore.ieee.org/document/8565866/Offline handwritten Chinese text recognition (HCTR)educational documentssequence transcription
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yuanzhi Zhu Zecheng Xie Lianwen Jin Xiaoxue Chen Yaoxiong Huang Ming Zhang
spellingShingle	Yuanzhi Zhu Zecheng Xie Lianwen Jin Xiaoxue Chen Yaoxiong Huang Ming Zhang SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper IEEE Access Offline handwritten Chinese text recognition (HCTR) educational documents sequence transcription
author_facet	Yuanzhi Zhu Zecheng Xie Lianwen Jin Xiaoxue Chen Yaoxiong Huang Ming Zhang
author_sort	Yuanzhi Zhu
title	SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper
title_short	SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper
title_full	SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper
title_fullStr	SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper
title_full_unstemmed	SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper
title_sort	scut-ept: new dataset and benchmark for offline chinese text recognition in examination paper
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Most existing studies and public datasets for handwritten Chinese text recognition are based on the regular documents with clean and blank background, lacking research reports for handwritten text recognition on challenging areas such as educational documents and financial bills. In this paper, we focus on examination paper text recognition and construct a challenging dataset named examination paper text (SCUT-EPT) dataset, which contains 50 000 text line images (40 000 for training and 10 000 for testing) selected from the examination papers of 2 986 volunteers. The proposed SCUT-EPT dataset presents numerous novel challenges, including character erasure, text line supplement, character/phrase switching, noised background, nonuniform word size, and unbalanced text length. In our experiments, the current advanced text recognition methods, such as convolutional recurrent neural network (CRNN) exhibits poor performance on the proposed SCUT-EPT dataset, proving the challenge and significance of the dataset. Nevertheless, through visualizing and error analysis, we observe that humans can avoid vast majority of the error predictions, which reveal the limitations and drawbacks of the current methods for handwritten Chinese text recognition (HCTR). Finally, three popular sequence transcription methods, connectionist temporal classification (CTC), attention mechanism, and cascaded attention-CTC are investigated for HCTR problem. It is interesting to observe that although the attention mechanism has been proved to be very effective in English scene text recognition, its performance is far inferior to the CTC method in the case of HCTR with large-scale character set.
topic	Offline handwritten Chinese text recognition (HCTR) educational documents sequence transcription
url	https://ieeexplore.ieee.org/document/8565866/
work_keys_str_mv	AT yuanzhizhu scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper AT zechengxie scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper AT lianwenjin scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper AT xiaoxuechen scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper AT yaoxionghuang scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper AT mingzhang scuteptnewdatasetandbenchmarkforofflinechinesetextrecognitioninexaminationpaper
_version_	1724192183173513216

SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper

Similar Items