Transfer Learning for OCRopus Model Training on Early Printed Books

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of...

Full description

Bibliographic Details
Main Authors:	Christian Reul, Christoph Wick, Uwe Springmann, Frank Puppe
Format:	Article
Language:	deu
Published:	Self-published via PubPub 2017-12-01
Series:	027.7 : Zeitschrift für Bibliothekskultur

id	doaj-c0dc0a4cfd934004b3ff74af0bed63fa
record_format	Article
spelling	doaj-c0dc0a4cfd934004b3ff74af0bed63fa2021-06-02T02:52:41ZdeuSelf-published via PubPub027.7 : Zeitschrift für Bibliothekskultur2296-05972296-05972017-12-0151385110.12685/027.7-5-1-169Transfer Learning for OCRopus Model Training on Early Printed BooksChristian Reul0Christoph Wick1Uwe Springmann2Frank Puppe3Chair for Artificial Intelligence and Applied Informatics, University of WürzburgChair for Artificial Intelligence and Applied Informatics, University of WürzburgKallimachos Center for Digital Humanities, University of WürzburgChair for Artificial Intelligence and Applied Informatics, University of WürzburgA method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on standard data unrelated to the newly added training and test data can lead to significantly improved recognition results.
collection	DOAJ
language	deu
format	Article
sources	DOAJ
author	Christian Reul Christoph Wick Uwe Springmann Frank Puppe
spellingShingle	Christian Reul Christoph Wick Uwe Springmann Frank Puppe Transfer Learning for OCRopus Model Training on Early Printed Books 027.7 : Zeitschrift für Bibliothekskultur
author_facet	Christian Reul Christoph Wick Uwe Springmann Frank Puppe
author_sort	Christian Reul
title	Transfer Learning for OCRopus Model Training on Early Printed Books
title_short	Transfer Learning for OCRopus Model Training on Early Printed Books
title_full	Transfer Learning for OCRopus Model Training on Early Printed Books
title_fullStr	Transfer Learning for OCRopus Model Training on Early Printed Books
title_full_unstemmed	Transfer Learning for OCRopus Model Training on Early Printed Books
title_sort	transfer learning for ocropus model training on early printed books
publisher	Self-published via PubPub
series	027.7 : Zeitschrift für Bibliothekskultur
issn	2296-0597 2296-0597
publishDate	2017-12-01
description	A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on standard data unrelated to the newly added training and test data can lead to significantly improved recognition results.
work_keys_str_mv	AT christianreul transferlearningforocropusmodeltrainingonearlyprintedbooks AT christophwick transferlearningforocropusmodeltrainingonearlyprintedbooks AT uwespringmann transferlearningforocropusmodeltrainingonearlyprintedbooks AT frankpuppe transferlearningforocropusmodeltrainingonearlyprintedbooks
_version_	1721409111529095168

Transfer Learning for OCRopus Model Training on Early Printed Books

Similar Items