Aligning document layouts extracted with different OCR engines with clustering approach

Layout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine...

Full description

Bibliographic Details
Main Authors:	S. Tomovic, K. Pavlovic, M. Bajceta
Format:	Article
Language:	English
Published:	Elsevier 2021-09-01
Series:	Egyptian Informatics Journal
Subjects:	Document layout Document indexing Information retrieval Clustering OCR
Online Access:	http://www.sciencedirect.com/science/article/pii/S1110866520301638

id	doaj-144193e2d0cf457293b78a8dadd96312
record_format	Article
spelling	doaj-144193e2d0cf457293b78a8dadd963122021-09-19T04:55:13ZengElsevierEgyptian Informatics Journal1110-86652021-09-01223329338Aligning document layouts extracted with different OCR engines with clustering approachS. Tomovic0K. Pavlovic1M. Bajceta2Faculty of Mathematics and Natural Sciences, University of Montenegro, Cetinjska bb, 81000 Podgorica, Montenegro; Corresponding author.Faculty of Mathematics and Natural Sciences, University of Montenegro, Cetinjska bb, 81000 Podgorica, MontenegroDatum Solutions, 81000 Podogrica, MontenegroLayout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine used for image processing. In that way information extraction from scanned documents, that is heavily dependent on fields positions in the document, does not depend on specific OCR engine. In other words, it is sufficient to maintain universal extractor knowledge and not necessary to train extractor explicitly with samples processed on specific OCR engine. The proposed algorithm can handle administrative documents with complex layout.http://www.sciencedirect.com/science/article/pii/S1110866520301638Document layoutDocument indexingInformation retrievalClusteringOCR
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	S. Tomovic K. Pavlovic M. Bajceta
spellingShingle	S. Tomovic K. Pavlovic M. Bajceta Aligning document layouts extracted with different OCR engines with clustering approach Egyptian Informatics Journal Document layout Document indexing Information retrieval Clustering OCR
author_facet	S. Tomovic K. Pavlovic M. Bajceta
author_sort	S. Tomovic
title	Aligning document layouts extracted with different OCR engines with clustering approach
title_short	Aligning document layouts extracted with different OCR engines with clustering approach
title_full	Aligning document layouts extracted with different OCR engines with clustering approach
title_fullStr	Aligning document layouts extracted with different OCR engines with clustering approach
title_full_unstemmed	Aligning document layouts extracted with different OCR engines with clustering approach
title_sort	aligning document layouts extracted with different ocr engines with clustering approach
publisher	Elsevier
series	Egyptian Informatics Journal
issn	1110-8665
publishDate	2021-09-01
description	Layout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine used for image processing. In that way information extraction from scanned documents, that is heavily dependent on fields positions in the document, does not depend on specific OCR engine. In other words, it is sufficient to maintain universal extractor knowledge and not necessary to train extractor explicitly with samples processed on specific OCR engine. The proposed algorithm can handle administrative documents with complex layout.
topic	Document layout Document indexing Information retrieval Clustering OCR
url	http://www.sciencedirect.com/science/article/pii/S1110866520301638
work_keys_str_mv	AT stomovic aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach AT kpavlovic aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach AT mbajceta aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach
_version_	1717376358897680384

Aligning document layouts extracted with different OCR engines with clustering approach

Similar Items