Aligning document layouts extracted with different OCR engines with clustering approach

Layout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine...

Full description

Bibliographic Details
Main Authors: S. Tomovic, K. Pavlovic, M. Bajceta
Format: Article
Language:English
Published: Elsevier 2021-09-01
Series:Egyptian Informatics Journal
Subjects:
OCR
Online Access:http://www.sciencedirect.com/science/article/pii/S1110866520301638
id doaj-144193e2d0cf457293b78a8dadd96312
record_format Article
spelling doaj-144193e2d0cf457293b78a8dadd963122021-09-19T04:55:13ZengElsevierEgyptian Informatics Journal1110-86652021-09-01223329338Aligning document layouts extracted with different OCR engines with clustering approachS. Tomovic0K. Pavlovic1M. Bajceta2Faculty of Mathematics and Natural Sciences, University of Montenegro, Cetinjska bb, 81000 Podgorica, Montenegro; Corresponding author.Faculty of Mathematics and Natural Sciences, University of Montenegro, Cetinjska bb, 81000 Podgorica, MontenegroDatum Solutions, 81000 Podogrica, MontenegroLayout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine used for image processing. In that way information extraction from scanned documents, that is heavily dependent on fields positions in the document, does not depend on specific OCR engine. In other words, it is sufficient to maintain universal extractor knowledge and not necessary to train extractor explicitly with samples processed on specific OCR engine. The proposed algorithm can handle administrative documents with complex layout.http://www.sciencedirect.com/science/article/pii/S1110866520301638Document layoutDocument indexingInformation retrievalClusteringOCR
collection DOAJ
language English
format Article
sources DOAJ
author S. Tomovic
K. Pavlovic
M. Bajceta
spellingShingle S. Tomovic
K. Pavlovic
M. Bajceta
Aligning document layouts extracted with different OCR engines with clustering approach
Egyptian Informatics Journal
Document layout
Document indexing
Information retrieval
Clustering
OCR
author_facet S. Tomovic
K. Pavlovic
M. Bajceta
author_sort S. Tomovic
title Aligning document layouts extracted with different OCR engines with clustering approach
title_short Aligning document layouts extracted with different OCR engines with clustering approach
title_full Aligning document layouts extracted with different OCR engines with clustering approach
title_fullStr Aligning document layouts extracted with different OCR engines with clustering approach
title_full_unstemmed Aligning document layouts extracted with different OCR engines with clustering approach
title_sort aligning document layouts extracted with different ocr engines with clustering approach
publisher Elsevier
series Egyptian Informatics Journal
issn 1110-8665
publishDate 2021-09-01
description Layout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine used for image processing. In that way information extraction from scanned documents, that is heavily dependent on fields positions in the document, does not depend on specific OCR engine. In other words, it is sufficient to maintain universal extractor knowledge and not necessary to train extractor explicitly with samples processed on specific OCR engine. The proposed algorithm can handle administrative documents with complex layout.
topic Document layout
Document indexing
Information retrieval
Clustering
OCR
url http://www.sciencedirect.com/science/article/pii/S1110866520301638
work_keys_str_mv AT stomovic aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach
AT kpavlovic aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach
AT mbajceta aligningdocumentlayoutsextractedwithdifferentocrengineswithclusteringapproach
_version_ 1717376358897680384