Retrieval of handwritten historical document images

Historical library collections across the world hold huge numbers of handwritten documents. By digitizing these manuscripts, their content can be preserved and made available to a large community via the Internet or other electronic media. Such corpora can nowadays be shared relatively easily, but t...

Full description

Bibliographic Details
Main Author: Rath, Toni Maximilian
Language:ENG
Published: ScholarWorks@UMass Amherst 2005
Subjects:
Online Access:https://scholarworks.umass.edu/dissertations/AAI3193936
Description
Summary:Historical library collections across the world hold huge numbers of handwritten documents. By digitizing these manuscripts, their content can be preserved and made available to a large community via the Internet or other electronic media. Such corpora can nowadays be shared relatively easily, but they are often large, unstructured, and only available in image formats, which makes them difficult to access. In particular, finding specific locations of interest in a handwritten image collection is generally very tedious without some sort of index or other access tool. The current solution for this problem is to manually annotate a historical collection, which is very costly in terms of time and money. In this work we explore several automatic techniques that allow the retrieval of handwritten document images with text queries. These are (i) word spotting, an approach that clusters word images to identify and annotate content-bearing words in a collection, (ii) handwriting recognition followed by text retrieval, and (iii) cross-modal retrieval models, which capture the joint occurrence of annotations and word image features in a probabilistic model. We compare the performance of these approaches empirically on several test collections. The main contributions of this work are a detailed examination of retrieval approaches for historical manuscripts, and the development of the first image retrieval system for historical manuscripts that allows text queries. This system extends the field of digital libraries beyond machine printed text into historical handwritten documents. Building such a system involves challenges on numerous levels: the noisy historical manuscript domain requires adequate image filtering, normalization and representation techniques, as well as a robust and scalable retrieval framework. We describe the construction of a prototype system, which demonstrates the feasibility of the proposed techniques for a large collection of handwritten historical documents.