Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the fo...

Full description

Bibliographic Details
Main Authors:	David Owen, Quentin Groom, Alex Hardisty, Thijs Leegwater, Laurence Livermore, Myriam van Walsum, Noortje Wijkamp, Irena Spasić
Format:	Article
Language:	English
Published:	Pensoft Publishers 2020-08-01
Series:	Research Ideas and Outcomes
Subjects:	automated text digitisation natural language proc
Online Access:	https://riojournal.com/article/58030/download/pdf/

id	doaj-a30624f2643248ad8c1aab6b1a560c6d
record_format	Article
spelling	doaj-a30624f2643248ad8c1aab6b1a560c6d2020-11-25T03:51:34ZengPensoft PublishersResearch Ideas and Outcomes2367-71632020-08-01612910.3897/rio.6.e5803058030Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collectionsDavid Owen0Quentin Groom1Alex Hardisty2Thijs Leegwater3Laurence Livermore4Myriam van Walsum5Noortje Wijkamp6Irena Spasić7Cardiff UniversityMeise Botanic GardenCardiff UniversityPicturaeThe Natural History MuseumNaturalis Biodiversity CentrePicturaeCardiff UniversityWe describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0.We have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.https://riojournal.com/article/58030/download/pdf/automated text digitisationnatural language proc
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	David Owen Quentin Groom Alex Hardisty Thijs Leegwater Laurence Livermore Myriam van Walsum Noortje Wijkamp Irena Spasić
spellingShingle	David Owen Quentin Groom Alex Hardisty Thijs Leegwater Laurence Livermore Myriam van Walsum Noortje Wijkamp Irena Spasić Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections Research Ideas and Outcomes automated text digitisation natural language proc
author_facet	David Owen Quentin Groom Alex Hardisty Thijs Leegwater Laurence Livermore Myriam van Walsum Noortje Wijkamp Irena Spasić
author_sort	David Owen
title	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
title_short	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
title_full	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
title_fullStr	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
title_full_unstemmed	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
title_sort	towards a scientific workflow featuring natural language processing for the digitisation of natural history collections
publisher	Pensoft Publishers
series	Research Ideas and Outcomes
issn	2367-7163
publishDate	2020-08-01
description	We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0.We have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
topic	automated text digitisation natural language proc
url	https://riojournal.com/article/58030/download/pdf/
work_keys_str_mv	AT davidowen towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT quentingroom towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT alexhardisty towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT thijsleegwater towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT laurencelivermore towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT myriamvanwalsum towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT noortjewijkamp towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections AT irenaspasic towardsascientificworkflowfeaturingnaturallanguageprocessingforthedigitisationofnaturalhistorycollections
_version_	1724486862066679808

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

Similar Items