The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...

Full description

Bibliographic Details
Main Authors: Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler
Format: Article
Language:English
Published: Ubiquity Press 2017-09-01
Series:Journal of Open Research Software
Subjects:
OCR
Online Access:https://openresearchsoftware.metajnl.com/articles/164
id doaj-81b79b76877c440b84393650a78a9d65
record_format Article
spelling doaj-81b79b76877c440b84393650a78a9d652020-11-24T21:40:23ZengUbiquity PressJournal of Open Research Software2049-96472017-09-015110.5334/jors.164133The Giles Ecosystem – Storage, Text Extraction, and OCR of DocumentsJulia Damerow0B. R. Erick Peirson1Manfred D. Laubichler2Arizona State UniversityArizona State UniversityArizona State UniversityIn the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.https://openresearchsoftware.metajnl.com/articles/164Text extractionOCRDocument storageApache KafkaJavaSpring Framework
collection DOAJ
language English
format Article
sources DOAJ
author Julia Damerow
B. R. Erick Peirson
Manfred D. Laubichler
spellingShingle Julia Damerow
B. R. Erick Peirson
Manfred D. Laubichler
The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
Journal of Open Research Software
Text extraction
OCR
Document storage
Apache Kafka
Java
Spring Framework
author_facet Julia Damerow
B. R. Erick Peirson
Manfred D. Laubichler
author_sort Julia Damerow
title The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_short The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_full The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_fullStr The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_full_unstemmed The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_sort giles ecosystem – storage, text extraction, and ocr of documents
publisher Ubiquity Press
series Journal of Open Research Software
issn 2049-9647
publishDate 2017-09-01
description In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.
topic Text extraction
OCR
Document storage
Apache Kafka
Java
Spring Framework
url https://openresearchsoftware.metajnl.com/articles/164
work_keys_str_mv AT juliadamerow thegilesecosystemstoragetextextractionandocrofdocuments
AT brerickpeirson thegilesecosystemstoragetextextractionandocrofdocuments
AT manfreddlaubichler thegilesecosystemstoragetextextractionandocrofdocuments
AT juliadamerow gilesecosystemstoragetextextractionandocrofdocuments
AT brerickpeirson gilesecosystemstoragetextextractionandocrofdocuments
AT manfreddlaubichler gilesecosystemstoragetextextractionandocrofdocuments
_version_ 1725926156223381504