The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...

Full description

Bibliographic Details
Main Authors:	Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler
Format:	Article
Language:	English
Published:	Ubiquity Press 2017-09-01
Series:	Journal of Open Research Software
Subjects:	Text extraction OCR Document storage Apache Kafka Java Spring Framework
Online Access:	https://openresearchsoftware.metajnl.com/articles/164

id	doaj-81b79b76877c440b84393650a78a9d65
record_format	Article
spelling	doaj-81b79b76877c440b84393650a78a9d652020-11-24T21:40:23ZengUbiquity PressJournal of Open Research Software2049-96472017-09-015110.5334/jors.164133The Giles Ecosystem – Storage, Text Extraction, and OCR of DocumentsJulia Damerow0B. R. Erick Peirson1Manfred D. Laubichler2Arizona State UniversityArizona State UniversityArizona State UniversityIn the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.https://openresearchsoftware.metajnl.com/articles/164Text extractionOCRDocument storageApache KafkaJavaSpring Framework
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Julia Damerow B. R. Erick Peirson Manfred D. Laubichler
spellingShingle	Julia Damerow B. R. Erick Peirson Manfred D. Laubichler The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents Journal of Open Research Software Text extraction OCR Document storage Apache Kafka Java Spring Framework
author_facet	Julia Damerow B. R. Erick Peirson Manfred D. Laubichler
author_sort	Julia Damerow
title	The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_short	The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_full	The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_fullStr	The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_full_unstemmed	The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
title_sort	giles ecosystem – storage, text extraction, and ocr of documents
publisher	Ubiquity Press
series	Journal of Open Research Software
issn	2049-9647
publishDate	2017-09-01
description	In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.
topic	Text extraction OCR Document storage Apache Kafka Java Spring Framework
url	https://openresearchsoftware.metajnl.com/articles/164
work_keys_str_mv	AT juliadamerow thegilesecosystemstoragetextextractionandocrofdocuments AT brerickpeirson thegilesecosystemstoragetextextractionandocrofdocuments AT manfreddlaubichler thegilesecosystemstoragetextextractionandocrofdocuments AT juliadamerow gilesecosystemstoragetextextractionandocrofdocuments AT brerickpeirson gilesecosystemstoragetextextractionandocrofdocuments AT manfreddlaubichler gilesecosystemstoragetextextractionandocrofdocuments
_version_	1725926156223381504

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Similar Items