The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...
Main Authors: | Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2017-09-01
|
Series: | Journal of Open Research Software |
Subjects: | |
Online Access: | https://openresearchsoftware.metajnl.com/articles/164 |
Similar Items
-
Aligning document layouts extracted with different OCR engines with clustering approach
by: S. Tomovic, et al.
Published: (2021-09-01) -
Utilization of OCR and text feature extraction to create a database of labour complaints
by: Yan Puspitarani, et al.
Published: (2020-08-01) -
Influencing Factors in the Scalability of Distributed Stream Processing Jobs
by: Giselle Van Dongen, et al.
Published: (2021-01-01) -
A Performance Analysis of Fault Recovery in Stream Processing Frameworks
by: Giselle van Dongen, et al.
Published: (2021-01-01) -
Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction
by: Darko Brodić, et al.
Published: (2010-05-01)