The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...

Full description

Bibliographic Details
Main Authors:	Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler
Format:	Article
Language:	English
Published:	Ubiquity Press 2017-09-01
Series:	Journal of Open Research Software
Subjects:	Text extraction OCR Document storage Apache Kafka Java Spring Framework
Online Access:	https://openresearchsoftware.metajnl.com/articles/164

Similar Items

Aligning document layouts extracted with different OCR engines with clustering approach
by: S. Tomovic, et al.
Published: (2021-09-01)

Utilization of OCR and text feature extraction to create a database of labour complaints
by: Yan Puspitarani, et al.
Published: (2020-08-01)

Influencing Factors in the Scalability of Distributed Stream Processing Jobs
by: Giselle Van Dongen, et al.
Published: (2021-01-01)

A Performance Analysis of Fault Recovery in Stream Processing Frameworks
by: Giselle van Dongen, et al.
Published: (2021-01-01)

Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction
by: Darko Brodić, et al.
Published: (2010-05-01)

OCR-D - Koordinierte Förderinitiative zur Weiterentwicklung von OCR-Verfahren
by: Elisa Herrmann, et al.
Published: (2017-12-01)

LSTM Network and OCR Performance for Classification of Decimal Dewey Classification Code
by: Yesy Diah Rosita, et al.
Published: (2020-04-01)

Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction
by: Packer, Thomas L
Published: (2014)

Experimental evaluation of Arabic OCR systems
by: Mansoor Alghamdi, et al.
Published: (2017-11-01)

A Holistic Technique for an Arabic OCR System
by: Farhan M. A. Nashwan, et al.
Published: (2017-12-01)

Text Segmentation of Historical Degraded Handwritten Documents
by: Nina, Oliver
Published: (2010)

Extraction hybride et description structurelle de caractères pour une reconnaissance efficace de texte dans les documents hétérogènes scannés : Méthodes et Algorithmes parallèles
by: Soua, Mahmoud
Published: (2016)

Generating an Ordered Data Set from an OCR Text File
by: Jon Crump
Published: (2014-11-01)

Test av OCR-verktyg för Linux
by: Nilsson, Elin
Published: (2010)

Issues & Challenges in Urdu OCR
by: Urooba Zaki, et al.
Published: (2019-02-01)

Pengenalan Pola Berbasis OCR untuk Pengambilan Data Bursa Saham
by: M. Dyovan Uidy Okta, et al.
Published: (2021-06-01)

A Regularization-Based Big Data Framework for Winter Precipitation Forecasting on Streaming Data
by: Andreas Kanavos, et al.
Published: (2021-08-01)

Comparison of Sakhr and Google Optical Characters for Persian Language
by: Mehrnaz Khorasanchi, et al.
Published: (2021-07-01)

MMU-OCR-21: Towards End-to-End Urdu Text Recognition Using Deep Learning
by: Tayyab Nasir, et al.
Published: (2021-01-01)

Fast Binarization of Unevenly Illuminated Document Images Based on Background Estimation for Optical Character Recognition Purposes
by: Hubert Michalak, et al.
Published: (2019-06-01)

Amharic OCR: An End-to-End Learning
by: Birhanu Belay, et al.
Published: (2020-02-01)

Document Image Processing
by: Laurence Likforman-Sulem, et al.
Published: (2018-06-01)

Implementering av testplattform för end-to-end streaming telemetry i nätverk
by: Erlandsson, Niklas
Published: (2020)

Optimization of the Gaussian Kernel Extended by Binary Morphology for Text Line Segmentation
by: Z. Milivojevic, et al.
Published: (2010-12-01)

Document Nature as a Text Feature (Exemplified by PR Texts)
by: Ekaterina Sergeevna Buslaeva
Published: (2016-04-01)

DoCA: A Content-Based Automatic Classification System Over Digital Documents
by: Suleyman Eken, et al.
Published: (2019-01-01)

OCR Application on Smartphone for Visually Impaired People
by: TEPELEA Laviniu, et al.
Published: (2014-05-01)

Thinning: A Preprocessing Technique for an OCR System for the Brahmi Script
by: H. K. Anasuya Devi
Published: (2006-12-01)

A REVIEW OF ARABIC TEXT RECOGNITION DATASET
by: Idris Saleh Al-Sheikh, et al.
Published: (2020-06-01)

Ensemble Methods for Historical Machine-Printed Document Recognition
by: Lund, William B.
Published: (2014)

Hur ser framtiden ut för OCR?
by: Lund, Mikael
Published: (2007)

Objective and Subjective Complexity of Document Texts
by: Anastasiya Alekseevna Dyakova
Published: (2016-04-01)

Automated invoice handling with machine learning and OCR
by: Larsson, Andreas, et al.
Published: (2016)

Semantic Text Segmentation from Synthetic Images of Full-Text Documents
by: Lukáš Bureš, et al.
Published: (2019-12-01)

Short Text Document Clustering using Distributed Word Representation and Document Distance
by: Supavit KONGWUDHIKUNAKORN, et al.
Published: (2018-03-01)

Prototypische Entwicklung eines mandantenfähigen dezentralen Austauschsystems für hochsensible Daten
by: Stockhaus, Christian
Published: (2017)

Cloud Based System Integration : System Integration between Salesforce.com and Web-based ERP System using Apache Camel
by: Söder, Mikael, et al.
Published: (2017)

Thresholding: A Pixel-Level Image Processing Methodology Preprocessing Technique for an OCR System for the Brahmi Script
by: H. K. Anasuya Devi
Published: (2006-12-01)

Bayesian Test Analytics for Document Collections
by: Walker, Daniel David
Published: (2012)

Features of Document and Literary Texts
by: Grigoriy Valeryevich Tokarev
Published: (2016-04-01)