DoCA: A Content-Based Automatic Classification System Over Digital Documents

Regardless of industry, the overload of information facing most organizations today is a drain on both individuals and the enterprise itself. The increasing volume of this information, which is stored in different electronic formats, requires new sophisticated systems to analyse and classify them. I...

Full description

Bibliographic Details
Main Authors: Suleyman Eken, Houssem Menhour, Kubra Koksal
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
OCR
Online Access:https://ieeexplore.ieee.org/document/8768370/
id doaj-a7eb84a1c3084bfc8d01094e699e617d
record_format Article
spelling doaj-a7eb84a1c3084bfc8d01094e699e617d2021-03-29T23:39:01ZengIEEEIEEE Access2169-35362019-01-017979969800410.1109/ACCESS.2019.29303398768370DoCA: A Content-Based Automatic Classification System Over Digital DocumentsSuleyman Eken0https://orcid.org/0000-0001-9488-908XHoussem Menhour1https://orcid.org/0000-0001-8920-7830Kubra Koksal2Department of Computer Engineering, Umuttepe Campus, Kocaeli University, Kocaeli, TurkeyDepartment of Computer Engineering, Umuttepe Campus, Kocaeli University, Kocaeli, TurkeyDepartment of Computer Engineering, Umuttepe Campus, Kocaeli University, Kocaeli, TurkeyRegardless of industry, the overload of information facing most organizations today is a drain on both individuals and the enterprise itself. The increasing volume of this information, which is stored in different electronic formats, requires new sophisticated systems to analyse and classify them. In this paper, we attempt to implement a framework Document Classification and Analysis (DoCA) that can simplify and automate such tasks for different file types, namely: office documents (text, spreadsheets, and presentations), scanned documents (images and PDFs), multimedia files (video and audio). Each file type requires different methods for pre-processing, analysis, and classification. The efficiency and feasibility of the DoCA are examined on HAVELSAN dataset and accuracy of different tasks shows that the DoCA is a promising tool for document analysis and classification.https://ieeexplore.ieee.org/document/8768370/Document analysisdocument classificationOCRvideo-audio analysis
collection DOAJ
language English
format Article
sources DOAJ
author Suleyman Eken
Houssem Menhour
Kubra Koksal
spellingShingle Suleyman Eken
Houssem Menhour
Kubra Koksal
DoCA: A Content-Based Automatic Classification System Over Digital Documents
IEEE Access
Document analysis
document classification
OCR
video-audio analysis
author_facet Suleyman Eken
Houssem Menhour
Kubra Koksal
author_sort Suleyman Eken
title DoCA: A Content-Based Automatic Classification System Over Digital Documents
title_short DoCA: A Content-Based Automatic Classification System Over Digital Documents
title_full DoCA: A Content-Based Automatic Classification System Over Digital Documents
title_fullStr DoCA: A Content-Based Automatic Classification System Over Digital Documents
title_full_unstemmed DoCA: A Content-Based Automatic Classification System Over Digital Documents
title_sort doca: a content-based automatic classification system over digital documents
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Regardless of industry, the overload of information facing most organizations today is a drain on both individuals and the enterprise itself. The increasing volume of this information, which is stored in different electronic formats, requires new sophisticated systems to analyse and classify them. In this paper, we attempt to implement a framework Document Classification and Analysis (DoCA) that can simplify and automate such tasks for different file types, namely: office documents (text, spreadsheets, and presentations), scanned documents (images and PDFs), multimedia files (video and audio). Each file type requires different methods for pre-processing, analysis, and classification. The efficiency and feasibility of the DoCA are examined on HAVELSAN dataset and accuracy of different tasks shows that the DoCA is a promising tool for document analysis and classification.
topic Document analysis
document classification
OCR
video-audio analysis
url https://ieeexplore.ieee.org/document/8768370/
work_keys_str_mv AT suleymaneken docaacontentbasedautomaticclassificationsystemoverdigitaldocuments
AT houssemmenhour docaacontentbasedautomaticclassificationsystemoverdigitaldocuments
AT kubrakoksal docaacontentbasedautomaticclassificationsystemoverdigitaldocuments
_version_ 1724189118777262080