Text Categorization with Latent Dirichlet Allocation

This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of...

Full description

Bibliographic Details
Main Authors: ZLACKÝ Daniel, STAŠ Ján, JUHÁR Jozef, CIŽMÁR Anton
Format: Article
Language:English
Published: Editura Universităţii din Oradea 2014-05-01
Series:Journal of Electrical and Electronics Engineering
Subjects:
Online Access:http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdf
id doaj-19fdad600d194f1cbf1542e9c3563675
record_format Article
spelling doaj-19fdad600d194f1cbf1542e9c35636752020-11-25T02:44:08ZengEditura Universităţii din OradeaJournal of Electrical and Electronics Engineering1844-60352067-21282014-05-0171161164Text Categorization with Latent Dirichlet AllocationZLACKÝ Daniel0STAŠ Ján1JUHÁR Jozef2CIŽMÁR Anton3Technical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsThis paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdflanguage modelinglatent Dirichlet allocationspeech recognitiontext categorization
collection DOAJ
language English
format Article
sources DOAJ
author ZLACKÝ Daniel
STAŠ Ján
JUHÁR Jozef
CIŽMÁR Anton
spellingShingle ZLACKÝ Daniel
STAŠ Ján
JUHÁR Jozef
CIŽMÁR Anton
Text Categorization with Latent Dirichlet Allocation
Journal of Electrical and Electronics Engineering
language modeling
latent Dirichlet allocation
speech recognition
text categorization
author_facet ZLACKÝ Daniel
STAŠ Ján
JUHÁR Jozef
CIŽMÁR Anton
author_sort ZLACKÝ Daniel
title Text Categorization with Latent Dirichlet Allocation
title_short Text Categorization with Latent Dirichlet Allocation
title_full Text Categorization with Latent Dirichlet Allocation
title_fullStr Text Categorization with Latent Dirichlet Allocation
title_full_unstemmed Text Categorization with Latent Dirichlet Allocation
title_sort text categorization with latent dirichlet allocation
publisher Editura Universităţii din Oradea
series Journal of Electrical and Electronics Engineering
issn 1844-6035
2067-2128
publishDate 2014-05-01
description This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.
topic language modeling
latent Dirichlet allocation
speech recognition
text categorization
url http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdf
work_keys_str_mv AT zlackydaniel textcategorizationwithlatentdirichletallocation
AT stasjan textcategorizationwithlatentdirichletallocation
AT juharjozef textcategorizationwithlatentdirichletallocation
AT cizmaranton textcategorizationwithlatentdirichletallocation
_version_ 1724767232096993280