Text Categorization with Latent Dirichlet Allocation
This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Editura Universităţii din Oradea
2014-05-01
|
Series: | Journal of Electrical and Electronics Engineering |
Subjects: | |
Online Access: | http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdf |
id |
doaj-19fdad600d194f1cbf1542e9c3563675 |
---|---|
record_format |
Article |
spelling |
doaj-19fdad600d194f1cbf1542e9c35636752020-11-25T02:44:08ZengEditura Universităţii din OradeaJournal of Electrical and Electronics Engineering1844-60352067-21282014-05-0171161164Text Categorization with Latent Dirichlet AllocationZLACKÝ Daniel0STAŠ Ján1JUHÁR Jozef2CIŽMÁR Anton3Technical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsTechnical University of Košice, Slovak Republic, Dep. of Electronics and Multimedia Communications, Faculty of Electrical Engineering and InformaticsThis paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdflanguage modelinglatent Dirichlet allocationspeech recognitiontext categorization |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
ZLACKÝ Daniel STAŠ Ján JUHÁR Jozef CIŽMÁR Anton |
spellingShingle |
ZLACKÝ Daniel STAŠ Ján JUHÁR Jozef CIŽMÁR Anton Text Categorization with Latent Dirichlet Allocation Journal of Electrical and Electronics Engineering language modeling latent Dirichlet allocation speech recognition text categorization |
author_facet |
ZLACKÝ Daniel STAŠ Ján JUHÁR Jozef CIŽMÁR Anton |
author_sort |
ZLACKÝ Daniel |
title |
Text Categorization with Latent Dirichlet Allocation |
title_short |
Text Categorization with Latent Dirichlet Allocation |
title_full |
Text Categorization with Latent Dirichlet Allocation |
title_fullStr |
Text Categorization with Latent Dirichlet Allocation |
title_full_unstemmed |
Text Categorization with Latent Dirichlet Allocation |
title_sort |
text categorization with latent dirichlet allocation |
publisher |
Editura Universităţii din Oradea |
series |
Journal of Electrical and Electronics Engineering |
issn |
1844-6035 2067-2128 |
publishDate |
2014-05-01 |
description |
This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora. |
topic |
language modeling latent Dirichlet allocation speech recognition text categorization |
url |
http://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JEEE/JEEE_V7_N1_MAY_2014/Zlacky_may2014.pdf |
work_keys_str_mv |
AT zlackydaniel textcategorizationwithlatentdirichletallocation AT stasjan textcategorizationwithlatentdirichletallocation AT juharjozef textcategorizationwithlatentdirichletallocation AT cizmaranton textcategorizationwithlatentdirichletallocation |
_version_ |
1724767232096993280 |