Text classification using a hidden Markov model
Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals a...
Main Author: | |
---|---|
Format: | Others |
Language: | en |
Published: |
McGill University
2005
|
Subjects: | |
Online Access: | http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214 |
id |
ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.85214 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.852142014-02-13T04:05:22ZText classification using a hidden Markov modelYi, Kwan, 1963-Markov processes.Classification -- AutomationText categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.McGill University2005Electronic Thesis or Dissertationapplication/pdfenalephsysno: 002211451proquestno: AAINR12966Theses scanned by UMI/ProQuest.All items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.Doctor of Philosophy (Graduate School of Library and Information Studies.) http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214 |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
Markov processes. Classification -- Automation |
spellingShingle |
Markov processes. Classification -- Automation Yi, Kwan, 1963- Text classification using a hidden Markov model |
description |
Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals. |
author |
Yi, Kwan, 1963- |
author_facet |
Yi, Kwan, 1963- |
author_sort |
Yi, Kwan, 1963- |
title |
Text classification using a hidden Markov model |
title_short |
Text classification using a hidden Markov model |
title_full |
Text classification using a hidden Markov model |
title_fullStr |
Text classification using a hidden Markov model |
title_full_unstemmed |
Text classification using a hidden Markov model |
title_sort |
text classification using a hidden markov model |
publisher |
McGill University |
publishDate |
2005 |
url |
http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214 |
work_keys_str_mv |
AT yikwan1963 textclassificationusingahiddenmarkovmodel |
_version_ |
1716645087952240640 |