Text classification using a hidden Markov model

Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals a...

Full description

Bibliographic Details
Main Author: Yi, Kwan, 1963-
Format: Others
Language:en
Published: McGill University 2005
Subjects:
Online Access:http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214
id ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.85214
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.852142014-02-13T04:05:22ZText classification using a hidden Markov modelYi, Kwan, 1963-Markov processes.Classification -- AutomationText categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.McGill University2005Electronic Thesis or Dissertationapplication/pdfenalephsysno: 002211451proquestno: AAINR12966Theses scanned by UMI/ProQuest.All items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.Doctor of Philosophy (Graduate School of Library and Information Studies.) http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214
collection NDLTD
language en
format Others
sources NDLTD
topic Markov processes.
Classification -- Automation
spellingShingle Markov processes.
Classification -- Automation
Yi, Kwan, 1963-
Text classification using a hidden Markov model
description Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.
author Yi, Kwan, 1963-
author_facet Yi, Kwan, 1963-
author_sort Yi, Kwan, 1963-
title Text classification using a hidden Markov model
title_short Text classification using a hidden Markov model
title_full Text classification using a hidden Markov model
title_fullStr Text classification using a hidden Markov model
title_full_unstemmed Text classification using a hidden Markov model
title_sort text classification using a hidden markov model
publisher McGill University
publishDate 2005
url http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214
work_keys_str_mv AT yikwan1963 textclassificationusingahiddenmarkovmodel
_version_ 1716645087952240640