Text classification using a hidden Markov model

Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals a...

Full description

Bibliographic Details
Main Author:	Yi, Kwan, 1963-
Format:	Others
Language:	en
Published:	McGill University 2005
Subjects:	Markov processes. Classification > Automation
Online Access:	http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214

id	ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.85214
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-QMM.852142014-02-13T04:05:22ZText classification using a hidden Markov modelYi, Kwan, 1963-Markov processes.Classification -- AutomationText categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.McGill University2005Electronic Thesis or Dissertationapplication/pdfenalephsysno: 002211451proquestno: AAINR12966Theses scanned by UMI/ProQuest.All items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.Doctor of Philosophy (Graduate School of Library and Information Studies.) http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214
collection	NDLTD
language	en
format	Others
sources	NDLTD
topic	Markov processes. Classification -- Automation
spellingShingle	Markov processes. Classification -- Automation Yi, Kwan, 1963- Text classification using a hidden Markov model
description	Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.
author	Yi, Kwan, 1963-
author_facet	Yi, Kwan, 1963-
author_sort	Yi, Kwan, 1963-
title	Text classification using a hidden Markov model
title_short	Text classification using a hidden Markov model
title_full	Text classification using a hidden Markov model
title_fullStr	Text classification using a hidden Markov model
title_full_unstemmed	Text classification using a hidden Markov model
title_sort	text classification using a hidden markov model
publisher	McGill University
publishDate	2005
url	http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214
work_keys_str_mv	AT yikwan1963 textclassificationusingahiddenmarkovmodel
_version_	1716645087952240640

Text classification using a hidden Markov model

Similar Items