Automatic Categorized Document Collection for Adaptive Text Classification

碩士 === 國立臺灣科技大學 === 資訊工程系 === 96 === In the real world, the labeled data are sometimes very few and expensive. In text classification, we are facing this kind of problems, it demands a lot of human effort to read over and correctly label an article. Some techniques have been developed to conquer thi...

Full description

Bibliographic Details
Main Authors: Tzu-Yang Huang, 黃子洋
Other Authors: Yuh-Jye Lee
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/83016398430671831643
Description
Summary:碩士 === 國立臺灣科技大學 === 資訊工程系 === 96 === In the real world, the labeled data are sometimes very few and expensive. In text classification, we are facing this kind of problems, it demands a lot of human effort to read over and correctly label an article. Some techniques have been developed to conquer this problem. However, we came up with an approach which collects enormous labeled data automatically, permanently and quickly. A structured document called Really Simple Syndication (RSS) was created to store and transport new articles. Mostly, an RSS feed will stick with a topical subject. Due to this characteristic of RSS, we can collect articles from RSS and assign the subject as a class label to those collected articles. In our works, we chose a certain amount of RSS feeds with various topics on the Internet, and we build up a web crawler to keep crawling these RSS feeds. We stored those collected articles into a database and recorded their subjects. In our setup, our system can effortlessly collect thousands of labeled articles in one day. Furthermore, we attempt to explain that using this method to collect data is reliable. Therefore, we use concept extraction method to extract the concept tokens and smooth support vector machines as our classification method to test our dataset. Moreover, we use the classifier to predict 2 more extra websites which we collected exclusively. These experiments provide satisfied results. Finally, we expect that our system can be used to solve the problem of lacking labeled data in text classification.