Automatic Extraction of Blog Post from Diverse Blog Pages

碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog post...

Full description

Bibliographic Details
Main Authors: Jhih-ming Chen, 陳志銘
Other Authors: Chia-hui Chang
Format: Others
Language:en_US
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/31678124319745134405
Description
Summary:碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods.