Automatic Extraction of Blog Post from Diverse Blog Pages

碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog post...

Full description

Bibliographic Details
Main Authors: Jhih-ming Chen, 陳志銘
Other Authors: Chia-hui Chang
Format: Others
Language:en_US
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/31678124319745134405
id ndltd-TW-099NCU05392056
record_format oai_dc
spelling ndltd-TW-099NCU053920562015-10-19T04:03:06Z http://ndltd.ncl.edu.tw/handle/31678124319745134405 Automatic Extraction of Blog Post from Diverse Blog Pages 基於多元化部落格網頁之自動化擷取部落格主要文章 Jhih-ming Chen 陳志銘 碩士 國立中央大學 資訊工程研究所 99 With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods. Chia-hui Chang 張嘉惠 2011 學位論文 ; thesis 41 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods.
author2 Chia-hui Chang
author_facet Chia-hui Chang
Jhih-ming Chen
陳志銘
author Jhih-ming Chen
陳志銘
spellingShingle Jhih-ming Chen
陳志銘
Automatic Extraction of Blog Post from Diverse Blog Pages
author_sort Jhih-ming Chen
title Automatic Extraction of Blog Post from Diverse Blog Pages
title_short Automatic Extraction of Blog Post from Diverse Blog Pages
title_full Automatic Extraction of Blog Post from Diverse Blog Pages
title_fullStr Automatic Extraction of Blog Post from Diverse Blog Pages
title_full_unstemmed Automatic Extraction of Blog Post from Diverse Blog Pages
title_sort automatic extraction of blog post from diverse blog pages
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/31678124319745134405
work_keys_str_mv AT jhihmingchen automaticextractionofblogpostfromdiverseblogpages
AT chénzhìmíng automaticextractionofblogpostfromdiverseblogpages
AT jhihmingchen jīyúduōyuánhuàbùluògéwǎngyèzhīzìdònghuàxiéqǔbùluògézhǔyàowénzhāng
AT chénzhìmíng jīyúduōyuánhuàbùluògéwǎngyèzhīzìdònghuàxiéqǔbùluògézhǔyàowénzhāng
_version_ 1718093138512314368