Automatic Extraction of Blog Post from Diverse Blog Pages
碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog post...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2011
|
Online Access: | http://ndltd.ncl.edu.tw/handle/31678124319745134405 |
id |
ndltd-TW-099NCU05392056 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-099NCU053920562015-10-19T04:03:06Z http://ndltd.ncl.edu.tw/handle/31678124319745134405 Automatic Extraction of Blog Post from Diverse Blog Pages 基於多元化部落格網頁之自動化擷取部落格主要文章 Jhih-ming Chen 陳志銘 碩士 國立中央大學 資訊工程研究所 99 With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods. Chia-hui Chang 張嘉惠 2011 學位論文 ; thesis 41 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊工程研究所 === 99 === With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods.
|
author2 |
Chia-hui Chang |
author_facet |
Chia-hui Chang Jhih-ming Chen 陳志銘 |
author |
Jhih-ming Chen 陳志銘 |
spellingShingle |
Jhih-ming Chen 陳志銘 Automatic Extraction of Blog Post from Diverse Blog Pages |
author_sort |
Jhih-ming Chen |
title |
Automatic Extraction of Blog Post from Diverse Blog Pages |
title_short |
Automatic Extraction of Blog Post from Diverse Blog Pages |
title_full |
Automatic Extraction of Blog Post from Diverse Blog Pages |
title_fullStr |
Automatic Extraction of Blog Post from Diverse Blog Pages |
title_full_unstemmed |
Automatic Extraction of Blog Post from Diverse Blog Pages |
title_sort |
automatic extraction of blog post from diverse blog pages |
publishDate |
2011 |
url |
http://ndltd.ncl.edu.tw/handle/31678124319745134405 |
work_keys_str_mv |
AT jhihmingchen automaticextractionofblogpostfromdiverseblogpages AT chénzhìmíng automaticextractionofblogpostfromdiverseblogpages AT jhihmingchen jīyúduōyuánhuàbùluògéwǎngyèzhīzìdònghuàxiéqǔbùluògézhǔyàowénzhāng AT chénzhìmíng jīyúduōyuánhuàbùluògéwǎngyèzhīzìdònghuàxiéqǔbùluògézhǔyàowénzhāng |
_version_ |
1718093138512314368 |