Mining Publication Records on Publication Pages based on Conditional Random Fields
碩士 === 國立臺灣科技大學 === 資訊工程系 === 100 === A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a var...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2012
|
Online Access: | http://ndltd.ncl.edu.tw/handle/bh8m2e |
id |
ndltd-TW-100NTUS5392042 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-100NTUS53920422019-05-15T20:43:22Z http://ndltd.ncl.edu.tw/handle/bh8m2e Mining Publication Records on Publication Pages based on Conditional Random Fields 基於條件機率域萃取引用文獻資訊於個人著述網頁 LIN, YA-HUEI 林雅惠 碩士 國立臺灣科技大學 資訊工程系 100 A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a variety of applications. A publication record is usually found among other information on a publication Web page (or ”publication page” for short). It is thus an interesting problem to extract publication record from such Web pages. The problem is difficult for several reasons, e.g., flexibility in formatting the metadata of a publication as a semi-structured citation string and flexibility in expressing the citation string visually presentation in HTML. Furthermore, two citation strings with a similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach, based on Conditional Random Fields and data region boundary analysis, the problem of automatically extracting publication records on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication pages. The precision rate and recall rate, and F-measure are 82.5%, 87.6%, and 85.0%, respectively. This is a significant improvement over previous researches. Hahn-Ming Lee Jan-Ming Ho 李漢銘 何建明 2012 學位論文 ; thesis 47 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣科技大學 === 資訊工程系 === 100 === A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a variety of applications. A publication record is usually found among other information on a publication Web page (or ”publication page” for short). It is thus an interesting problem to extract publication record from such Web pages. The problem is difficult for several reasons, e.g., flexibility in formatting the metadata of a publication as a
semi-structured citation string and flexibility in expressing the citation string visually presentation in HTML. Furthermore, two citation strings with a similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach, based on Conditional Random Fields and data region boundary analysis, the problem of automatically extracting publication records on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication pages. The precision rate and recall rate, and F-measure are 82.5%, 87.6%, and 85.0%, respectively. This is a significant improvement over previous researches.
|
author2 |
Hahn-Ming Lee |
author_facet |
Hahn-Ming Lee LIN, YA-HUEI 林雅惠 |
author |
LIN, YA-HUEI 林雅惠 |
spellingShingle |
LIN, YA-HUEI 林雅惠 Mining Publication Records on Publication Pages based on Conditional Random Fields |
author_sort |
LIN, YA-HUEI |
title |
Mining Publication Records on Publication Pages based on Conditional Random Fields |
title_short |
Mining Publication Records on Publication Pages based on Conditional Random Fields |
title_full |
Mining Publication Records on Publication Pages based on Conditional Random Fields |
title_fullStr |
Mining Publication Records on Publication Pages based on Conditional Random Fields |
title_full_unstemmed |
Mining Publication Records on Publication Pages based on Conditional Random Fields |
title_sort |
mining publication records on publication pages based on conditional random fields |
publishDate |
2012 |
url |
http://ndltd.ncl.edu.tw/handle/bh8m2e |
work_keys_str_mv |
AT linyahuei miningpublicationrecordsonpublicationpagesbasedonconditionalrandomfields AT línyǎhuì miningpublicationrecordsonpublicationpagesbasedonconditionalrandomfields AT linyahuei jīyútiáojiànjīlǜyùcuìqǔyǐnyòngwénxiànzīxùnyúgèrénzheshùwǎngyè AT línyǎhuì jīyútiáojiànjīlǜyùcuìqǔyǐnyòngwénxiànzīxùnyúgèrénzheshùwǎngyè |
_version_ |
1719104641714618368 |