Mining Publication Records on Publication Pages based on Conditional Random Fields

碩士 === 國立臺灣科技大學 === 資訊工程系 === 100 === A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a var...

Full description

Bibliographic Details
Main Authors: LIN, YA-HUEI, 林雅惠
Other Authors: Hahn-Ming Lee
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/bh8m2e
id ndltd-TW-100NTUS5392042
record_format oai_dc
spelling ndltd-TW-100NTUS53920422019-05-15T20:43:22Z http://ndltd.ncl.edu.tw/handle/bh8m2e Mining Publication Records on Publication Pages based on Conditional Random Fields 基於條件機率域萃取引用文獻資訊於個人著述網頁 LIN, YA-HUEI 林雅惠 碩士 國立臺灣科技大學 資訊工程系 100 A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a variety of applications. A publication record is usually found among other information on a publication Web page (or ”publication page” for short). It is thus an interesting problem to extract publication record from such Web pages. The problem is difficult for several reasons, e.g., flexibility in formatting the metadata of a publication as a semi-structured citation string and flexibility in expressing the citation string visually presentation in HTML. Furthermore, two citation strings with a similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach, based on Conditional Random Fields and data region boundary analysis, the problem of automatically extracting publication records on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication pages. The precision rate and recall rate, and F-measure are 82.5%, 87.6%, and 85.0%, respectively. This is a significant improvement over previous researches. Hahn-Ming Lee Jan-Ming Ho 李漢銘 何建明 2012 學位論文 ; thesis 47 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣科技大學 === 資訊工程系 === 100 === A publication record is a list of semi-structured citation strings for publications of a research institute or an individual researcher. Publication records are integrated into a digital library which becomes an important knowledge base and thereby enables a variety of applications. A publication record is usually found among other information on a publication Web page (or ”publication page” for short). It is thus an interesting problem to extract publication record from such Web pages. The problem is difficult for several reasons, e.g., flexibility in formatting the metadata of a publication as a semi-structured citation string and flexibility in expressing the citation string visually presentation in HTML. Furthermore, two citation strings with a similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach, based on Conditional Random Fields and data region boundary analysis, the problem of automatically extracting publication records on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication pages. The precision rate and recall rate, and F-measure are 82.5%, 87.6%, and 85.0%, respectively. This is a significant improvement over previous researches.
author2 Hahn-Ming Lee
author_facet Hahn-Ming Lee
LIN, YA-HUEI
林雅惠
author LIN, YA-HUEI
林雅惠
spellingShingle LIN, YA-HUEI
林雅惠
Mining Publication Records on Publication Pages based on Conditional Random Fields
author_sort LIN, YA-HUEI
title Mining Publication Records on Publication Pages based on Conditional Random Fields
title_short Mining Publication Records on Publication Pages based on Conditional Random Fields
title_full Mining Publication Records on Publication Pages based on Conditional Random Fields
title_fullStr Mining Publication Records on Publication Pages based on Conditional Random Fields
title_full_unstemmed Mining Publication Records on Publication Pages based on Conditional Random Fields
title_sort mining publication records on publication pages based on conditional random fields
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/bh8m2e
work_keys_str_mv AT linyahuei miningpublicationrecordsonpublicationpagesbasedonconditionalrandomfields
AT línyǎhuì miningpublicationrecordsonpublicationpagesbasedonconditionalrandomfields
AT linyahuei jīyútiáojiànjīlǜyùcuìqǔyǐnyòngwénxiànzīxùnyúgèrénzheshùwǎngyè
AT línyǎhuì jīyútiáojiànjīlǜyùcuìqǔyǐnyòngwénxiànzīxùnyúgèrénzheshùwǎngyè
_version_ 1719104641714618368