Statistical principle-based approach for gene and protein related object recognition

Abstract The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protei...

Full description

Bibliographic Details
Main Authors: Po-Ting Lai, Ming-Siang Huang, Ting-Hao Yang, Wen-Lian Hsu, Richard Tzong-Han Tsai
Format: Article
Language:English
Published: BMC 2018-12-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-018-0314-7
id doaj-ecf74773132444f29eba80067d36ac0a
record_format Article
spelling doaj-ecf74773132444f29eba80067d36ac0a2020-11-25T00:41:49ZengBMCJournal of Cheminformatics1758-29462018-12-011011910.1186/s13321-018-0314-7Statistical principle-based approach for gene and protein related object recognitionPo-Ting Lai0Ming-Siang Huang1Ting-Hao Yang2Wen-Lian Hsu3Richard Tzong-Han Tsai4Department of Computer Science, National Tsing-Hua UniversityBioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia SinicaDepartment of Computer Science, National Tsing-Hua UniversityDepartment of Computer Science, National Tsing-Hua UniversityIntelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central UniversityAbstract The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.http://link.springer.com/article/10.1186/s13321-018-0314-7Named entity recognitionInformation extractionNatural language processingBiomedical text miningMachine learningMedical chemical patent
collection DOAJ
language English
format Article
sources DOAJ
author Po-Ting Lai
Ming-Siang Huang
Ting-Hao Yang
Wen-Lian Hsu
Richard Tzong-Han Tsai
spellingShingle Po-Ting Lai
Ming-Siang Huang
Ting-Hao Yang
Wen-Lian Hsu
Richard Tzong-Han Tsai
Statistical principle-based approach for gene and protein related object recognition
Journal of Cheminformatics
Named entity recognition
Information extraction
Natural language processing
Biomedical text mining
Machine learning
Medical chemical patent
author_facet Po-Ting Lai
Ming-Siang Huang
Ting-Hao Yang
Wen-Lian Hsu
Richard Tzong-Han Tsai
author_sort Po-Ting Lai
title Statistical principle-based approach for gene and protein related object recognition
title_short Statistical principle-based approach for gene and protein related object recognition
title_full Statistical principle-based approach for gene and protein related object recognition
title_fullStr Statistical principle-based approach for gene and protein related object recognition
title_full_unstemmed Statistical principle-based approach for gene and protein related object recognition
title_sort statistical principle-based approach for gene and protein related object recognition
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2018-12-01
description Abstract The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.
topic Named entity recognition
Information extraction
Natural language processing
Biomedical text mining
Machine learning
Medical chemical patent
url http://link.springer.com/article/10.1186/s13321-018-0314-7
work_keys_str_mv AT potinglai statisticalprinciplebasedapproachforgeneandproteinrelatedobjectrecognition
AT mingsianghuang statisticalprinciplebasedapproachforgeneandproteinrelatedobjectrecognition
AT tinghaoyang statisticalprinciplebasedapproachforgeneandproteinrelatedobjectrecognition
AT wenlianhsu statisticalprinciplebasedapproachforgeneandproteinrelatedobjectrecognition
AT richardtzonghantsai statisticalprinciplebasedapproachforgeneandproteinrelatedobjectrecognition
_version_ 1725285410496577536