Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.

Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our no...

Full description

Bibliographic Details
Main Authors: Fei Zhu, Bairong Shen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2012-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3383748?pdf=render
id doaj-31a75bbc5d9e455c83997d390aa0407b
record_format Article
spelling doaj-31a75bbc5d9e455c83997d390aa0407b2020-11-25T02:42:34ZengPublic Library of Science (PLoS)PLoS ONE1932-62032012-01-0176e3923010.1371/journal.pone.0039230Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.Fei ZhuBairong ShenBiological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.http://europepmc.org/articles/PMC3383748?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Fei Zhu
Bairong Shen
spellingShingle Fei Zhu
Bairong Shen
Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
PLoS ONE
author_facet Fei Zhu
Bairong Shen
author_sort Fei Zhu
title Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
title_short Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
title_full Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
title_fullStr Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
title_full_unstemmed Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.
title_sort combined svm-crfs for biological named entity recognition with maximal bidirectional squeezing.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2012-01-01
description Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.
url http://europepmc.org/articles/PMC3383748?pdf=render
work_keys_str_mv AT feizhu combinedsvmcrfsforbiologicalnamedentityrecognitionwithmaximalbidirectionalsqueezing
AT bairongshen combinedsvmcrfsforbiologicalnamedentityrecognitionwithmaximalbidirectionalsqueezing
_version_ 1724772983166205952