Identification of transcription factor contexts in literature using machine learning approaches

<p>Abstract</p> <p>Background</p> <p>Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the dev...

Full description

Bibliographic Details
Main Authors: Nenadic Goran, Yang Hui, Keane John A
Format: Article
Language:English
Published: BMC 2008-04-01
Series:BMC Bioinformatics
id doaj-c264ba7a3a6344878a1c65394c2d2df0
record_format Article
spelling doaj-c264ba7a3a6344878a1c65394c2d2df02020-11-25T01:19:31ZengBMCBMC Bioinformatics1471-21052008-04-019Suppl 3S1110.1186/1471-2105-9-S3-S11Identification of transcription factor contexts in literature using machine learning approachesNenadic GoranYang HuiKeane John A<p>Abstract</p> <p>Background</p> <p>Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.</p> <p>Results</p> <p>In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.</p> <p>Conclusions</p> <p>The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.</p>
collection DOAJ
language English
format Article
sources DOAJ
author Nenadic Goran
Yang Hui
Keane John A
spellingShingle Nenadic Goran
Yang Hui
Keane John A
Identification of transcription factor contexts in literature using machine learning approaches
BMC Bioinformatics
author_facet Nenadic Goran
Yang Hui
Keane John A
author_sort Nenadic Goran
title Identification of transcription factor contexts in literature using machine learning approaches
title_short Identification of transcription factor contexts in literature using machine learning approaches
title_full Identification of transcription factor contexts in literature using machine learning approaches
title_fullStr Identification of transcription factor contexts in literature using machine learning approaches
title_full_unstemmed Identification of transcription factor contexts in literature using machine learning approaches
title_sort identification of transcription factor contexts in literature using machine learning approaches
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2008-04-01
description <p>Abstract</p> <p>Background</p> <p>Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.</p> <p>Results</p> <p>In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.</p> <p>Conclusions</p> <p>The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.</p>
work_keys_str_mv AT nenadicgoran identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches
AT yanghui identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches
AT keanejohna identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches
_version_ 1725137750669131776