Identify disease associated gene by literature mining

碩士 === 國立陽明大學 === 生物資訊研究所 === 94 === In the post-genomic era, a lot of efforts have been put into converting the raw data into usable information and extracting knowledge from the information. However, a direct application of such results to solve real problems is still difficult. A long being forgo...

Full description

Bibliographic Details
Main Authors: Hsin-Ta Wu, 吳欣達
Other Authors: Ueng-Cheng Yang
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/19742165301877949659
id ndltd-TW-094YM005112010
record_format oai_dc
spelling ndltd-TW-094YM0051120102015-10-13T16:31:17Z http://ndltd.ncl.edu.tw/handle/19742165301877949659 Identify disease associated gene by literature mining 利用文件探勘技術找尋疾病關連基因 Hsin-Ta Wu 吳欣達 碩士 國立陽明大學 生物資訊研究所 94 In the post-genomic era, a lot of efforts have been put into converting the raw data into usable information and extracting knowledge from the information. However, a direct application of such results to solve real problems is still difficult. A long being forgotten treasure is the literature, which collects lots of information or even knowledge. Thus, the goal of this thesis is to use a literature mining approach to find the possible association among genes and diseases. This approach only collects abstracts for 1,145 diseases that have phenotype description in the Online Mandelian Inheritance in Man (OMIM). The disease names and their alias were used to collect all the disease-related abstracts from Medline (Medical literature analysis and retrieval system online). Moreover, these abstracts were scored and ranked according to the algorithm based on the statistical approach. The gene names in these abstracts were tagged by using the program “GeneTaggerCRF” and a gene name dictionary. The co-occurrence of gene and disease names can be scored by either a sentence-based or a document-based method. These abstracts and the co-occurrence scores were stored in a Disease Associated Gene database (DAG db). The recall rate of the sentence-based and the document-based approaches are 76 % and 92 %, respectively. If one-fourth of the highest score is used as a cutoff, the recall rate for the sentence-based and the document-based approaches dropped to 67 % and 88 %, respectively. Although the sentence-based approach has a lower recall rate, the co-occurrence relation is stricter than that of the document-based approach. By comparing to the manual search of Medline, this value-added database approach is faster and more comprehensive. For example, a protein variant of DRD4, which is derived from an alternative splicing event, may link to Tourette Syndrome according to the search result in DAG db. Therefore, it is more suitable for generating research ideas than the manual approach. Moreover, the relation among these disease associated genes may be further explored to link a genotype and a phenotype. The method in this research can help important information to be discovered from biomedical literatures and thus helps users to gain quick and easy access to important information. Ueng-Cheng Yang Jung-Hsien Chiang 楊永正 蔣榮先 2006 學位論文 ; thesis 78 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立陽明大學 === 生物資訊研究所 === 94 === In the post-genomic era, a lot of efforts have been put into converting the raw data into usable information and extracting knowledge from the information. However, a direct application of such results to solve real problems is still difficult. A long being forgotten treasure is the literature, which collects lots of information or even knowledge. Thus, the goal of this thesis is to use a literature mining approach to find the possible association among genes and diseases. This approach only collects abstracts for 1,145 diseases that have phenotype description in the Online Mandelian Inheritance in Man (OMIM). The disease names and their alias were used to collect all the disease-related abstracts from Medline (Medical literature analysis and retrieval system online). Moreover, these abstracts were scored and ranked according to the algorithm based on the statistical approach. The gene names in these abstracts were tagged by using the program “GeneTaggerCRF” and a gene name dictionary. The co-occurrence of gene and disease names can be scored by either a sentence-based or a document-based method. These abstracts and the co-occurrence scores were stored in a Disease Associated Gene database (DAG db). The recall rate of the sentence-based and the document-based approaches are 76 % and 92 %, respectively. If one-fourth of the highest score is used as a cutoff, the recall rate for the sentence-based and the document-based approaches dropped to 67 % and 88 %, respectively. Although the sentence-based approach has a lower recall rate, the co-occurrence relation is stricter than that of the document-based approach. By comparing to the manual search of Medline, this value-added database approach is faster and more comprehensive. For example, a protein variant of DRD4, which is derived from an alternative splicing event, may link to Tourette Syndrome according to the search result in DAG db. Therefore, it is more suitable for generating research ideas than the manual approach. Moreover, the relation among these disease associated genes may be further explored to link a genotype and a phenotype. The method in this research can help important information to be discovered from biomedical literatures and thus helps users to gain quick and easy access to important information.
author2 Ueng-Cheng Yang
author_facet Ueng-Cheng Yang
Hsin-Ta Wu
吳欣達
author Hsin-Ta Wu
吳欣達
spellingShingle Hsin-Ta Wu
吳欣達
Identify disease associated gene by literature mining
author_sort Hsin-Ta Wu
title Identify disease associated gene by literature mining
title_short Identify disease associated gene by literature mining
title_full Identify disease associated gene by literature mining
title_fullStr Identify disease associated gene by literature mining
title_full_unstemmed Identify disease associated gene by literature mining
title_sort identify disease associated gene by literature mining
publishDate 2006
url http://ndltd.ncl.edu.tw/handle/19742165301877949659
work_keys_str_mv AT hsintawu identifydiseaseassociatedgenebyliteraturemining
AT wúxīndá identifydiseaseassociatedgenebyliteraturemining
AT hsintawu lìyòngwénjiàntànkānjìshùzhǎoxúnjíbìngguānliánjīyīn
AT wúxīndá lìyòngwénjiàntànkānjìshùzhǎoxúnjíbìngguānliánjīyīn
_version_ 1717771766631235584