An Unsupervised Pattern Recognition Method for Identifying TFBS Based on DNA Short Sequence Features

碩士 === 國立臺灣科技大學 === 資訊工程系 === 94 === Identifying binding sites for the transcription factor in the upstream sequences of genes to which the factor binds is the first step to understand the gene regulatory mechanism. Recent assessment of computational tools for identifying these binding sites indica...

Full description

Bibliographic Details
Main Authors: Pai-Ling Lo, 羅百玲
Other Authors: Jan-Ming Ho
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/enb329
Description
Summary:碩士 === 國立臺灣科技大學 === 資訊工程系 === 94 === Identifying binding sites for the transcription factor in the upstream sequences of genes to which the factor binds is the first step to understand the gene regulatory mechanism. Recent assessment of computational tools for identifying these binding sites indicates that identifying these regulatory elements remains a challenging task in higher organisms, such as the human species. The task is limited in the intrinsic subtlety of binding sites and the huge background noise. That is, only a small portion of genome will be bound by transcription factors and sequence-specific recognition for binding is subtle. In this thesis, we proposed an unsupervised pattern recognition method to handle the incomplete and unbalanced biological data. To model the binding activity, a vector of small sequence features was proposed. To identify candidate pattern for binding sites, the overall over-representative and sequence popularity of each pattern are taken into consideration in ranking. To evaluate the performance and to compare with related work, a benchmark which has been used to assess existing tools was adopted. The experimental results show that the proposed methodology outperforms the related work in terms of the nucleotide level performance when we only consider the top 3 nodes in the ranking. The proposed representation of binding sites and the ranking mechanism make efficient predictions of binding sites.