Summary: | 碩士 === 國立成功大學 === 資訊管理研究所 === 95 === With the flourishing development of bioinformatics, biologists often use protein sequences to do analysis like predict and annotate unknown proteins. While carrying on these researches, the necessary leading work is clustering. Once the sequences are clustered, the sequences with similar function will be at same cluster and it is well to do follow-up researches by the cluster characteristic. Recently, the most common way to do clustering is exploiting the similarity degrees from sequences each other, but some researchers have already proved that this simple way is insufficient and will cause lots of errors. In addition, some biologists want to understand not only the members of the clusters or the number of the clusters, but also the relation in and between the clusters.
The problem we want to solve is to focus on the protein superfamily and this area is researches’ less treatment. It is different from general sequences clustering to protein superfamily clustering because there are still some different subfamilies in the superfamily. These subfamilies maybe have different characteristics and relations so it needs to be clustered based on evolution. Therefore only using simple tool or method couldn’t cope with this kind of problem. On the basis of the above mentions, we use the phylogenetic tree to cluster the sequences in protein superfamily. First, the standard package – Phylip is used to reconstruct the phylogenetic tree and then analyze it via a succession of procedures such as distance parsing, threshold choosing, splitting, merging, re-merging, and so on. Finally, the phylogenetic tree will be transformed into several sub-trees, and each sub-tree can represent one cluster. The method is based on phylogenetic tree, so it has evolutionary meanings and better clustering results than the methods based on sequences similarities.
|