Summary: | 博士 === 國立陽明大學 === 生物醫學資訊研究所 === 99 === Domains are the functional units to protein composition. About 60% and 80% of the proteins contain at least one domain in prokaryotes and eukaryotes, respectively. Proteins with the same domain architecture had been shown to be more likely to be derived from common ancestor instead of convergent evolution. Comparing to primary sequences, domain(s) or domain architectures on proteins are more directly related to the function of the protein and therefore more likely to be conserved through evolution. These properties make domain architecture informative in investigating evolution history of proteins. Here, we exploit the domain information to shed light on the ortholog detection or protein function prediction.
We implement an efficient pipeline named DODO (DOmain based Detection of Ortholog) which utilize the domain architecture information to cluster proteins into homolog groups and further identify orthologs within those homolog groups. DODO has been shown to perform well while testing with several well-known ortholog databases such as InParanoid and HomoloGene. Aided by domain information, DODO is able to detect those distantly related orthologs even when their sequences may already become diverged and share low sequence similarity. In addition to the ortholog detection, we further investigated the domain architecture distribution and domain usage in other eukaryotes and constructed a protein domain architecture database (proDAD) where homolog proteins were clustered according to their domain architecture. In the database, those homolog proteins could be further aligned together, and the alignment result is shown to be useful in correcting the start site annotation of proteins. Finally, we construct a VIrus Protein domain DataBase (VIP DB) in which all domains on virus proteins are identified. VIP DB aims to provide clues for protein function from the protein domains and integrate information from domain GO annotation, domain-domain interaction and KEGG pathway based on those protein domains.
With the advance of high throughput sequencing technologies, more and more genomes are sequenced. Efficient methods to identify the ortholog of newly sequenced protein and identify the function of those proteins would be beneficial. Our work of using domain information to identify proteins with common ancestor or protein functions makes important contribution to the post-sequencing analysis.
|