An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases
碩士 === 國立中山大學 === 資訊工程學系研究所 === 97 === Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Patterns in proteins sequences are some subsequences which appear frequently. Patterns often denote important functional regions in...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2009
|
Online Access: | http://ndltd.ncl.edu.tw/handle/n3ew48 |
id |
ndltd-TW-097NSYS5392016 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097NSYS53920162019-05-29T03:42:53Z http://ndltd.ncl.edu.tw/handle/n3ew48 An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases 一個於蛋白質資料庫中有效率地基於位元模式來探勘順序項目之方法 Yin-han Jeng 鄭尹涵 碩士 國立中山大學 資訊工程學系研究所 97 Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Patterns in proteins sequences are some subsequences which appear frequently. Patterns often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. Moreover, it provides valuable information about the evolution of species. Patterns contain gaps of arbitrary size. Considering the no--gap--limit sequential pattern problem in a protein database, we may use the algorithm of mining sequential patterns to solve it. However, in a protein database, the order of segment appearing in protein sequences is important and it may appear many times repeatedly in a protein sequence. Therefore, we can not directly use the traditional sequential pattern mining algorithms to mine them. Many algorithms have been proposed to mine sequential patterns in protein databases, for example, the SP-index algorithm. They enumerate patterns of limited sizes (segments) in the solution space and find all patterns. The SP-index algorithm is based on the traditional sequential pattern mining algorithms and considers the the problem of the multiple--appearances of segments in a protein sequence. Although the SP-index algorithm considers the characteristics of bioinformatics, it still contains a time--consuming step which constructs the SP-tree to find the frequent patterns. In this step, it has to trace many nodes to get the result. Therefore, in this thesis, we propose a Bit--Pattern--based (BP) algorithm to improve the disadvantages of the SP-index algorithm. First, we transform the protein sequences into bit sequences. Second, we construct the frequent segments by using the AND operator. Because we use the bit operator, it is efficient to get the frequent segments. Then, we prune unnecessary frequent segments, which results in the case that we do not have to test many frequent segments in the following step. Third, we use the OR operator to get the longest pattern. In this step, we test whether two segments can be linked together to construct a long segment, and we get the result by testing once. Because we focus on which position the segment appears on, we can use the OR operator and then judge the bit sequences to get the result. Thus, we can avoid many testing processes. From our performance study based on the biological data, we show that we can improve the efficiency of the SP-index algorithm. Moreover, from our simulation results, we show that our proposed algorithm can improve the processing time up to 50\% as compared to the SP-index algorithm, since the SP--index algorithm has to trace many nodes to construct the longest pattern. Ye-In Chang 張玉盈 2009 學位論文 ; thesis 86 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中山大學 === 資訊工程學系研究所 === 97 === Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Patterns in proteins sequences are some subsequences which appear frequently. Patterns often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. Moreover, it provides valuable information about the evolution of species. Patterns contain gaps of arbitrary size. Considering the no--gap--limit sequential pattern problem in a protein database, we may use the algorithm of mining sequential patterns to solve it. However, in a protein database, the order of segment appearing in protein sequences is important and it may
appear many times repeatedly in a protein sequence. Therefore, we can not directly use the traditional sequential pattern mining algorithms to mine them. Many algorithms have been proposed to mine sequential patterns in protein databases, for example, the SP-index algorithm. They enumerate patterns of limited sizes (segments) in the solution space and find all patterns. The SP-index algorithm is based on the traditional sequential pattern mining algorithms and considers the the problem of the multiple--appearances of segments in a protein sequence. Although the SP-index algorithm considers the characteristics of bioinformatics, it still contains a time--consuming step which constructs the SP-tree to find the
frequent patterns. In this step, it has to trace many nodes to get the result. Therefore, in this thesis, we propose a
Bit--Pattern--based (BP) algorithm to improve the
disadvantages of the SP-index algorithm. First, we transform the protein sequences into bit sequences. Second, we construct the frequent segments by using the AND operator. Because we use the bit operator, it is efficient to get the frequent segments. Then, we prune unnecessary frequent segments, which results in the case that we do not have to test many frequent segments in the following step. Third, we use the OR operator to get the longest pattern. In this step, we test whether two segments can be linked together to construct a long segment, and we get the result by testing once. Because we focus on which position the segment appears on, we can use the OR operator and then judge the bit sequences to get the
result. Thus, we can avoid many testing processes. From our performance study based on the biological data, we show that we can improve the efficiency of the SP-index algorithm. Moreover, from our simulation results, we show that our proposed algorithm can improve the processing time up to 50\% as compared to the SP-index algorithm, since the SP--index algorithm has to trace many nodes to
construct the longest pattern.
|
author2 |
Ye-In Chang |
author_facet |
Ye-In Chang Yin-han Jeng 鄭尹涵 |
author |
Yin-han Jeng 鄭尹涵 |
spellingShingle |
Yin-han Jeng 鄭尹涵 An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
author_sort |
Yin-han Jeng |
title |
An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
title_short |
An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
title_full |
An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
title_fullStr |
An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
title_full_unstemmed |
An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases |
title_sort |
efficient bit-pattern-based algorithm for mining sequential patterns in protein databases |
publishDate |
2009 |
url |
http://ndltd.ncl.edu.tw/handle/n3ew48 |
work_keys_str_mv |
AT yinhanjeng anefficientbitpatternbasedalgorithmforminingsequentialpatternsinproteindatabases AT zhèngyǐnhán anefficientbitpatternbasedalgorithmforminingsequentialpatternsinproteindatabases AT yinhanjeng yīgèyúdànbáizhìzīliàokùzhōngyǒuxiàolǜdejīyúwèiyuánmóshìláitànkānshùnxùxiàngmùzhīfāngfǎ AT zhèngyǐnhán yīgèyúdànbáizhìzīliàokùzhōngyǒuxiàolǜdejīyúwèiyuánmóshìláitànkānshùnxùxiàngmùzhīfāngfǎ AT yinhanjeng efficientbitpatternbasedalgorithmforminingsequentialpatternsinproteindatabases AT zhèngyǐnhán efficientbitpatternbasedalgorithmforminingsequentialpatternsinproteindatabases |
_version_ |
1719192992240107520 |