Summary: | 碩士 === 國立嘉義大學 === 資訊工程研究所 === 91 === Due to the Human Genome Project and other similar projects, the number of sequences submitted to the biological sequence databases rapidly increases. Researchers of biology, medicine, and pharmaceutics, etc., can be assisted in their experiments with these databases. With the assumption that similar sequences might have similar functions, the researchers search for sequences in the databases similar to their query sequences. So that, they can predict the function of the query sequences, and design proper experiments.
In this study, we represent a biological sequence with a multi-dimensional data point. The dimensions for a DNA/RNA sequence in our study are (1) counts for the nucleotides, (2) distribution of the nucleotides in the sequence, (3) chain code of the sequence. For a protein sequence, the dimensions are (1) counts for the amino acids, (2) distribution of the amino acids in the sequence.
Our experiments show that if the representing multi-dimensional data points of two sequences are close, the two sequences have a high similarity score. Therefore, we can use the multi-dimensional data points to index biological sequences, and store the data points in a spatial data structure. When a query sequence is issued, the multi-dimensional data point of the query sequence is used to retrieve those sequences close to it. We use three different species for experiment. The result is as we expected.
|