Enhanced Suffix Trees for Very Large DNA Sequences

Recent advances in bio-technology have provided rapid accumulation of biological DNA sequence data. New techniques are required for fast, scalable, and versatile processing of such data. Suffix tree (ST) is a data structure used for indexing genome data. This, however, comes with a price: it occupi...

Full description

Bibliographic Details
Main Author: Fan, Si Ai
Format: Others
Published: 2011
Online Access:http://spectrum.library.concordia.ca/35799/1/Fan_MCompSc_F2011.pdf
Fan, Si Ai <http://spectrum.library.concordia.ca/view/creators/Fan=3ASi_Ai=3A=3A.html> (2011) Enhanced Suffix Trees for Very Large DNA Sequences. Masters thesis, Concordia University.
Description
Summary:Recent advances in bio-technology have provided rapid accumulation of biological DNA sequence data. New techniques are required for fast, scalable, and versatile processing of such data. Suffix tree (ST) is a data structure used for indexing genome data. This, however, comes with a price: it occupies a space that is about 10 times more than the input size. Existing disk-based ST index techniques either suffer from data skew problem, like TDD and HST, or are not space efficient for very large sequences, like TRELLIS and B2ST. We propose a new disk-based ST index, called Compact Binary Suffix Tree (CBST), together with a construction algorithm, which can support DNA sequences of size up to 256 terabyte. The results of our numerous experiments indicated that, compared to existing ST and suffix array techniques, CBST is superior in speed, space requirement, and scalability. It is the fastest among the disk-based techniques for very large sequences.