Scaffolding Pre-Assembled Contigs Using Long-Read Sequencing

碩士 === 國立中正大學 === 資訊工程研究所 === 101 === In recent years, third-generation sequencing platform has been applied for improving genome assembly, which is able to sequence a single DNA molecular in real time and generate reads with longer length. But unfortunately, these long reads are often with higher e...

Full description

Bibliographic Details
Main Authors: Tsai,Cheng-Wei, 蔡承洧
Other Authors: Huang,Yao-Ting
Format: Others
Language:en_US
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/97232567550873046854
Description
Summary:碩士 === 國立中正大學 === 資訊工程研究所 === 101 === In recent years, third-generation sequencing platform has been applied for improving genome assembly, which is able to sequence a single DNA molecular in real time and generate reads with longer length. But unfortunately, these long reads are often with higher error rates compared with previous sequencing technologies, in which most errors are indels. The high error rates greatly reduce the usability of long reads for improving genome assembly. In this thesis, we design and implement a program for scaffolding pre-assembled contigs using long reads (called SACLR) generated by Pacific Biosciences platform. Given a set of pre-assembled contigs and long reads, SACLR determines the mapped boundary of contigs using a novel clustering alignment approach for tolerating various errors of the platform. The linkage between contigs across multiple long reads is established and integrated for further improving the scaffolding length. It is worth mentioning that the gaps within our scaffolds can be directly filled and the two ends of each scaffold may be further extended by long reads. SACLR has been tested using a variety of real data sets. The experimental results showed that SCALR produced more contiguous and accurate sequences.