Inferring Genomic Sequences

Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and req...

Full description

Bibliographic Details
Main Author: Astrovskaya, Irina A
Format: Others
Published: Digital Archive @ GSU 2011
Subjects:
Online Access:http://digitalarchive.gsu.edu/cs_diss/59
http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1059&context=cs_diss
id ndltd-GEORGIA-oai-digitalarchive.gsu.edu-cs_diss-1059
record_format oai_dc
spelling ndltd-GEORGIA-oai-digitalarchive.gsu.edu-cs_diss-10592013-04-23T03:18:55Z Inferring Genomic Sequences Astrovskaya, Irina A Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies. The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic. 2011-05-07 text application/pdf http://digitalarchive.gsu.edu/cs_diss/59 http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1059&context=cs_diss Computer Science Dissertations Digital Archive @ GSU Haplotype assembling Viral quasispecies Next generation sequencing VISPA Genotype tagging Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic Haplotype assembling
Viral quasispecies
Next generation sequencing
VISPA
Genotype tagging
Computer Sciences
spellingShingle Haplotype assembling
Viral quasispecies
Next generation sequencing
VISPA
Genotype tagging
Computer Sciences
Astrovskaya, Irina A
Inferring Genomic Sequences
description Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies. The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic.
author Astrovskaya, Irina A
author_facet Astrovskaya, Irina A
author_sort Astrovskaya, Irina A
title Inferring Genomic Sequences
title_short Inferring Genomic Sequences
title_full Inferring Genomic Sequences
title_fullStr Inferring Genomic Sequences
title_full_unstemmed Inferring Genomic Sequences
title_sort inferring genomic sequences
publisher Digital Archive @ GSU
publishDate 2011
url http://digitalarchive.gsu.edu/cs_diss/59
http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1059&context=cs_diss
work_keys_str_mv AT astrovskayairinaa inferringgenomicsequences
_version_ 1716583960414257152