Computational approaches for metagenomic analysis of high-throughput sequencing data

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented in...

Full description

Bibliographic Details
Main Author: Ainsworth, David
Other Authors: Sternberg, Michael ; Butcher, Sarah ; Knottenbelt, William
Published: Imperial College London 2016
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.702824
id ndltd-bl.uk-oai-ethos.bl.uk-702824
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7028242018-06-12T03:38:38ZComputational approaches for metagenomic analysis of high-throughput sequencing dataAinsworth, DavidSternberg, Michael ; Butcher, Sarah ; Knottenbelt, William2016High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.572.8Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.702824http://hdl.handle.net/10044/1/44070Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 572.8
spellingShingle 572.8
Ainsworth, David
Computational approaches for metagenomic analysis of high-throughput sequencing data
description High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.
author2 Sternberg, Michael ; Butcher, Sarah ; Knottenbelt, William
author_facet Sternberg, Michael ; Butcher, Sarah ; Knottenbelt, William
Ainsworth, David
author Ainsworth, David
author_sort Ainsworth, David
title Computational approaches for metagenomic analysis of high-throughput sequencing data
title_short Computational approaches for metagenomic analysis of high-throughput sequencing data
title_full Computational approaches for metagenomic analysis of high-throughput sequencing data
title_fullStr Computational approaches for metagenomic analysis of high-throughput sequencing data
title_full_unstemmed Computational approaches for metagenomic analysis of high-throughput sequencing data
title_sort computational approaches for metagenomic analysis of high-throughput sequencing data
publisher Imperial College London
publishDate 2016
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.702824
work_keys_str_mv AT ainsworthdavid computationalapproachesformetagenomicanalysisofhighthroughputsequencingdata
_version_ 1718694258563612672