Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...

Full description

Bibliographic Details
Main Authors:	Ariel W Chan, Martha T Hamblin, Jean-Luc Jannink
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2016-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC4990193?pdf=render

id	doaj-7126694994df4deba3bb14d6c632f5cc
record_format	Article
spelling	doaj-7126694994df4deba3bb14d6c632f5cc2020-11-24T22:11:46ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-01118e016073310.1371/journal.pone.0160733Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.Ariel W ChanMartha T HamblinJean-Luc JanninkWell-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.http://europepmc.org/articles/PMC4990193?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ariel W Chan Martha T Hamblin Jean-Luc Jannink
spellingShingle	Ariel W Chan Martha T Hamblin Jean-Luc Jannink Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data. PLoS ONE
author_facet	Ariel W Chan Martha T Hamblin Jean-Luc Jannink
author_sort	Ariel W Chan
title	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_short	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_full	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_fullStr	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_full_unstemmed	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_sort	evaluating imputation algorithms for low-depth genotyping-by-sequencing (gbs) data.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2016-01-01
description	Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
url	http://europepmc.org/articles/PMC4990193?pdf=render
work_keys_str_mv	AT arielwchan evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT marthathamblin evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT jeanlucjannink evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
_version_	1725804284713369600

Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

Similar Items