Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...

Full description

Bibliographic Details
Main Authors: Ariel W Chan, Martha T Hamblin, Jean-Luc Jannink
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4990193?pdf=render
id doaj-7126694994df4deba3bb14d6c632f5cc
record_format Article
spelling doaj-7126694994df4deba3bb14d6c632f5cc2020-11-24T22:11:46ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-01118e016073310.1371/journal.pone.0160733Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.Ariel W ChanMartha T HamblinJean-Luc JanninkWell-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.http://europepmc.org/articles/PMC4990193?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Ariel W Chan
Martha T Hamblin
Jean-Luc Jannink
spellingShingle Ariel W Chan
Martha T Hamblin
Jean-Luc Jannink
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
PLoS ONE
author_facet Ariel W Chan
Martha T Hamblin
Jean-Luc Jannink
author_sort Ariel W Chan
title Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_short Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_full Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_fullStr Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_full_unstemmed Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
title_sort evaluating imputation algorithms for low-depth genotyping-by-sequencing (gbs) data.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2016-01-01
description Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
url http://europepmc.org/articles/PMC4990193?pdf=render
work_keys_str_mv AT arielwchan evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
AT marthathamblin evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
AT jeanlucjannink evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
_version_ 1725804284713369600