Cleaning Genotype Data from Diversity Outbred Mice

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental...

Full description

Bibliographic Details
Main Authors: Karl W. Broman, Daniel M. Gatti, Karen L. Svenson, Śaunak Sen, Gary A. Churchill
Format: Article
Language:English
Published: Oxford University Press 2019-05-01
Series:G3: Genes, Genomes, Genetics
Subjects:
QTL
MPP
Online Access:http://g3journal.org/lookup/doi/10.1534/g3.119.400165
id doaj-a79dbaf6d50543a89f350f49d2c2e59e
record_format Article
spelling doaj-a79dbaf6d50543a89f350f49d2c2e59e2021-07-02T14:12:13ZengOxford University PressG3: Genes, Genomes, Genetics2160-18362019-05-01951571157910.1534/g3.119.40016527Cleaning Genotype Data from Diversity Outbred MiceKarl W. BromanDaniel M. GattiKaren L. SvensonŚaunak SenGary A. ChurchillData cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.http://g3journal.org/lookup/doi/10.1534/g3.119.400165data cleaningQTLquantitative trait locidata diagnosticsMultiparent Advanced Generation Inter-Cross (MAGIC)multiparental populationsMPP
collection DOAJ
language English
format Article
sources DOAJ
author Karl W. Broman
Daniel M. Gatti
Karen L. Svenson
Śaunak Sen
Gary A. Churchill
spellingShingle Karl W. Broman
Daniel M. Gatti
Karen L. Svenson
Śaunak Sen
Gary A. Churchill
Cleaning Genotype Data from Diversity Outbred Mice
G3: Genes, Genomes, Genetics
data cleaning
QTL
quantitative trait loci
data diagnostics
Multiparent Advanced Generation Inter-Cross (MAGIC)
multiparental populations
MPP
author_facet Karl W. Broman
Daniel M. Gatti
Karen L. Svenson
Śaunak Sen
Gary A. Churchill
author_sort Karl W. Broman
title Cleaning Genotype Data from Diversity Outbred Mice
title_short Cleaning Genotype Data from Diversity Outbred Mice
title_full Cleaning Genotype Data from Diversity Outbred Mice
title_fullStr Cleaning Genotype Data from Diversity Outbred Mice
title_full_unstemmed Cleaning Genotype Data from Diversity Outbred Mice
title_sort cleaning genotype data from diversity outbred mice
publisher Oxford University Press
series G3: Genes, Genomes, Genetics
issn 2160-1836
publishDate 2019-05-01
description Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.
topic data cleaning
QTL
quantitative trait loci
data diagnostics
Multiparent Advanced Generation Inter-Cross (MAGIC)
multiparental populations
MPP
url http://g3journal.org/lookup/doi/10.1534/g3.119.400165
work_keys_str_mv AT karlwbroman cleaninggenotypedatafromdiversityoutbredmice
AT danielmgatti cleaninggenotypedatafromdiversityoutbredmice
AT karenlsvenson cleaninggenotypedatafromdiversityoutbredmice
AT saunaksen cleaninggenotypedatafromdiversityoutbredmice
AT garyachurchill cleaninggenotypedatafromdiversityoutbredmice
_version_ 1721328289576910848