Recurrent miscalling of missense variation from short-read genome sequence data

Abstract Background Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproport...

Full description

Bibliographic Details
Main Authors: Matthew A. Field, Gaetan Burgio, Aaron Chuah, Jalila Al Shekaili, Batool Hassan, Nashat Al Sukaiti, Simon J. Foote, Matthew C. Cook, T. Daniel Andrews
Format: Article
Language:English
Published: BMC 2019-07-01
Series:BMC Genomics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12864-019-5863-2
id doaj-7149218fe3724c40ba59eaeaf72d6345
record_format Article
spelling doaj-7149218fe3724c40ba59eaeaf72d63452020-11-25T03:20:52ZengBMCBMC Genomics1471-21642019-07-0120S81910.1186/s12864-019-5863-2Recurrent miscalling of missense variation from short-read genome sequence dataMatthew A. Field0Gaetan Burgio1Aaron Chuah2Jalila Al Shekaili3Batool Hassan4Nashat Al Sukaiti5Simon J. Foote6Matthew C. Cook7T. Daniel Andrews8Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityDepartment of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityDepartment of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityDepartment of Microbiology and Immunology, Sultan Qaboos University HospitalDepartment of Medicine, Sultan Qaboos University HospitalDepartment of Paediatrics, Allergy, and Clinical Immunology Unit, Royal HospitalDepartment of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityDepartment of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityDepartment of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National UniversityAbstract Background Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2–300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3–5000 recurrent false positive variants per mouse – the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.http://link.springer.com/article/10.1186/s12864-019-5863-2Single nucleotide variantMiscallResamplingExomeAlignment
collection DOAJ
language English
format Article
sources DOAJ
author Matthew A. Field
Gaetan Burgio
Aaron Chuah
Jalila Al Shekaili
Batool Hassan
Nashat Al Sukaiti
Simon J. Foote
Matthew C. Cook
T. Daniel Andrews
spellingShingle Matthew A. Field
Gaetan Burgio
Aaron Chuah
Jalila Al Shekaili
Batool Hassan
Nashat Al Sukaiti
Simon J. Foote
Matthew C. Cook
T. Daniel Andrews
Recurrent miscalling of missense variation from short-read genome sequence data
BMC Genomics
Single nucleotide variant
Miscall
Resampling
Exome
Alignment
author_facet Matthew A. Field
Gaetan Burgio
Aaron Chuah
Jalila Al Shekaili
Batool Hassan
Nashat Al Sukaiti
Simon J. Foote
Matthew C. Cook
T. Daniel Andrews
author_sort Matthew A. Field
title Recurrent miscalling of missense variation from short-read genome sequence data
title_short Recurrent miscalling of missense variation from short-read genome sequence data
title_full Recurrent miscalling of missense variation from short-read genome sequence data
title_fullStr Recurrent miscalling of missense variation from short-read genome sequence data
title_full_unstemmed Recurrent miscalling of missense variation from short-read genome sequence data
title_sort recurrent miscalling of missense variation from short-read genome sequence data
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2019-07-01
description Abstract Background Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2–300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3–5000 recurrent false positive variants per mouse – the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.
topic Single nucleotide variant
Miscall
Resampling
Exome
Alignment
url http://link.springer.com/article/10.1186/s12864-019-5863-2
work_keys_str_mv AT matthewafield recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT gaetanburgio recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT aaronchuah recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT jalilaalshekaili recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT batoolhassan recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT nashatalsukaiti recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT simonjfoote recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT matthewccook recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
AT tdanielandrews recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata
_version_ 1724616088044437504