De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data

The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes tha...

Full description

Bibliographic Details
Main Authors: Adam Ameur, Huiwen Che, Marcel Martin, Ignas Bunikis, Johan Dahlberg, Ida Höijer, Susana Häggqvist, Francesco Vezzi, Jessica Nordlund, Pall Olason, Lars Feuk, Ulf Gyllensten
Format: Article
Language:English
Published: MDPI AG 2018-10-01
Series:Genes
Subjects:
Online Access:http://www.mdpi.com/2073-4425/9/10/486
id doaj-d5295c0ad00449b0b8f2592137db29a3
record_format Article
spelling doaj-d5295c0ad00449b0b8f2592137db29a32020-11-25T00:16:49ZengMDPI AGGenes2073-44252018-10-0191048610.3390/genes9100486genes9100486De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing DataAdam Ameur0Huiwen Che1Marcel Martin2Ignas Bunikis3Johan Dahlberg4Ida Höijer5Susana Häggqvist6Francesco Vezzi7Jessica Nordlund8Pall Olason9Lars Feuk10Ulf Gyllensten11Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Biochemistry and Biophysics (DBB), Stockholm University, 114 19 Stockholm, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Medical Sciences, Molecular Medicine, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Biochemistry and Biophysics (DBB), Stockholm University, 114 19 Stockholm, SwedenScience for Life Laboratory, Department of Medical Sciences, Molecular Medicine, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenScience for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 752 36 Uppsala, SwedenThe current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.http://www.mdpi.com/2073-4425/9/10/486de novo assemblySMRT sequencingGRCh38human reference genomehuman whole-genome sequencingpopulation sequencingSwedish population
collection DOAJ
language English
format Article
sources DOAJ
author Adam Ameur
Huiwen Che
Marcel Martin
Ignas Bunikis
Johan Dahlberg
Ida Höijer
Susana Häggqvist
Francesco Vezzi
Jessica Nordlund
Pall Olason
Lars Feuk
Ulf Gyllensten
spellingShingle Adam Ameur
Huiwen Che
Marcel Martin
Ignas Bunikis
Johan Dahlberg
Ida Höijer
Susana Häggqvist
Francesco Vezzi
Jessica Nordlund
Pall Olason
Lars Feuk
Ulf Gyllensten
De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
Genes
de novo assembly
SMRT sequencing
GRCh38
human reference genome
human whole-genome sequencing
population sequencing
Swedish population
author_facet Adam Ameur
Huiwen Che
Marcel Martin
Ignas Bunikis
Johan Dahlberg
Ida Höijer
Susana Häggqvist
Francesco Vezzi
Jessica Nordlund
Pall Olason
Lars Feuk
Ulf Gyllensten
author_sort Adam Ameur
title De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_short De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_full De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_fullStr De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_full_unstemmed De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
title_sort de novo assembly of two swedish genomes reveals missing segments from the human grch38 reference and improves variant calling of population-scale sequencing data
publisher MDPI AG
series Genes
issn 2073-4425
publishDate 2018-10-01
description The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.
topic de novo assembly
SMRT sequencing
GRCh38
human reference genome
human whole-genome sequencing
population sequencing
Swedish population
url http://www.mdpi.com/2073-4425/9/10/486
work_keys_str_mv AT adamameur denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT huiwenche denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT marcelmartin denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT ignasbunikis denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT johandahlberg denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT idahoijer denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT susanahaggqvist denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT francescovezzi denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT jessicanordlund denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT pallolason denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT larsfeuk denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
AT ulfgyllensten denovoassemblyoftwoswedishgenomesrevealsmissingsegmentsfromthehumangrch38referenceandimprovesvariantcallingofpopulationscalesequencingdata
_version_ 1725382358665789440