Fast Ordered Sampling of DNA Sequence Variants

Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-...

Full description

Bibliographic Details
Main Author: Anthony J. Greenberg
Format: Article
Language:English
Published: Oxford University Press 2018-05-01
Series:G3: Genes, Genomes, Genetics
Subjects:
C++
Online Access:http://g3journal.org/lookup/doi/10.1534/g3.117.300465
id doaj-21677c1c0707499096b2e6c7594687fd
record_format Article
spelling doaj-21677c1c0707499096b2e6c7594687fd2021-07-02T18:14:02ZengOxford University PressG3: Genes, Genomes, Genetics2160-18362018-05-01851455146010.1534/g3.117.3004658Fast Ordered Sampling of DNA Sequence VariantsAnthony J. GreenbergExplosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.http://g3journal.org/lookup/doi/10.1534/g3.117.300465nucleotide polymorphismrandom samplingstatistical geneticsgenomicsC++
collection DOAJ
language English
format Article
sources DOAJ
author Anthony J. Greenberg
spellingShingle Anthony J. Greenberg
Fast Ordered Sampling of DNA Sequence Variants
G3: Genes, Genomes, Genetics
nucleotide polymorphism
random sampling
statistical genetics
genomics
C++
author_facet Anthony J. Greenberg
author_sort Anthony J. Greenberg
title Fast Ordered Sampling of DNA Sequence Variants
title_short Fast Ordered Sampling of DNA Sequence Variants
title_full Fast Ordered Sampling of DNA Sequence Variants
title_fullStr Fast Ordered Sampling of DNA Sequence Variants
title_full_unstemmed Fast Ordered Sampling of DNA Sequence Variants
title_sort fast ordered sampling of dna sequence variants
publisher Oxford University Press
series G3: Genes, Genomes, Genetics
issn 2160-1836
publishDate 2018-05-01
description Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.
topic nucleotide polymorphism
random sampling
statistical genetics
genomics
C++
url http://g3journal.org/lookup/doi/10.1534/g3.117.300465
work_keys_str_mv AT anthonyjgreenberg fastorderedsamplingofdnasequencevariants
_version_ 1721324692592132096