Fast Ordered Sampling of DNA Sequence Variants
Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Oxford University Press
2018-05-01
|
Series: | G3: Genes, Genomes, Genetics |
Subjects: | |
Online Access: | http://g3journal.org/lookup/doi/10.1534/g3.117.300465 |
id |
doaj-21677c1c0707499096b2e6c7594687fd |
---|---|
record_format |
Article |
spelling |
doaj-21677c1c0707499096b2e6c7594687fd2021-07-02T18:14:02ZengOxford University PressG3: Genes, Genomes, Genetics2160-18362018-05-01851455146010.1534/g3.117.3004658Fast Ordered Sampling of DNA Sequence VariantsAnthony J. GreenbergExplosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.http://g3journal.org/lookup/doi/10.1534/g3.117.300465nucleotide polymorphismrandom samplingstatistical geneticsgenomicsC++ |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Anthony J. Greenberg |
spellingShingle |
Anthony J. Greenberg Fast Ordered Sampling of DNA Sequence Variants G3: Genes, Genomes, Genetics nucleotide polymorphism random sampling statistical genetics genomics C++ |
author_facet |
Anthony J. Greenberg |
author_sort |
Anthony J. Greenberg |
title |
Fast Ordered Sampling of DNA Sequence Variants |
title_short |
Fast Ordered Sampling of DNA Sequence Variants |
title_full |
Fast Ordered Sampling of DNA Sequence Variants |
title_fullStr |
Fast Ordered Sampling of DNA Sequence Variants |
title_full_unstemmed |
Fast Ordered Sampling of DNA Sequence Variants |
title_sort |
fast ordered sampling of dna sequence variants |
publisher |
Oxford University Press |
series |
G3: Genes, Genomes, Genetics |
issn |
2160-1836 |
publishDate |
2018-05-01 |
description |
Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects. |
topic |
nucleotide polymorphism random sampling statistical genetics genomics C++ |
url |
http://g3journal.org/lookup/doi/10.1534/g3.117.300465 |
work_keys_str_mv |
AT anthonyjgreenberg fastorderedsamplingofdnasequencevariants |
_version_ |
1721324692592132096 |