beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.

Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data...

Full description

Bibliographic Details
Main Authors: Aaron T L Lun, Hervé Pagès, Mike L Smith
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-05-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1006135
id doaj-23478ab922484ce691f778aeb3f1810a
record_format Article
spelling doaj-23478ab922484ce691f778aeb3f1810a2021-04-21T15:09:59ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-05-01145e100613510.1371/journal.pcbi.1006135beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.Aaron T L LunHervé PagèsMike L SmithBiological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.https://doi.org/10.1371/journal.pcbi.1006135
collection DOAJ
language English
format Article
sources DOAJ
author Aaron T L Lun
Hervé Pagès
Mike L Smith
spellingShingle Aaron T L Lun
Hervé Pagès
Mike L Smith
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
PLoS Computational Biology
author_facet Aaron T L Lun
Hervé Pagès
Mike L Smith
author_sort Aaron T L Lun
title beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
title_short beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
title_full beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
title_fullStr beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
title_full_unstemmed beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
title_sort beachmat: a bioconductor c++ api for accessing high-throughput biological data from a variety of r matrix types.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2018-05-01
description Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.
url https://doi.org/10.1371/journal.pcbi.1006135
work_keys_str_mv AT aarontllun beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes
AT hervepages beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes
AT mikelsmith beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes
_version_ 1714667875839508480