beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.
Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2018-05-01
|
Series: | PLoS Computational Biology |
Online Access: | https://doi.org/10.1371/journal.pcbi.1006135 |
id |
doaj-23478ab922484ce691f778aeb3f1810a |
---|---|
record_format |
Article |
spelling |
doaj-23478ab922484ce691f778aeb3f1810a2021-04-21T15:09:59ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-05-01145e100613510.1371/journal.pcbi.1006135beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.Aaron T L LunHervé PagèsMike L SmithBiological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.https://doi.org/10.1371/journal.pcbi.1006135 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Aaron T L Lun Hervé Pagès Mike L Smith |
spellingShingle |
Aaron T L Lun Hervé Pagès Mike L Smith beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. PLoS Computational Biology |
author_facet |
Aaron T L Lun Hervé Pagès Mike L Smith |
author_sort |
Aaron T L Lun |
title |
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. |
title_short |
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. |
title_full |
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. |
title_fullStr |
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. |
title_full_unstemmed |
beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. |
title_sort |
beachmat: a bioconductor c++ api for accessing high-throughput biological data from a variety of r matrix types. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS Computational Biology |
issn |
1553-734X 1553-7358 |
publishDate |
2018-05-01 |
description |
Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set. |
url |
https://doi.org/10.1371/journal.pcbi.1006135 |
work_keys_str_mv |
AT aarontllun beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes AT hervepages beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes AT mikelsmith beachmatabioconductorcapiforaccessinghighthroughputbiologicaldatafromavarietyofrmatrixtypes |
_version_ |
1714667875839508480 |