Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
Abstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the re...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-05-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12859-019-2855-9 |
id |
doaj-190774c2781d4b58823ebda8e3929265 |
---|---|
record_format |
Article |
spelling |
doaj-190774c2781d4b58823ebda8e39292652020-11-25T03:18:09ZengBMCBMC Bioinformatics1471-21052019-05-0120111010.1186/s12859-019-2855-9Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx datasetJudith Somekh0Shai S Shen-Orr1Isaac S Kohane2Department of Biomedical Informatics, Harvard Medical SchoolFaculty of Medicine, Technion – Israel Institute of TechnologyDepartment of Biomedical Informatics, Harvard Medical SchoolAbstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.http://link.springer.com/article/10.1186/s12859-019-2855-9Batch correctionBatch effectGene expressionComBatPrincipal component analysisGTEx |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Judith Somekh Shai S Shen-Orr Isaac S Kohane |
spellingShingle |
Judith Somekh Shai S Shen-Orr Isaac S Kohane Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset BMC Bioinformatics Batch correction Batch effect Gene expression ComBat Principal component analysis GTEx |
author_facet |
Judith Somekh Shai S Shen-Orr Isaac S Kohane |
author_sort |
Judith Somekh |
title |
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset |
title_short |
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset |
title_full |
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset |
title_fullStr |
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset |
title_full_unstemmed |
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset |
title_sort |
batch correction evaluation framework using a-priori gene-gene associations: applied to the gtex dataset |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2019-05-01 |
description |
Abstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package. |
topic |
Batch correction Batch effect Gene expression ComBat Principal component analysis GTEx |
url |
http://link.springer.com/article/10.1186/s12859-019-2855-9 |
work_keys_str_mv |
AT judithsomekh batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset AT shaisshenorr batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset AT isaacskohane batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset |
_version_ |
1724628521041526784 |