Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset

Abstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the re...

Full description

Bibliographic Details
Main Authors: Judith Somekh, Shai S Shen-Orr, Isaac S Kohane
Format: Article
Language:English
Published: BMC 2019-05-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-019-2855-9
id doaj-190774c2781d4b58823ebda8e3929265
record_format Article
spelling doaj-190774c2781d4b58823ebda8e39292652020-11-25T03:18:09ZengBMCBMC Bioinformatics1471-21052019-05-0120111010.1186/s12859-019-2855-9Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx datasetJudith Somekh0Shai S Shen-Orr1Isaac S Kohane2Department of Biomedical Informatics, Harvard Medical SchoolFaculty of Medicine, Technion – Israel Institute of TechnologyDepartment of Biomedical Informatics, Harvard Medical SchoolAbstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.http://link.springer.com/article/10.1186/s12859-019-2855-9Batch correctionBatch effectGene expressionComBatPrincipal component analysisGTEx
collection DOAJ
language English
format Article
sources DOAJ
author Judith Somekh
Shai S Shen-Orr
Isaac S Kohane
spellingShingle Judith Somekh
Shai S Shen-Orr
Isaac S Kohane
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
BMC Bioinformatics
Batch correction
Batch effect
Gene expression
ComBat
Principal component analysis
GTEx
author_facet Judith Somekh
Shai S Shen-Orr
Isaac S Kohane
author_sort Judith Somekh
title Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
title_short Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
title_full Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
title_fullStr Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
title_full_unstemmed Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
title_sort batch correction evaluation framework using a-priori gene-gene associations: applied to the gtex dataset
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2019-05-01
description Abstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.
topic Batch correction
Batch effect
Gene expression
ComBat
Principal component analysis
GTEx
url http://link.springer.com/article/10.1186/s12859-019-2855-9
work_keys_str_mv AT judithsomekh batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset
AT shaisshenorr batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset
AT isaacskohane batchcorrectionevaluationframeworkusingapriorigenegeneassociationsappliedtothegtexdataset
_version_ 1724628521041526784