SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples

RNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and...

Full description

Bibliographic Details
Main Authors: Shen Yin, Xiaowei Zhan, Bo Yao, Guanghua Xiao, Xinlei Wang, Yang Xie
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-03-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2021.650795/full
id doaj-fe949209c7fa40568d24bb9005054955
record_format Article
spelling doaj-fe949209c7fa40568d24bb90050549552021-03-24T06:23:09ZengFrontiers Media S.A.Frontiers in Genetics1664-80212021-03-011210.3389/fgene.2021.650795650795SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded SamplesShen Yin0Shen Yin1Xiaowei Zhan2Bo Yao3Guanghua Xiao4Xinlei Wang5Yang Xie6Department of Population and Data Sciences, Quantitative Biomedical Research Center, The University of Texas Southwestern Medical Center, Dallas, TX, United StatesDepartment of Statistical Science, Southern Methodist University, Dallas, TX, United StatesDepartment of Population and Data Sciences, Quantitative Biomedical Research Center, The University of Texas Southwestern Medical Center, Dallas, TX, United StatesDepartment of Population and Data Sciences, Quantitative Biomedical Research Center, The University of Texas Southwestern Medical Center, Dallas, TX, United StatesDepartment of Population and Data Sciences, Quantitative Biomedical Research Center, The University of Texas Southwestern Medical Center, Dallas, TX, United StatesDepartment of Statistical Science, Southern Methodist University, Dallas, TX, United StatesDepartment of Population and Data Sciences, Quantitative Biomedical Research Center, The University of Texas Southwestern Medical Center, Dallas, TX, United StatesRNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and patient care. Normalization is an essential step in RNA-seq data analysis. A number of normalization methods, though developed for RNA-seq data from fresh frozen (FF) samples, can be used with FFPE samples as well. The only extant normalization method specifically designed for FFPE RNA-seq data, MIXnorm, which has been shown to outperform the normalization methods, but at the cost of a complex mixture model and a high computational burden. It is therefore important to adapt MIXnorm for simplicity and computational efficiency while maintaining superior performance. Furthermore, it is critical to develop an integrated tool that performs commonly used normalization methods for both FF and FFPE RNA-seq data. We developed a new normalization method for FFPE RNA-seq data, named SMIXnorm, based on a simplified two-component mixture model compared to MIXnorm to facilitate computation. The expression levels of expressed genes are modeled by normal distributions without truncation, and those of non-expressed genes are modeled by zero-inflated Poisson distributions. The maximum likelihood estimates of the model parameters are obtained by a nested Expectation-Maximization algorithm with a less complicated latent variable structure, and closed-form updates are available within each iteration. Real data applications and simulation studies show that SMIXnorm greatly reduces computing time compared to MIXnorm, without sacrificing the performance. More importantly, we developed a web-based tool, RNA-seq Normalization (RSeqNorm), that offers a simple workflow to compute normalized RNA-seq data for both FFPE and FF samples. It includes SMIXnorm and MIXnorm for FFPE RNA-seq data, together with five commonly used normalization methods for FF RNA-seq data. Users can easily upload a raw RNA-seq count matrix and select one of the seven normalization methods to produce a downloadable normalized expression matrix for any downstream analysis. The R package is available at https://github.com/S-YIN/RSEQNORM. The web-based tool, RSeqNorm is available at http://lce.biohpc.swmed.edu/rseqnorm with no restriction to use or redistribute.https://www.frontiersin.org/articles/10.3389/fgene.2021.650795/fullRNA-sequencingnormalizationFFPEformalin-fixed paraffin-embedded samplesarchived samplesstatistical methods
collection DOAJ
language English
format Article
sources DOAJ
author Shen Yin
Shen Yin
Xiaowei Zhan
Bo Yao
Guanghua Xiao
Xinlei Wang
Yang Xie
spellingShingle Shen Yin
Shen Yin
Xiaowei Zhan
Bo Yao
Guanghua Xiao
Xinlei Wang
Yang Xie
SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
Frontiers in Genetics
RNA-sequencing
normalization
FFPE
formalin-fixed paraffin-embedded samples
archived samples
statistical methods
author_facet Shen Yin
Shen Yin
Xiaowei Zhan
Bo Yao
Guanghua Xiao
Xinlei Wang
Yang Xie
author_sort Shen Yin
title SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
title_short SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
title_full SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
title_fullStr SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
title_full_unstemmed SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples
title_sort smixnorm: fast and accurate rna-seq data normalization for formalin-fixed paraffin-embedded samples
publisher Frontiers Media S.A.
series Frontiers in Genetics
issn 1664-8021
publishDate 2021-03-01
description RNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and patient care. Normalization is an essential step in RNA-seq data analysis. A number of normalization methods, though developed for RNA-seq data from fresh frozen (FF) samples, can be used with FFPE samples as well. The only extant normalization method specifically designed for FFPE RNA-seq data, MIXnorm, which has been shown to outperform the normalization methods, but at the cost of a complex mixture model and a high computational burden. It is therefore important to adapt MIXnorm for simplicity and computational efficiency while maintaining superior performance. Furthermore, it is critical to develop an integrated tool that performs commonly used normalization methods for both FF and FFPE RNA-seq data. We developed a new normalization method for FFPE RNA-seq data, named SMIXnorm, based on a simplified two-component mixture model compared to MIXnorm to facilitate computation. The expression levels of expressed genes are modeled by normal distributions without truncation, and those of non-expressed genes are modeled by zero-inflated Poisson distributions. The maximum likelihood estimates of the model parameters are obtained by a nested Expectation-Maximization algorithm with a less complicated latent variable structure, and closed-form updates are available within each iteration. Real data applications and simulation studies show that SMIXnorm greatly reduces computing time compared to MIXnorm, without sacrificing the performance. More importantly, we developed a web-based tool, RNA-seq Normalization (RSeqNorm), that offers a simple workflow to compute normalized RNA-seq data for both FFPE and FF samples. It includes SMIXnorm and MIXnorm for FFPE RNA-seq data, together with five commonly used normalization methods for FF RNA-seq data. Users can easily upload a raw RNA-seq count matrix and select one of the seven normalization methods to produce a downloadable normalized expression matrix for any downstream analysis. The R package is available at https://github.com/S-YIN/RSEQNORM. The web-based tool, RSeqNorm is available at http://lce.biohpc.swmed.edu/rseqnorm with no restriction to use or redistribute.
topic RNA-sequencing
normalization
FFPE
formalin-fixed paraffin-embedded samples
archived samples
statistical methods
url https://www.frontiersin.org/articles/10.3389/fgene.2021.650795/full
work_keys_str_mv AT shenyin smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT shenyin smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT xiaoweizhan smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT boyao smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT guanghuaxiao smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT xinleiwang smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
AT yangxie smixnormfastandaccuraternaseqdatanormalizationforformalinfixedparaffinembeddedsamples
_version_ 1724205210226655232