Effect of method of deduplication on estimation of differential gene expression using RNA-seq

Background RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene express...

Full description

Bibliographic Details
Main Authors: Anna V. Klepikova, Artem S. Kasianov, Mikhail S. Chesnokov, Natalia L. Lazarevich, Aleksey A. Penin, Maria Logacheva
Format: Article
Language:English
Published: PeerJ Inc. 2017-03-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/3091.pdf
id doaj-1a74685030fe4028ab7b6dcfc9735bb3
record_format Article
spelling doaj-1a74685030fe4028ab7b6dcfc9735bb32020-11-25T00:58:53ZengPeerJ Inc.PeerJ2167-83592017-03-015e309110.7717/peerj.3091Effect of method of deduplication on estimation of differential gene expression using RNA-seqAnna V. Klepikova0Artem S. Kasianov1Mikhail S. Chesnokov2Natalia L. Lazarevich3Aleksey A. Penin4Maria Logacheva5Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, RussiaA. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, RussiaN.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, RussiaN.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, RussiaInstitute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, RussiaInstitute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, RussiaBackground RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.https://peerj.com/articles/3091.pdfRNA-seqDifferential expressionDeduplicationCancer genomicsHepatocarcinoma
collection DOAJ
language English
format Article
sources DOAJ
author Anna V. Klepikova
Artem S. Kasianov
Mikhail S. Chesnokov
Natalia L. Lazarevich
Aleksey A. Penin
Maria Logacheva
spellingShingle Anna V. Klepikova
Artem S. Kasianov
Mikhail S. Chesnokov
Natalia L. Lazarevich
Aleksey A. Penin
Maria Logacheva
Effect of method of deduplication on estimation of differential gene expression using RNA-seq
PeerJ
RNA-seq
Differential expression
Deduplication
Cancer genomics
Hepatocarcinoma
author_facet Anna V. Klepikova
Artem S. Kasianov
Mikhail S. Chesnokov
Natalia L. Lazarevich
Aleksey A. Penin
Maria Logacheva
author_sort Anna V. Klepikova
title Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_short Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_full Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_fullStr Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_full_unstemmed Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_sort effect of method of deduplication on estimation of differential gene expression using rna-seq
publisher PeerJ Inc.
series PeerJ
issn 2167-8359
publishDate 2017-03-01
description Background RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.
topic RNA-seq
Differential expression
Deduplication
Cancer genomics
Hepatocarcinoma
url https://peerj.com/articles/3091.pdf
work_keys_str_mv AT annavklepikova effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT artemskasianov effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT mikhailschesnokov effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT nataliallazarevich effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT alekseyapenin effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT marialogacheva effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
_version_ 1725220051539197952