A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations

Abstract Background Understanding the functional effects of non-coding variants is important as they are often associated with gene-expression alteration and disease development. Over the past few years, many computational tools have been developed to predict their functional impact. However, the in...

Full description

Bibliographic Details
Main Authors: Hao Jia, Sung-Joon Park, Kenta Nakai
Format: Article
Language:English
Published: BMC 2021-06-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-03999-8
id doaj-1ce41be56a654eaf8af4bf957c51e85c
record_format Article
spelling doaj-1ce41be56a654eaf8af4bf957c51e85c2021-06-06T11:54:49ZengBMCBMC Bioinformatics1471-21052021-06-0122S611210.1186/s12859-021-03999-8A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variationsHao Jia0Sung-Joon Park1Kenta Nakai2Department of Computer Science, The University of TokyoDepartment of Computer Science, The University of TokyoDepartment of Computer Science, The University of TokyoAbstract Background Understanding the functional effects of non-coding variants is important as they are often associated with gene-expression alteration and disease development. Over the past few years, many computational tools have been developed to predict their functional impact. However, the intrinsic difficulty in dealing with the scarcity of data leads to the necessity to further improve the algorithms. In this work, we propose a novel method, employing a semi-supervised deep-learning model with pseudo labels, which takes advantage of learning from both experimentally annotated and unannotated data. Results We prepared known functional non-coding variants with histone marks, DNA accessibility, and sequence context in GM12878, HepG2, and K562 cell lines. Applying our method to the dataset demonstrated its outstanding performance, compared with that of existing tools. Our results also indicated that the semi-supervised model with pseudo labels achieves higher predictive performance than the supervised model without pseudo labels. Interestingly, a model trained with the data in a certain cell line is unlikely to succeed in other cell lines, which implies the cell-type-specific nature of the non-coding variants. Remarkably, we found that DNA accessibility significantly contributes to the functional consequence of variants, which suggests the importance of open chromatin conformation prior to establishing the interaction of non-coding variants with gene regulation. Conclusions The semi-supervised deep learning model coupled with pseudo labeling has advantages in studying with limited datasets, which is not unusual in biology. Our study provides an effective approach in finding non-coding mutations potentially associated with various biological phenomena, including human diseases.https://doi.org/10.1186/s12859-021-03999-8Non-coding variantsEpigenomeSemi-supervised learningDeep learningPseudo label
collection DOAJ
language English
format Article
sources DOAJ
author Hao Jia
Sung-Joon Park
Kenta Nakai
spellingShingle Hao Jia
Sung-Joon Park
Kenta Nakai
A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
BMC Bioinformatics
Non-coding variants
Epigenome
Semi-supervised learning
Deep learning
Pseudo label
author_facet Hao Jia
Sung-Joon Park
Kenta Nakai
author_sort Hao Jia
title A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
title_short A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
title_full A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
title_fullStr A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
title_full_unstemmed A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
title_sort semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-06-01
description Abstract Background Understanding the functional effects of non-coding variants is important as they are often associated with gene-expression alteration and disease development. Over the past few years, many computational tools have been developed to predict their functional impact. However, the intrinsic difficulty in dealing with the scarcity of data leads to the necessity to further improve the algorithms. In this work, we propose a novel method, employing a semi-supervised deep-learning model with pseudo labels, which takes advantage of learning from both experimentally annotated and unannotated data. Results We prepared known functional non-coding variants with histone marks, DNA accessibility, and sequence context in GM12878, HepG2, and K562 cell lines. Applying our method to the dataset demonstrated its outstanding performance, compared with that of existing tools. Our results also indicated that the semi-supervised model with pseudo labels achieves higher predictive performance than the supervised model without pseudo labels. Interestingly, a model trained with the data in a certain cell line is unlikely to succeed in other cell lines, which implies the cell-type-specific nature of the non-coding variants. Remarkably, we found that DNA accessibility significantly contributes to the functional consequence of variants, which suggests the importance of open chromatin conformation prior to establishing the interaction of non-coding variants with gene regulation. Conclusions The semi-supervised deep learning model coupled with pseudo labeling has advantages in studying with limited datasets, which is not unusual in biology. Our study provides an effective approach in finding non-coding mutations potentially associated with various biological phenomena, including human diseases.
topic Non-coding variants
Epigenome
Semi-supervised learning
Deep learning
Pseudo label
url https://doi.org/10.1186/s12859-021-03999-8
work_keys_str_mv AT haojia asemisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
AT sungjoonpark asemisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
AT kentanakai asemisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
AT haojia semisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
AT sungjoonpark semisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
AT kentanakai semisuperviseddeeplearningapproachforpredictingthefunctionaleffectsofgenomicnoncodingvariations
_version_ 1721393489489428480