Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational...

Full description

Bibliographic Details
Main Authors: Mengli Xiao, Zhong Zhuang, Wei Pan
Format: Article
Language:English
Published: MDPI AG 2019-12-01
Series:Genes
Subjects:
Online Access:https://www.mdpi.com/2073-4425/11/1/41
id doaj-ae5e955dc8ac4c1fafc17f0f7792f611
record_format Article
spelling doaj-ae5e955dc8ac4c1fafc17f0f7792f6112020-11-25T02:40:33ZengMDPI AGGenes2073-44252019-12-011114110.3390/genes11010041genes11010041Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural NetworksMengli Xiao0Zhong Zhuang1Wei Pan2Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USADepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USADivision of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USAEnhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.https://www.mdpi.com/2073-4425/11/1/41boostingconvolutional neural networksdeep learningfeed-forward neural networksmachine learning
collection DOAJ
language English
format Article
sources DOAJ
author Mengli Xiao
Zhong Zhuang
Wei Pan
spellingShingle Mengli Xiao
Zhong Zhuang
Wei Pan
Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
Genes
boosting
convolutional neural networks
deep learning
feed-forward neural networks
machine learning
author_facet Mengli Xiao
Zhong Zhuang
Wei Pan
author_sort Mengli Xiao
title Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
title_short Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
title_full Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
title_fullStr Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
title_full_unstemmed Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
title_sort local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks
publisher MDPI AG
series Genes
issn 2073-4425
publishDate 2019-12-01
description Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.
topic boosting
convolutional neural networks
deep learning
feed-forward neural networks
machine learning
url https://www.mdpi.com/2073-4425/11/1/41
work_keys_str_mv AT menglixiao localepigenomicdataaremoreinformativethanlocalgenomesequencedatainpredictingenhancerpromoterinteractionsusingneuralnetworks
AT zhongzhuang localepigenomicdataaremoreinformativethanlocalgenomesequencedatainpredictingenhancerpromoterinteractionsusingneuralnetworks
AT weipan localepigenomicdataaremoreinformativethanlocalgenomesequencedatainpredictingenhancerpromoterinteractionsusingneuralnetworks
_version_ 1724780936769306624