Deep learning for HGT insertion sites recognition

Abstract Background Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and...

Full description

Bibliographic Details
Main Authors: Chen Li, Jiaxing Chen, Shuai Cheng Li
Format: Article
Language:English
Published: BMC 2020-12-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-020-07296-1
id doaj-d1275894d4f34567b52576c238a4358e
record_format Article
spelling doaj-d1275894d4f34567b52576c238a4358e2021-01-03T12:10:21ZengBMCBMC Genomics1471-21642020-12-0121S1111810.1186/s12864-020-07296-1Deep learning for HGT insertion sites recognitionChen Li0Jiaxing Chen1Shuai Cheng Li2Department of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongAbstract Background Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. Results In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. Conclusion DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern.https://doi.org/10.1186/s12864-020-07296-1Deep residual modelHGT insertion siteDNA sequence feature
collection DOAJ
language English
format Article
sources DOAJ
author Chen Li
Jiaxing Chen
Shuai Cheng Li
spellingShingle Chen Li
Jiaxing Chen
Shuai Cheng Li
Deep learning for HGT insertion sites recognition
BMC Genomics
Deep residual model
HGT insertion site
DNA sequence feature
author_facet Chen Li
Jiaxing Chen
Shuai Cheng Li
author_sort Chen Li
title Deep learning for HGT insertion sites recognition
title_short Deep learning for HGT insertion sites recognition
title_full Deep learning for HGT insertion sites recognition
title_fullStr Deep learning for HGT insertion sites recognition
title_full_unstemmed Deep learning for HGT insertion sites recognition
title_sort deep learning for hgt insertion sites recognition
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2020-12-01
description Abstract Background Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. Results In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. Conclusion DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern.
topic Deep residual model
HGT insertion site
DNA sequence feature
url https://doi.org/10.1186/s12864-020-07296-1
work_keys_str_mv AT chenli deeplearningforhgtinsertionsitesrecognition
AT jiaxingchen deeplearningforhgtinsertionsitesrecognition
AT shuaichengli deeplearningforhgtinsertionsitesrecognition
_version_ 1724350761080455168