i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes

N4-methylcytosine (4mC) is one of the most important DNA modifications and involved in regulating cell differentiations and gene expressions. The accurate identification of 4mC sites is necessary to understand various biological functions. In this work, we developed a new computational predictor cal...

Full description

Bibliographic Details
Main Authors: Md. Mehedi Hasan, Balachandran Manavalan, Watshara Shoombuatong, Mst. Shamima Khatun, Hiroyuki Kurata
Format: Article
Language:English
Published: Elsevier 2020-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037020300064
id doaj-dbc1aa37f282420998381234e252e1b2
record_format Article
spelling doaj-dbc1aa37f282420998381234e252e1b22021-01-02T05:08:31ZengElsevierComputational and Structural Biotechnology Journal2001-03702020-01-0118906912i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemesMd. Mehedi Hasan0Balachandran Manavalan1Watshara Shoombuatong2Mst. Shamima Khatun3Hiroyuki Kurata4Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, JapanDepartment of Physiology, Ajou University School of Medicine, Suwon 443380, Republic of KoreaCenter of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, ThailandDepartment of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, JapanDepartment of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Corresponding author at: Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.N4-methylcytosine (4mC) is one of the most important DNA modifications and involved in regulating cell differentiations and gene expressions. The accurate identification of 4mC sites is necessary to understand various biological functions. In this work, we developed a new computational predictor called i4mC-Mouse to identify 4mC sites in the mouse genome. Herein, six encoding schemes of k-space nucleotide composition (KSNC), k-mer nucleotide composition (Kmer), mono nucleotide binary encoding (MBE), dinucleotide binary encoding, electron–ion interaction pseudo potentials (EIIP) and dinucleotide physicochemical composition were explored that cover different characteristics of DNA sequence information. Subsequently, we built six RF-based encoding models and then linearly combined their probability scores to construct the final predictor. Among the six RF-based models, the Kmer, KSNC, MBE, and EIIP encodings are sufficient, which contributed to 10%, 45%, 25%, and 20% of the prediction performance, respectively. On the independent test the i4mC-Mouse predicted the 4mC sites with accuracy and MCC of 0.816 and 0.633, respectively, which were approximately 2.5% and 5% higher than those of the existing method (4mCpred-EL). For experimental biologists, a freely available web application was implemented at http://kurata14.bio.kyutech.ac.jp/i4mC-Mouse/.http://www.sciencedirect.com/science/article/pii/S2001037020300064Mouse genomeSequence analysisSequence encodingMachine learning
collection DOAJ
language English
format Article
sources DOAJ
author Md. Mehedi Hasan
Balachandran Manavalan
Watshara Shoombuatong
Mst. Shamima Khatun
Hiroyuki Kurata
spellingShingle Md. Mehedi Hasan
Balachandran Manavalan
Watshara Shoombuatong
Mst. Shamima Khatun
Hiroyuki Kurata
i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
Computational and Structural Biotechnology Journal
Mouse genome
Sequence analysis
Sequence encoding
Machine learning
author_facet Md. Mehedi Hasan
Balachandran Manavalan
Watshara Shoombuatong
Mst. Shamima Khatun
Hiroyuki Kurata
author_sort Md. Mehedi Hasan
title i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
title_short i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
title_full i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
title_fullStr i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
title_full_unstemmed i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
title_sort i4mc-mouse: improved identification of dna n4-methylcytosine sites in the mouse genome using multiple encoding schemes
publisher Elsevier
series Computational and Structural Biotechnology Journal
issn 2001-0370
publishDate 2020-01-01
description N4-methylcytosine (4mC) is one of the most important DNA modifications and involved in regulating cell differentiations and gene expressions. The accurate identification of 4mC sites is necessary to understand various biological functions. In this work, we developed a new computational predictor called i4mC-Mouse to identify 4mC sites in the mouse genome. Herein, six encoding schemes of k-space nucleotide composition (KSNC), k-mer nucleotide composition (Kmer), mono nucleotide binary encoding (MBE), dinucleotide binary encoding, electron–ion interaction pseudo potentials (EIIP) and dinucleotide physicochemical composition were explored that cover different characteristics of DNA sequence information. Subsequently, we built six RF-based encoding models and then linearly combined their probability scores to construct the final predictor. Among the six RF-based models, the Kmer, KSNC, MBE, and EIIP encodings are sufficient, which contributed to 10%, 45%, 25%, and 20% of the prediction performance, respectively. On the independent test the i4mC-Mouse predicted the 4mC sites with accuracy and MCC of 0.816 and 0.633, respectively, which were approximately 2.5% and 5% higher than those of the existing method (4mCpred-EL). For experimental biologists, a freely available web application was implemented at http://kurata14.bio.kyutech.ac.jp/i4mC-Mouse/.
topic Mouse genome
Sequence analysis
Sequence encoding
Machine learning
url http://www.sciencedirect.com/science/article/pii/S2001037020300064
work_keys_str_mv AT mdmehedihasan i4mcmouseimprovedidentificationofdnan4methylcytosinesitesinthemousegenomeusingmultipleencodingschemes
AT balachandranmanavalan i4mcmouseimprovedidentificationofdnan4methylcytosinesitesinthemousegenomeusingmultipleencodingschemes
AT watsharashoombuatong i4mcmouseimprovedidentificationofdnan4methylcytosinesitesinthemousegenomeusingmultipleencodingschemes
AT mstshamimakhatun i4mcmouseimprovedidentificationofdnan4methylcytosinesitesinthemousegenomeusingmultipleencodingschemes
AT hiroyukikurata i4mcmouseimprovedidentificationofdnan4methylcytosinesitesinthemousegenomeusingmultipleencodingschemes
_version_ 1724359711105482752