Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy

In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cance...

Full description

Bibliographic Details
Main Authors: Xuan Zhang, Tianjun Li, Jun Wang, Jing Li, Long Chen, Changning Liu
Format: Article
Language:English
Published: Frontiers Media S.A. 2019-08-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/article/10.3389/fgene.2019.00735/full
id doaj-e0b11b39233c46cea30da4bcc751c4e8
record_format Article
spelling doaj-e0b11b39233c46cea30da4bcc751c4e82020-11-25T01:36:05ZengFrontiers Media S.A.Frontiers in Genetics1664-80212019-08-011010.3389/fgene.2019.00735456130Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High AccuracyXuan Zhang0Xuan Zhang1Tianjun Li2Jun Wang3Jing Li4Long Chen5Changning LiuCAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, ChinaUniversity of Chinese Academy of Sciences, Beijing, ChinaDepartment of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, ChinaInstitute of Medical Sciences, Xiangya Hospital, Central South University, Changsha, ChinaCAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, ChinaDepartment of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, ChinaIn the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy.https://www.frontiersin.org/article/10.3389/fgene.2019.00735/fullcancerlong noncoding RNAmachine learningSynthetic Minority Over-sampling TechniqueXGBoost
collection DOAJ
language English
format Article
sources DOAJ
author Xuan Zhang
Xuan Zhang
Tianjun Li
Jun Wang
Jing Li
Long Chen
Changning Liu
spellingShingle Xuan Zhang
Xuan Zhang
Tianjun Li
Jun Wang
Jing Li
Long Chen
Changning Liu
Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
Frontiers in Genetics
cancer
long noncoding RNA
machine learning
Synthetic Minority Over-sampling Technique
XGBoost
author_facet Xuan Zhang
Xuan Zhang
Tianjun Li
Jun Wang
Jing Li
Long Chen
Changning Liu
author_sort Xuan Zhang
title Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_short Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_full Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_fullStr Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_full_unstemmed Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy
title_sort identification of cancer-related long non-coding rnas using xgboost with high accuracy
publisher Frontiers Media S.A.
series Frontiers in Genetics
issn 1664-8021
publishDate 2019-08-01
description In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy.
topic cancer
long noncoding RNA
machine learning
Synthetic Minority Over-sampling Technique
XGBoost
url https://www.frontiersin.org/article/10.3389/fgene.2019.00735/full
work_keys_str_mv AT xuanzhang identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT xuanzhang identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT tianjunli identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT junwang identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT jingli identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT longchen identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
AT changningliu identificationofcancerrelatedlongnoncodingrnasusingxgboostwithhighaccuracy
_version_ 1725064250787889152