A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations

The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after...

Full description

Bibliographic Details
Main Authors: Yulin Zhang, Tong Feng, Shudong Wang, Ruyi Dong, Jialiang Yang, Jionglong Su, Bo Wang
Format: Article
Language:English
Published: Frontiers Media S.A. 2020-11-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2020.585029/full
id doaj-784d04f595404629ad6a4de441ec8bc9
record_format Article
spelling doaj-784d04f595404629ad6a4de441ec8bc92020-11-25T04:11:19ZengFrontiers Media S.A.Frontiers in Genetics1664-80212020-11-011110.3389/fgene.2020.585029585029A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number VariationsYulin Zhang0Tong Feng1Shudong Wang2Ruyi Dong3Jialiang Yang4Jionglong Su5Bo Wang6College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, ChinaCollege of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum (East China), Qingdao, ChinaGeneis (Beijing) Co., Ltd., Beijing, ChinaGeneis (Beijing) Co., Ltd., Beijing, ChinaSchool of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University, Suzhou, ChinaGeneis (Beijing) Co., Ltd., Beijing, ChinaThe discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.https://www.frontiersin.org/articles/10.3389/fgene.2020.585029/fulltissue-of-origincopy number variationsmulticlassXGBoostextremely randomized treeprincipal component analysis
collection DOAJ
language English
format Article
sources DOAJ
author Yulin Zhang
Tong Feng
Shudong Wang
Ruyi Dong
Jialiang Yang
Jionglong Su
Bo Wang
spellingShingle Yulin Zhang
Tong Feng
Shudong Wang
Ruyi Dong
Jialiang Yang
Jionglong Su
Bo Wang
A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
Frontiers in Genetics
tissue-of-origin
copy number variations
multiclass
XGBoost
extremely randomized tree
principal component analysis
author_facet Yulin Zhang
Tong Feng
Shudong Wang
Ruyi Dong
Jialiang Yang
Jionglong Su
Bo Wang
author_sort Yulin Zhang
title A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_short A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_full A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_fullStr A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_full_unstemmed A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations
title_sort novel xgboost method to identify cancer tissue-of-origin based on copy number variations
publisher Frontiers Media S.A.
series Frontiers in Genetics
issn 1664-8021
publishDate 2020-11-01
description The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.
topic tissue-of-origin
copy number variations
multiclass
XGBoost
extremely randomized tree
principal component analysis
url https://www.frontiersin.org/articles/10.3389/fgene.2020.585029/full
work_keys_str_mv AT yulinzhang anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT tongfeng anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT shudongwang anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT ruyidong anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT jialiangyang anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT jionglongsu anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT bowang anovelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT yulinzhang novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT tongfeng novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT shudongwang novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT ruyidong novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT jialiangyang novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT jionglongsu novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
AT bowang novelxgboostmethodtoidentifycancertissueoforiginbasedoncopynumbervariations
_version_ 1724418116369252352