Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data

Over the last decade, high dimensional data have been popularly paid attention to in bioinformatics. These data increase the likelihood of detecting the most promising novel information. However, there are limitations of high-performance computing and overfitting issues. To overcome the issues, alte...

Full description

Bibliographic Details
Main Authors: Sangjin Kim, Jong-Min Kim
Format: Article
Language:English
Published: MDPI AG 2019-05-01
Series:Mathematics
Subjects:
MCP
SIS
Online Access:https://www.mdpi.com/2227-7390/7/6/493
id doaj-7d6ce6e70a7647da8c20b7e50627fad8
record_format Article
spelling doaj-7d6ce6e70a7647da8c20b7e50627fad82020-11-24T21:21:13ZengMDPI AGMathematics2227-73902019-05-017649310.3390/math7060493math7060493Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput DataSangjin Kim0Jong-Min Kim1Department of Mathematical Sciences, University of Texas at El Paso, El Paso, TX 79968, USADivision of Sciences and Mathematics, University of Minnesota at Morris, Morris, MN 56267, USAOver the last decade, high dimensional data have been popularly paid attention to in bioinformatics. These data increase the likelihood of detecting the most promising novel information. However, there are limitations of high-performance computing and overfitting issues. To overcome the issues, alternative strategies need to be explored for the detection of true important features. A two-stage approach, filtering and variable selection steps, has been receiving attention. Filtering methods are divided into two categories of individual ranking and feature subset selection methods. Both have issues with the lack of consideration for joint correlation among features and computing time of an NP-hard problem. Therefore, we proposed a new filter ranking method (PF) using the elastic net penalty with sure independence screening (SIS) based on resampling technique to overcome these issues. We demonstrated that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieved superior performance of not only accuracy, AUROC, and geometric mean but also true positive detection compared to those with the marginal maximum likelihood ranking method (MMLR) through extensive simulation studies. In addition, we applied it in a real application of colon and lung cancer gene expression data to investigate the classification performance and power of detecting true genes associated with colon and lung cancer.https://www.mdpi.com/2227-7390/7/6/493LASSOSCADMCPSISelastic netaccuracyAUROCgeometric mean
collection DOAJ
language English
format Article
sources DOAJ
author Sangjin Kim
Jong-Min Kim
spellingShingle Sangjin Kim
Jong-Min Kim
Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
Mathematics
LASSO
SCAD
MCP
SIS
elastic net
accuracy
AUROC
geometric mean
author_facet Sangjin Kim
Jong-Min Kim
author_sort Sangjin Kim
title Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
title_short Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
title_full Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
title_fullStr Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
title_full_unstemmed Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
title_sort two-stage classification with sis using a new filter ranking method in high throughput data
publisher MDPI AG
series Mathematics
issn 2227-7390
publishDate 2019-05-01
description Over the last decade, high dimensional data have been popularly paid attention to in bioinformatics. These data increase the likelihood of detecting the most promising novel information. However, there are limitations of high-performance computing and overfitting issues. To overcome the issues, alternative strategies need to be explored for the detection of true important features. A two-stage approach, filtering and variable selection steps, has been receiving attention. Filtering methods are divided into two categories of individual ranking and feature subset selection methods. Both have issues with the lack of consideration for joint correlation among features and computing time of an NP-hard problem. Therefore, we proposed a new filter ranking method (PF) using the elastic net penalty with sure independence screening (SIS) based on resampling technique to overcome these issues. We demonstrated that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieved superior performance of not only accuracy, AUROC, and geometric mean but also true positive detection compared to those with the marginal maximum likelihood ranking method (MMLR) through extensive simulation studies. In addition, we applied it in a real application of colon and lung cancer gene expression data to investigate the classification performance and power of detecting true genes associated with colon and lung cancer.
topic LASSO
SCAD
MCP
SIS
elastic net
accuracy
AUROC
geometric mean
url https://www.mdpi.com/2227-7390/7/6/493
work_keys_str_mv AT sangjinkim twostageclassificationwithsisusinganewfilterrankingmethodinhighthroughputdata
AT jongminkim twostageclassificationwithsisusinganewfilterrankingmethodinhighthroughputdata
_version_ 1726000425667133440