Improved Feature Selection Model for Big Data Analytics

Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge d...

Full description

Bibliographic Details
Main Authors: Ibrahim M. El-Hasnony, Sherif I. Barakat, Mohamed Elhoseny, Reham R. Mostafa
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9058715/
id doaj-7f1fecfba9c54fcaabac3d800b698715
record_format Article
spelling doaj-7f1fecfba9c54fcaabac3d800b6987152021-03-30T03:12:22ZengIEEEIEEE Access2169-35362020-01-018669896700410.1109/ACCESS.2020.29862329058715Improved Feature Selection Model for Big Data AnalyticsIbrahim M. El-Hasnony0https://orcid.org/0000-0002-9489-3449Sherif I. Barakat1Mohamed Elhoseny2https://orcid.org/0000-0001-6347-8368Reham R. Mostafa3Information Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptAlthough there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Cross-validation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively.https://ieeexplore.ieee.org/document/9058715/Particle swarm optimization (PSO)grey wolf optimization (GWO)data miningbig data analyticsfeature selection
collection DOAJ
language English
format Article
sources DOAJ
author Ibrahim M. El-Hasnony
Sherif I. Barakat
Mohamed Elhoseny
Reham R. Mostafa
spellingShingle Ibrahim M. El-Hasnony
Sherif I. Barakat
Mohamed Elhoseny
Reham R. Mostafa
Improved Feature Selection Model for Big Data Analytics
IEEE Access
Particle swarm optimization (PSO)
grey wolf optimization (GWO)
data mining
big data analytics
feature selection
author_facet Ibrahim M. El-Hasnony
Sherif I. Barakat
Mohamed Elhoseny
Reham R. Mostafa
author_sort Ibrahim M. El-Hasnony
title Improved Feature Selection Model for Big Data Analytics
title_short Improved Feature Selection Model for Big Data Analytics
title_full Improved Feature Selection Model for Big Data Analytics
title_fullStr Improved Feature Selection Model for Big Data Analytics
title_full_unstemmed Improved Feature Selection Model for Big Data Analytics
title_sort improved feature selection model for big data analytics
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Cross-validation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively.
topic Particle swarm optimization (PSO)
grey wolf optimization (GWO)
data mining
big data analytics
feature selection
url https://ieeexplore.ieee.org/document/9058715/
work_keys_str_mv AT ibrahimmelhasnony improvedfeatureselectionmodelforbigdataanalytics
AT sherifibarakat improvedfeatureselectionmodelforbigdataanalytics
AT mohamedelhoseny improvedfeatureselectionmodelforbigdataanalytics
AT rehamrmostafa improvedfeatureselectionmodelforbigdataanalytics
_version_ 1724183878230343680