Improved Feature Selection Model for Big Data Analytics
Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge d...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9058715/ |
id |
doaj-7f1fecfba9c54fcaabac3d800b698715 |
---|---|
record_format |
Article |
spelling |
doaj-7f1fecfba9c54fcaabac3d800b6987152021-03-30T03:12:22ZengIEEEIEEE Access2169-35362020-01-018669896700410.1109/ACCESS.2020.29862329058715Improved Feature Selection Model for Big Data AnalyticsIbrahim M. El-Hasnony0https://orcid.org/0000-0002-9489-3449Sherif I. Barakat1Mohamed Elhoseny2https://orcid.org/0000-0001-6347-8368Reham R. Mostafa3Information Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptInformation Systems Department, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, EgyptAlthough there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Cross-validation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively.https://ieeexplore.ieee.org/document/9058715/Particle swarm optimization (PSO)grey wolf optimization (GWO)data miningbig data analyticsfeature selection |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ibrahim M. El-Hasnony Sherif I. Barakat Mohamed Elhoseny Reham R. Mostafa |
spellingShingle |
Ibrahim M. El-Hasnony Sherif I. Barakat Mohamed Elhoseny Reham R. Mostafa Improved Feature Selection Model for Big Data Analytics IEEE Access Particle swarm optimization (PSO) grey wolf optimization (GWO) data mining big data analytics feature selection |
author_facet |
Ibrahim M. El-Hasnony Sherif I. Barakat Mohamed Elhoseny Reham R. Mostafa |
author_sort |
Ibrahim M. El-Hasnony |
title |
Improved Feature Selection Model for Big Data Analytics |
title_short |
Improved Feature Selection Model for Big Data Analytics |
title_full |
Improved Feature Selection Model for Big Data Analytics |
title_fullStr |
Improved Feature Selection Model for Big Data Analytics |
title_full_unstemmed |
Improved Feature Selection Model for Big Data Analytics |
title_sort |
improved feature selection model for big data analytics |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2020-01-01 |
description |
Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Cross-validation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively. |
topic |
Particle swarm optimization (PSO) grey wolf optimization (GWO) data mining big data analytics feature selection |
url |
https://ieeexplore.ieee.org/document/9058715/ |
work_keys_str_mv |
AT ibrahimmelhasnony improvedfeatureselectionmodelforbigdataanalytics AT sherifibarakat improvedfeatureselectionmodelforbigdataanalytics AT mohamedelhoseny improvedfeatureselectionmodelforbigdataanalytics AT rehamrmostafa improvedfeatureselectionmodelforbigdataanalytics |
_version_ |
1724183878230343680 |