A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design ai...

Full description

Bibliographic Details
Main Authors: Terzi Duygu Sinanc, Sagiroglu Seref
Format: Article
Language:English
Published: Sciendo 2019-12-01
Series:Applied Computer Systems
Subjects:
Online Access:https://doi.org/10.2478/acss-2019-0013
id doaj-ae20c2d8188045fd89404a54d6e09f61
record_format Article
spelling doaj-ae20c2d8188045fd89404a54d6e09f612021-09-06T19:41:00ZengSciendoApplied Computer Systems2255-86912019-12-0124210411010.2478/acss-2019-0013acss-2019-0013A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance ProblemTerzi Duygu Sinanc0Sagiroglu Seref1Department of Computer Engineering, Gazi University, Ankara, TurkeyDepartment of Computer Engineering, Gazi University, Ankara, TurkeyThe class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.https://doi.org/10.2478/acss-2019-0013big datacluster-based resamplingimbalanced big data classificationimbalanced data
collection DOAJ
language English
format Article
sources DOAJ
author Terzi Duygu Sinanc
Sagiroglu Seref
spellingShingle Terzi Duygu Sinanc
Sagiroglu Seref
A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
Applied Computer Systems
big data
cluster-based resampling
imbalanced big data classification
imbalanced data
author_facet Terzi Duygu Sinanc
Sagiroglu Seref
author_sort Terzi Duygu Sinanc
title A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
title_short A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
title_full A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
title_fullStr A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
title_full_unstemmed A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
title_sort new big data model using distributed cluster-based resampling for class-imbalance problem
publisher Sciendo
series Applied Computer Systems
issn 2255-8691
publishDate 2019-12-01
description The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.
topic big data
cluster-based resampling
imbalanced big data classification
imbalanced data
url https://doi.org/10.2478/acss-2019-0013
work_keys_str_mv AT terziduygusinanc anewbigdatamodelusingdistributedclusterbasedresamplingforclassimbalanceproblem
AT sagirogluseref anewbigdatamodelusingdistributedclusterbasedresamplingforclassimbalanceproblem
AT terziduygusinanc newbigdatamodelusingdistributedclusterbasedresamplingforclassimbalanceproblem
AT sagirogluseref newbigdatamodelusingdistributedclusterbasedresamplingforclassimbalanceproblem
_version_ 1717767242551132160