Classification of Imbalanced Data Represented as Binary Features

Typically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification task...

Full description

Bibliographic Details
Main Authors: Kunti Robiatul Mahmudah, Fatma Indriani, Yukiko Takemori-Sakai, Yasunori Iwata, Takashi Wada, Kenji Satou
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/17/7825
id doaj-abb4969dffed46fe8b2c6002e0d8a848
record_format Article
spelling doaj-abb4969dffed46fe8b2c6002e0d8a8482021-09-09T13:38:17ZengMDPI AGApplied Sciences2076-34172021-08-01117825782510.3390/app11177825Classification of Imbalanced Data Represented as Binary FeaturesKunti Robiatul Mahmudah0Fatma Indriani1Yukiko Takemori-Sakai2Yasunori Iwata3Takashi Wada4Kenji Satou5Graduate School of Natural Science and Technology, Kanazawa University, Kanazawa 9201192, JapanGraduate School of Natural Science and Technology, Kanazawa University, Kanazawa 9201192, JapanDivision of Clinical Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanDepartment of Nephrology and Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanDepartment of Nephrology and Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanInstitute of Science and Engineering, Kanazawa University, Kanazawa 9201192, JapanTypically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification tasks. However, datasets represented as binary features cannot use many standard machine learning algorithms optimally, yet their amount is not negligible. On the other hand, oversampling algorithms such as synthetic minority oversampling technique (SMOTE) and its variants are often used if the dataset for classification is imbalanced. However, since SMOTE and its variants synthesize new minority samples based on the original samples, the diversity of the samples synthesized from binary features is highly limited due to the poor representation of original features. To solve this problem, a preprocessing approach is studied. By converting binary features into numerical ones using feature extraction methods, succeeding oversampling methods can fully display their potential in improving the classifiers’ performances. Through comprehensive experiments using benchmark datasets and real medical datasets, it was observed that a converted dataset consisting of numerical features is better for oversampling methods (maximum improvements of accuracy and F1-score were 35.11% and 42.17%, respectively). In addition, it is confirmed that feature extraction and oversampling synergistically contribute to the improvement of classification performance.https://www.mdpi.com/2076-3417/11/17/7825binary feature classificationmutationfeature extractionoversampling
collection DOAJ
language English
format Article
sources DOAJ
author Kunti Robiatul Mahmudah
Fatma Indriani
Yukiko Takemori-Sakai
Yasunori Iwata
Takashi Wada
Kenji Satou
spellingShingle Kunti Robiatul Mahmudah
Fatma Indriani
Yukiko Takemori-Sakai
Yasunori Iwata
Takashi Wada
Kenji Satou
Classification of Imbalanced Data Represented as Binary Features
Applied Sciences
binary feature classification
mutation
feature extraction
oversampling
author_facet Kunti Robiatul Mahmudah
Fatma Indriani
Yukiko Takemori-Sakai
Yasunori Iwata
Takashi Wada
Kenji Satou
author_sort Kunti Robiatul Mahmudah
title Classification of Imbalanced Data Represented as Binary Features
title_short Classification of Imbalanced Data Represented as Binary Features
title_full Classification of Imbalanced Data Represented as Binary Features
title_fullStr Classification of Imbalanced Data Represented as Binary Features
title_full_unstemmed Classification of Imbalanced Data Represented as Binary Features
title_sort classification of imbalanced data represented as binary features
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-08-01
description Typically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification tasks. However, datasets represented as binary features cannot use many standard machine learning algorithms optimally, yet their amount is not negligible. On the other hand, oversampling algorithms such as synthetic minority oversampling technique (SMOTE) and its variants are often used if the dataset for classification is imbalanced. However, since SMOTE and its variants synthesize new minority samples based on the original samples, the diversity of the samples synthesized from binary features is highly limited due to the poor representation of original features. To solve this problem, a preprocessing approach is studied. By converting binary features into numerical ones using feature extraction methods, succeeding oversampling methods can fully display their potential in improving the classifiers’ performances. Through comprehensive experiments using benchmark datasets and real medical datasets, it was observed that a converted dataset consisting of numerical features is better for oversampling methods (maximum improvements of accuracy and F1-score were 35.11% and 42.17%, respectively). In addition, it is confirmed that feature extraction and oversampling synergistically contribute to the improvement of classification performance.
topic binary feature classification
mutation
feature extraction
oversampling
url https://www.mdpi.com/2076-3417/11/17/7825
work_keys_str_mv AT kuntirobiatulmahmudah classificationofimbalanceddatarepresentedasbinaryfeatures
AT fatmaindriani classificationofimbalanceddatarepresentedasbinaryfeatures
AT yukikotakemorisakai classificationofimbalanceddatarepresentedasbinaryfeatures
AT yasunoriiwata classificationofimbalanceddatarepresentedasbinaryfeatures
AT takashiwada classificationofimbalanceddatarepresentedasbinaryfeatures
AT kenjisatou classificationofimbalanceddatarepresentedasbinaryfeatures
_version_ 1717760876682936320