Classification of Imbalanced Data Represented as Binary Features
Typically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification task...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-08-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/17/7825 |
id |
doaj-abb4969dffed46fe8b2c6002e0d8a848 |
---|---|
record_format |
Article |
spelling |
doaj-abb4969dffed46fe8b2c6002e0d8a8482021-09-09T13:38:17ZengMDPI AGApplied Sciences2076-34172021-08-01117825782510.3390/app11177825Classification of Imbalanced Data Represented as Binary FeaturesKunti Robiatul Mahmudah0Fatma Indriani1Yukiko Takemori-Sakai2Yasunori Iwata3Takashi Wada4Kenji Satou5Graduate School of Natural Science and Technology, Kanazawa University, Kanazawa 9201192, JapanGraduate School of Natural Science and Technology, Kanazawa University, Kanazawa 9201192, JapanDivision of Clinical Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanDepartment of Nephrology and Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanDepartment of Nephrology and Laboratory Medicine, Kanazawa University, Kanazawa 9201192, JapanInstitute of Science and Engineering, Kanazawa University, Kanazawa 9201192, JapanTypically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification tasks. However, datasets represented as binary features cannot use many standard machine learning algorithms optimally, yet their amount is not negligible. On the other hand, oversampling algorithms such as synthetic minority oversampling technique (SMOTE) and its variants are often used if the dataset for classification is imbalanced. However, since SMOTE and its variants synthesize new minority samples based on the original samples, the diversity of the samples synthesized from binary features is highly limited due to the poor representation of original features. To solve this problem, a preprocessing approach is studied. By converting binary features into numerical ones using feature extraction methods, succeeding oversampling methods can fully display their potential in improving the classifiers’ performances. Through comprehensive experiments using benchmark datasets and real medical datasets, it was observed that a converted dataset consisting of numerical features is better for oversampling methods (maximum improvements of accuracy and F1-score were 35.11% and 42.17%, respectively). In addition, it is confirmed that feature extraction and oversampling synergistically contribute to the improvement of classification performance.https://www.mdpi.com/2076-3417/11/17/7825binary feature classificationmutationfeature extractionoversampling |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Kunti Robiatul Mahmudah Fatma Indriani Yukiko Takemori-Sakai Yasunori Iwata Takashi Wada Kenji Satou |
spellingShingle |
Kunti Robiatul Mahmudah Fatma Indriani Yukiko Takemori-Sakai Yasunori Iwata Takashi Wada Kenji Satou Classification of Imbalanced Data Represented as Binary Features Applied Sciences binary feature classification mutation feature extraction oversampling |
author_facet |
Kunti Robiatul Mahmudah Fatma Indriani Yukiko Takemori-Sakai Yasunori Iwata Takashi Wada Kenji Satou |
author_sort |
Kunti Robiatul Mahmudah |
title |
Classification of Imbalanced Data Represented as Binary Features |
title_short |
Classification of Imbalanced Data Represented as Binary Features |
title_full |
Classification of Imbalanced Data Represented as Binary Features |
title_fullStr |
Classification of Imbalanced Data Represented as Binary Features |
title_full_unstemmed |
Classification of Imbalanced Data Represented as Binary Features |
title_sort |
classification of imbalanced data represented as binary features |
publisher |
MDPI AG |
series |
Applied Sciences |
issn |
2076-3417 |
publishDate |
2021-08-01 |
description |
Typically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification tasks. However, datasets represented as binary features cannot use many standard machine learning algorithms optimally, yet their amount is not negligible. On the other hand, oversampling algorithms such as synthetic minority oversampling technique (SMOTE) and its variants are often used if the dataset for classification is imbalanced. However, since SMOTE and its variants synthesize new minority samples based on the original samples, the diversity of the samples synthesized from binary features is highly limited due to the poor representation of original features. To solve this problem, a preprocessing approach is studied. By converting binary features into numerical ones using feature extraction methods, succeeding oversampling methods can fully display their potential in improving the classifiers’ performances. Through comprehensive experiments using benchmark datasets and real medical datasets, it was observed that a converted dataset consisting of numerical features is better for oversampling methods (maximum improvements of accuracy and F1-score were 35.11% and 42.17%, respectively). In addition, it is confirmed that feature extraction and oversampling synergistically contribute to the improvement of classification performance. |
topic |
binary feature classification mutation feature extraction oversampling |
url |
https://www.mdpi.com/2076-3417/11/17/7825 |
work_keys_str_mv |
AT kuntirobiatulmahmudah classificationofimbalanceddatarepresentedasbinaryfeatures AT fatmaindriani classificationofimbalanceddatarepresentedasbinaryfeatures AT yukikotakemorisakai classificationofimbalanceddatarepresentedasbinaryfeatures AT yasunoriiwata classificationofimbalanceddatarepresentedasbinaryfeatures AT takashiwada classificationofimbalanceddatarepresentedasbinaryfeatures AT kenjisatou classificationofimbalanceddatarepresentedasbinaryfeatures |
_version_ |
1717760876682936320 |