A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Imbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is oversampled by producing synthetic examples in feature ve...

Full description

Bibliographic Details
Main Authors:	Ahmed Saad Hussein, Tianrui Li, Chubato Wondaferaw Yohannese, Kamal Bashir
Format:	Article
Language:	English
Published:	Atlantis Press 2019-11-01
Series:	International Journal of Computational Intelligence Systems
Subjects:	Imbalanced datasets SMOTE Machine learning Oversampling Undersampling
Online Access:	https://www.atlantis-press.com/article/125924019/view

id	doaj-2ba3cdc6b9fc45f19c33b862dfbae39f
record_format	Article
spelling	doaj-2ba3cdc6b9fc45f19c33b862dfbae39f2020-11-25T02:18:42ZengAtlantis PressInternational Journal of Computational Intelligence Systems 1875-68832019-11-0112210.2991/ijcis.d.191114.002A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTEAhmed Saad HusseinTianrui LiChubato Wondaferaw YohanneseKamal BashirImbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is oversampled by producing synthetic examples in feature vector rather than data space. However, many recent works have shown that the imbalanced ratio in itself is not a problem and deterioration of the model performance is caused by other reasons linked to the minority class sample distribution. The blind oversampling by SMOTE leads to two major problems: noise and borderline examples. Noisy examples are those from one class located in the safe zone of the other. Borderline examples are those located in the neighborhood of the class boundary. These samples are associated with deteriorating performance of the models developed. Therefore, it is critical to concentrate on the minority class data structure and regulate the positioning of the newly introduced minority class samples for better performance of classifiers. Hence, this paper proposes the advanced SMOTE, denoted as A-SMOTE, to adjust the newly introduced minority class examples based on distance to the original minority class samples. To achieve this objective, we first employ the SMOTE algorithm to introduce new samples to the minority and eliminate those examples that are closer to the majority than the minority. We apply the proposed method to 44 datasets at various imbalance ratios. Ten widely used data sampling methods selected from the literature are employed for performance comparison. The C4.5 and Naive Bayes classifiers are utilized for experimental validation. The results confirm the advantage of the proposed method over the other methods in almost all the datasets and illustrate its suitability for data preprocessing in classification tasks.https://www.atlantis-press.com/article/125924019/viewImbalanced datasetsSMOTEMachine learningOversamplingUndersampling
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ahmed Saad Hussein Tianrui Li Chubato Wondaferaw Yohannese Kamal Bashir
spellingShingle	Ahmed Saad Hussein Tianrui Li Chubato Wondaferaw Yohannese Kamal Bashir A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE International Journal of Computational Intelligence Systems Imbalanced datasets SMOTE Machine learning Oversampling Undersampling
author_facet	Ahmed Saad Hussein Tianrui Li Chubato Wondaferaw Yohannese Kamal Bashir
author_sort	Ahmed Saad Hussein
title	A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE
title_short	A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE
title_full	A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE
title_fullStr	A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE
title_full_unstemmed	A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE
title_sort	a-smote: a new preprocessing approach for highly imbalanced datasets by improving smote
publisher	Atlantis Press
series	International Journal of Computational Intelligence Systems
issn	1875-6883
publishDate	2019-11-01
description	Imbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is oversampled by producing synthetic examples in feature vector rather than data space. However, many recent works have shown that the imbalanced ratio in itself is not a problem and deterioration of the model performance is caused by other reasons linked to the minority class sample distribution. The blind oversampling by SMOTE leads to two major problems: noise and borderline examples. Noisy examples are those from one class located in the safe zone of the other. Borderline examples are those located in the neighborhood of the class boundary. These samples are associated with deteriorating performance of the models developed. Therefore, it is critical to concentrate on the minority class data structure and regulate the positioning of the newly introduced minority class samples for better performance of classifiers. Hence, this paper proposes the advanced SMOTE, denoted as A-SMOTE, to adjust the newly introduced minority class examples based on distance to the original minority class samples. To achieve this objective, we first employ the SMOTE algorithm to introduce new samples to the minority and eliminate those examples that are closer to the majority than the minority. We apply the proposed method to 44 datasets at various imbalance ratios. Ten widely used data sampling methods selected from the literature are employed for performance comparison. The C4.5 and Naive Bayes classifiers are utilized for experimental validation. The results confirm the advantage of the proposed method over the other methods in almost all the datasets and illustrate its suitability for data preprocessing in classification tasks.
topic	Imbalanced datasets SMOTE Machine learning Oversampling Undersampling
url	https://www.atlantis-press.com/article/125924019/view
work_keys_str_mv	AT ahmedsaadhussein asmoteanewpreprocessingapproachforhighlyimbalanceddatasetsbyimprovingsmote AT tianruili asmoteanewpreprocessingapproachforhighlyimbalanceddatasetsbyimprovingsmote AT chubatowondaferawyohannese asmoteanewpreprocessingapproachforhighlyimbalanceddatasetsbyimprovingsmote AT kamalbashir asmoteanewpreprocessingapproachforhighlyimbalanceddatasetsbyimprovingsmote
_version_	1724880377265258496

A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Similar Items