A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset

Bankruptcy prediction has been a popular and challenging research topic in both computer science and economics due to its importance to financial institutions, fund managers, lenders, governments, as well as economic stakeholders in recent years. In a bankruptcy dataset, the problem of class imbalan...

Full description

Bibliographic Details
Main Authors:	Tuong Le, Le Hoang Son, Minh Thanh Vo, Mi Young Lee, Sung Wook Baik
Format:	Article
Language:	English
Published:	MDPI AG 2018-07-01
Series:	Symmetry
Subjects:	bankruptcy prediction undersampling technique cluster-based boosting machine learning
Online Access:	http://www.mdpi.com/2073-8994/10/7/250

id	doaj-164ae8b892be4c1bb2f86d7314e1a3b5
record_format	Article
spelling	doaj-164ae8b892be4c1bb2f86d7314e1a3b52020-11-25T00:55:58ZengMDPI AGSymmetry2073-89942018-07-0110725010.3390/sym10070250sym10070250A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced DatasetTuong Le0Le Hoang Son1Minh Thanh Vo2Mi Young Lee3Sung Wook Baik4Digital Contents Research Institute, Sejong University, Seoul 143-747, KoreaVNU University of Science, Vietnam National University, Hanoi, VietnamDigital Contents Research Institute, Sejong University, Seoul 143-747, KoreaDigital Contents Research Institute, Sejong University, Seoul 143-747, KoreaDigital Contents Research Institute, Sejong University, Seoul 143-747, KoreaBankruptcy prediction has been a popular and challenging research topic in both computer science and economics due to its importance to financial institutions, fund managers, lenders, governments, as well as economic stakeholders in recent years. In a bankruptcy dataset, the problem of class imbalance, in which the number of bankruptcy companies is smaller than the number of normal companies, leads to a standard classification algorithm that does not work well. Therefore, this study proposes a cluster-based boosting algorithm as well as a robust framework using the CBoost algorithm and Instance Hardness Threshold (RFCI) for effective bankruptcy prediction of a financial dataset. This framework first resamples the imbalance dataset by the undersampling method using Instance Hardness Threshold (IHT), which is used to remove the noise instances having large IHT value in the majority class. Then, this study proposes a Cluster-based Boosting algorithm, namely CBoost, for dealing with the class imbalance. In this algorithm, the majority class will be clustered into a number of clusters. The distance from each sample to its closest centroid will be used to initialize its weight. This algorithm will perform several iterations for finding weak classifiers and combining them to create a strong classifier. The resample set resulting from the previous module, will be used to train CBoost, which will be used to predict bankruptcy for the validation set. The proposed framework is verified by the Korean bankruptcy dataset (KBD), which has a very small balancing ratio in both the training and the testing phases. The experimental results of this research show that the proposed framework achieves 86.8% in AUC (area under the ROC curve) and outperforms several methods for dealing with the imbalanced data problem for bankruptcy prediction such as GMBoost algorithm, the oversampling-based method using SMOTEENN, and the clustering-based undersampling method for bankruptcy prediction in the experimental dataset.http://www.mdpi.com/2073-8994/10/7/250bankruptcy predictionundersampling techniquecluster-based boostingmachine learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Tuong Le Le Hoang Son Minh Thanh Vo Mi Young Lee Sung Wook Baik
spellingShingle	Tuong Le Le Hoang Son Minh Thanh Vo Mi Young Lee Sung Wook Baik A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset Symmetry bankruptcy prediction undersampling technique cluster-based boosting machine learning
author_facet	Tuong Le Le Hoang Son Minh Thanh Vo Mi Young Lee Sung Wook Baik
author_sort	Tuong Le
title	A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
title_short	A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
title_full	A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
title_fullStr	A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
title_full_unstemmed	A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
title_sort	cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset
publisher	MDPI AG
series	Symmetry
issn	2073-8994
publishDate	2018-07-01
description	Bankruptcy prediction has been a popular and challenging research topic in both computer science and economics due to its importance to financial institutions, fund managers, lenders, governments, as well as economic stakeholders in recent years. In a bankruptcy dataset, the problem of class imbalance, in which the number of bankruptcy companies is smaller than the number of normal companies, leads to a standard classification algorithm that does not work well. Therefore, this study proposes a cluster-based boosting algorithm as well as a robust framework using the CBoost algorithm and Instance Hardness Threshold (RFCI) for effective bankruptcy prediction of a financial dataset. This framework first resamples the imbalance dataset by the undersampling method using Instance Hardness Threshold (IHT), which is used to remove the noise instances having large IHT value in the majority class. Then, this study proposes a Cluster-based Boosting algorithm, namely CBoost, for dealing with the class imbalance. In this algorithm, the majority class will be clustered into a number of clusters. The distance from each sample to its closest centroid will be used to initialize its weight. This algorithm will perform several iterations for finding weak classifiers and combining them to create a strong classifier. The resample set resulting from the previous module, will be used to train CBoost, which will be used to predict bankruptcy for the validation set. The proposed framework is verified by the Korean bankruptcy dataset (KBD), which has a very small balancing ratio in both the training and the testing phases. The experimental results of this research show that the proposed framework achieves 86.8% in AUC (area under the ROC curve) and outperforms several methods for dealing with the imbalanced data problem for bankruptcy prediction such as GMBoost algorithm, the oversampling-based method using SMOTEENN, and the clustering-based undersampling method for bankruptcy prediction in the experimental dataset.
topic	bankruptcy prediction undersampling technique cluster-based boosting machine learning
url	http://www.mdpi.com/2073-8994/10/7/250
work_keys_str_mv	AT tuongle aclusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT lehoangson aclusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT minhthanhvo aclusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT miyounglee aclusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT sungwookbaik aclusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT tuongle clusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT lehoangson clusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT minhthanhvo clusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT miyounglee clusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset AT sungwookbaik clusterbasedboostingalgorithmforbankruptcypredictioninahighlyimbalanceddataset
_version_	1725228671037341696

A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset

Similar Items