Summary: | 碩士 === 國立清華大學 === 服務科學研究所 === 107 === Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm.
|