Summary: | In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.
|