Statistical Learning with Imbalanced Data

In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided...

Full description

Bibliographic Details
Main Author: Shipitsyn, Aleksey
Format: Others
Language:English
Published: Linköpings universitet, Filosofiska fakulteten 2017
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168
id ndltd-UPSALLA1-oai-DiVA.org-liu-139168
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-liu-1391682017-07-05T05:39:54ZStatistical Learning with Imbalanced DataengShipitsyn, AlekseyLinköpings universitet, Filosofiska fakultetenLinköpings universitet, Statistik och maskininlärning2017imbalanced learningsampling algorithmsintelligent samplingProbability Theory and StatisticsSannolikhetsteori och statistikIn this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic imbalanced learning
sampling algorithms
intelligent sampling
Probability Theory and Statistics
Sannolikhetsteori och statistik
spellingShingle imbalanced learning
sampling algorithms
intelligent sampling
Probability Theory and Statistics
Sannolikhetsteori och statistik
Shipitsyn, Aleksey
Statistical Learning with Imbalanced Data
description In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.
author Shipitsyn, Aleksey
author_facet Shipitsyn, Aleksey
author_sort Shipitsyn, Aleksey
title Statistical Learning with Imbalanced Data
title_short Statistical Learning with Imbalanced Data
title_full Statistical Learning with Imbalanced Data
title_fullStr Statistical Learning with Imbalanced Data
title_full_unstemmed Statistical Learning with Imbalanced Data
title_sort statistical learning with imbalanced data
publisher Linköpings universitet, Filosofiska fakulteten
publishDate 2017
url http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168
work_keys_str_mv AT shipitsynaleksey statisticallearningwithimbalanceddata
_version_ 1718490358352969728