Statistical Learning with Imbalanced Data

In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided...

Full description

Bibliographic Details
Main Author:	Shipitsyn, Aleksey
Format:	Others
Language:	English
Published:	Linköpings universitet, Filosofiska fakulteten 2017
Subjects:	imbalanced learning sampling algorithms intelligent sampling Probability Theory and Statistics Sannolikhetsteori och statistik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168

id	ndltd-UPSALLA1-oai-DiVA.org-liu-139168
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-liu-1391682017-07-05T05:39:54ZStatistical Learning with Imbalanced DataengShipitsyn, AlekseyLinköpings universitet, Filosofiska fakultetenLinköpings universitet, Statistik och maskininlärning2017imbalanced learningsampling algorithmsintelligent samplingProbability Theory and StatisticsSannolikhetsteori och statistikIn this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	imbalanced learning sampling algorithms intelligent sampling Probability Theory and Statistics Sannolikhetsteori och statistik
spellingShingle	imbalanced learning sampling algorithms intelligent sampling Probability Theory and Statistics Sannolikhetsteori och statistik Shipitsyn, Aleksey Statistical Learning with Imbalanced Data
description	In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.
author	Shipitsyn, Aleksey
author_facet	Shipitsyn, Aleksey
author_sort	Shipitsyn, Aleksey
title	Statistical Learning with Imbalanced Data
title_short	Statistical Learning with Imbalanced Data
title_full	Statistical Learning with Imbalanced Data
title_fullStr	Statistical Learning with Imbalanced Data
title_full_unstemmed	Statistical Learning with Imbalanced Data
title_sort	statistical learning with imbalanced data
publisher	Linköpings universitet, Filosofiska fakulteten
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139168
work_keys_str_mv	AT shipitsynaleksey statisticallearningwithimbalanceddata
_version_	1718490358352969728

Statistical Learning with Imbalanced Data

Similar Items