Performance of tree algorithms on imbalanced data under different sampling strategies
碩士 === 國立清華大學 === 服務科學研究所 === 107 === Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the e...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/3aqbms |
id |
ndltd-TW-107NTHU5836003 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-107NTHU58360032019-07-16T03:45:02Z http://ndltd.ncl.edu.tw/handle/3aqbms Performance of tree algorithms on imbalanced data under different sampling strategies 決策樹演算法搭配不同抽樣法於不均衡資料集的表現 Chen, Hsing-Chun 陳幸君 碩士 國立清華大學 服務科學研究所 107 Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm. Shmueli, Galit 徐茉莉 2018 學位論文 ; thesis 94 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立清華大學 === 服務科學研究所 === 107 === Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm.
|
author2 |
Shmueli, Galit |
author_facet |
Shmueli, Galit Chen, Hsing-Chun 陳幸君 |
author |
Chen, Hsing-Chun 陳幸君 |
spellingShingle |
Chen, Hsing-Chun 陳幸君 Performance of tree algorithms on imbalanced data under different sampling strategies |
author_sort |
Chen, Hsing-Chun |
title |
Performance of tree algorithms on imbalanced data under different sampling strategies |
title_short |
Performance of tree algorithms on imbalanced data under different sampling strategies |
title_full |
Performance of tree algorithms on imbalanced data under different sampling strategies |
title_fullStr |
Performance of tree algorithms on imbalanced data under different sampling strategies |
title_full_unstemmed |
Performance of tree algorithms on imbalanced data under different sampling strategies |
title_sort |
performance of tree algorithms on imbalanced data under different sampling strategies |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/3aqbms |
work_keys_str_mv |
AT chenhsingchun performanceoftreealgorithmsonimbalanceddataunderdifferentsamplingstrategies AT chénxìngjūn performanceoftreealgorithmsonimbalanceddataunderdifferentsamplingstrategies AT chenhsingchun juécèshùyǎnsuànfǎdāpèibùtóngchōuyàngfǎyúbùjūnhéngzīliàojídebiǎoxiàn AT chénxìngjūn juécèshùyǎnsuànfǎdāpèibùtóngchōuyàngfǎyúbùjūnhéngzīliàojídebiǎoxiàn |
_version_ |
1719223963947630592 |