Performance of tree algorithms on imbalanced data under different sampling strategies

碩士 === 國立清華大學 === 服務科學研究所 === 107 === Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the e...

Full description

Bibliographic Details
Main Authors: Chen, Hsing-Chun, 陳幸君
Other Authors: Shmueli, Galit
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/3aqbms
id ndltd-TW-107NTHU5836003
record_format oai_dc
spelling ndltd-TW-107NTHU58360032019-07-16T03:45:02Z http://ndltd.ncl.edu.tw/handle/3aqbms Performance of tree algorithms on imbalanced data under different sampling strategies 決策樹演算法搭配不同抽樣法於不均衡資料集的表現 Chen, Hsing-Chun 陳幸君 碩士 國立清華大學 服務科學研究所 107 Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm. Shmueli, Galit 徐茉莉 2018 學位論文 ; thesis 94 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立清華大學 === 服務科學研究所 === 107 === Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm.
author2 Shmueli, Galit
author_facet Shmueli, Galit
Chen, Hsing-Chun
陳幸君
author Chen, Hsing-Chun
陳幸君
spellingShingle Chen, Hsing-Chun
陳幸君
Performance of tree algorithms on imbalanced data under different sampling strategies
author_sort Chen, Hsing-Chun
title Performance of tree algorithms on imbalanced data under different sampling strategies
title_short Performance of tree algorithms on imbalanced data under different sampling strategies
title_full Performance of tree algorithms on imbalanced data under different sampling strategies
title_fullStr Performance of tree algorithms on imbalanced data under different sampling strategies
title_full_unstemmed Performance of tree algorithms on imbalanced data under different sampling strategies
title_sort performance of tree algorithms on imbalanced data under different sampling strategies
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/3aqbms
work_keys_str_mv AT chenhsingchun performanceoftreealgorithmsonimbalanceddataunderdifferentsamplingstrategies
AT chénxìngjūn performanceoftreealgorithmsonimbalanceddataunderdifferentsamplingstrategies
AT chenhsingchun juécèshùyǎnsuànfǎdāpèibùtóngchōuyàngfǎyúbùjūnhéngzīliàojídebiǎoxiàn
AT chénxìngjūn juécèshùyǎnsuànfǎdāpèibùtóngchōuyàngfǎyúbùjūnhéngzīliàojídebiǎoxiàn
_version_ 1719223963947630592