Learning Classification Models From Datasets with Block Missing

碩士 === 國立清華大學 === 服務科學研究所 === 100 === The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due t...

Full description

Bibliographic Details
Main Authors: Tsai, Ya Hsun, 蔡亞勳
Other Authors: Wei, Chih-Ping
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/22417161768041454901
id ndltd-TW-100NTHU5836009
record_format oai_dc
spelling ndltd-TW-100NTHU58360092015-10-13T21:22:42Z http://ndltd.ncl.edu.tw/handle/22417161768041454901 Learning Classification Models From Datasets with Block Missing Tsai, Ya Hsun 蔡亞勳 碩士 國立清華大學 服務科學研究所 100 The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due to personal privacy, confidence, or operational mistakes. These missing values could be handled by some existing methods, such as simple imputation ones. As a result, the performance of a classifier based on a training dataset with missing values will still be acceptable. However, due to the data sharing in the developing environment, another condition of data incompletion, i.e., blocking missing has become a new challenge and cannot be solved by the same ways for random missing. The blocking missing usually exists in an increasing training dataset or an integrated dataset of different sources. A dataset with blocking missing is meant to some instances lack of the values for certain specific attributes. The certain specific attributes could be the new attributes that never used before or the attributes that are exclusive in a source. Some preliminary experiments are conducted to demonstrate that the common imputation methods for random missing handling are not applicable for the block missing handling. To address this new challenge, we purpose two novel methods that consider the uncertainty of the imputed value and build the corresponding classifier model accordingly. Specifically, we first extract the distortion information for each missing value to show the corresponding statistics of all its possible values. Following the concept of bagging technique, we further adopt the proposed distortion-based bagging technique to build different classifier for the same prediction task based on different distorted training dataset that fill in missing values according to the corresponding distortion information. Finally, the final result for a testing instance can be obtained by the major option from all the classifiers for this prediction task. A series of experiments are then performed based on three different kinds of training dataset. The experimental results show that our two proposed methods are superior to the benchmark methods with acceptable effectiveness, especially the blocking missing exits in the attributes with higher discrimination for prediction. Wei, Chih-Ping Lin, Fu-Ren 魏志平 林福仁 2012 學位論文 ; thesis 63 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立清華大學 === 服務科學研究所 === 100 === The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due to personal privacy, confidence, or operational mistakes. These missing values could be handled by some existing methods, such as simple imputation ones. As a result, the performance of a classifier based on a training dataset with missing values will still be acceptable. However, due to the data sharing in the developing environment, another condition of data incompletion, i.e., blocking missing has become a new challenge and cannot be solved by the same ways for random missing. The blocking missing usually exists in an increasing training dataset or an integrated dataset of different sources. A dataset with blocking missing is meant to some instances lack of the values for certain specific attributes. The certain specific attributes could be the new attributes that never used before or the attributes that are exclusive in a source. Some preliminary experiments are conducted to demonstrate that the common imputation methods for random missing handling are not applicable for the block missing handling. To address this new challenge, we purpose two novel methods that consider the uncertainty of the imputed value and build the corresponding classifier model accordingly. Specifically, we first extract the distortion information for each missing value to show the corresponding statistics of all its possible values. Following the concept of bagging technique, we further adopt the proposed distortion-based bagging technique to build different classifier for the same prediction task based on different distorted training dataset that fill in missing values according to the corresponding distortion information. Finally, the final result for a testing instance can be obtained by the major option from all the classifiers for this prediction task. A series of experiments are then performed based on three different kinds of training dataset. The experimental results show that our two proposed methods are superior to the benchmark methods with acceptable effectiveness, especially the blocking missing exits in the attributes with higher discrimination for prediction.
author2 Wei, Chih-Ping
author_facet Wei, Chih-Ping
Tsai, Ya Hsun
蔡亞勳
author Tsai, Ya Hsun
蔡亞勳
spellingShingle Tsai, Ya Hsun
蔡亞勳
Learning Classification Models From Datasets with Block Missing
author_sort Tsai, Ya Hsun
title Learning Classification Models From Datasets with Block Missing
title_short Learning Classification Models From Datasets with Block Missing
title_full Learning Classification Models From Datasets with Block Missing
title_fullStr Learning Classification Models From Datasets with Block Missing
title_full_unstemmed Learning Classification Models From Datasets with Block Missing
title_sort learning classification models from datasets with block missing
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/22417161768041454901
work_keys_str_mv AT tsaiyahsun learningclassificationmodelsfromdatasetswithblockmissing
AT càiyàxūn learningclassificationmodelsfromdatasetswithblockmissing
_version_ 1718062862618853376