Learning Classification Models From Datasets with Block Missing

碩士 === 國立清華大學 === 服務科學研究所 === 100 === The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due t...

Full description

Bibliographic Details
Main Authors:	Tsai, Ya Hsun, 蔡亞勳
Other Authors:	Wei, Chih-Ping
Format:	Others
Language:	en_US
Published:	2012
Online Access:	http://ndltd.ncl.edu.tw/handle/22417161768041454901

id	ndltd-TW-100NTHU5836009
record_format	oai_dc
spelling	ndltd-TW-100NTHU58360092015-10-13T21:22:42Z http://ndltd.ncl.edu.tw/handle/22417161768041454901 Learning Classification Models From Datasets with Block Missing Tsai, Ya Hsun 蔡亞勳碩士國立清華大學服務科學研究所 100 The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due to personal privacy, confidence, or operational mistakes. These missing values could be handled by some existing methods, such as simple imputation ones. As a result, the performance of a classifier based on a training dataset with missing values will still be acceptable. However, due to the data sharing in the developing environment, another condition of data incompletion, i.e., blocking missing has become a new challenge and cannot be solved by the same ways for random missing. The blocking missing usually exists in an increasing training dataset or an integrated dataset of different sources. A dataset with blocking missing is meant to some instances lack of the values for certain specific attributes. The certain specific attributes could be the new attributes that never used before or the attributes that are exclusive in a source. Some preliminary experiments are conducted to demonstrate that the common imputation methods for random missing handling are not applicable for the block missing handling. To address this new challenge, we purpose two novel methods that consider the uncertainty of the imputed value and build the corresponding classifier model accordingly. Specifically, we first extract the distortion information for each missing value to show the corresponding statistics of all its possible values. Following the concept of bagging technique, we further adopt the proposed distortion-based bagging technique to build different classifier for the same prediction task based on different distorted training dataset that fill in missing values according to the corresponding distortion information. Finally, the final result for a testing instance can be obtained by the major option from all the classifiers for this prediction task. A series of experiments are then performed based on three different kinds of training dataset. The experimental results show that our two proposed methods are superior to the benchmark methods with acceptable effectiveness, especially the blocking missing exits in the attributes with higher discrimination for prediction. Wei, Chih-Ping Lin, Fu-Ren 魏志平林福仁 2012 學位論文 ; thesis 63 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立清華大學 === 服務科學研究所 === 100 === The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due to personal privacy, confidence, or operational mistakes. These missing values could be handled by some existing methods, such as simple imputation ones. As a result, the performance of a classifier based on a training dataset with missing values will still be acceptable. However, due to the data sharing in the developing environment, another condition of data incompletion, i.e., blocking missing has become a new challenge and cannot be solved by the same ways for random missing. The blocking missing usually exists in an increasing training dataset or an integrated dataset of different sources. A dataset with blocking missing is meant to some instances lack of the values for certain specific attributes. The certain specific attributes could be the new attributes that never used before or the attributes that are exclusive in a source. Some preliminary experiments are conducted to demonstrate that the common imputation methods for random missing handling are not applicable for the block missing handling. To address this new challenge, we purpose two novel methods that consider the uncertainty of the imputed value and build the corresponding classifier model accordingly. Specifically, we first extract the distortion information for each missing value to show the corresponding statistics of all its possible values. Following the concept of bagging technique, we further adopt the proposed distortion-based bagging technique to build different classifier for the same prediction task based on different distorted training dataset that fill in missing values according to the corresponding distortion information. Finally, the final result for a testing instance can be obtained by the major option from all the classifiers for this prediction task. A series of experiments are then performed based on three different kinds of training dataset. The experimental results show that our two proposed methods are superior to the benchmark methods with acceptable effectiveness, especially the blocking missing exits in the attributes with higher discrimination for prediction.
author2	Wei, Chih-Ping
author_facet	Wei, Chih-Ping Tsai, Ya Hsun 蔡亞勳
author	Tsai, Ya Hsun 蔡亞勳
spellingShingle	Tsai, Ya Hsun 蔡亞勳 Learning Classification Models From Datasets with Block Missing
author_sort	Tsai, Ya Hsun
title	Learning Classification Models From Datasets with Block Missing
title_short	Learning Classification Models From Datasets with Block Missing
title_full	Learning Classification Models From Datasets with Block Missing
title_fullStr	Learning Classification Models From Datasets with Block Missing
title_full_unstemmed	Learning Classification Models From Datasets with Block Missing
title_sort	learning classification models from datasets with block missing
publishDate	2012
url	http://ndltd.ncl.edu.tw/handle/22417161768041454901
work_keys_str_mv	AT tsaiyahsun learningclassificationmodelsfromdatasetswithblockmissing AT càiyàxūn learningclassificationmodelsfromdatasetswithblockmissing
_version_	1718062862618853376

Learning Classification Models From Datasets with Block Missing

Similar Items