Data Mining with Uncertain Data

碩士 === 國立高雄大學 === 電機工程學系碩士班 === 97 === Machine learning and data mining are two kinds of important techniques for extracting valuable information from datasets. Although current mining and learning technologies can handle large amounts of data, the rapid growth of datasets may cause some attribute v...

Full description

Bibliographic Details
Main Authors: Chih-Wei Wu, 吳志偉
Other Authors: Tzung-Pei Hong
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/67289209055615848501
Description
Summary:碩士 === 國立高雄大學 === 電機工程學系碩士班 === 97 === Machine learning and data mining are two kinds of important techniques for extracting valuable information from datasets. Although current mining and learning technologies can handle large amounts of data, the rapid growth of datasets may cause some attribute values to be missed in the data-gathering process. Incomplete data are usually appropriately handled to improve the quality of the discovered information. Therefore, the problem of recovering missing values from a data set has become an important research issue in the field of data mining and machine learning. In this thesis, we first introduce an iterative missing-value completion method based on the RAR support values to extract useful association rules for inferring missing values in an iterative way. The proposed method can fully infer the missing attribute values by combining an iterative mechanism and data mining techniques. It consists of three phases. The first phase uses the association rules to roughly complete the missing values. The second phase iteratively reduces the minimum support to gather more association rules to complete the rest of missing values. The third phase uses the association rules from the completed dataset to correct the missing values that have been filled in. The proposed approach is then a little modified to consider the partial support values in deriving missing values. The second approach is a little better than the first one because the former uses more information (incomplete tuples) in guessing. Experimental results show both the proposed approaches have good accuracy and data recovery even when the missing-value rate is high.