Utility of Considering Multiple Alternative Rectifications in Data Cleaning

abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This i...

Full description

Bibliographic Details
Other Authors: Rihan, Preet Inder Singh (Author)
Format: Dissertation
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/2286/R.I.18825
id ndltd-asu.edu-item-18825
record_format oai_dc
spelling ndltd-asu.edu-item-188252018-06-22T03:04:28Z Utility of Considering Multiple Alternative Rectifications in Data Cleaning abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. Dissertation/Thesis Rihan, Preet Inder Singh (Author) Kambhampati, Subbarao (Advisor) Liu, Huan (Committee member) Davulcu, Hasan (Committee member) Arizona State University (Publisher) Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive eng 39 pages M.S. Computer Science 2013 Masters Thesis http://hdl.handle.net/2286/R.I.18825 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2013
collection NDLTD
language English
format Dissertation
sources NDLTD
topic Computer science
False Positive
Precision
Probabilistic Database
Probabilistic data cleaning
Recall
True Posiitive
spellingShingle Computer science
False Positive
Precision
Probabilistic Database
Probabilistic data cleaning
Recall
True Posiitive
Utility of Considering Multiple Alternative Rectifications in Data Cleaning
description abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. === Dissertation/Thesis === M.S. Computer Science 2013
author2 Rihan, Preet Inder Singh (Author)
author_facet Rihan, Preet Inder Singh (Author)
title Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_short Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_full Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_fullStr Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_full_unstemmed Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_sort utility of considering multiple alternative rectifications in data cleaning
publishDate 2013
url http://hdl.handle.net/2286/R.I.18825
_version_ 1718700222750654464