Utility of Considering Multiple Alternative Rectifications in Data Cleaning

abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This i...

Full description

Bibliographic Details
Other Authors:	Rihan, Preet Inder Singh (Author)
Format:	Dissertation
Language:	English
Published:	2013
Subjects:	Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive
Online Access:	http://hdl.handle.net/2286/R.I.18825

id	ndltd-asu.edu-item-18825
record_format	oai_dc
spelling	ndltd-asu.edu-item-188252018-06-22T03:04:28Z Utility of Considering Multiple Alternative Rectifications in Data Cleaning abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. Dissertation/Thesis Rihan, Preet Inder Singh (Author) Kambhampati, Subbarao (Advisor) Liu, Huan (Committee member) Davulcu, Hasan (Committee member) Arizona State University (Publisher) Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive eng 39 pages M.S. Computer Science 2013 Masters Thesis http://hdl.handle.net/2286/R.I.18825 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2013
collection	NDLTD
language	English
format	Dissertation
sources	NDLTD
topic	Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive
spellingShingle	Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive Utility of Considering Multiple Alternative Rectifications in Data Cleaning
description	abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. === Dissertation/Thesis === M.S. Computer Science 2013
author2	Rihan, Preet Inder Singh (Author)
author_facet	Rihan, Preet Inder Singh (Author)
title	Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_short	Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_full	Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_fullStr	Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_full_unstemmed	Utility of Considering Multiple Alternative Rectifications in Data Cleaning
title_sort	utility of considering multiple alternative rectifications in data cleaning
publishDate	2013
url	http://hdl.handle.net/2286/R.I.18825
_version_	1718700222750654464

Utility of Considering Multiple Alternative Rectifications in Data Cleaning

Similar Items