SICE: an improved missing data imputation technique

Abstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeh...

Full description

Bibliographic Details
Main Authors: Shahidul Islam Khan, Abu Sayed Md Latiful Hoque
Format: Article
Language:English
Published: SpringerOpen 2020-06-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-020-00313-w
id doaj-6422474a094445f98f45cec877277047
record_format Article
spelling doaj-6422474a094445f98f45cec8772770472020-11-25T03:20:48ZengSpringerOpenJournal of Big Data2196-11152020-06-017112110.1186/s40537-020-00313-wSICE: an improved missing data imputation techniqueShahidul Islam Khan0Abu Sayed Md Latiful Hoque1Department of CSE, Bangladesh University of Engineering and TechnologyDepartment of CSE, Bangladesh University of Engineering and TechnologyAbstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.http://link.springer.com/article/10.1186/s40537-020-00313-wMissing Data ImputationSingle ImputationMultiple ImputationMICEData Analytics
collection DOAJ
language English
format Article
sources DOAJ
author Shahidul Islam Khan
Abu Sayed Md Latiful Hoque
spellingShingle Shahidul Islam Khan
Abu Sayed Md Latiful Hoque
SICE: an improved missing data imputation technique
Journal of Big Data
Missing Data Imputation
Single Imputation
Multiple Imputation
MICE
Data Analytics
author_facet Shahidul Islam Khan
Abu Sayed Md Latiful Hoque
author_sort Shahidul Islam Khan
title SICE: an improved missing data imputation technique
title_short SICE: an improved missing data imputation technique
title_full SICE: an improved missing data imputation technique
title_fullStr SICE: an improved missing data imputation technique
title_full_unstemmed SICE: an improved missing data imputation technique
title_sort sice: an improved missing data imputation technique
publisher SpringerOpen
series Journal of Big Data
issn 2196-1115
publishDate 2020-06-01
description Abstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.
topic Missing Data Imputation
Single Imputation
Multiple Imputation
MICE
Data Analytics
url http://link.springer.com/article/10.1186/s40537-020-00313-w
work_keys_str_mv AT shahidulislamkhan siceanimprovedmissingdataimputationtechnique
AT abusayedmdlatifulhoque siceanimprovedmissingdataimputationtechnique
_version_ 1724616589028884480