Building Data Civilizer Pipelines with an Advanced Workflow Engine

© 2018 IEEE. In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data...

Full description

Bibliographic Details
Main Authors: Mansour, Essam (Author), Deng, Dong (Author), Castro Fernandez, Raul (Author), Qahtan, Abdulhakim A. (Author), Tao, Wenbo (Author), Abedjan, Ziawasch (Author), Elmagarmid, Ahmed (Author), Ilyas, Ihab F. (Author), Madden, Samuel R (Author), Ouzzani, Mourad (Author), Stonebraker, Michael (Author), Tang, Nan (Author)
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor)
Format: Article
Language:English
Published: IEEE, 2022-01-07T16:09:25Z.
Subjects:
Online Access:Get fulltext
Description
Summary:© 2018 IEEE. In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.