Metadata Extraction and Management in Data Lakes With GEMMS

In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a...

Full description

Bibliographic Details
Main Authors:	Christoph Quix, Rihan Hai, Ivan Vatov
Format:	Article
Language:	English
Published:	Riga Technical University 2016-12-01
Series:	Complex Systems Informatics and Modeling Quarterly
Subjects:	Metadata management data integration scientific data metadata extraction data lakes
Online Access:	https://csimq-journals.rtu.lv/article/view/1548

id	doaj-dbcd957f697246e3a8ee86a4882e8200
record_format	Article
spelling	doaj-dbcd957f697246e3a8ee86a4882e82002020-11-25T00:10:48ZengRiga Technical UniversityComplex Systems Informatics and Modeling Quarterly2255-99222016-12-0109678310.7250/csimq.2016-9.04974Metadata Extraction and Management in Data Lakes With GEMMSChristoph Quix0Rihan Hai1Ivan Vatov2Fraunhofer-Institute for Applied Information Technology FIT, Schloss Birlinghoven 53754 Sankt Augustin; Databases and Information Systems, RWTH Aachen University, Templergraben 55, 52062 AachenDatabases and Information Systems, RWTH Aachen University, Templergraben 55, 52062 AachenDatabases and Information Systems, RWTH Aachen University, Templergraben 55, 52062 AachenIn addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori.https://csimq-journals.rtu.lv/article/view/1548Metadata managementdata integrationscientific datametadata extractiondata lakes
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Christoph Quix Rihan Hai Ivan Vatov
spellingShingle	Christoph Quix Rihan Hai Ivan Vatov Metadata Extraction and Management in Data Lakes With GEMMS Complex Systems Informatics and Modeling Quarterly Metadata management data integration scientific data metadata extraction data lakes
author_facet	Christoph Quix Rihan Hai Ivan Vatov
author_sort	Christoph Quix
title	Metadata Extraction and Management in Data Lakes With GEMMS
title_short	Metadata Extraction and Management in Data Lakes With GEMMS
title_full	Metadata Extraction and Management in Data Lakes With GEMMS
title_fullStr	Metadata Extraction and Management in Data Lakes With GEMMS
title_full_unstemmed	Metadata Extraction and Management in Data Lakes With GEMMS
title_sort	metadata extraction and management in data lakes with gemms
publisher	Riga Technical University
series	Complex Systems Informatics and Modeling Quarterly
issn	2255-9922
publishDate	2016-12-01
description	In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori.
topic	Metadata management data integration scientific data metadata extraction data lakes
url	https://csimq-journals.rtu.lv/article/view/1548
work_keys_str_mv	AT christophquix metadataextractionandmanagementindatalakeswithgemms AT rihanhai metadataextractionandmanagementindatalakeswithgemms AT ivanvatov metadataextractionandmanagementindatalakeswithgemms
_version_	1725406940399403008

Metadata Extraction and Management in Data Lakes With GEMMS

Similar Items