Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data

Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were a...

Full description

Bibliographic Details
Main Authors: Fei Hu, Mengchao Xu, Jingchao Yang, Yanshou Liang, Kejin Cui, Michael M. Little, Christopher S. Lynnes, Daniel Q. Duffy, Chaowei Yang
Format: Article
Language:English
Published: MDPI AG 2018-04-01
Series:ISPRS International Journal of Geo-Information
Subjects:
GIS
Online Access:http://www.mdpi.com/2220-9964/7/4/144
id doaj-9aae74b79cf54feca0d3cb0663e325e5
record_format Article
spelling doaj-9aae74b79cf54feca0d3cb0663e325e52020-11-24T21:01:42ZengMDPI AGISPRS International Journal of Geo-Information2220-99642018-04-017414410.3390/ijgi7040144ijgi7040144Evaluating the Open Source Data Containers for Handling Big Geospatial Raster DataFei Hu0Mengchao Xu1Jingchao Yang2Yanshou Liang3Kejin Cui4Michael M. Little5Christopher S. Lynnes6Daniel Q. Duffy7Chaowei Yang8NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USANSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USANSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USANSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USANSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USANASA Goddard Space Flight Center, Greenbelt, MD 20771, USANASA Goddard Space Flight Center, Greenbelt, MD 20771, USANASA Goddard Space Flight Center, Greenbelt, MD 20771, USANSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USABig geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.http://www.mdpi.com/2220-9964/7/4/144big datadata containergeospatial raster data managementGIS
collection DOAJ
language English
format Article
sources DOAJ
author Fei Hu
Mengchao Xu
Jingchao Yang
Yanshou Liang
Kejin Cui
Michael M. Little
Christopher S. Lynnes
Daniel Q. Duffy
Chaowei Yang
spellingShingle Fei Hu
Mengchao Xu
Jingchao Yang
Yanshou Liang
Kejin Cui
Michael M. Little
Christopher S. Lynnes
Daniel Q. Duffy
Chaowei Yang
Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
ISPRS International Journal of Geo-Information
big data
data container
geospatial raster data management
GIS
author_facet Fei Hu
Mengchao Xu
Jingchao Yang
Yanshou Liang
Kejin Cui
Michael M. Little
Christopher S. Lynnes
Daniel Q. Duffy
Chaowei Yang
author_sort Fei Hu
title Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
title_short Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
title_full Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
title_fullStr Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
title_full_unstemmed Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data
title_sort evaluating the open source data containers for handling big geospatial raster data
publisher MDPI AG
series ISPRS International Journal of Geo-Information
issn 2220-9964
publishDate 2018-04-01
description Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.
topic big data
data container
geospatial raster data management
GIS
url http://www.mdpi.com/2220-9964/7/4/144
work_keys_str_mv AT feihu evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT mengchaoxu evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT jingchaoyang evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT yanshouliang evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT kejincui evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT michaelmlittle evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT christopherslynnes evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT danielqduffy evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
AT chaoweiyang evaluatingtheopensourcedatacontainersforhandlingbiggeospatialrasterdata
_version_ 1716777162081566720