Influencing Factors in the Scalability of Distributed Stream Processing Jobs

More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in fou...

Full description

Bibliographic Details
Main Authors:	Giselle Van Dongen, Dirk Van Den Poel
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Apache spark structured streaming apache flink apache kafka kafka streams distributed computing
Online Access:	https://ieeexplore.ieee.org/document/9507502/

id	doaj-f3b357c673c548f88e6bf6c226308a5f
record_format	Article
spelling	doaj-f3b357c673c548f88e6bf6c226308a5f2021-08-10T23:00:31ZengIEEEIEEE Access2169-35362021-01-01910941310943110.1109/ACCESS.2021.31026459507502Influencing Factors in the Scalability of Distributed Stream Processing JobsGiselle Van Dongen0https://orcid.org/0000-0003-1605-724XDirk Van Den Poel1https://orcid.org/0000-0002-8676-8103Department of MIO/Data Analytics, Ghent University, Ghent, BelgiumDepartment of MIO/Data Analytics, Ghent University, Ghent, BelgiumMore and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.https://ieeexplore.ieee.org/document/9507502/Apache sparkstructured streamingapache flinkapache kafkakafka streamsdistributed computing
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Giselle Van Dongen Dirk Van Den Poel
spellingShingle	Giselle Van Dongen Dirk Van Den Poel Influencing Factors in the Scalability of Distributed Stream Processing Jobs IEEE Access Apache spark structured streaming apache flink apache kafka kafka streams distributed computing
author_facet	Giselle Van Dongen Dirk Van Den Poel
author_sort	Giselle Van Dongen
title	Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_short	Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_full	Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_fullStr	Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_full_unstemmed	Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_sort	influencing factors in the scalability of distributed stream processing jobs
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.
topic	Apache spark structured streaming apache flink apache kafka kafka streams distributed computing
url	https://ieeexplore.ieee.org/document/9507502/
work_keys_str_mv	AT gisellevandongen influencingfactorsinthescalabilityofdistributedstreamprocessingjobs AT dirkvandenpoel influencingfactorsinthescalabilityofdistributedstreamprocessingjobs
_version_	1721211793892704256

Influencing Factors in the Scalability of Distributed Stream Processing Jobs

Similar Items