Influencing Factors in the Scalability of Distributed Stream Processing Jobs

More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in fou...

Full description

Bibliographic Details
Main Authors: Giselle Van Dongen, Dirk Van Den Poel
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9507502/
id doaj-f3b357c673c548f88e6bf6c226308a5f
record_format Article
spelling doaj-f3b357c673c548f88e6bf6c226308a5f2021-08-10T23:00:31ZengIEEEIEEE Access2169-35362021-01-01910941310943110.1109/ACCESS.2021.31026459507502Influencing Factors in the Scalability of Distributed Stream Processing JobsGiselle Van Dongen0https://orcid.org/0000-0003-1605-724XDirk Van Den Poel1https://orcid.org/0000-0002-8676-8103Department of MIO/Data Analytics, Ghent University, Ghent, BelgiumDepartment of MIO/Data Analytics, Ghent University, Ghent, BelgiumMore and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.https://ieeexplore.ieee.org/document/9507502/Apache sparkstructured streamingapache flinkapache kafkakafka streamsdistributed computing
collection DOAJ
language English
format Article
sources DOAJ
author Giselle Van Dongen
Dirk Van Den Poel
spellingShingle Giselle Van Dongen
Dirk Van Den Poel
Influencing Factors in the Scalability of Distributed Stream Processing Jobs
IEEE Access
Apache spark
structured streaming
apache flink
apache kafka
kafka streams
distributed computing
author_facet Giselle Van Dongen
Dirk Van Den Poel
author_sort Giselle Van Dongen
title Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_short Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_full Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_fullStr Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_full_unstemmed Influencing Factors in the Scalability of Distributed Stream Processing Jobs
title_sort influencing factors in the scalability of distributed stream processing jobs
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.
topic Apache spark
structured streaming
apache flink
apache kafka
kafka streams
distributed computing
url https://ieeexplore.ieee.org/document/9507502/
work_keys_str_mv AT gisellevandongen influencingfactorsinthescalabilityofdistributedstreamprocessingjobs
AT dirkvandenpoel influencingfactorsinthescalabilityofdistributedstreamprocessingjobs
_version_ 1721211793892704256