Apache Spark usage and deployment models for scientific computing

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis wit...

Full description

Bibliographic Details
Main Authors:	Castro Diogo, Kothuri Prasanth, Mrowczynski Piotr, Piparo Danilo, Tejedor Enric
Format:	Article
Language:	English
Published:	EDP Sciences 2019-01-01
Series:	EPJ Web of Conferences
Online Access:	https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_07020.pdf

id	doaj-bfd442ee723444539fe08cb8390edb7a
record_format	Article
spelling	doaj-bfd442ee723444539fe08cb8390edb7a2021-08-02T09:59:50ZengEDP SciencesEPJ Web of Conferences2100-014X2019-01-012140702010.1051/epjconf/201921407020epjconf_chep2018_07020Apache Spark usage and deployment models for scientific computingCastro DiogoKothuri PrasanthMrowczynski PiotrPiparo DaniloTejedor EnricThis talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_07020.pdf
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Castro Diogo Kothuri Prasanth Mrowczynski Piotr Piparo Danilo Tejedor Enric
spellingShingle	Castro Diogo Kothuri Prasanth Mrowczynski Piotr Piparo Danilo Tejedor Enric Apache Spark usage and deployment models for scientific computing EPJ Web of Conferences
author_facet	Castro Diogo Kothuri Prasanth Mrowczynski Piotr Piparo Danilo Tejedor Enric
author_sort	Castro Diogo
title	Apache Spark usage and deployment models for scientific computing
title_short	Apache Spark usage and deployment models for scientific computing
title_full	Apache Spark usage and deployment models for scientific computing
title_fullStr	Apache Spark usage and deployment models for scientific computing
title_full_unstemmed	Apache Spark usage and deployment models for scientific computing
title_sort	apache spark usage and deployment models for scientific computing
publisher	EDP Sciences
series	EPJ Web of Conferences
issn	2100-014X
publishDate	2019-01-01
description	This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.
url	https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_07020.pdf
work_keys_str_mv	AT castrodiogo apachesparkusageanddeploymentmodelsforscientificcomputing AT kothuriprasanth apachesparkusageanddeploymentmodelsforscientificcomputing AT mrowczynskipiotr apachesparkusageanddeploymentmodelsforscientificcomputing AT piparodanilo apachesparkusageanddeploymentmodelsforscientificcomputing AT tejedorenric apachesparkusageanddeploymentmodelsforscientificcomputing
_version_	1721234254181957632

Apache Spark usage and deployment models for scientific computing

Similar Items