Code Cloning Habits Of The Jupyter Notebook Community

Code reuse has the benefits of saving time and resources but poses a risk whenattempting to tailor copied code for a new purpose or in cases when such copies arebuggy or otherwise faulty. In the field of data science, the web application JupyterNotebook is a popular tool for creating computational n...

Full description

Bibliographic Details
Main Author: Sigvardsson, Ulf
Format: Others
Language:English
Published: Uppsala universitet, Institutionen för informationsteknologi 2019
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-396822
id ndltd-UPSALLA1-oai-DiVA.org-uu-396822
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-uu-3968222019-11-11T22:06:33ZCode Cloning Habits Of The Jupyter Notebook CommunityengSigvardsson, UlfUppsala universitet, Institutionen för informationsteknologi2019Engineering and TechnologyTeknik och teknologierCode reuse has the benefits of saving time and resources but poses a risk whenattempting to tailor copied code for a new purpose or in cases when such copies arebuggy or otherwise faulty. In the field of data science, the web application JupyterNotebook is a popular tool for creating computational notebooks, documentscontaining both plain text and code snippets, many of which are publicly available oncode hosting sites such as GitHub. This thesis describes the acquisition ofapproximately 2.6 million computational notebooks and analysis of this data set.By hashing the contents of every code snippet, using the MD5 hashing algorithm,cloned snippets were found through snippets producing identical hashes. Bysubsequently mapping the snippets to their corresponding notebooks, the relativeoriginality of a notebook could be determined. This analysis shows that nearly 95% ofnotebooks are written in some version of Python. Furthermore, nearly 54% ofnotebooks in the data set are comprised of code blocks also found in othernotebooks and, on average, approximately 70% of the code in any given notebookis copied from elsewhere. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-396822IT ; 19032application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Engineering and Technology
Teknik och teknologier
spellingShingle Engineering and Technology
Teknik och teknologier
Sigvardsson, Ulf
Code Cloning Habits Of The Jupyter Notebook Community
description Code reuse has the benefits of saving time and resources but poses a risk whenattempting to tailor copied code for a new purpose or in cases when such copies arebuggy or otherwise faulty. In the field of data science, the web application JupyterNotebook is a popular tool for creating computational notebooks, documentscontaining both plain text and code snippets, many of which are publicly available oncode hosting sites such as GitHub. This thesis describes the acquisition ofapproximately 2.6 million computational notebooks and analysis of this data set.By hashing the contents of every code snippet, using the MD5 hashing algorithm,cloned snippets were found through snippets producing identical hashes. Bysubsequently mapping the snippets to their corresponding notebooks, the relativeoriginality of a notebook could be determined. This analysis shows that nearly 95% ofnotebooks are written in some version of Python. Furthermore, nearly 54% ofnotebooks in the data set are comprised of code blocks also found in othernotebooks and, on average, approximately 70% of the code in any given notebookis copied from elsewhere.
author Sigvardsson, Ulf
author_facet Sigvardsson, Ulf
author_sort Sigvardsson, Ulf
title Code Cloning Habits Of The Jupyter Notebook Community
title_short Code Cloning Habits Of The Jupyter Notebook Community
title_full Code Cloning Habits Of The Jupyter Notebook Community
title_fullStr Code Cloning Habits Of The Jupyter Notebook Community
title_full_unstemmed Code Cloning Habits Of The Jupyter Notebook Community
title_sort code cloning habits of the jupyter notebook community
publisher Uppsala universitet, Institutionen för informationsteknologi
publishDate 2019
url http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-396822
work_keys_str_mv AT sigvardssonulf codecloninghabitsofthejupyternotebookcommunity
_version_ 1719290213889474560