Supervised Authorship Segmentation of Open Source Code Projects

Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work,...

Full description

Bibliographic Details
Main Authors: Dauber Edwin, Erbacher Robert, Shearer Gregory, Weisman Michael, Nelson Frederica, Greenstadt Rachel
Format: Article
Language:English
Published: Sciendo 2021-10-01
Series:Proceedings on Privacy Enhancing Technologies
Subjects:
Online Access:https://doi.org/10.2478/popets-2021-0080
id doaj-3c0bba85faaa41fabef628feba717002
record_format Article
spelling doaj-3c0bba85faaa41fabef628feba7170022021-09-05T14:01:12ZengSciendoProceedings on Privacy Enhancing Technologies2299-09842021-10-012021446447910.2478/popets-2021-0080Supervised Authorship Segmentation of Open Source Code ProjectsDauber Edwin0Erbacher Robert1Shearer Gregory2Weisman Michael3Nelson Frederica4Greenstadt Rachel5Drexel UniversityUnited States Army Research LaboratoryICF InternationalUnited States Army Research LaboratoryUnited States Army Research LaboratoryNew York UniversitySource code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.https://doi.org/10.2478/popets-2021-0080stylometrycode authorship attributionsegmentation
collection DOAJ
language English
format Article
sources DOAJ
author Dauber Edwin
Erbacher Robert
Shearer Gregory
Weisman Michael
Nelson Frederica
Greenstadt Rachel
spellingShingle Dauber Edwin
Erbacher Robert
Shearer Gregory
Weisman Michael
Nelson Frederica
Greenstadt Rachel
Supervised Authorship Segmentation of Open Source Code Projects
Proceedings on Privacy Enhancing Technologies
stylometry
code authorship attribution
segmentation
author_facet Dauber Edwin
Erbacher Robert
Shearer Gregory
Weisman Michael
Nelson Frederica
Greenstadt Rachel
author_sort Dauber Edwin
title Supervised Authorship Segmentation of Open Source Code Projects
title_short Supervised Authorship Segmentation of Open Source Code Projects
title_full Supervised Authorship Segmentation of Open Source Code Projects
title_fullStr Supervised Authorship Segmentation of Open Source Code Projects
title_full_unstemmed Supervised Authorship Segmentation of Open Source Code Projects
title_sort supervised authorship segmentation of open source code projects
publisher Sciendo
series Proceedings on Privacy Enhancing Technologies
issn 2299-0984
publishDate 2021-10-01
description Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.
topic stylometry
code authorship attribution
segmentation
url https://doi.org/10.2478/popets-2021-0080
work_keys_str_mv AT dauberedwin supervisedauthorshipsegmentationofopensourcecodeprojects
AT erbacherrobert supervisedauthorshipsegmentationofopensourcecodeprojects
AT shearergregory supervisedauthorshipsegmentationofopensourcecodeprojects
AT weismanmichael supervisedauthorshipsegmentationofopensourcecodeprojects
AT nelsonfrederica supervisedauthorshipsegmentationofopensourcecodeprojects
AT greenstadtrachel supervisedauthorshipsegmentationofopensourcecodeprojects
_version_ 1717810590516248576