Supervised Authorship Segmentation of Open Source Code Projects
Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work,...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sciendo
2021-10-01
|
Series: | Proceedings on Privacy Enhancing Technologies |
Subjects: | |
Online Access: | https://doi.org/10.2478/popets-2021-0080 |
id |
doaj-3c0bba85faaa41fabef628feba717002 |
---|---|
record_format |
Article |
spelling |
doaj-3c0bba85faaa41fabef628feba7170022021-09-05T14:01:12ZengSciendoProceedings on Privacy Enhancing Technologies2299-09842021-10-012021446447910.2478/popets-2021-0080Supervised Authorship Segmentation of Open Source Code ProjectsDauber Edwin0Erbacher Robert1Shearer Gregory2Weisman Michael3Nelson Frederica4Greenstadt Rachel5Drexel UniversityUnited States Army Research LaboratoryICF InternationalUnited States Army Research LaboratoryUnited States Army Research LaboratoryNew York UniversitySource code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.https://doi.org/10.2478/popets-2021-0080stylometrycode authorship attributionsegmentation |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Dauber Edwin Erbacher Robert Shearer Gregory Weisman Michael Nelson Frederica Greenstadt Rachel |
spellingShingle |
Dauber Edwin Erbacher Robert Shearer Gregory Weisman Michael Nelson Frederica Greenstadt Rachel Supervised Authorship Segmentation of Open Source Code Projects Proceedings on Privacy Enhancing Technologies stylometry code authorship attribution segmentation |
author_facet |
Dauber Edwin Erbacher Robert Shearer Gregory Weisman Michael Nelson Frederica Greenstadt Rachel |
author_sort |
Dauber Edwin |
title |
Supervised Authorship Segmentation of Open Source Code Projects |
title_short |
Supervised Authorship Segmentation of Open Source Code Projects |
title_full |
Supervised Authorship Segmentation of Open Source Code Projects |
title_fullStr |
Supervised Authorship Segmentation of Open Source Code Projects |
title_full_unstemmed |
Supervised Authorship Segmentation of Open Source Code Projects |
title_sort |
supervised authorship segmentation of open source code projects |
publisher |
Sciendo |
series |
Proceedings on Privacy Enhancing Technologies |
issn |
2299-0984 |
publishDate |
2021-10-01 |
description |
Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research. |
topic |
stylometry code authorship attribution segmentation |
url |
https://doi.org/10.2478/popets-2021-0080 |
work_keys_str_mv |
AT dauberedwin supervisedauthorshipsegmentationofopensourcecodeprojects AT erbacherrobert supervisedauthorshipsegmentationofopensourcecodeprojects AT shearergregory supervisedauthorshipsegmentationofopensourcecodeprojects AT weismanmichael supervisedauthorshipsegmentationofopensourcecodeprojects AT nelsonfrederica supervisedauthorshipsegmentationofopensourcecodeprojects AT greenstadtrachel supervisedauthorshipsegmentationofopensourcecodeprojects |
_version_ |
1717810590516248576 |