Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and...

Full description

Bibliographic Details
Main Authors:	Farhan Ullah, Junfeng Wang, Sohail Jabbar, Fadi Al-Turjman, Mamoun Alazab
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Code authorship attribution program dependence graph deep learning software forensics and security software plagiarism
Online Access:	https://ieeexplore.ieee.org/document/8848478/

id	doaj-b161a55b2728426fa3f640a85bee4a7a
record_format	Article
spelling	doaj-b161a55b2728426fa3f640a85bee4a7a2021-03-29T23:54:45ZengIEEEIEEE Access2169-35362019-01-01714198714199910.1109/ACCESS.2019.29436398848478Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning ModelFarhan Ullah0Junfeng Wang1https://orcid.org/0000-0003-1699-2270Sohail Jabbar2https://orcid.org/0000-0002-2127-1235Fadi Al-Turjman3https://orcid.org/0000-0001-5418-873XMamoun Alazab4https://orcid.org/0000-0002-1928-3704College of Computer Science, Sichuan University, Chengdu, ChinaSchool of Aeronautics and Astronautics, College of Computer Science, Sichuan University, Chengdu, ChinaDepartment of Computing and Mathematics, Manchester Metropolitan University, Manchester, U.K.Artificial Intelligence Department, Near East University, Nicosia, TurkeyCollege of Engineering, IT & Environment, Charles Darwin University, Casuarina, NT, AustraliaSource Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.https://ieeexplore.ieee.org/document/8848478/Code authorship attributionprogram dependence graphdeep learningsoftware forensics and securitysoftware plagiarism
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Farhan Ullah Junfeng Wang Sohail Jabbar Fadi Al-Turjman Mamoun Alazab
spellingShingle	Farhan Ullah Junfeng Wang Sohail Jabbar Fadi Al-Turjman Mamoun Alazab Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model IEEE Access Code authorship attribution program dependence graph deep learning software forensics and security software plagiarism
author_facet	Farhan Ullah Junfeng Wang Sohail Jabbar Fadi Al-Turjman Mamoun Alazab
author_sort	Farhan Ullah
title	Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
title_short	Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
title_full	Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
title_fullStr	Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
title_full_unstemmed	Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
title_sort	source code authorship attribution using hybrid approach of program dependence graph and deep learning model
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.
topic	Code authorship attribution program dependence graph deep learning software forensics and security software plagiarism
url	https://ieeexplore.ieee.org/document/8848478/
work_keys_str_mv	AT farhanullah sourcecodeauthorshipattributionusinghybridapproachofprogramdependencegraphanddeeplearningmodel AT junfengwang sourcecodeauthorshipattributionusinghybridapproachofprogramdependencegraphanddeeplearningmodel AT sohailjabbar sourcecodeauthorshipattributionusinghybridapproachofprogramdependencegraphanddeeplearningmodel AT fadialturjman sourcecodeauthorshipattributionusinghybridapproachofprogramdependencegraphanddeeplearningmodel AT mamounalazab sourcecodeauthorshipattributionusinghybridapproachofprogramdependencegraphanddeeplearningmodel
_version_	1724188908373147648

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Similar Items