A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification

Code clone detection plays a vital role in both industry and academia. Last three decades have seen more than 250 clone detection techniques with lack of single framework that can detect and classify all 4 basic types of code clones with high precision. This serious lack of clone classification impa...

Full description

Bibliographic Details
Main Authors: Ankali, S.B (Author), Parthiban, L. (Author)
Format: Article
Language:English
Published: Modern Education and Computer Science Press 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 02372nam a2200205Ia 4500
001 10.5815-ijmecs.2021.03.04
008 220510s2021 CNT 000 0 und d
020 |a 20750161 (ISSN) 
245 1 0 |a A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification 
260 0 |b Modern Education and Computer Science Press  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.5815/ijmecs.2021.03.04 
520 3 |a Code clone detection plays a vital role in both industry and academia. Last three decades have seen more than 250 clone detection techniques with lack of single framework that can detect and classify all 4 basic types of code clones with high precision. This serious lack of clone classification impacts largely on the universities and online learning platforms that fail to validate the projects or coding assignments submitted online. In this paper, we propose a complete and language agnostic technique to detect and classify all 4 clone types of C, C++, and Java programs. The method first generates the parse tree then extracts the functional tree to eliminate the need for the preprocessing stage employed by previous clone detection techniques. The generated parse tree contains all the necessary information for detecting code clones. We employ TF-IDF cosine similarity for the proper classification of clone types. The proposed technique achieves incredible precision rate of 100% in detecting the first two types of clones and 98% precision in detecting type-3 and type-4 clones for small codes of C, C++, and Java containing an average line count of 5. The proposed technique outperforms the existing tree-based clone detection tools by providing the average precision of 98.07% on the C, C++, and Java programs crawled from Github with an average line count of 15 which signifies that cosine similarity measure on ANTLR functional tree accurately detects all 4 types of small clones and act as proper validation tools for identifying the learning level in the submitted programming assignment. © 2021 MECS. 
650 0 4 |a Clone types 
650 0 4 |a Code plagiarism 
650 0 4 |a cosine similarity 
650 0 4 |a functional tree 
650 0 4 |a TF-IDF 
700 1 |a Ankali, S.B.  |e author 
700 1 |a Parthiban, L.  |e author 
773 |t International Journal of Modern Education and Computer Science