A Source Code Similarity Based on Siamese Neural Network

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Si...

Full description

Bibliographic Details
Main Authors: Chunli Xie, Xia Wang, Cheng Qian, Mengqi Wang
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/21/7519
id doaj-8ff1fb48f11640bd84b0b7a667e15745
record_format Article
spelling doaj-8ff1fb48f11640bd84b0b7a667e157452020-11-25T03:39:25ZengMDPI AGApplied Sciences2076-34172020-10-01107519751910.3390/app10217519A Source Code Similarity Based on Siamese Neural NetworkChunli Xie0Xia Wang1Cheng Qian2Mengqi Wang3Department of Computer Science & Technology, Jiangsu Normal University, Xuzhou 221116, ChinaDepartment of Computer Science & Technology, Jiangsu Normal University, Xuzhou 221116, ChinaDepartment of Computer Science & Technology, Jiangsu Normal University, Xuzhou 221116, ChinaDepartment of Computer Science & Technology, Jiangsu Normal University, Xuzhou 221116, ChinaFinding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.https://www.mdpi.com/2076-3417/10/21/7519code similarityword embeddingsiamese neural networks
collection DOAJ
language English
format Article
sources DOAJ
author Chunli Xie
Xia Wang
Cheng Qian
Mengqi Wang
spellingShingle Chunli Xie
Xia Wang
Cheng Qian
Mengqi Wang
A Source Code Similarity Based on Siamese Neural Network
Applied Sciences
code similarity
word embedding
siamese neural networks
author_facet Chunli Xie
Xia Wang
Cheng Qian
Mengqi Wang
author_sort Chunli Xie
title A Source Code Similarity Based on Siamese Neural Network
title_short A Source Code Similarity Based on Siamese Neural Network
title_full A Source Code Similarity Based on Siamese Neural Network
title_fullStr A Source Code Similarity Based on Siamese Neural Network
title_full_unstemmed A Source Code Similarity Based on Siamese Neural Network
title_sort source code similarity based on siamese neural network
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2020-10-01
description Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.
topic code similarity
word embedding
siamese neural networks
url https://www.mdpi.com/2076-3417/10/21/7519
work_keys_str_mv AT chunlixie asourcecodesimilaritybasedonsiameseneuralnetwork
AT xiawang asourcecodesimilaritybasedonsiameseneuralnetwork
AT chengqian asourcecodesimilaritybasedonsiameseneuralnetwork
AT mengqiwang asourcecodesimilaritybasedonsiameseneuralnetwork
AT chunlixie sourcecodesimilaritybasedonsiameseneuralnetwork
AT xiawang sourcecodesimilaritybasedonsiameseneuralnetwork
AT chengqian sourcecodesimilaritybasedonsiameseneuralnetwork
AT mengqiwang sourcecodesimilaritybasedonsiameseneuralnetwork
_version_ 1724538973212114944