Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8790690/ |
id |
doaj-a82d78b33c0045a9ae4caae75381c3e0 |
---|---|
record_format |
Article |
spelling |
doaj-a82d78b33c0045a9ae4caae75381c3e02021-04-05T17:07:12ZengIEEEIEEE Access2169-35362019-01-01710647510649410.1109/ACCESS.2019.29333188790690Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word EmbeddingJernej Flisar0Vili Podgorelec1https://orcid.org/0000-0001-6955-7868Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaFaculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaSelf-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice.https://ieeexplore.ieee.org/document/8790690/Feature enhancementfeature selectionself-admitted technical debttext classificationword embedding |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jernej Flisar Vili Podgorelec |
spellingShingle |
Jernej Flisar Vili Podgorelec Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding IEEE Access Feature enhancement feature selection self-admitted technical debt text classification word embedding |
author_facet |
Jernej Flisar Vili Podgorelec |
author_sort |
Jernej Flisar |
title |
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding |
title_short |
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding |
title_full |
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding |
title_fullStr |
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding |
title_full_unstemmed |
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding |
title_sort |
identification of self-admitted technical debt using enhanced feature selection based on word embedding |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice. |
topic |
Feature enhancement feature selection self-admitted technical debt text classification word embedding |
url |
https://ieeexplore.ieee.org/document/8790690/ |
work_keys_str_mv |
AT jernejflisar identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding AT vilipodgorelec identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding |
_version_ |
1721540283359821824 |