Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding

Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their...

Full description

Bibliographic Details
Main Authors: Jernej Flisar, Vili Podgorelec
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8790690/
id doaj-a82d78b33c0045a9ae4caae75381c3e0
record_format Article
spelling doaj-a82d78b33c0045a9ae4caae75381c3e02021-04-05T17:07:12ZengIEEEIEEE Access2169-35362019-01-01710647510649410.1109/ACCESS.2019.29333188790690Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word EmbeddingJernej Flisar0Vili Podgorelec1https://orcid.org/0000-0001-6955-7868Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaFaculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaSelf-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice.https://ieeexplore.ieee.org/document/8790690/Feature enhancementfeature selectionself-admitted technical debttext classificationword embedding
collection DOAJ
language English
format Article
sources DOAJ
author Jernej Flisar
Vili Podgorelec
spellingShingle Jernej Flisar
Vili Podgorelec
Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
IEEE Access
Feature enhancement
feature selection
self-admitted technical debt
text classification
word embedding
author_facet Jernej Flisar
Vili Podgorelec
author_sort Jernej Flisar
title Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_short Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_full Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_fullStr Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_full_unstemmed Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_sort identification of self-admitted technical debt using enhanced feature selection based on word embedding
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice.
topic Feature enhancement
feature selection
self-admitted technical debt
text classification
word embedding
url https://ieeexplore.ieee.org/document/8790690/
work_keys_str_mv AT jernejflisar identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding
AT vilipodgorelec identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding
_version_ 1721540283359821824