Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding

Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their...

Full description

Bibliographic Details
Main Authors:	Jernej Flisar, Vili Podgorelec
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Feature enhancement feature selection self-admitted technical debt text classification word embedding
Online Access:	https://ieeexplore.ieee.org/document/8790690/

id	doaj-a82d78b33c0045a9ae4caae75381c3e0
record_format	Article
spelling	doaj-a82d78b33c0045a9ae4caae75381c3e02021-04-05T17:07:12ZengIEEEIEEE Access2169-35362019-01-01710647510649410.1109/ACCESS.2019.29333188790690Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word EmbeddingJernej Flisar0Vili Podgorelec1https://orcid.org/0000-0001-6955-7868Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaFaculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, SloveniaSelf-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice.https://ieeexplore.ieee.org/document/8790690/Feature enhancementfeature selectionself-admitted technical debttext classificationword embedding
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Jernej Flisar Vili Podgorelec
spellingShingle	Jernej Flisar Vili Podgorelec Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding IEEE Access Feature enhancement feature selection self-admitted technical debt text classification word embedding
author_facet	Jernej Flisar Vili Podgorelec
author_sort	Jernej Flisar
title	Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_short	Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_full	Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_fullStr	Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_full_unstemmed	Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding
title_sort	identification of self-admitted technical debt using enhanced feature selection based on word embedding
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Self-admitted technical debt (SATD) is annotated in source code comments by developers and has been recognized as a great source of discovering flawed software. To reduce manual effort, some recent studies have focused on automated detection of SATD using text classification methods. To train their classifier, these methods need labeled samples, which also require a lot of effort to obtain. We developed a new SATD identification method, which takes advantage of a large corpus of unlabeled code comments, extracted from open source projects, to train a word embedding model. After applying feature selection, the pre-trained word embedding is used for discovering semantically similar features in source code comments to enhance the original feature set. By using such enhanced feature set for classification, our goal was to improve the identification of SATD when compared to existing methods. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI), and three well-known text classification algorithms (NB, SVM, and ME) and was tested on ten open source projects. The experimental results show a significant improvement in SATD identification over the compared methods. With an achieved 82% of correct predictions of SATD, the proposed method seems to be a good candidate to be adopted in practice.
topic	Feature enhancement feature selection self-admitted technical debt text classification word embedding
url	https://ieeexplore.ieee.org/document/8790690/
work_keys_str_mv	AT jernejflisar identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding AT vilipodgorelec identificationofselfadmittedtechnicaldebtusingenhancedfeatureselectionbasedonwordembedding
_version_	1721540283359821824

Identification of Self-Admitted Technical Debt Using Enhanced Feature Selection Based on Word Embedding

Similar Items