Empirical evaluation of feature selection and machine learning techniques to recommend clones for software refactoring
The article’s subject matter deals with the management of software clones. Software clones are duplicate code fragments that can exist in the same or different software files. Software clone detection and management has become a well-established research area. Software clones should be managed to mi...
| الحاوية / القاعدة: | Радіоелектронні і комп'ютерні системи |
|---|---|
| المؤلفون الرئيسيون: | , , |
| التنسيق: | مقال |
| اللغة: | الإنجليزية |
| منشور في: |
National Aerospace University «Kharkiv Aviation Institute»
2025-09-01
|
| الموضوعات: | |
| الوصول للمادة أونلاين: | https://nti.khai.edu/ojs/index.php/reks/article/view/3145 |
| الملخص: | The article’s subject matter deals with the management of software clones. Software clones are duplicate code fragments that can exist in the same or different software files. Software clone detection and management has become a well-established research area. Software clones should be managed to minimize their ill-effects, as the presence of clones can increase the software’s maintenance cost and resource requirements. Refactoring is a commonly used technique for managing clones. A software clone detection tool can detect many clones from the software, but not all detected clones are suitable for refactoring. A developer needs a subset of detected clones that can be easily refactored. This study aims to suggest software clones for refactoring using machine learning techniques. This study evaluates the performance of fourteen machine-learning algorithms and investigates the influence of three feature selection methods on clone recommendation accuracy. The tasks to be solved are as follows: selecting appropriate features from datasets, developing machine learning-based models that can suggest suitable clones for refactoring, and selecting an effective machine learning and feature selection algorithm for recommending clones for refactoring. The methods used for feature selection are correlation, InfoGain, and ReliefF. The study is conducted on datasets from six open-source software written in Java. The experimental results show that the Decision Tree and LogitBoost classifiers achieve the highest accuracy of 94.44 % on the Lucene dataset. ReliefF yields the best performance among the feature selection methods, particularly when used with the Decision Tree algorithm. This study concludes that Random Committee, Random Forest, and Decision Tree perform best when paired with correlation, InfoGain, and ReliefF, respectively. Overall, the Decision Tree classifier, combined with the ReliefF feature selection method, delivers the highest average precision, recall, and F-score across datasets. |
|---|---|
| تدمد: | 1814-4225 2663-2012 |
