Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9057637/ |
id |
doaj-084dd718738447e68eb1cf019a2c2798 |
---|---|
record_format |
Article |
spelling |
doaj-084dd718738447e68eb1cf019a2c27982021-03-30T01:35:06ZengIEEEIEEE Access2169-35362020-01-018767967680810.1109/ACCESS.2020.29860149057637Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection SystemsSeoungyul Euh0Hyunjong Lee1https://orcid.org/0000-0002-2990-1545Donghoon Kim2Doosung Hwang3https://orcid.org/0000-0003-1840-9296Security Technology Institute, KSign, Seoul, South KoreaSecurity Technology Institute, KSign, Seoul, South KoreaDepartment of Computer Science, Arkansas State University, Jonesboro, AR, USADepartment of Software Science, Dankook University, Yongin, South KoreaAdvances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.https://ieeexplore.ieee.org/document/9057637/Malware detectionfeature extractiontree-based ensembleAUC-PRC |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Seoungyul Euh Hyunjong Lee Donghoon Kim Doosung Hwang |
spellingShingle |
Seoungyul Euh Hyunjong Lee Donghoon Kim Doosung Hwang Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems IEEE Access Malware detection feature extraction tree-based ensemble AUC-PRC |
author_facet |
Seoungyul Euh Hyunjong Lee Donghoon Kim Doosung Hwang |
author_sort |
Seoungyul Euh |
title |
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems |
title_short |
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems |
title_full |
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems |
title_fullStr |
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems |
title_full_unstemmed |
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems |
title_sort |
comparative analysis of low-dimensional features and tree-based ensembles for malware detection systems |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2020-01-01 |
description |
Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank. |
topic |
Malware detection feature extraction tree-based ensemble AUC-PRC |
url |
https://ieeexplore.ieee.org/document/9057637/ |
work_keys_str_mv |
AT seoungyuleuh comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems AT hyunjonglee comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems AT donghoonkim comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems AT doosunghwang comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems |
_version_ |
1724186740319584256 |