Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This...

Full description

Bibliographic Details
Main Authors: Seoungyul Euh, Hyunjong Lee, Donghoon Kim, Doosung Hwang
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9057637/
id doaj-084dd718738447e68eb1cf019a2c2798
record_format Article
spelling doaj-084dd718738447e68eb1cf019a2c27982021-03-30T01:35:06ZengIEEEIEEE Access2169-35362020-01-018767967680810.1109/ACCESS.2020.29860149057637Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection SystemsSeoungyul Euh0Hyunjong Lee1https://orcid.org/0000-0002-2990-1545Donghoon Kim2Doosung Hwang3https://orcid.org/0000-0003-1840-9296Security Technology Institute, KSign, Seoul, South KoreaSecurity Technology Institute, KSign, Seoul, South KoreaDepartment of Computer Science, Arkansas State University, Jonesboro, AR, USADepartment of Software Science, Dankook University, Yongin, South KoreaAdvances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.https://ieeexplore.ieee.org/document/9057637/Malware detectionfeature extractiontree-based ensembleAUC-PRC
collection DOAJ
language English
format Article
sources DOAJ
author Seoungyul Euh
Hyunjong Lee
Donghoon Kim
Doosung Hwang
spellingShingle Seoungyul Euh
Hyunjong Lee
Donghoon Kim
Doosung Hwang
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
IEEE Access
Malware detection
feature extraction
tree-based ensemble
AUC-PRC
author_facet Seoungyul Euh
Hyunjong Lee
Donghoon Kim
Doosung Hwang
author_sort Seoungyul Euh
title Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
title_short Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
title_full Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
title_fullStr Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
title_full_unstemmed Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
title_sort comparative analysis of low-dimensional features and tree-based ensembles for malware detection systems
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.
topic Malware detection
feature extraction
tree-based ensemble
AUC-PRC
url https://ieeexplore.ieee.org/document/9057637/
work_keys_str_mv AT seoungyuleuh comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems
AT hyunjonglee comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems
AT donghoonkim comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems
AT doosunghwang comparativeanalysisoflowdimensionalfeaturesandtreebasedensemblesformalwaredetectionsystems
_version_ 1724186740319584256