Dynamic analyses of malware

This thesis examines machine learning techniques for detecting malware using dynamic runtime opcodes. Previous work in the field has faltered on inadequately sized and poorly sampled datasets. A novel run-trace dataset is presented, the largest in the literature to date. Using this dataset, malware...

Full description

Bibliographic Details
Main Author: Carlin, Domhnall
Other Authors: Sezer, Sakir ; O'Kane, Philip
Published: Queen's University Belfast 2018
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.766287
id ndltd-bl.uk-oai-ethos.bl.uk-766287
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7662872019-02-27T03:27:32ZDynamic analyses of malwareCarlin, DomhnallSezer, Sakir ; O'Kane, Philip2018This thesis examines machine learning techniques for detecting malware using dynamic runtime opcodes. Previous work in the field has faltered on inadequately sized and poorly sampled datasets. A novel run-trace dataset is presented, the largest in the literature to date. Using this dataset, malware detection using opcode analysis is shown to be not only feasible, but highly accurate at short run-lengths and without computationally-expensive sequencing analysis. Second, unsupervised learning is used to investigate the effects of anti-virus (AV) labels on detection rates. AV labels offer an English-language description of the malware type, whereas it is found that using an assembly language description is more beneficial in malware triaging. Third, the machine learning techniques are applied to ransomware run-traces, which has not been explored in the literature to date. This offers four further novel contributions: examination of dynamic API calls vs opcode traces in ransomware; run-lengths necessary to detect ransomware accurately; creation of a logical feature reduction algorithm to minimise computational expense in machine learning; the first model in the literature which can differentiate between benign encryption (zipping) and malicious encryption. Lastly, the computational costs of 23 machine learning algorithms are investigated with respect to the run trace dataset. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The cost of retraining and testing updatable and non-updatable classifiers, both parallelised and non-parallelised, is examined with simulated escalating datasets. Lastly, a model is proposed and examined to mitigate the disadvantages of the most successful classifiers for future work.Queen's University Belfasthttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.766287Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
description This thesis examines machine learning techniques for detecting malware using dynamic runtime opcodes. Previous work in the field has faltered on inadequately sized and poorly sampled datasets. A novel run-trace dataset is presented, the largest in the literature to date. Using this dataset, malware detection using opcode analysis is shown to be not only feasible, but highly accurate at short run-lengths and without computationally-expensive sequencing analysis. Second, unsupervised learning is used to investigate the effects of anti-virus (AV) labels on detection rates. AV labels offer an English-language description of the malware type, whereas it is found that using an assembly language description is more beneficial in malware triaging. Third, the machine learning techniques are applied to ransomware run-traces, which has not been explored in the literature to date. This offers four further novel contributions: examination of dynamic API calls vs opcode traces in ransomware; run-lengths necessary to detect ransomware accurately; creation of a logical feature reduction algorithm to minimise computational expense in machine learning; the first model in the literature which can differentiate between benign encryption (zipping) and malicious encryption. Lastly, the computational costs of 23 machine learning algorithms are investigated with respect to the run trace dataset. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The cost of retraining and testing updatable and non-updatable classifiers, both parallelised and non-parallelised, is examined with simulated escalating datasets. Lastly, a model is proposed and examined to mitigate the disadvantages of the most successful classifiers for future work.
author2 Sezer, Sakir ; O'Kane, Philip
author_facet Sezer, Sakir ; O'Kane, Philip
Carlin, Domhnall
author Carlin, Domhnall
spellingShingle Carlin, Domhnall
Dynamic analyses of malware
author_sort Carlin, Domhnall
title Dynamic analyses of malware
title_short Dynamic analyses of malware
title_full Dynamic analyses of malware
title_fullStr Dynamic analyses of malware
title_full_unstemmed Dynamic analyses of malware
title_sort dynamic analyses of malware
publisher Queen's University Belfast
publishDate 2018
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.766287
work_keys_str_mv AT carlindomhnall dynamicanalysesofmalware
_version_ 1718984348291563520