Using Machine Learning to Manage Resources in Datacenters with Diverse Computing Requirements

碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Apache Mesos has become a popular cluster resource management tool with the emergence of various new cluster computing applications, such as Big Data analytics and deep learning. Resource offer mechanism of Mesos gives framework schedulers the ability to choose...

Full description

Bibliographic Details
Main Authors: Lee, Chin-Feng, 李青峰
Other Authors: Chou, Jerry
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/jrgj47
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Apache Mesos has become a popular cluster resource management tool with the emergence of various new cluster computing applications, such as Big Data analytics and deep learning. Resource offer mechanism of Mesos gives framework schedulers the ability to choose the best resources based on their own constraints and preferences. The default hierarchical DRF allocator gives near optimal results for simple task placement preferences and resource requirements under large resource pool running mostly short-living jobs. However, if these properties do not hold, higher offer rejection rate is expected, which leads to degraded overall performance. Moreover, in scenarios where the overall system throughput is the main concern, improving allocator has more chance for optimization instead of passively waiting for desirable resource offer to be given to frameworks. Therefore, we propose to use machine learning techniques to improve offer quality. We consider the problem of actively improving the quality of resource offers with limited information and interactions to users. In this work, we propose a quality-aware allocator with a pre-defined quality function for optimizing job execution time. In addition, we implemented an emulation environment to evaluate the performance of proposed allocator under various synthetic batch-processing workloads. Our evaluation shows up to 2x improvement in total completion time, 33% higher residual capacity, 46% less rejection rate and 70% better allocation placement with data locality.