Hadoop performance modeling and job optimization for big data analytics

Khan, Mukhtaj

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/11078

Title:	Hadoop performance modeling and job optimization for big data analytics
Authors:	Khan, Mukhtaj
Advisors:	Li, M
Keywords:	Resource provisioning;Phasor measurement units;Gene expression programming;Parallel detrended fluctuation analysis;Job estimation
Issue Date:	2015
Publisher:	Brunel University London
Abstract:	Big data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.
Description:	This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London
URI:	http://bura.brunel.ac.uk/handle/2438/11078
Appears in Collections:	Electronic and Computer Engineering Dept of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
FulltextThesis.pdf		3.53 MB	Adobe PDF	View/Open

Show full item record