Munch: An efficient modularisation strategy on sequential source code check-ins

Arzoky, Mahir

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/13808

Title:	Munch: An efficient modularisation strategy on sequential source code check-ins
Authors:	Arzoky, Mahir
Advisors:	Swift, S Counsell, S
Keywords:	Clustering;Time-series;Search based software engineering;Seeding;Refactoring
Issue Date:	2015
Publisher:	Brunel University London
Abstract:	As developers are increasingly creating more sophisticated applications, software systems are growing in both their complexity and size. When source code is easy to understand, the system can be more maintainable, which leads to reduced costs. Better structured code can also lead to new requirements being introduced more efficiently with fewer issues. However, the maintenance and evolution of systems can be frustrating; it is difficult for developers to keep a fixed understanding of the system’s structure as the structure can change during maintenance. Software module clustering is the process of automatically partitioning the structure of the system using low-level dependencies in the source code, to improve the system’s structure. There have been a large number of studies using the Search Based Software Engineering approach to solve the software module clustering problem. A software clustering tool, Munch, was developed and employed in this study to modularise a unique dataset of sequential source code software versions. The tool is based on Search Based Software Engineering techniques. The tool constitutes of a number of components that includes the clustering algorithm, and a number of different fitness functions and metrics that are used for measuring and assessing the quality of the clustering decompositions. The tool will provide a framework for evaluating a number of clustering techniques and strategies. The dataset used in this study is provided by Quantel Limited, it is from processed source code of a product line architecture library that has delivered numerous products. The dataset analysed is the persistence engine used by all products, comprising of over 0.5 million lines of C++. It consists of 503 software versions. This study looks to investigate whether search-based software clustering approaches can help stakeholders to understand how inter-class dependencies of the software system change over time. It performs efficient modularisation on a time-series of source code relationships, taking advantage of the fact that the nearer the source code in time the more similar the modularisation is expected to be. This study introduces a seeding concept and highlights how it can be used to significantly reduce the runtime of the modularisation. The dataset is not treated as separate modularisation problems, but instead the result of the previous modularisation of the graph is used to give the next graph a head start. Code structure and sequence is used to obtain more effective modularisation and reduce the runtime of the process. To evaluate the efficiency of the modularisation numerous experiments were conducted on the dataset. The results of the experiments present strong evidence to support the seeding strategy. To reduce the runtime further, statistical techniques for controlling the number of iterations of the modularisation, based on the similarities between time adjacent graphs, is introduced. The convergence of the heuristic search technique is examined and a number of stopping criterions are estimated and evaluated. Extensive experiments were conducted on the time-series dataset and evidence are presented to support the proposed techniques. In addition, this thesis investigated and evaluated the starting clustering arrangement of Munch’s clustering algorithm, and introduced and experimented with a number of starting clustering arrangements that includes a uniformly random clustering arrangement strategy. Moreover, this study investigates whether the dataset used for the modularisation resembles a random graph by computing the probabilities of observing certain connectivity. This thesis demonstrates how modularisation is not possible with data that resembles random graphs, and demonstrates that the dataset being used does not resemble a random graph except for small sections where there were large maintenance activities. Furthermore, it explores and shows how the random graph metric can be used as a tool to indicate areas of interest in the dataset, without the need to run the modularisation. Last but not least, there is a huge amount of software code that has and will be developed, however very little has been learnt from how the code evolves over time. The intention of this study is also to help developers and stakeholders to model the internal software and to aid in modelling development trends and biases, and to try and predict the occurrence of large changes and potential refactorings. Thus, industrial feedback of the research was obtained. This thesis presents work on the detection of refactoring activities, and discusses the possible applications of the findings of this research in industrial settings.
Description:	This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London
URI:	http://bura.brunel.ac.uk/handle/2438/13808
Appears in Collections:	Computer Science Dept of Computer Science Theses

Files in This Item:

File	Description	Size	Format
FulltextThesis.pdf		2.79 MB	Adobe PDF	View/Open

Show full item record