Search based training data selection for cross project defect prediction

Hosseini, S; Turhan, B; Mäntylä, M

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/14704

Full metadata record

DC Field	Value	Language
dc.contributor.author	Hosseini, S	-
dc.contributor.author	Turhan, B	-
dc.contributor.author	Mäntylä, M	-
dc.date.accessioned	2017-06-08T12:38:17Z	-
dc.date.available	2016	-
dc.date.available	2017-06-08T12:38:17Z	-
dc.date.issued	2016	-
dc.identifier.citation	Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, Ciudad Real, Spain, 09 September, pp. 1-10, (2016)	en_US
dc.identifier.isbn	978-1-4503-4772-3	-
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/14704	-
dc.description.abstract	Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p – value ≪ 0.001, Cohen’s d = 0.697) and GMean (p – value ≪ 0.001, Cohen’s d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p – value ≪ 0.001, Cohen’s d = 0.753) and GMean (p – value ≪ 0.001, Cohen’s d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p – value ≪ 0.001, Cohen’s d = 0.227) and GMean (p – value ≪ 0.001, Cohen’s d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.	en_US
dc.format.extent	3:1 - 3:10	-
dc.language.iso	en	en_US
dc.publisher	ACM	en_US
dc.subject	Cross project defect prediction	en_US
dc.subject	Search based optimization	en_US
dc.subject	Genetic algorithms	en_US
dc.subject	Instance selection	en_US
dc.subject	Training data selection	en_US
dc.title	Search based training data selection for cross project defect prediction	en_US
dc.type	Conference Paper	en_US
dc.identifier.doi	http://dx.doi.org/10.1145/2972958.2972964	-
dc.relation.isPartOf	Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering	-
pubs.notes	acmid: 2972964 articleno: 3 interhash: 141967ca11c83faf539c09437733fc99 intrahash: 24187e884ce3b83df822e9fbe5cfbe2b location: Ciudad Real, Spain numpages: 10	-
Appears in Collections:	Dept of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf		372.92 kB	Adobe PDF	View/Open

Show simple item record