A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction

Turhan, B; Mäntylä, M

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/14787

Full metadata record

DC Field	Value	Language
dc.contributor.author	Turhan, B	-
dc.contributor.author	Mäntylä, M	-
dc.date.accessioned	2017-06-20T12:30:38Z	-
dc.date.available	2017-06-20T12:30:38Z	-
dc.date.issued	2017	-
dc.identifier.citation	Information and Software Technology	en_US
dc.identifier.issn	0950-5849	-
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/14787	-
dc.description.abstract	Abstract Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction( CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP. Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using di erent feature sets. Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN- lter and Naive-CPDP) and within project (Cross- Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection. Results: GIS variant with info gain feature selection is signi cantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p = values 0:001, Cohen's d = f0:621; 0:845; 0:762g) and G (p = values 0:001, Cohen's d = f0:899; 1:114; 1:056g), and Naive CPDP (all,ckloc,IG) in terms of F1 (p = values 0:001, Cohen's d = f0:743; 0:865; 0:789g) and G (p = values 0:001, Cohen's d = f1:027; 1:119; 1:050g). Overall, the performance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches. Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in precision. Using di erent optimization goals, utilizing other validation datasets and other feat	en_US
dc.language.iso	en	en_US
dc.subject	Cross Project Defect Prediction	en_US
dc.subject	Search Based Optimization	en_US
dc.subject	Genetic Algorithms	en_US
dc.subject	Instance Selection,	en_US
dc.subject	Training Data Selection	en_US
dc.title	A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction	en_US
dc.type	Article	en_US
dc.relation.isPartOf	Information and Software Technology	-
pubs.publication-status	Accepted	-
Appears in Collections:	Dept of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf		845.59 kB	Adobe PDF	View/Open

Show simple item record