Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/27695
Full metadata record
DC FieldValueLanguage
dc.contributor.authorDraghi, B-
dc.contributor.authorWang, Z-
dc.contributor.authorMyles, P-
dc.contributor.authorTucker, A-
dc.date.accessioned2023-11-21T20:06:57Z-
dc.date.available2023-11-21T20:06:57Z-
dc.date.issued2024-01-10-
dc.identifierORCID iD: Allan Tucker https://orcid.org/0000-0001-5105-3506-
dc.identifiere24164-
dc.identifier.citationDraghi, B. et al. (2024) 'Identifying and handling data bias within primary healthcaredata using synthetic data generators', Heliyon, 10 (2), e24164, pp. 1 - 15. doi: 10.1016/j.heliyon.2024.e24164.en_US
dc.identifier.urihttps://bura.brunel.ac.uk/handle/2438/27695-
dc.descriptionData availability: The anonymised electronic healthcare record data used in this research is not publicly available but can be requested from CPRD subject to a data licence and research data governance (RDG) approval. The generated synthetic data set discussed in this paper can also be requested from CPRD subject to a data sharing agreement (DSA). Data access licence fees apply (https://cprd.com/data). Code availability: All our R code is available via GitHub (https://github.com/barbaraDraghi/BayesBoost). The R package bnlearn (v4.8.1) is used for all Bayesian network inference.en_US
dc.descriptionAppendix A. Additional results are available online at: https://www.sciencedirect.com/science/article/pii/S2405844024001956#se0130 .-
dc.description.abstractCopyright © 2024 The Authors. Advanced synthetic data generators can simulate data samples that closely resemble sensitive personal datasets while significantly reducing the risk of individual identification. The use of these advanced generators holds enormous potential in the medical field, as it allows for the simulation and sharing of sensitive patient data. This enables the development and rigorous validation of novel AI technologies for accurate diagnosis and efficient disease management. Despite the availability of massive ground truth datasets (such as UK-NHS databases that contain millions of patient records), the risk of biases being carried over to data generators still exists. These biases may arise from the under-representation of specific patient cohorts due to cultural sensitivities within certain communities or standardised data collection procedures. Machine learning models can exhibit bias in various forms, including the under-representation of certain groups in the data. This can lead to missing data and inaccurate correlations and distributions, which may also be reflected in synthetic data. Our paper aims to improve synthetic data generators by introducing probabilistic approaches to first detect difficult-to-predict data samples in ground truth data and then boost them when applying the generator. In addition, we explore strategies to generate synthetic data that can reduce bias and, at the same time, improve the performance of predictive models.en_US
dc.description.sponsorshipNHSX grant: BEIS Innovate Regulatory Pioneer Fund, project: "Using High-fidelity Synthetic Data as synthetic control arms and to boost sample sizes in clinical trials".en_US
dc.format.extent1 - 15-
dc.format.mediumElectronic-
dc.language.isoen_USen_US
dc.publisherElsevieren_US
dc.rightsCopyright © 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).-
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/-
dc.subjectsynthetic data generatorsen_US
dc.subjectdata biasen_US
dc.subjectover-samplingen_US
dc.subjectBayesian networksen_US
dc.subjectmachine learningen_US
dc.titleIdentifying and handling data bias within primary healthcaredata using synthetic data generatorsen_US
dc.typeArticleen_US
dc.identifier.doihttps://doi.org/10.1016/j.heliyon.2024.e24164-
dc.relation.isPartOfHeliyon-
pubs.issue2-
pubs.publication-statusPublished-
pubs.volume10-
dc.identifier.eissn2405-8440-
dc.rights.holderThe Authors-
Appears in Collections:Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdfCopyright © 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).2.14 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons