Kokosi, Theodora;
Harron, Katie;
(2022)
Synthetic data in medical research.
BMJ Medicine
, 1
(1)
, Article e000167. 10.1136/bmjmed-2022-000167.
Preview |
Text
e000167.full.pdf - Published Version Download (623kB) | Preview |
Abstract
Introduction Demand to access high quality data at the individual level for medical and healthcare research is growing. Electronic health record data collected on whole populations can help to generate real world evidence and can be used for a range of secondary purposes, including testing new hypotheses and developing and evaluating different methodological and statistical approaches. Secondary analysis of primary research data, such as from clinical trials,1 is also valuable—for example, to conduct meta-analyses of individual participant data. However, several complex privacy requirements make accessing these data challenging.2 Information contained in electronic health records or in clinical trial data are highly sensitive and access to these datasets can be an expensive and lengthy process.3 Data privacy and protection regulations are the main barriers to accessing these data for healthcare and medical research.4 Anonymisation (where potentially identifiable variables are removed) is one way to make data available; however, intensive anonymisation can degrade the data to the extent that it is no longer fit for purpose.5 For example, adding random noise to the data reduces precision and leads to larger confidence intervals. Several reidentification attempts on anonymised data have been successful and have harmed public and regulators’ trust in such methods.6 7 For instance, one study showed that patients could be identified by matching information from patient level data that was publicly available, attributing information obtained from newspapers, and contacting those patients directly.6 Use of information from clinical trials and electronic health records of large populations has the potential to benefit medical and healthcare research and makes seeking new approaches to data access imperative. One solution is to use so-called synthetic data, or artificial data, which provide a realistic representation of the original data source. Synthetic data look like the original data source, without containing any information on any real individuals. Synthetic data can attempt to preserve some of the statistical properties of the original data source (eg, distributions of continuous data, proportions of categorical data, correlations between variables, and other model parameters).
Type: | Article |
---|---|
Title: | Synthetic data in medical research |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1136/bmjmed-2022-000167 |
Publisher version: | https://doi.org/10.1136/bmjmed-2022-000167 |
Language: | English |
Additional information: | This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third-party material in this article are included in the Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health > Population, Policy and Practice Dept UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > UCL GOS Institute of Child Health UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10156537 |
Archive Staff Only
View Item |