Pita, RD;
Pinto, C;
Sena, S;
Fiaccone, R;
Amorim, L;
Reis, S;
Barreto, M;
... Barreto, ME; + view all
(2018)
On the Accuracy and Scalability of Probabilistic Data Linkage over the Brazilian 114 Million Cohort.
IEEE Journal of Biomedical and Health Informatics
10.1109/JBHI.2018.2796941.
(In press).
Preview |
Text
08293793.pdf - Published Version Download (816kB) | Preview |
Abstract
Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative and other surveillance databases are aggregated and used for research, decision-making, and assessment of public policies. When a common set of unique identifiers do not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cut-off values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high-accuracy and scalability in massive datasets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage and accuracy assessment. We present results from linking a large population- based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12 seconds over heterogeneous CPU and GPU architectures.
Type: | Article |
---|---|
Title: | On the Accuracy and Scalability of Probabilistic Data Linkage over the Brazilian 114 Million Cohort |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1109/JBHI.2018.2796941 |
Publisher version: | http://dx.doi.org/10.1109/JBHI.2018.2796941 |
Language: | English |
Additional information: | This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions. |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics > Clinical Epidemiology |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10044322 |
Archive Staff Only
View Item |