UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

On the Accuracy and Scalability of Probabilistic Data Linkage over the Brazilian 114 Million Cohort

Pita, RD; Pinto, C; Sena, S; Fiaccone, R; Amorim, L; Reis, S; Barreto, M; ... Barreto, ME; + view all (2018) On the Accuracy and Scalability of Probabilistic Data Linkage over the Brazilian 114 Million Cohort. IEEE Journal of Biomedical and Health Informatics 10.1109/JBHI.2018.2796941. (In press). Green open access

[thumbnail of 08293793.pdf]
Preview
Text
08293793.pdf - Published Version

Download (816kB) | Preview

Abstract

Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative and other surveillance databases are aggregated and used for research, decision-making, and assessment of public policies. When a common set of unique identifiers do not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cut-off values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high-accuracy and scalability in massive datasets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage and accuracy assessment. We present results from linking a large population- based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12 seconds over heterogeneous CPU and GPU architectures.

Type: Article
Title: On the Accuracy and Scalability of Probabilistic Data Linkage over the Brazilian 114 Million Cohort
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/JBHI.2018.2796941
Publisher version: http://dx.doi.org/10.1109/JBHI.2018.2796941
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics > Clinical Epidemiology
URI: https://discovery-pp.ucl.ac.uk/id/eprint/10044322
Downloads since deposit
20,824Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item