UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

Exploring hybrid parallel systems for probabilistic record linkage

Boratto, M; Alonso, P; Pinto, C; Melo, P; Barreto, M; Denaxas, S; (2018) Exploring hybrid parallel systems for probabilistic record linkage. The Journal of Supercomputing 10.1007/s11227-018-2328-3. (In press). Green open access

[thumbnail of Denaxas_Exploring hybrid parallel systems for probabilistic record linkage_AAM.pdf]
Preview
Text
Denaxas_Exploring hybrid parallel systems for probabilistic record linkage_AAM.pdf - Accepted Version

Download (528kB) | Preview

Abstract

Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.

Type: Article
Title: Exploring hybrid parallel systems for probabilistic record linkage
Open access status: An open access version is available from UCL Discovery
DOI: 10.1007/s11227-018-2328-3
Publisher version: https://doi.org/10.1007/s11227-018-2328-3
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Probabilistic linkage, Public health, Performance evaluation, Multicore, Multi-GPU
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics > Clinical Epidemiology
URI: https://discovery-pp.ucl.ac.uk/id/eprint/10058414
Downloads since deposit
9,728Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item