Boratto, M;
Alonso, P;
Pinto, C;
Melo, P;
Barreto, M;
Denaxas, S;
(2018)
Exploring hybrid parallel systems for probabilistic record linkage.
The Journal of Supercomputing
10.1007/s11227-018-2328-3.
(In press).
Preview |
Text
Denaxas_Exploring hybrid parallel systems for probabilistic record linkage_AAM.pdf - Accepted Version Download (528kB) | Preview |
Abstract
Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.
Archive Staff Only
View Item |