CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

Abstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge...

Full description

Bibliographic Details
Main Authors: George C. G. Barbosa, M. Sanni Ali, Bruno Araujo, Sandra Reis, Samila Sena, Maria Y. T. Ichihara, Julia Pescarini, Rosemeire L. Fiaccone, Leila D. Amorim, Robespierre Pita, Marcos E. Barreto, Liam Smeeth, Mauricio L. Barreto
Format: Article
Language:English
Published: BMC 2020-11-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12911-020-01285-w
id doaj-829763e29b6e4d53945a61f7e72b8f09
record_format Article
spelling doaj-829763e29b6e4d53945a61f7e72b8f092020-11-25T04:01:35ZengBMCBMC Medical Informatics and Decision Making1472-69472020-11-0120111310.1186/s12911-020-01285-wCIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalabilityGeorge C. G. Barbosa0M. Sanni Ali1Bruno Araujo2Sandra Reis3Samila Sena4Maria Y. T. Ichihara5Julia Pescarini6Rosemeire L. Fiaccone7Leila D. Amorim8Robespierre Pita9Marcos E. Barreto10Liam Smeeth11Mauricio L. Barreto12Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaDepartment of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical MedicineCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaAbstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.http://link.springer.com/article/10.1186/s12911-020-01285-wAccuracyData linkageEntity resolutionIndexingInformation retrieval techniquesScalability
collection DOAJ
language English
format Article
sources DOAJ
author George C. G. Barbosa
M. Sanni Ali
Bruno Araujo
Sandra Reis
Samila Sena
Maria Y. T. Ichihara
Julia Pescarini
Rosemeire L. Fiaccone
Leila D. Amorim
Robespierre Pita
Marcos E. Barreto
Liam Smeeth
Mauricio L. Barreto
spellingShingle George C. G. Barbosa
M. Sanni Ali
Bruno Araujo
Sandra Reis
Samila Sena
Maria Y. T. Ichihara
Julia Pescarini
Rosemeire L. Fiaccone
Leila D. Amorim
Robespierre Pita
Marcos E. Barreto
Liam Smeeth
Mauricio L. Barreto
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
BMC Medical Informatics and Decision Making
Accuracy
Data linkage
Entity resolution
Indexing
Information retrieval techniques
Scalability
author_facet George C. G. Barbosa
M. Sanni Ali
Bruno Araujo
Sandra Reis
Samila Sena
Maria Y. T. Ichihara
Julia Pescarini
Rosemeire L. Fiaccone
Leila D. Amorim
Robespierre Pita
Marcos E. Barreto
Liam Smeeth
Mauricio L. Barreto
author_sort George C. G. Barbosa
title CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_short CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_full CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_fullStr CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_full_unstemmed CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_sort cidacs-rl: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2020-11-01
description Abstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.
topic Accuracy
Data linkage
Entity resolution
Indexing
Information retrieval techniques
Scalability
url http://link.springer.com/article/10.1186/s12911-020-01285-w
work_keys_str_mv AT georgecgbarbosa cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT msanniali cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT brunoaraujo cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT sandrareis cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT samilasena cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT mariaytichihara cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT juliapescarini cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT rosemeirelfiaccone cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT leiladamorim cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT robespierrepita cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT marcosebarreto cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT liamsmeeth cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT mauriciolbarreto cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
_version_ 1724446331919925248