CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
Abstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge...
Main Authors: | , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2020-11-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12911-020-01285-w |
id |
doaj-829763e29b6e4d53945a61f7e72b8f09 |
---|---|
record_format |
Article |
spelling |
doaj-829763e29b6e4d53945a61f7e72b8f092020-11-25T04:01:35ZengBMCBMC Medical Informatics and Decision Making1472-69472020-11-0120111310.1186/s12911-020-01285-wCIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalabilityGeorge C. G. Barbosa0M. Sanni Ali1Bruno Araujo2Sandra Reis3Samila Sena4Maria Y. T. Ichihara5Julia Pescarini6Rosemeire L. Fiaccone7Leila D. Amorim8Robespierre Pita9Marcos E. Barreto10Liam Smeeth11Mauricio L. Barreto12Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaDepartment of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical MedicineCentre for Data and Knowledge Integration for Health (CIDACS), Fiocruz BahiaAbstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.http://link.springer.com/article/10.1186/s12911-020-01285-wAccuracyData linkageEntity resolutionIndexingInformation retrieval techniquesScalability |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
George C. G. Barbosa M. Sanni Ali Bruno Araujo Sandra Reis Samila Sena Maria Y. T. Ichihara Julia Pescarini Rosemeire L. Fiaccone Leila D. Amorim Robespierre Pita Marcos E. Barreto Liam Smeeth Mauricio L. Barreto |
spellingShingle |
George C. G. Barbosa M. Sanni Ali Bruno Araujo Sandra Reis Samila Sena Maria Y. T. Ichihara Julia Pescarini Rosemeire L. Fiaccone Leila D. Amorim Robespierre Pita Marcos E. Barreto Liam Smeeth Mauricio L. Barreto CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability BMC Medical Informatics and Decision Making Accuracy Data linkage Entity resolution Indexing Information retrieval techniques Scalability |
author_facet |
George C. G. Barbosa M. Sanni Ali Bruno Araujo Sandra Reis Samila Sena Maria Y. T. Ichihara Julia Pescarini Rosemeire L. Fiaccone Leila D. Amorim Robespierre Pita Marcos E. Barreto Liam Smeeth Mauricio L. Barreto |
author_sort |
George C. G. Barbosa |
title |
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_short |
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_full |
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_fullStr |
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_full_unstemmed |
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_sort |
cidacs-rl: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
publisher |
BMC |
series |
BMC Medical Informatics and Decision Making |
issn |
1472-6947 |
publishDate |
2020-11-01 |
description |
Abstract Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures. |
topic |
Accuracy Data linkage Entity resolution Indexing Information retrieval techniques Scalability |
url |
http://link.springer.com/article/10.1186/s12911-020-01285-w |
work_keys_str_mv |
AT georgecgbarbosa cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT msanniali cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT brunoaraujo cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT sandrareis cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT samilasena cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT mariaytichihara cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT juliapescarini cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT rosemeirelfiaccone cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT leiladamorim cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT robespierrepita cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT marcosebarreto cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT liamsmeeth cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT mauriciolbarreto cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability |
_version_ |
1724446331919925248 |