Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and pr...

Full description

Bibliographic Details
Main Authors: Hejblum, Boris P. (Author), Weber, Griffin M. (Author), Liao, Katherine P. (Author), Palmer, Nathan P. (Author), Churchill, Susanne (Author), Shadick, Nancy A. (Author), Szolovits, Peter (Author), Murphy, Shawn N. (Author), Kohane, Isaac S. (Author), Cai, Tianxi (Author)
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor), Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Contributor)
Format: Article
Language:English
Published: Springer Nature, 2019-11-11T16:22:01Z.
Subjects:
Online Access:Get fulltext
LEADER 02254 am a22002893u 4500
001 122815
042 |a dc 
100 1 0 |a Hejblum, Boris P.  |e author 
100 1 0 |a Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory  |e contributor 
100 1 0 |a Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science  |e contributor 
700 1 0 |a Weber, Griffin M.  |e author 
700 1 0 |a Liao, Katherine P.  |e author 
700 1 0 |a Palmer, Nathan P.  |e author 
700 1 0 |a Churchill, Susanne  |e author 
700 1 0 |a Shadick, Nancy A.  |e author 
700 1 0 |a Szolovits, Peter  |e author 
700 1 0 |a Murphy, Shawn N.  |e author 
700 1 0 |a Kohane, Isaac S.  |e author 
700 1 0 |a Cai, Tianxi  |e author 
245 0 0 |a Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes 
260 |b Springer Nature,   |c 2019-11-11T16:22:01Z. 
856 |z Get fulltext  |u https://hdl.handle.net/1721.1/122815 
520 |a We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources. 
520 |a National Institutes of Health (U.S.) (Grant U54-HG007963) 
546 |a en 
655 7 |a Article 
773 |t Scientific Data