Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage
Introduction Applications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute valu...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2020-12-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/1445 |
id |
doaj-34481e2fbabe47f686c077eec1e3d166 |
---|---|
record_format |
Article |
spelling |
doaj-34481e2fbabe47f686c077eec1e3d1662021-02-10T16:43:18ZengSwansea UniversityInternational Journal of Population Data Science2399-49082020-12-015510.23889/ijpds.v5i5.1445Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record LinkageThilina Ranbaduge0Peter Christen1Research School of Computer Science, Australian National University, Canberra, AustraliaResearch School of Computer Science, Australian National University, Canberra, Australia Introduction Applications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and Approach Binary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. Results We encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. Conclusion Binary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values. https://ijpds.org/article/view/1445 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Thilina Ranbaduge Peter Christen |
spellingShingle |
Thilina Ranbaduge Peter Christen Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage International Journal of Population Data Science |
author_facet |
Thilina Ranbaduge Peter Christen |
author_sort |
Thilina Ranbaduge |
title |
Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage |
title_short |
Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage |
title_full |
Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage |
title_fullStr |
Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage |
title_full_unstemmed |
Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage |
title_sort |
evaluating binary encoding techniques in the presence of missing values in privacy-preserving record linkage |
publisher |
Swansea University |
series |
International Journal of Population Data Science |
issn |
2399-4908 |
publishDate |
2020-12-01 |
description |
Introduction
Applications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications.
Objectives and Approach
Binary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values.
Results
We encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques.
Conclusion
Binary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.
|
url |
https://ijpds.org/article/view/1445 |
work_keys_str_mv |
AT thilinaranbaduge evaluatingbinaryencodingtechniquesinthepresenceofmissingvaluesinprivacypreservingrecordlinkage AT peterchristen evaluatingbinaryencodingtechniquesinthepresenceofmissingvaluesinprivacypreservingrecordlinkage |
_version_ |
1724275155208765440 |