Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics

Abstract Background Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impa...

Full description

Bibliographic Details
Main Authors: Kelly Broen, Rob Trangucci, Jon Zelner
Format: Article
Language:English
Published: BMC 2021-01-01
Series:International Journal of Health Geographics
Subjects:
Online Access:https://doi.org/10.1186/s12942-020-00256-8
id doaj-fc31dcc1e2fb41f19c7d777a5bd04a3b
record_format Article
spelling doaj-fc31dcc1e2fb41f19c7d777a5bd04a3b2021-01-10T12:10:57ZengBMCInternational Journal of Health Geographics1476-072X2021-01-0120111610.1186/s12942-020-00256-8Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statisticsKelly Broen0Rob Trangucci1Jon Zelner2Department of Epidemiology, University of Michigan School of Public HealthDept. of Statistics, University of MichiganDepartment of Epidemiology, University of Michigan School of Public HealthAbstract Background Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy. Methods We analyzed the impact of perturbation on spatial patterns in the full set of address-level mortality data from Lawrence, MA during the period from 1911 to 1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran’s I, Local Moran’s I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley’s K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and its compliance with HIPAA and GDPR privacy standards. Results Random perturbation at 50 m, donut masking between 5 and 50 m, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100 × 100 and 250 × 250 m cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard. Conclusions Using the set of published perturbation methods applied in this analysis, HIPAA and GDPR compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.https://doi.org/10.1186/s12942-020-00256-8GeomaskingPrivacySpatial anonymityReproducibility
collection DOAJ
language English
format Article
sources DOAJ
author Kelly Broen
Rob Trangucci
Jon Zelner
spellingShingle Kelly Broen
Rob Trangucci
Jon Zelner
Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
International Journal of Health Geographics
Geomasking
Privacy
Spatial anonymity
Reproducibility
author_facet Kelly Broen
Rob Trangucci
Jon Zelner
author_sort Kelly Broen
title Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
title_short Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
title_full Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
title_fullStr Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
title_full_unstemmed Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
title_sort measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics
publisher BMC
series International Journal of Health Geographics
issn 1476-072X
publishDate 2021-01-01
description Abstract Background Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy. Methods We analyzed the impact of perturbation on spatial patterns in the full set of address-level mortality data from Lawrence, MA during the period from 1911 to 1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran’s I, Local Moran’s I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley’s K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and its compliance with HIPAA and GDPR privacy standards. Results Random perturbation at 50 m, donut masking between 5 and 50 m, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100 × 100 and 250 × 250 m cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard. Conclusions Using the set of published perturbation methods applied in this analysis, HIPAA and GDPR compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.
topic Geomasking
Privacy
Spatial anonymity
Reproducibility
url https://doi.org/10.1186/s12942-020-00256-8
work_keys_str_mv AT kellybroen measuringtheimpactofspatialperturbationsontherelationshipbetweendataprivacyandvalidityofdescriptivestatistics
AT robtrangucci measuringtheimpactofspatialperturbationsontherelationshipbetweendataprivacyandvalidityofdescriptivestatistics
AT jonzelner measuringtheimpactofspatialperturbationsontherelationshipbetweendataprivacyandvalidityofdescriptivestatistics
_version_ 1724343427895656448