A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss

Thesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006. === Includes bibliographical references (leaves 85-86). === Privac...

Full description

Bibliographic Details
Main Author: Katirai, Hooman
Other Authors: Peter Szolovits.
Format: Others
Language:English
Published: Massachusetts Institute of Technology 2006
Subjects:
Online Access:http://hdl.handle.net/1721.1/34526
id ndltd-MIT-oai-dspace.mit.edu-1721.1-34526
record_format oai_dc
collection NDLTD
language English
format Others
sources NDLTD
topic Technology and Policy Program.
Electrical Engineering and Computer Science.
spellingShingle Technology and Policy Program.
Electrical Engineering and Computer Science.
Katirai, Hooman
A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
description Thesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006. === Includes bibliographical references (leaves 85-86). === Privacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth - fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population - where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth. === (cont.) Previous studies have shown that achieving k-anonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital. The theoretical contributions of this work include: (1) A new threat model for privacy - that allows an adversary's capabilities to be modeled using a formalism called a virtual attack database. (2) Rationally defensible information loss measures - we show that previously published information loss measures are difficult to defend because they fall prey to what is known as the "weighted indexing problem." To remedy this problem we propose a number of information-loss measures that are in principle more attractive than previously published measures. === (cont.) (3) Shown that suppression and generalization - two concepts that were previously thought to be distinct - are in fact the same thing; insofar as each generalization can be represented by a suppression and vice versa. (4) We show that Domain Generalization Hierarchies can be harvested to assist the construction of a Bayesian network to measure information loss. (5) A database can be thought of as a sub-sample of a population. We outline a technique that allows one to predict k-anonymity in a population. This allows us, under some conditions, to release records that match fewer than k individuals in a database while still achieving k-anonymity against an adversary according to some probability and confidence interval. While we have chosen to focus our thesis on the anonymization of medical records, our methodologies, toolkit and command line tools are equally applicable to any tabular data such as the data one finds in relational databases - the most common type of database today. === by Hooman Katirai. === S.M.
author2 Peter Szolovits.
author_facet Peter Szolovits.
Katirai, Hooman
author Katirai, Hooman
author_sort Katirai, Hooman
title A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
title_short A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
title_full A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
title_fullStr A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
title_full_unstemmed A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
title_sort theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss
publisher Massachusetts Institute of Technology
publishDate 2006
url http://hdl.handle.net/1721.1/34526
work_keys_str_mv AT katiraihooman atheoryandtoolkitforthemathematicsofprivacymethodsforanonymizingdatawhileminimizinginformationloss
AT katiraihooman theoryandtoolkitforthemathematicsofprivacymethodsforanonymizingdatawhileminimizinginformationloss
_version_ 1719034810313211904
spelling ndltd-MIT-oai-dspace.mit.edu-1721.1-345262019-05-02T16:06:53Z A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss Katirai, Hooman Peter Szolovits. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Technology and Policy Program. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Technology and Policy Program. Electrical Engineering and Computer Science. Thesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006. Includes bibliographical references (leaves 85-86). Privacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth - fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population - where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth. (cont.) Previous studies have shown that achieving k-anonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital. The theoretical contributions of this work include: (1) A new threat model for privacy - that allows an adversary's capabilities to be modeled using a formalism called a virtual attack database. (2) Rationally defensible information loss measures - we show that previously published information loss measures are difficult to defend because they fall prey to what is known as the "weighted indexing problem." To remedy this problem we propose a number of information-loss measures that are in principle more attractive than previously published measures. (cont.) (3) Shown that suppression and generalization - two concepts that were previously thought to be distinct - are in fact the same thing; insofar as each generalization can be represented by a suppression and vice versa. (4) We show that Domain Generalization Hierarchies can be harvested to assist the construction of a Bayesian network to measure information loss. (5) A database can be thought of as a sub-sample of a population. We outline a technique that allows one to predict k-anonymity in a population. This allows us, under some conditions, to release records that match fewer than k individuals in a database while still achieving k-anonymity against an adversary according to some probability and confidence interval. While we have chosen to focus our thesis on the anonymization of medical records, our methodologies, toolkit and command line tools are equally applicable to any tabular data such as the data one finds in relational databases - the most common type of database today. by Hooman Katirai. S.M. 2006-11-07T12:45:15Z 2006-11-07T12:45:15Z 2006 2006 Thesis http://hdl.handle.net/1721.1/34526 70902079 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 86 leaves 14904672 bytes 14904307 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology