k-anonymity for High Dimensional and Sparse Data

碩士 === 國立中興大學 === 電機工程學系所 === 105 === As the information technology developed vigorously, a variety of IoT intelligence applications are gradually getting into our daily life and generate a lots of data. These data may implicate valuable knowledge and could be utilized to develop more intelligence a...

Full description

Bibliographic Details
Main Authors: Cheng-Yen Wu, 吳政諺
Other Authors: Hon-Son Don
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/04450471731025927163
Description
Summary:碩士 === 國立中興大學 === 電機工程學系所 === 105 === As the information technology developed vigorously, a variety of IoT intelligence applications are gradually getting into our daily life and generate a lots of data. These data may implicate valuable knowledge and could be utilized to develop more intelligence applications. However, they also contain a great deal of sensitive information that result in the concern of privacy leakage and safety crisis. Once the data are leaked, the sensitive information may be excavated and used on virtual or physical attacks, such as recognizing the real identity of an anonymous user, revealing user’s physical location or threatening a user’s information which he/she is reluctant to bring to light. Therefore, data should be carefully examinated and go through a privacy protection handling process before being released. k-anonymity is one of the most frequently used privacy preserving models. Generalization and perturbation are the commonly techniques to conceal data to provide the probability-based guarantee. However, most of the generalization or perturbation techniques do not consider the data characteristics of high dimension and sparsity and thus leads to low data utilization. Therefore, we formulate a new High-dimensional Data M-Subspace (HDMS) k-anonymity problem to preserve privacy for high dimension and sparse data. Moreover, we derive the formula of Information Loss and Anonymity Crisis (ILAC) to calculate the similarity between the data points and propose six greedy-based algorithms to solve the HDMS k-anonymity problem. Our approaches target to lower the "Information Loss" while ensuring the privacy protection. To validate our methods, we conduct experiments with two public real datasets to study the key factors which influence the "Information Loss" and computing time. According to the experimental results, our method outperforms the other traditional methods in achieving k-anonymity with lower "Information Loss" . Besides, the experimental results show a long computing time which is due to the high computational complexity. The future works include designing efficient data structures or algorithms to make the technique serviceable.