Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives

It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free tex...

Full description

Bibliographic Details
Main Authors: Saman Hina, Raheela Asif, Syed Abbas Ali
Format: Article
Language:English
Published: Mehran University of Engineering and Technology 2020-07-01
Series:Mehran University Research Journal of Engineering and Technology
Online Access:https://publications.muet.edu.pk/index.php/muetrj/article/view/1704
id doaj-e31157b2e2f04937b4992e57a3992ab8
record_format Article
spelling doaj-e31157b2e2f04937b4992e57a3992ab82020-11-25T02:36:36ZengMehran University of Engineering and TechnologyMehran University Research Journal of Engineering and Technology0254-78212413-72192020-07-0139361262410.22581/muet1982.2003.161704Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical NarrativesSaman Hina0Raheela Asif1Syed Abbas Ali2Department of Computer Science and Information Technology, NED University of Engineering and Technology, Karachi, Pakistan.Department of Software Engineering, NED University of Engineering and Technology, Karachi, Pakistan.Department of Computer and Information Systems Engineering, NED University of Engineering and Technology, Karachi, Pakistan.It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.https://publications.muet.edu.pk/index.php/muetrj/article/view/1704
collection DOAJ
language English
format Article
sources DOAJ
author Saman Hina
Raheela Asif
Syed Abbas Ali
spellingShingle Saman Hina
Raheela Asif
Syed Abbas Ali
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
Mehran University Research Journal of Engineering and Technology
author_facet Saman Hina
Raheela Asif
Syed Abbas Ali
author_sort Saman Hina
title Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_short Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_full Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_fullStr Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_full_unstemmed Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_sort anonymization framework for securing protected health information in a complex dataset of medical narratives
publisher Mehran University of Engineering and Technology
series Mehran University Research Journal of Engineering and Technology
issn 0254-7821
2413-7219
publishDate 2020-07-01
description It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.
url https://publications.muet.edu.pk/index.php/muetrj/article/view/1704
work_keys_str_mv AT samanhina anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives
AT raheelaasif anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives
AT syedabbasali anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives
_version_ 1724799215510487040