Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record

Introduction Electronic medical records (EMRs) are increasingly used in health services research. Accurate/efficient identification of a target population with a specific disease phenotype is a necessary precursor to studying the health of these individuals. Objectives and Approach We explored t...

Full description

Bibliographic Details
Main Authors: Christopher Meaney, Jessica Widdifield, Liisa Jaakkimainen, Michael Escobar, Frank Rudzicz, Karen Tu
Format: Article
Language:English
Published: Swansea University 2018-08-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/761
id doaj-11f4540ea6a84e6ab5d0554a7d9dcffd
record_format Article
spelling doaj-11f4540ea6a84e6ab5d0554a7d9dcffd2020-11-24T21:43:05ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.761Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical RecordChristopher Meaney0Jessica Widdifield1Liisa Jaakkimainen2Michael Escobar3Frank Rudzicz4Karen Tu5University of TorontoInstitute for Clinical Evaluative SciencesInstitute for Clinical Evaluative SciencesUniversity of TorontoUniversity Health NetworkUniversity of Toronto Introduction Electronic medical records (EMRs) are increasingly used in health services research. Accurate/efficient identification of a target population with a specific disease phenotype is a necessary precursor to studying the health of these individuals. Objectives and Approach We explored the use of biomedical text as inputs to supervised phenotype identification algorithms. We employed a two-stage classification approach to map the discrete, sparse high-dimensional biomedical text data to a dense low dimensional vector space using methods from unsupervised machine learning. Next we used these learned vectors as inputs to supervised machine learning algorithms for phenotype identification. We were able to demonstrate the applicability of the approach to identifying patients with an osteoarthritis (OA) phenotype using primary care data from the Electronic Medical Record Administrative data Linked Database (EMRALD) held at ICES. Results EMRALD contains approximately 20Gb of biomedical text data on approximately 500,000 patients. The unit of analysis for this study is the patient. We were interested in identifying OA patients using solely text data as features. Labelled outcome information wass available from a random sample of 7,500 patients. We divided patients into training (N=6000), validation (N=750) and test (N=750) cohorts. We learned low dimensional representations of the input text data on the entire EMRALD corpus (N=500,000). We used learned numeric vectors as inputs to supervised machine learning models for OA classification (N=6,000 training set patients). We compared models in terms of accuracy, sensitivity, specificity, PPV and NPV. The best learned models achieved approximately 90\% sensitivity and 80\% specificity. Classification accuracy varied as a function of learned inputs. Conclusion/Implications We developed an approach to phenotype identification using solely biomedical text as an input. Preliminary results suggest our two-stage ML approach has improved operating characteristics compared to existing clinically derived decision rules for OA classification. Future work will explore the generalizability of this methodology to other disease phenotypes. https://ijpds.org/article/view/761
collection DOAJ
language English
format Article
sources DOAJ
author Christopher Meaney
Jessica Widdifield
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Karen Tu
spellingShingle Christopher Meaney
Jessica Widdifield
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Karen Tu
Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
International Journal of Population Data Science
author_facet Christopher Meaney
Jessica Widdifield
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Karen Tu
author_sort Christopher Meaney
title Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
title_short Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
title_full Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
title_fullStr Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
title_full_unstemmed Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
title_sort using biomedical text as data and representation learning for identifying patients with an osteoarthritis phenotype in the electronic medical record
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2018-08-01
description Introduction Electronic medical records (EMRs) are increasingly used in health services research. Accurate/efficient identification of a target population with a specific disease phenotype is a necessary precursor to studying the health of these individuals. Objectives and Approach We explored the use of biomedical text as inputs to supervised phenotype identification algorithms. We employed a two-stage classification approach to map the discrete, sparse high-dimensional biomedical text data to a dense low dimensional vector space using methods from unsupervised machine learning. Next we used these learned vectors as inputs to supervised machine learning algorithms for phenotype identification. We were able to demonstrate the applicability of the approach to identifying patients with an osteoarthritis (OA) phenotype using primary care data from the Electronic Medical Record Administrative data Linked Database (EMRALD) held at ICES. Results EMRALD contains approximately 20Gb of biomedical text data on approximately 500,000 patients. The unit of analysis for this study is the patient. We were interested in identifying OA patients using solely text data as features. Labelled outcome information wass available from a random sample of 7,500 patients. We divided patients into training (N=6000), validation (N=750) and test (N=750) cohorts. We learned low dimensional representations of the input text data on the entire EMRALD corpus (N=500,000). We used learned numeric vectors as inputs to supervised machine learning models for OA classification (N=6,000 training set patients). We compared models in terms of accuracy, sensitivity, specificity, PPV and NPV. The best learned models achieved approximately 90\% sensitivity and 80\% specificity. Classification accuracy varied as a function of learned inputs. Conclusion/Implications We developed an approach to phenotype identification using solely biomedical text as an input. Preliminary results suggest our two-stage ML approach has improved operating characteristics compared to existing clinically derived decision rules for OA classification. Future work will explore the generalizability of this methodology to other disease phenotypes.
url https://ijpds.org/article/view/761
work_keys_str_mv AT christophermeaney usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
AT jessicawiddifield usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
AT liisajaakkimainen usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
AT michaelescobar usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
AT frankrudzicz usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
AT karentu usingbiomedicaltextasdataandrepresentationlearningforidentifyingpatientswithanosteoarthritisphenotypeintheelectronicmedicalrecord
_version_ 1725915529120579584