Learning Unsupervised Representations from Biomedical Text
Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2018-08-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/760 |
id |
doaj-d3abcbfd760b4686bbe318a324310085 |
---|---|
record_format |
Article |
spelling |
doaj-d3abcbfd760b4686bbe318a3243100852020-11-24T22:45:26ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.760Learning Unsupervised Representations from Biomedical TextChristopher Meaney0Karen Tu1Liisa Jaakkimainen2Michael Escobar3Frank Rudzicz4Jessica Widdifield5University of TorontoUniversity of TorontoInstitute for Clinical Evaluative SciencesUniversity of TorontoUniversity Health NetworkInstitute for Clinical Evaluative Sciences Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning. Objectives and Approach EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields. The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models. Results EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models. Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.). Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space. Conclusion/Implications “Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure. https://ijpds.org/article/view/760 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield |
spellingShingle |
Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield Learning Unsupervised Representations from Biomedical Text International Journal of Population Data Science |
author_facet |
Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield |
author_sort |
Christopher Meaney |
title |
Learning Unsupervised Representations from Biomedical Text |
title_short |
Learning Unsupervised Representations from Biomedical Text |
title_full |
Learning Unsupervised Representations from Biomedical Text |
title_fullStr |
Learning Unsupervised Representations from Biomedical Text |
title_full_unstemmed |
Learning Unsupervised Representations from Biomedical Text |
title_sort |
learning unsupervised representations from biomedical text |
publisher |
Swansea University |
series |
International Journal of Population Data Science |
issn |
2399-4908 |
publishDate |
2018-08-01 |
description |
Introduction
Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning.
Objectives and Approach
EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields.
The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models.
Results
EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models.
Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.).
Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space.
Conclusion/Implications
“Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure.
|
url |
https://ijpds.org/article/view/760 |
work_keys_str_mv |
AT christophermeaney learningunsupervisedrepresentationsfrombiomedicaltext AT karentu learningunsupervisedrepresentationsfrombiomedicaltext AT liisajaakkimainen learningunsupervisedrepresentationsfrombiomedicaltext AT michaelescobar learningunsupervisedrepresentationsfrombiomedicaltext AT frankrudzicz learningunsupervisedrepresentationsfrombiomedicaltext AT jessicawiddifield learningunsupervisedrepresentationsfrombiomedicaltext |
_version_ |
1725688532618444800 |