Learning Unsupervised Representations from Biomedical Text

Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers...

Full description

Bibliographic Details
Main Authors: Christopher Meaney, Karen Tu, Liisa Jaakkimainen, Michael Escobar, Frank Rudzicz, Jessica Widdifield
Format: Article
Language:English
Published: Swansea University 2018-08-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/760
id doaj-d3abcbfd760b4686bbe318a324310085
record_format Article
spelling doaj-d3abcbfd760b4686bbe318a3243100852020-11-24T22:45:26ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.760Learning Unsupervised Representations from Biomedical TextChristopher Meaney0Karen Tu1Liisa Jaakkimainen2Michael Escobar3Frank Rudzicz4Jessica Widdifield5University of TorontoUniversity of TorontoInstitute for Clinical Evaluative SciencesUniversity of TorontoUniversity Health NetworkInstitute for Clinical Evaluative Sciences Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning. Objectives and Approach EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields. The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models. Results EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models. Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.). Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space. Conclusion/Implications “Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure. https://ijpds.org/article/view/760
collection DOAJ
language English
format Article
sources DOAJ
author Christopher Meaney
Karen Tu
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Jessica Widdifield
spellingShingle Christopher Meaney
Karen Tu
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Jessica Widdifield
Learning Unsupervised Representations from Biomedical Text
International Journal of Population Data Science
author_facet Christopher Meaney
Karen Tu
Liisa Jaakkimainen
Michael Escobar
Frank Rudzicz
Jessica Widdifield
author_sort Christopher Meaney
title Learning Unsupervised Representations from Biomedical Text
title_short Learning Unsupervised Representations from Biomedical Text
title_full Learning Unsupervised Representations from Biomedical Text
title_fullStr Learning Unsupervised Representations from Biomedical Text
title_full_unstemmed Learning Unsupervised Representations from Biomedical Text
title_sort learning unsupervised representations from biomedical text
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2018-08-01
description Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning. Objectives and Approach EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields. The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models. Results EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models. Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.). Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space. Conclusion/Implications “Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure.
url https://ijpds.org/article/view/760
work_keys_str_mv AT christophermeaney learningunsupervisedrepresentationsfrombiomedicaltext
AT karentu learningunsupervisedrepresentationsfrombiomedicaltext
AT liisajaakkimainen learningunsupervisedrepresentationsfrombiomedicaltext
AT michaelescobar learningunsupervisedrepresentationsfrombiomedicaltext
AT frankrudzicz learningunsupervisedrepresentationsfrombiomedicaltext
AT jessicawiddifield learningunsupervisedrepresentationsfrombiomedicaltext
_version_ 1725688532618444800