Learning Unsupervised Representations from Biomedical Text

Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers...

Full description

Bibliographic Details
Main Authors:	Christopher Meaney, Karen Tu, Liisa Jaakkimainen, Michael Escobar, Frank Rudzicz, Jessica Widdifield
Format:	Article
Language:	English
Published:	Swansea University 2018-08-01
Series:	International Journal of Population Data Science
Online Access:	https://ijpds.org/article/view/760

id	doaj-d3abcbfd760b4686bbe318a324310085
record_format	Article
spelling	doaj-d3abcbfd760b4686bbe318a3243100852020-11-24T22:45:26ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.760Learning Unsupervised Representations from Biomedical TextChristopher Meaney0Karen Tu1Liisa Jaakkimainen2Michael Escobar3Frank Rudzicz4Jessica Widdifield5University of TorontoUniversity of TorontoInstitute for Clinical Evaluative SciencesUniversity of TorontoUniversity Health NetworkInstitute for Clinical Evaluative Sciences Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning. Objectives and Approach EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields. The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models. Results EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models. Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.). Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space. Conclusion/Implications “Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure. https://ijpds.org/article/view/760
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield
spellingShingle	Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield Learning Unsupervised Representations from Biomedical Text International Journal of Population Data Science
author_facet	Christopher Meaney Karen Tu Liisa Jaakkimainen Michael Escobar Frank Rudzicz Jessica Widdifield
author_sort	Christopher Meaney
title	Learning Unsupervised Representations from Biomedical Text
title_short	Learning Unsupervised Representations from Biomedical Text
title_full	Learning Unsupervised Representations from Biomedical Text
title_fullStr	Learning Unsupervised Representations from Biomedical Text
title_full_unstemmed	Learning Unsupervised Representations from Biomedical Text
title_sort	learning unsupervised representations from biomedical text
publisher	Swansea University
series	International Journal of Population Data Science
issn	2399-4908
publishDate	2018-08-01
description	Introduction Healthcare settings are becoming increasingly technological. Interactions/events involving healthcare providers and the patients they service are captured as digital text. Healthcare organizations are amassing increasingly large/complex collections of biomedical text data. Researchers and policy makers are beginning to explore these text data holdings for structure, patterns, and meaning. Objectives and Approach EMRALD is a primary care electronic medical record (EMR) database, comprised of over 40 family medicine clinics, nearly 400 primary care physicians and over 500,000 patients. EMRALD includes full-chart extractions, including all clinical narrative information/data in a variety of fields. The input data (raw text strings) are discrete, sparse and high dimensional. We assessed scalable statistical models for high dimensional discrete data, including fitting, assessing and exploring models from three broad statistical areas: i) matrix factorization/decomposition models ii) probabilistic topic models and iii) word-vector embedding models. Results EMRALD is comprised of 12 text data streams. EMRALD text data is structured into 84 million clinical notes (3.5 billion word/language tokens) and is approximately 18Gb in storage size. We employ a “text as data” pipeline, i) mapping raw strings to sequences of word/language tokens, ii) mapping token sequences to numeric arrays, and finally iii) using numeric arrays as inputs to statistical models. Fitted topic models yield useful thematic summaries of the EMRALD corpora. Topics discovered reflect core responsibilities of primary care physicians (e.g. women’s health, pain management, nutrition/diet, etc.). Fitted vector embedding models capture structure of discourse/syntax. Related words are mapped to similar locations of vector spaces. Analogical reasoning is possible in the embedding space. Conclusion/Implications “Text as data” requires an understanding of statistical models for discrete, sparse, high dimensional data. We fit a variety of unsupervised statistical models to biomedical text data. Preliminary results suggest that the learned low dimensional representations of the biomedical text data are effective at uncovering meaningful patterns/structure.
url	https://ijpds.org/article/view/760
work_keys_str_mv	AT christophermeaney learningunsupervisedrepresentationsfrombiomedicaltext AT karentu learningunsupervisedrepresentationsfrombiomedicaltext AT liisajaakkimainen learningunsupervisedrepresentationsfrombiomedicaltext AT michaelescobar learningunsupervisedrepresentationsfrombiomedicaltext AT frankrudzicz learningunsupervisedrepresentationsfrombiomedicaltext AT jessicawiddifield learningunsupervisedrepresentationsfrombiomedicaltext
_version_	1725688532618444800

Learning Unsupervised Representations from Biomedical Text

Similar Items