PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES

Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health reco...

Full description

Bibliographic Details
Other Authors:	Richter, Aaron N. (author)
Format:	Others
Language:	English
Published:	Florida Atlantic University
Subjects:	Melanoma Electronic Health Records Machine learning > Technique Big Data
Online Access:	http://purl.flvc.org/fau/fd/FA00013342

id	ndltd-fau.edu-oai-fau.digital.flvc.org-fau_41962
record_format	oai_dc
spelling	ndltd-fau.edu-oai-fau.digital.flvc.org-fau_419622019-10-17T03:26:52Z PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES FA00013342 Richter, Aaron N. (author) Khoshgoftaar, Taghi M. (Thesis advisor) Florida Atlantic University (Degree grantor) College of Engineering and Computer Science Department of Computer and Electrical Engineering and Computer Science 191 p. application/pdf Electronic Thesis or Dissertation Text English Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data. Florida Atlantic University Melanoma Electronic Health Records Machine learning--Technique Big Data Includes bibliography. Dissertation (Ph.D.)--Florida Atlantic University, 2019. FAU Electronic Theses and Dissertations Collection Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. http://purl.flvc.org/fau/fd/FA00013342 http://rightsstatements.org/vocab/InC/1.0/ https://fau.digital.flvc.org/islandora/object/fau%3A41962/datastream/TN/view/PREDICTING%20MELANOMA%20RISK%20FROM%20ELECTRONIC%20HEALTH%20RECORDS%20WITH%20MACHINE%20LEARNING%20TECHNIQUES.jpg
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Melanoma Electronic Health Records Machine learning--Technique Big Data
spellingShingle	Melanoma Electronic Health Records Machine learning--Technique Big Data PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
description	Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data. === Includes bibliography. === Dissertation (Ph.D.)--Florida Atlantic University, 2019. === FAU Electronic Theses and Dissertations Collection
author2	Richter, Aaron N. (author)
author_facet	Richter, Aaron N. (author)
title	PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_short	PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_full	PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_fullStr	PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_full_unstemmed	PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_sort	predicting melanoma risk from electronic health records with machine learning techniques
publisher	Florida Atlantic University
url	http://purl.flvc.org/fau/fd/FA00013342
_version_	1719269939797295104

PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES

Similar Items