PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES

Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health reco...

Full description

Bibliographic Details
Other Authors: Richter, Aaron N. (author)
Format: Others
Language:English
Published: Florida Atlantic University
Subjects:
Online Access:http://purl.flvc.org/fau/fd/FA00013342
id ndltd-fau.edu-oai-fau.digital.flvc.org-fau_41962
record_format oai_dc
spelling ndltd-fau.edu-oai-fau.digital.flvc.org-fau_419622019-10-17T03:26:52Z PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES FA00013342 Richter, Aaron N. (author) Khoshgoftaar, Taghi M. (Thesis advisor) Florida Atlantic University (Degree grantor) College of Engineering and Computer Science Department of Computer and Electrical Engineering and Computer Science 191 p. application/pdf Electronic Thesis or Dissertation Text English Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data. Florida Atlantic University Melanoma Electronic Health Records Machine learning--Technique Big Data Includes bibliography. Dissertation (Ph.D.)--Florida Atlantic University, 2019. FAU Electronic Theses and Dissertations Collection Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. http://purl.flvc.org/fau/fd/FA00013342 http://rightsstatements.org/vocab/InC/1.0/ https://fau.digital.flvc.org/islandora/object/fau%3A41962/datastream/TN/view/PREDICTING%20MELANOMA%20RISK%20FROM%20ELECTRONIC%20HEALTH%20RECORDS%20WITH%20MACHINE%20LEARNING%20TECHNIQUES.jpg
collection NDLTD
language English
format Others
sources NDLTD
topic Melanoma
Electronic Health Records
Machine learning--Technique
Big Data
spellingShingle Melanoma
Electronic Health Records
Machine learning--Technique
Big Data
PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
description Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data. === Includes bibliography. === Dissertation (Ph.D.)--Florida Atlantic University, 2019. === FAU Electronic Theses and Dissertations Collection
author2 Richter, Aaron N. (author)
author_facet Richter, Aaron N. (author)
title PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_short PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_full PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_fullStr PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_full_unstemmed PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
title_sort predicting melanoma risk from electronic health records with machine learning techniques
publisher Florida Atlantic University
url http://purl.flvc.org/fau/fd/FA00013342
_version_ 1719269939797295104