Establish medical record similarity query system based on SPARK distributed architecture

碩士 === 國立中正大學 === 資訊管理學系碩士在職專班 === 107 === This research constructed a platform to predict the suffering a heart failure of a patient through evaluating the similarities among patients of their electronic patient records, which are the most complete and sound source to depict the healthy status of p...

Full description

Bibliographic Details
Main Authors: Huang,Hsin-Chieh, 黃信傑
Other Authors: WU, FAN
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/8rmy9h
Description
Summary:碩士 === 國立中正大學 === 資訊管理學系碩士在職專班 === 107 === This research constructed a platform to predict the suffering a heart failure of a patient through evaluating the similarities among patients of their electronic patient records, which are the most complete and sound source to depict the healthy status of patients. This research adopted a supervised machine-learning method, labeling the patients according to their disease diagnosis, assessing their similarities, and then predicting the risk of suffering a disease. The data used in this study comes from a patient's medical history at a regional teaching hospital, including diagnostics, medicines, laboratory results, hospitalization information, and vital characteristics of each patient. The patient datasets labeled with heart failure were trained and modeled. Two experiments for the analysis and comparison of the research results are performed: one used Python and Spark as a data processing tool in the data processing stage in order to compare the processing efficiency of the two methods; the second is to compare the analysis results of the models, comparing the effectiveness of scikit-learn and Spark using logistic regression, support vector machine, decision tree three machine learning models. The results of the study indicate that in Python environment, the time spent processing a small amount of data is not much different from that in Spark environment; but when the amount of data is getting larger, the time spent in Python processing the data will increase in geometric growth. Under such an amount of data, it is a good choice to run in Spark environment to process it. In terms of model analysis, the scikit-learn is better than the Spark classification algorithm since the former’s prediction index is better than the latter’s.