Big Data Analytics on Liver-Cancer Literature using Spark

碩士 === 逢甲大學 === 生醫資訊暨生醫工程碩士學位學程 === 106 === Cancer is one of the main causes of death. The number of known cancer to-date is more than one hundred. Liver cancer, ranked after trachea and lung cancer, has long been high in the cancer leading cause in Taiwan and even ranked as second in 2016. The numb...

Full description

Bibliographic Details
Main Authors: LIN, YU-JU, 林玗儒
Other Authors: LIN, MING-YEN
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/g4j775
Description
Summary:碩士 === 逢甲大學 === 生醫資訊暨生醫工程碩士學位學程 === 106 === Cancer is one of the main causes of death. The number of known cancer to-date is more than one hundred. Liver cancer, ranked after trachea and lung cancer, has long been high in the cancer leading cause in Taiwan and even ranked as second in 2016. The number of scientific articles related to cancer proliferates every year. The number reaches as high as 20 million in PubMed so that discovering useful information from the massive collection is very difficult. In addition, using a single machine to sift through these articles is very time-consuming. Therefore, we present a big data analytic framework using the distributed Apache Spark platform for text mining in PubMed literature. After an efficient analysis on the huge volume of liver cancer articles, a word cloud is constructed to highlight important terms within. We also establish a prediction model for liver cancer articles so that researchers may effectively validate whether an article is related to liver cancer or not. Terms including cell, patient, liver, cancer and tumor are mostly visible in the word cloud in our experiments, this is the same as our general knowledge. Several classification models in Spark MLlib including Linear Support Vector Machines (SVM), Logistic Regression, Naïve Bayes, Decision Tree, and Random Forest are used in our experiments. Relevancy to liver cancer is further confirmed by using MeSH (Medical Subject Headings) terms. Logistic regression is about 3 times faster than SVMs and the accuracy of both methods is close to 95% in the experiments using hold-out validation. When max_features is 500 and min_df ≤ 0.1(or equal to 1), the accuracy may reach 96%. In the experiments with K-fold cross-validation, Decision Tree is 20 times faster than SVMs while the accuracy of both methods is 96%. The experimental results show that that our prediction model may effectively classify liver cancer articles.