Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms

碩士 === 中原大學 === 資訊工程研究所 === 105 === Recently, Sentiment analysis (SA) is gaining popularity. Most previous work studied product reviews with machine learning techniques to predict the sentiment polarity. They focused on how to build the patterns like statistical language models or to extract semanti...

Full description

Bibliographic Details
Main Authors: Zong-Yao Wu, 吳宗耀
Other Authors: Shih-Wen Ke
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/q78nt2
id ndltd-TW-105CYCU5392031
record_format oai_dc
spelling ndltd-TW-105CYCU53920312019-05-15T23:39:16Z http://ndltd.ncl.edu.tw/handle/q78nt2 Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms 文字情感分析:利用病徵分析病患自撰之日誌 Zong-Yao Wu 吳宗耀 碩士 中原大學 資訊工程研究所 105 Recently, Sentiment analysis (SA) is gaining popularity. Most previous work studied product reviews with machine learning techniques to predict the sentiment polarity. They focused on how to build the patterns like statistical language models or to extract semantic features from texts. In this paper, we apply SA techniques to patient-authored text on online medical communities. Our datasets are patient-authored text (PAT) from a well-known medical website, patientslikeme.com (PLM). Patients can share mood phrases, severity of symptoms, treatment, and quality of life on PLM. PAT is more like a diary or journal reflecting on the patients themselves. There is another special point unique to the PLM datasets that is discussion of symptoms and diseases. So we will discuss the relationship of sentiment polarity and symptoms. Many studies used bag-of-word to represent document features but some studies showed that bag-of-word will lose the word a part of meaning. In our study, we attempted to explore the possibility of using “word vectors” to represent documents. Word2Vec is a tool which most want to express the concept is training the vector not only finding similar words, but also having multiple levels of meaning. In the first experiment, we used Word2Vec to generate word vectors and we used five different methods to generate sentence vector including the most-commonly used average method, no normalization method, the stop word method, and the sentiment method in the SA domain. Then we used two classifiers support vector machine (SVM) and k-nearest neighbors (k-NN) with Cosine Similarity to classify the sentiment polarity of the PATs. Some previous studies claimed that the corpus for training the Word2Vec model is very important, so we also wished to discuss the effect of corpus composition on the classification results. We prepared two corpora for second experiment which will discuss whether high quality or volume is more helpful for classification. We have observed that “PATs with reference to symptoms” have a large effect on classification from past studies. Our observation shows that negative polarity and reference to symptoms are highly correlated. Therefore we are going to use build another training model and evaluate the results based on this observation. The results show that the non-normalization method is the best in identifying positive polarity, the sentiment method is the best in identifying negative polarity. We also found that the normalization method produced worse classification results than the non-normalization method. In the second experiment, we used two different types of classifiers, i.e. SVM and k-NN. All results showed that the Word2Vec model trained on medical corpora yielded better classification performance than the Wikipedia corpus. This outcome indicated that the quality in the training corpus was more important than the volume when training Word2Vec models. In the future, we wish to further explore the usage of explicit and implicit references to symptoms in the PATs. Shih-Wen Ke 柯士文 2017 學位論文 ; thesis 82 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 中原大學 === 資訊工程研究所 === 105 === Recently, Sentiment analysis (SA) is gaining popularity. Most previous work studied product reviews with machine learning techniques to predict the sentiment polarity. They focused on how to build the patterns like statistical language models or to extract semantic features from texts. In this paper, we apply SA techniques to patient-authored text on online medical communities. Our datasets are patient-authored text (PAT) from a well-known medical website, patientslikeme.com (PLM). Patients can share mood phrases, severity of symptoms, treatment, and quality of life on PLM. PAT is more like a diary or journal reflecting on the patients themselves. There is another special point unique to the PLM datasets that is discussion of symptoms and diseases. So we will discuss the relationship of sentiment polarity and symptoms. Many studies used bag-of-word to represent document features but some studies showed that bag-of-word will lose the word a part of meaning. In our study, we attempted to explore the possibility of using “word vectors” to represent documents. Word2Vec is a tool which most want to express the concept is training the vector not only finding similar words, but also having multiple levels of meaning. In the first experiment, we used Word2Vec to generate word vectors and we used five different methods to generate sentence vector including the most-commonly used average method, no normalization method, the stop word method, and the sentiment method in the SA domain. Then we used two classifiers support vector machine (SVM) and k-nearest neighbors (k-NN) with Cosine Similarity to classify the sentiment polarity of the PATs. Some previous studies claimed that the corpus for training the Word2Vec model is very important, so we also wished to discuss the effect of corpus composition on the classification results. We prepared two corpora for second experiment which will discuss whether high quality or volume is more helpful for classification. We have observed that “PATs with reference to symptoms” have a large effect on classification from past studies. Our observation shows that negative polarity and reference to symptoms are highly correlated. Therefore we are going to use build another training model and evaluate the results based on this observation. The results show that the non-normalization method is the best in identifying positive polarity, the sentiment method is the best in identifying negative polarity. We also found that the normalization method produced worse classification results than the non-normalization method. In the second experiment, we used two different types of classifiers, i.e. SVM and k-NN. All results showed that the Word2Vec model trained on medical corpora yielded better classification performance than the Wikipedia corpus. This outcome indicated that the quality in the training corpus was more important than the volume when training Word2Vec models. In the future, we wish to further explore the usage of explicit and implicit references to symptoms in the PATs.
author2 Shih-Wen Ke
author_facet Shih-Wen Ke
Zong-Yao Wu
吳宗耀
author Zong-Yao Wu
吳宗耀
spellingShingle Zong-Yao Wu
吳宗耀
Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
author_sort Zong-Yao Wu
title Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
title_short Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
title_full Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
title_fullStr Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
title_full_unstemmed Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms
title_sort sentiment analysis for patient-author text: using word2vec and symptoms
publishDate 2017
url http://ndltd.ncl.edu.tw/handle/q78nt2
work_keys_str_mv AT zongyaowu sentimentanalysisforpatientauthortextusingword2vecandsymptoms
AT wúzōngyào sentimentanalysisforpatientauthortextusingword2vecandsymptoms
AT zongyaowu wénzìqínggǎnfēnxīlìyòngbìngzhēngfēnxībìnghuànzìzhuànzhīrìzhì
AT wúzōngyào wénzìqínggǎnfēnxīlìyòngbìngzhēngfēnxībìnghuànzìzhuànzhīrìzhì
_version_ 1719150020312170496