Summary: | Sentiment analysis aims at figuring out the opinions of the users towards a certain service or product. In this research, the aim is at classifying the sentiments of users based on the comments they have posed on Douban movie website. In this thesis, I try two different ways to classify the sentiments: with the first one classifying comments into five classes of ratings from 1 to 5, and with the second one classifying comments into three classes of ratings: negative, neutral and positive. For the latter, the ratings of 1 and 2 are grouped as negative, the ratings of 3 neutral and the ratings of 4 and 5 positive. First, Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for machine learning algorithms. Chi Square and Mutual Information are used for feature selection. The selected features are fed into different machine learning methods: Logistic Regression, Linear SVC, SGD classifier and Multinomial Naive Bayes. The performance of models with feature selection will be compared with the performance of models without feature selection for 5-class classification as well as 3-class classification. Also, fastText and Skip-Gram are used as embedding methods for deep learning algorithms LSTM and BILSTM. FastText will also be used for both embedding as well as being a classifier. The aim is to compare different machine learning and deep learning algorithms using different vectorization methods to see which model performs the best regarding both 5-class and 3-class classification. The two classification strategies will be compared with each other in terms of error analysis. The aim is to figure out the similarities and differences of misclassifications made by two different classification strategies.
|