Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media

Homophobic expressions are a form of insulting the sexual orientation or personality of people. Severe psychological traumas may occur in people who are exposed to this type of communication. It is important to develop automatic classification systems based on language models to examine social media...

Full description

Bibliographic Details
Main Authors: Aci, Ç.İ (Author), Akdagli, A. (Author), Karayiğit, H. (Author)
Format: Article
Language:English
Published: Kauno Technologijos Universitetas 2022
Subjects:
Online Access:View Fulltext in Publisher
LEADER 02990nam a2200241Ia 4500
001 10.5755-j01.itc.51.2.29988
008 220718s2022 CNT 000 0 und d
020 |a 1392124X (ISSN) 
245 1 0 |a Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media 
260 0 |b Kauno Technologijos Universitetas  |c 2022 
856 |z View Fulltext in Publisher  |u https://doi.org/10.5755/j01.itc.51.2.29988 
520 3 |a Homophobic expressions are a form of insulting the sexual orientation or personality of people. Severe psychological traumas may occur in people who are exposed to this type of communication. It is important to develop automatic classification systems based on language models to examine social media content and distinguish homophobic discourse. This study aims to present a pre-trained Multilingual Bidirectional Encoder Representations from Transformers (M-BERT) model that can successfully detect whether Turkish comments on social media contain homophobic or related hate comments (i.e., sexist, severe humiliation, and defecation expressions). Comments in the Homophobic-Abusive Turkish Comments (HATC) dataset were collected from Instagram to train the detection models. The HATC dataset was manually labeled at the sentence level and combined with the Abusive Turkish Comments (ATC) dataset that has developed in our previous study. The HATC dataset has been balanced using the resampling method and two forms of the dataset (i.e., resHATC and original HATC) were used in the experiments. Afterward, the M-BERT model was compared with DL-based models (i.e., Long-Short Term Memory, Bidirectional Long-Short Term Memory (BiLSTM), Gated Recurrent Unit), Traditional Machine Learning (TML) classifiers (i.e., Support Vector Machine, Naive Bayes, Random Forest) and Ensemble Classifiers (i.e., Adaptive Boosting, eXtreme Gradient Boosting, Gradient Boosting) for the best model selection. The performance of the detection models was evaluated using F1-score, precision, and recall performance metrics. Results showed the best performance (homophobic F1-score: 82.64%, hateful F1-score: 91.75%, neutral F1-score: 96.08%, average F1-score: 90.15%) were achieved with the M-BERT model on the HATC dataset. The M-BERT detection model can increase the effectiveness of filters in detecting Turkish homophobic and related hate speech in social networks. It can be used to detect homophobic and related hate speech for different languages since the M-BERT model has multilingual pre-trained data. © 2022, Kauno Technologijos Universitetas. All rights reserved. 
650 0 4 |a deep learning 
650 0 4 |a Homophobic speech detection 
650 0 4 |a multilingual BERT 
650 0 4 |a sentiment analysis 
650 0 4 |a text classification 
650 0 4 |a transfer learning 
650 0 4 |a Turkish social media 
700 1 |a Aci, Ç.İ.  |e author 
700 1 |a Akdagli, A.  |e author 
700 1 |a Karayiğit, H.  |e author 
773 |t Information Technology and Control