A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collectio...

Full description

Bibliographic Details
Main Authors:	Han-Sub Shin, Hyuk-Yoon Kwon, Seung-Jin Ryu
Format:	Article
Language:	English
Published:	MDPI AG 2020-09-01
Series:	Electronics
Subjects:	cybersecurity intelligence word embedding deep learning background knowledge Twitter
Online Access:	https://www.mdpi.com/2079-9292/9/9/1527

id	doaj-9f24c932a15c426a8c88d56785d63036
record_format	Article
spelling	doaj-9f24c932a15c426a8c88d56785d630362020-11-25T03:23:11ZengMDPI AGElectronics2079-92922020-09-0191527152710.3390/electronics9091527A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in TwitterHan-Sub Shin0Hyuk-Yoon Kwon1Seung-Jin Ryu2Department of Industrial Engineering, Seoul National University of Science and Technology, 232 Gongneung-Ro, Nowon-Gu, Seoul 01811, KoreaDepartment of Industrial Engineering, The Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, 232 Gongneung-Ro, Nowon-Gu, Seoul 01811, KoreaThe Affiliated Institute of ETRI (Electronics and Telecommunications Research Institute), 1559 Yuseong-daero, Yuseong-gu, Daejeon 34044, KoreaDetecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.https://www.mdpi.com/2079-9292/9/9/1527cybersecurity intelligenceword embeddingdeep learningbackground knowledgeTwitter
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Han-Sub Shin Hyuk-Yoon Kwon Seung-Jin Ryu
spellingShingle	Han-Sub Shin Hyuk-Yoon Kwon Seung-Jin Ryu A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter Electronics cybersecurity intelligence word embedding deep learning background knowledge Twitter
author_facet	Han-Sub Shin Hyuk-Yoon Kwon Seung-Jin Ryu
author_sort	Han-Sub Shin
title	A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter
title_short	A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter
title_full	A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter
title_fullStr	A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter
title_full_unstemmed	A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter
title_sort	new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter
publisher	MDPI AG
series	Electronics
issn	2079-9292
publishDate	2020-09-01
description	Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.
topic	cybersecurity intelligence word embedding deep learning background knowledge Twitter
url	https://www.mdpi.com/2079-9292/9/9/1527
work_keys_str_mv	AT hansubshin anewtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter AT hyukyoonkwon anewtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter AT seungjinryu anewtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter AT hansubshin newtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter AT hyukyoonkwon newtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter AT seungjinryu newtextclassificationmodelbasedoncontrastivewordembeddingfordetectingcybersecurityintelligenceintwitter
_version_	1724607164184526848

A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Similar Items