Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter

Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in t...

Full description

Bibliographic Details
Main Authors:	Yuan He, Cheng Wang, Changjun Jiang
Format:	Article
Language:	English
Published:	IEEE 2017-01-01
Series:	IEEE Access
Subjects:	Topic model social network short texts
Online Access:	https://ieeexplore.ieee.org/document/7941989/

id	doaj-c8dd3321c4a349b5bf8888315a19b6eb
record_format	Article
spelling	doaj-c8dd3321c4a349b5bf8888315a19b6eb2021-03-29T20:07:45ZengIEEEIEEE Access2169-35362017-01-015105151052510.1109/ACCESS.2017.26965587941989Mining Coherent Topics With Pre-Learned Interest Knowledge in TwitterYuan He0https://orcid.org/0000-0001-8462-3907Cheng Wang1Changjun Jiang2Department of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDiscovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines.https://ieeexplore.ieee.org/document/7941989/Topic modelsocial networkshort texts
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yuan He Cheng Wang Changjun Jiang
spellingShingle	Yuan He Cheng Wang Changjun Jiang Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter IEEE Access Topic model social network short texts
author_facet	Yuan He Cheng Wang Changjun Jiang
author_sort	Yuan He
title	Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_short	Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_full	Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_fullStr	Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_full_unstemmed	Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_sort	mining coherent topics with pre-learned interest knowledge in twitter
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2017-01-01
description	Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines.
topic	Topic model social network short texts
url	https://ieeexplore.ieee.org/document/7941989/
work_keys_str_mv	AT yuanhe miningcoherenttopicswithprelearnedinterestknowledgeintwitter AT chengwang miningcoherenttopicswithprelearnedinterestknowledgeintwitter AT changjunjiang miningcoherenttopicswithprelearnedinterestknowledgeintwitter
_version_	1724195242777772032

Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter

Similar Items