Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter

Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in t...

Full description

Bibliographic Details
Main Authors: Yuan He, Cheng Wang, Changjun Jiang
Format: Article
Language:English
Published: IEEE 2017-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/7941989/
id doaj-c8dd3321c4a349b5bf8888315a19b6eb
record_format Article
spelling doaj-c8dd3321c4a349b5bf8888315a19b6eb2021-03-29T20:07:45ZengIEEEIEEE Access2169-35362017-01-015105151052510.1109/ACCESS.2017.26965587941989Mining Coherent Topics With Pre-Learned Interest Knowledge in TwitterYuan He0https://orcid.org/0000-0001-8462-3907Cheng Wang1Changjun Jiang2Department of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDiscovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines.https://ieeexplore.ieee.org/document/7941989/Topic modelsocial networkshort texts
collection DOAJ
language English
format Article
sources DOAJ
author Yuan He
Cheng Wang
Changjun Jiang
spellingShingle Yuan He
Cheng Wang
Changjun Jiang
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
IEEE Access
Topic model
social network
short texts
author_facet Yuan He
Cheng Wang
Changjun Jiang
author_sort Yuan He
title Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_short Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_full Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_fullStr Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_full_unstemmed Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
title_sort mining coherent topics with pre-learned interest knowledge in twitter
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2017-01-01
description Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines.
topic Topic model
social network
short texts
url https://ieeexplore.ieee.org/document/7941989/
work_keys_str_mv AT yuanhe miningcoherenttopicswithprelearnedinterestknowledgeintwitter
AT chengwang miningcoherenttopicswithprelearnedinterestknowledgeintwitter
AT changjunjiang miningcoherenttopicswithprelearnedinterestknowledgeintwitter
_version_ 1724195242777772032