Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter
Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in t...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2017-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/7941989/ |
id |
doaj-c8dd3321c4a349b5bf8888315a19b6eb |
---|---|
record_format |
Article |
spelling |
doaj-c8dd3321c4a349b5bf8888315a19b6eb2021-03-29T20:07:45ZengIEEEIEEE Access2169-35362017-01-015105151052510.1109/ACCESS.2017.26965587941989Mining Coherent Topics With Pre-Learned Interest Knowledge in TwitterYuan He0https://orcid.org/0000-0001-8462-3907Cheng Wang1Changjun Jiang2Department of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDepartment of Computer Science and Engineering, Tongji University, Shanghai, ChinaDiscovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines.https://ieeexplore.ieee.org/document/7941989/Topic modelsocial networkshort texts |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Yuan He Cheng Wang Changjun Jiang |
spellingShingle |
Yuan He Cheng Wang Changjun Jiang Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter IEEE Access Topic model social network short texts |
author_facet |
Yuan He Cheng Wang Changjun Jiang |
author_sort |
Yuan He |
title |
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter |
title_short |
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter |
title_full |
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter |
title_fullStr |
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter |
title_full_unstemmed |
Mining Coherent Topics With Pre-Learned Interest Knowledge in Twitter |
title_sort |
mining coherent topics with pre-learned interest knowledge in twitter |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2017-01-01 |
description |
Discovering semantic coherent topics from the large amount of user-generated content (UGC) in social media would facilitate many downstream applications of intelligent computing. Topic models, as one of the most powerful algorithms, have been widely used to discover the latent semantic patterns in text collections. However, one key weakness of topic models is that they need documents with certain length to provide reliable statistics for generating coherent topics. In Twitter, the users' tweets are mostly short and noisy. Observations of word co-occurrences are incomprehensible for topic models. To deal with this problem, previous work tried to incorporate prior knowledge to obtain better results. However, this strategy is not practical for the fast evolving UGC in Twitter. In this paper, we first cluster the users according to the retweet network, and the users' interests are mined as the prior knowledge. Such data are then applied to improve the performance of topic learning. The potential cause for the effectiveness of this approach is that users in the same community usually share similar interests, which will result in less noisy sub-data sets. Our algorithm pre-learns two types of interest knowledge from the data set: the interest-word-sets and a tweet-interest preference matrix. Furthermore, a dedicated background model is introduced to judge whether a word is drawn from the background noise. Experiments on two real life twitter data sets show that our model achieves significant improvements over state-of-the-art baselines. |
topic |
Topic model social network short texts |
url |
https://ieeexplore.ieee.org/document/7941989/ |
work_keys_str_mv |
AT yuanhe miningcoherenttopicswithprelearnedinterestknowledgeintwitter AT chengwang miningcoherenttopicswithprelearnedinterestknowledgeintwitter AT changjunjiang miningcoherenttopicswithprelearnedinterestknowledgeintwitter |
_version_ |
1724195242777772032 |