Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains r...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-08-01
|
Series: | Healthcare |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-9032/9/9/1094 |
id |
doaj-ad6f67c794f344f38f1bc0a5c4e199a1 |
---|---|
record_format |
Article |
spelling |
doaj-ad6f67c794f344f38f1bc0a5c4e199a12021-09-26T00:14:23ZengMDPI AGHealthcare2227-90322021-08-0191094109410.3390/healthcare9091094Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation IdentificationJia Luo0Rui Xue1Jinglu Hu2Didier El Baz3College of Economics and Management, Beijing University of Technology, Beijing 100124, ChinaCollege of Economics and Management, Beijing University of Technology, Beijing 100124, ChinaGraduate School of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, JapanLAAS-CNRS, Université de Toulouse, CNRS, 31031 Toulouse, FranceMisinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works.https://www.mdpi.com/2227-9032/9/9/1094COVID-19infodemic datamisinformation identificationdeep learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jia Luo Rui Xue Jinglu Hu Didier El Baz |
spellingShingle |
Jia Luo Rui Xue Jinglu Hu Didier El Baz Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification Healthcare COVID-19 infodemic data misinformation identification deep learning |
author_facet |
Jia Luo Rui Xue Jinglu Hu Didier El Baz |
author_sort |
Jia Luo |
title |
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification |
title_short |
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification |
title_full |
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification |
title_fullStr |
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification |
title_full_unstemmed |
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification |
title_sort |
combating the infodemic: a chinese infodemic dataset for misinformation identification |
publisher |
MDPI AG |
series |
Healthcare |
issn |
2227-9032 |
publishDate |
2021-08-01 |
description |
Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works. |
topic |
COVID-19 infodemic data misinformation identification deep learning |
url |
https://www.mdpi.com/2227-9032/9/9/1094 |
work_keys_str_mv |
AT jialuo combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification AT ruixue combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification AT jingluhu combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification AT didierelbaz combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification |
_version_ |
1717366698810540032 |