Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification

Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains r...

Full description

Bibliographic Details
Main Authors: Jia Luo, Rui Xue, Jinglu Hu, Didier El Baz
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Healthcare
Subjects:
Online Access:https://www.mdpi.com/2227-9032/9/9/1094
id doaj-ad6f67c794f344f38f1bc0a5c4e199a1
record_format Article
spelling doaj-ad6f67c794f344f38f1bc0a5c4e199a12021-09-26T00:14:23ZengMDPI AGHealthcare2227-90322021-08-0191094109410.3390/healthcare9091094Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation IdentificationJia Luo0Rui Xue1Jinglu Hu2Didier El Baz3College of Economics and Management, Beijing University of Technology, Beijing 100124, ChinaCollege of Economics and Management, Beijing University of Technology, Beijing 100124, ChinaGraduate School of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, JapanLAAS-CNRS, Université de Toulouse, CNRS, 31031 Toulouse, FranceMisinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works.https://www.mdpi.com/2227-9032/9/9/1094COVID-19infodemic datamisinformation identificationdeep learning
collection DOAJ
language English
format Article
sources DOAJ
author Jia Luo
Rui Xue
Jinglu Hu
Didier El Baz
spellingShingle Jia Luo
Rui Xue
Jinglu Hu
Didier El Baz
Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
Healthcare
COVID-19
infodemic data
misinformation identification
deep learning
author_facet Jia Luo
Rui Xue
Jinglu Hu
Didier El Baz
author_sort Jia Luo
title Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
title_short Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
title_full Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
title_fullStr Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
title_full_unstemmed Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
title_sort combating the infodemic: a chinese infodemic dataset for misinformation identification
publisher MDPI AG
series Healthcare
issn 2227-9032
publishDate 2021-08-01
description Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works.
topic COVID-19
infodemic data
misinformation identification
deep learning
url https://www.mdpi.com/2227-9032/9/9/1094
work_keys_str_mv AT jialuo combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification
AT ruixue combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification
AT jingluhu combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification
AT didierelbaz combatingtheinfodemicachineseinfodemicdatasetformisinformationidentification
_version_ 1717366698810540032