A cross-lingual COVID-19 fake news dataset

Overview

CrossFake

An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below:
Cross-lingual COVID-19 Fake News Detection.
Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. Yu.

Introduction

The COVID-19 pandemic poses a significant threat to global public health. Meanwhile, there is massive misinformation associated with the pandemic, which advocates unfounded or unscientific claims. Even major social media and news outlets have made an extra effort in debunking COVID-19 misinformation, most of the fact-checking information is in English, whereas some unmoderated COVID-19 misinformation is still circulating in other languages, threatening the health of less informed people in immigrant communities and developing countries (The Vox, New York Times).

In the above paper, we make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English).

This repo contains a Chinese-English real & fake news dataset according to existing English fact-checking information. Details on this dataset are described in Dataset Detail.

The highlights of our dataset are as follows:

  • Bilingual news pieces for the same event (fact).
  • Multiple Chinese news pieces for the same event (fact).
  • Comprehensive metadata for each news (see below).

Dataset Detail

The table below shows the number of annotated news in each language:

Lang. Fake Real Total
ENG 55 82 137
CHN 101 118 219

The metadata of our dataset can be found at CrossFake_metadata.xlsx, which includes two sheets (news_fake and news_real). Given the news id, you can find the corresponding news body text in the body_text directory. The meanings of each column of the metadata are shown below:

  • Column A (id):

    News id. Chinese real & fake news is annotated according to existing English fact-checking information. Thus, each piece of English news may correspond to multiple pieces of Chinese news from different sources. For example, in the news_fake sheet, the ids 1_1 and 1_2 indicate one piece of English news, corresponding to two pieces of Chinese news.

  • Column B (fact_check_url):

    The fact-checking source of the corresponding English news.

  • Column C (type):

    The news type. Post and Article represent the news is from a social media post or an online article, respectively. Note that we also annotated some clickbait news whose title and body text present contradictory information.

  • Column D (source):

    The news source. Personal and Professional represent the news is from a personal account or professional source (WHO, NIH, etc.), respectively.

  • Column E (mixed?):

    Whether the news include mixed content? If a news body text only has the content related to the checked fact, the piece of news is annotated as not mixed. Accordingly, the news whose content includes events/facts besides the checked fact is regarded as mixed news.

  • Column F (platform):

    The platform where the news is published.

  • Column G (news_url):

    The news source URL. Note that some of the links are invalid due to the deletion/removal of the news. We have archived the accessible news (see Column H) during we curate the dataset.

  • Column H (archive):

    The archived news link. To permanently store the original news, we archived the news source URL.

  • Column I (newstitle):

    The news title.

  • Column J (publish_date):

    The news publishing date.

  • Columns K to R have the same meanings as Columns C to J, but they indicate the information of Chinese news.

Case Study

Besides the findings and conclusions presented in our paper. We have extra interesting findings during collecting the data:

  1. Mixed Fact. For some fake news, their corresponding Chinese news articles presented them in the form of a news digest with other news events. It brings an extra hurdle to fact-check those news pieces since only partial content of the news contains misinformation. A typical example is news_id 8_3 in the news_fake sheet. You can check out other news whose mixed? annotated as Yes.

  2. Misused Fact. For news_real id 9_2, we find a Chinese social post leveraging the fact that "coronavirus can live for up to 4 hours on copper" to promote their copper-made pot. In this case, even the title and most of the news content seem legit, but the connection between "the copper kills coronavirus" and "copper pot is good" is still questionable.

  3. Fake News Type. During we annotate the Chinese news based on the fact-checked English news. We find that most of the fact-checked fake news from Politifact have no corresponding Chinese news. Those news pieces usually are local news in the United States.

  4. Cross-lingual Fact-checking. For the news_real id 9_1, we find a Chinese news piece from a professional news outlet published five days earlier than the fact-checked English Facebook post. It suggests that we could leverage fact information from another language to help fact-check the news. Note that most of the Chinese news in our datasets are published later than the source English news since most of the checked news events are originated in English media.

Future Directions

Given the current dataset, some future research directions include:

  • The writing style/sentiment/stance differences between fake news and real news.
  • The writing style/sentiment/stance differences between professional news outlets and personal accounts.
  • The information distortion/loss from English news to Chinese news.
  • The temporal patterns of cross-lingual news migration.
  • The title patterns of different news.

Citation

If you use our code, please cite the paper below:

@inproceedings{du2021cross,
  title={Cross-lingual COVID-19 Fake News Detection},
  author={Du, Jiangshu and Dou, Yingtong and Xia, Congying and Cui, Limeng and Ma, Jing and Yu, Philip S},
  booktitle={Proceedings of the 21st IEEE International Conference on Data Mining Workshops (ICDMW'21)},
  year={2021}
}
Owner
Yingtong Dou
Ph.D. @ UIC. Graph Mining; Fraud Detection; Secure Machine Learning
Yingtong Dou
Deep-learning X-Ray Micro-CT image enhancement, pore-network modelling and continuum modelling

EDSR modelling A Github repository for deep-learning image enhancement, pore-network and continuum modelling from X-Ray Micro-CT images. The repositor

Samuel Jackson 7 Nov 03, 2022
This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

Zheng Li 9 Sep 26, 2022
Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation

Tiny-NewsRec The source codes for our paper "Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation". Requirements PyTorch == 1.6.0 Tensor

Yang Yu 3 Dec 07, 2022
Image Segmentation with U-Net Algorithm on Carvana Dataset using AWS Sagemaker

Image Segmentation with U-Net Algorithm on Carvana Dataset using AWS Sagemaker This is a full project of image segmentation using the model built with

Htin Aung Lu 1 Jan 04, 2022
Code accompanying "Adaptive Methods for Aggregated Domain Generalization"

Adaptive Methods for Aggregated Domain Generalization (AdaClust) Official Pytorch Implementation of Adaptive Methods for Aggregated Domain Generalizat

Xavier Thomas 15 Sep 20, 2022
Bravia core script for python

Bravia-Core-Script You need to have a mandatory account If this L3 does not work, try another L3. enjoy

5 Dec 26, 2021
Easy to use Audio Tagging in PyTorch

Audio Classification, Tagging & Sound Event Detection in PyTorch Progress: Fine-tune on audio classification Fine-tune on audio tagging Fine-tune on s

sithu3 15 Dec 22, 2022
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introdu

OATML 360 Dec 28, 2022
Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

[AAAI 2021]DropLoss for Long-Tail Instance Segmentation [AAAI 2021] DropLoss for Long-Tail Instance Segmentation Ting-I Hsieh*, Esther Robb*, Hwann-Tz

Tim 37 Dec 02, 2022
This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset.

FACT This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset. To cite, please use:

105 Dec 17, 2022
(AAAI2022) Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Semantic Segmentation

SM-PPM This is a Pytorch implementation of our paper "Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Seman

W-zx-Y 10 Dec 07, 2022
🤗 Paper Style Guide

🤗 Paper Style Guide (Work in progress, send a PR!) Libraries to Know booktabs natbib cleveref Either seaborn, plotly or altair for graphs algorithmic

Hugging Face 66 Dec 12, 2022
A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Segnet is deep fully convolutional neural network architecture for semantic pixel-wise segmentation. This is implementation of http://arxiv.org/pdf/15

Pradyumna Reddy Chinthala 190 Dec 15, 2022
TensorFlow implementation of Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction)

Barlow-Twins-TF This repository implements Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction) in TensorFlow and demonstrat

Sayak Paul 36 Sep 14, 2022
It helps user to learn Pick-up lines and share if he has a better one

Pick-up-Lines-Generator(Open Source) It helps user to learn Pick-up lines Share and Add one or many to the DataBase Unique SQLite DataBase AI Undercon

knock_nott 0 May 04, 2022
RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

The first comprehensive Robustness investigation benchmark on large-scale dataset ImageNet regarding ARchitecture design and Training techniques towards diverse noises.

132 Dec 23, 2022
Semi-Supervised Signed Clustering Graph Neural Network (and Implementation of Some Spectral Methods)

SSSNET SSSNET: Semi-Supervised Signed Network Clustering For details, please read our paper. Environment Setup Overview The project has been tested on

Yixuan He 9 Nov 24, 2022
Godot RL Agents is a fully Open Source packages that allows video game creators

Godot RL Agents The Godot RL Agents is a fully Open Source packages that allows video game creators, AI researchers and hobbiest the opportunity to le

Edward Beeching 326 Dec 30, 2022
Quadruped-command-tracking-controller - Quadruped command tracking controller (flat terrain)

Quadruped command tracking controller (flat terrain) Prepare Install RAISIM link

Yunho Kim 4 Oct 20, 2022
Prediction of MBA refinance Index (Mortgage prepayment)

Prediction of MBA refinance Index (Mortgage prepayment) Deep Neural Network based Model The ability to predict mortgage prepayment is of critical use

Ruchil Barya 1 Jan 16, 2022