A cross-lingual COVID-19 fake news dataset

Overview

CrossFake

An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below:
Cross-lingual COVID-19 Fake News Detection.
Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. Yu.

Introduction

The COVID-19 pandemic poses a significant threat to global public health. Meanwhile, there is massive misinformation associated with the pandemic, which advocates unfounded or unscientific claims. Even major social media and news outlets have made an extra effort in debunking COVID-19 misinformation, most of the fact-checking information is in English, whereas some unmoderated COVID-19 misinformation is still circulating in other languages, threatening the health of less informed people in immigrant communities and developing countries (The Vox, New York Times).

In the above paper, we make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English).

This repo contains a Chinese-English real & fake news dataset according to existing English fact-checking information. Details on this dataset are described in Dataset Detail.

The highlights of our dataset are as follows:

  • Bilingual news pieces for the same event (fact).
  • Multiple Chinese news pieces for the same event (fact).
  • Comprehensive metadata for each news (see below).

Dataset Detail

The table below shows the number of annotated news in each language:

Lang. Fake Real Total
ENG 55 82 137
CHN 101 118 219

The metadata of our dataset can be found at CrossFake_metadata.xlsx, which includes two sheets (news_fake and news_real). Given the news id, you can find the corresponding news body text in the body_text directory. The meanings of each column of the metadata are shown below:

  • Column A (id):

    News id. Chinese real & fake news is annotated according to existing English fact-checking information. Thus, each piece of English news may correspond to multiple pieces of Chinese news from different sources. For example, in the news_fake sheet, the ids 1_1 and 1_2 indicate one piece of English news, corresponding to two pieces of Chinese news.

  • Column B (fact_check_url):

    The fact-checking source of the corresponding English news.

  • Column C (type):

    The news type. Post and Article represent the news is from a social media post or an online article, respectively. Note that we also annotated some clickbait news whose title and body text present contradictory information.

  • Column D (source):

    The news source. Personal and Professional represent the news is from a personal account or professional source (WHO, NIH, etc.), respectively.

  • Column E (mixed?):

    Whether the news include mixed content? If a news body text only has the content related to the checked fact, the piece of news is annotated as not mixed. Accordingly, the news whose content includes events/facts besides the checked fact is regarded as mixed news.

  • Column F (platform):

    The platform where the news is published.

  • Column G (news_url):

    The news source URL. Note that some of the links are invalid due to the deletion/removal of the news. We have archived the accessible news (see Column H) during we curate the dataset.

  • Column H (archive):

    The archived news link. To permanently store the original news, we archived the news source URL.

  • Column I (newstitle):

    The news title.

  • Column J (publish_date):

    The news publishing date.

  • Columns K to R have the same meanings as Columns C to J, but they indicate the information of Chinese news.

Case Study

Besides the findings and conclusions presented in our paper. We have extra interesting findings during collecting the data:

  1. Mixed Fact. For some fake news, their corresponding Chinese news articles presented them in the form of a news digest with other news events. It brings an extra hurdle to fact-check those news pieces since only partial content of the news contains misinformation. A typical example is news_id 8_3 in the news_fake sheet. You can check out other news whose mixed? annotated as Yes.

  2. Misused Fact. For news_real id 9_2, we find a Chinese social post leveraging the fact that "coronavirus can live for up to 4 hours on copper" to promote their copper-made pot. In this case, even the title and most of the news content seem legit, but the connection between "the copper kills coronavirus" and "copper pot is good" is still questionable.

  3. Fake News Type. During we annotate the Chinese news based on the fact-checked English news. We find that most of the fact-checked fake news from Politifact have no corresponding Chinese news. Those news pieces usually are local news in the United States.

  4. Cross-lingual Fact-checking. For the news_real id 9_1, we find a Chinese news piece from a professional news outlet published five days earlier than the fact-checked English Facebook post. It suggests that we could leverage fact information from another language to help fact-check the news. Note that most of the Chinese news in our datasets are published later than the source English news since most of the checked news events are originated in English media.

Future Directions

Given the current dataset, some future research directions include:

  • The writing style/sentiment/stance differences between fake news and real news.
  • The writing style/sentiment/stance differences between professional news outlets and personal accounts.
  • The information distortion/loss from English news to Chinese news.
  • The temporal patterns of cross-lingual news migration.
  • The title patterns of different news.

Citation

If you use our code, please cite the paper below:

@inproceedings{du2021cross,
  title={Cross-lingual COVID-19 Fake News Detection},
  author={Du, Jiangshu and Dou, Yingtong and Xia, Congying and Cui, Limeng and Ma, Jing and Yu, Philip S},
  booktitle={Proceedings of the 21st IEEE International Conference on Data Mining Workshops (ICDMW'21)},
  year={2021}
}
Owner
Yingtong Dou
Ph.D. @ UIC. Graph Mining; Fraud Detection; Secure Machine Learning
Yingtong Dou
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
InvTorch: memory-efficient models with invertible functions

InvTorch: Memory-Efficient Invertible Functions This module extends the functionality of torch.utils.checkpoint.checkpoint to work with invertible fun

Modar M. Alfadly 12 May 12, 2022
TensorFlow CNN for fast style transfer

Fast Style Transfer in TensorFlow Add styles from famous paintings to any photo in a fraction of a second! It takes 100ms on a 2015 Titan X to style t

1 Dec 14, 2021
Depth image based mouse cursor visual haptic

Depth image based mouse cursor visual haptic How to run it. Install pyqt5. Install python modules pip install Pillow pip install numpy For illustrati

Xiong Jie 17 Dec 20, 2022
Tensorflow 2 implementation of our high quality frame interpolation neural network

FILM: Frame Interpolation for Large Scene Motion Project | Paper | YouTube | Benchmark Scores Tensorflow 2 implementation of our high quality frame in

Google Research 1.6k Dec 28, 2022
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.

blind SQLIpy Alat blind SQLIpy ini merupakan alat injeksi sql yang menggunakan metode time based blind sql injection metode tersebut membutuhkan waktu

Galih Anggoro Prasetya 4 Feb 24, 2022
Medical Image Segmentation using Squeeze-and-Expansion Transformers

Medical Image Segmentation using Squeeze-and-Expansion Transformers Introduction This repository contains the code of the IJCAI'2021 paper 'Medical Im

askerlee 172 Dec 20, 2022
The pytorch implementation of the paper "text-guided neural image inpainting" at MM'2020

TDANet: Text-Guided Neural Image Inpainting, MM'2020 (Oral) MM | ArXiv This repository implements the paper "Text-Guided Neural Image Inpainting" by L

LisaiZhang 75 Dec 22, 2022
Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

This repo contains code for our paper State-only Imitation with Transition Dynamics Mismatch published at ICLR 2020. The code heavily uses the RL mach

20 Sep 08, 2022
PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)

Completer: Incomplete Multi-view Clustering via Contrastive Prediction This repo contains the code and data of the following paper accepted by CVPR 20

XLearning Group 72 Dec 07, 2022
Official PyTorch implementation of U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

U-GAT-IT — Official PyTorch Implementation : Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Imag

Hyeonwoo Kang 2.4k Jan 04, 2023
Deep Sketch-guided Cartoon Video Inbetweening

Cartoon Video Inbetweening Paper | DOI | Video The source code of Deep Sketch-guided Cartoon Video Inbetweening by Xiaoyu Li, Bo Zhang, Jing Liao, Ped

Xiaoyu Li 37 Dec 22, 2022
The source code for 'Noisy-Labeled NER with Confidence Estimation' accepted by NAACL 2021

Kun Liu*, Yao Fu*, Chuanqi Tan, Mosha Chen, Ningyu Zhang, Songfang Huang, Sheng Gao. Noisy-Labeled NER with Confidence Estimation. NAACL 2021. [arxiv]

30 Nov 12, 2022
Subpopulation detection in high-dimensional single-cell data

PhenoGraph for Python3 PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") repr

Dana Pe'er Lab 42 Sep 05, 2022
Shuffle Attention for MobileNetV3

SA-MobileNetV3 Shuffle Attention for MobileNetV3 Train Run the following command for train model on your own dataset: python train.py --dataset mnist

Sajjad Aemmi 36 Dec 28, 2022
LightningFSL: Pytorch-Lightning implementations of Few-Shot Learning models.

LightningFSL: Few-Shot Learning with Pytorch-Lightning In this repo, a number of pytorch-lightning implementations of FSL algorithms are provided, inc

Xu Luo 76 Dec 11, 2022
Implementation of paper "Self-supervised Learning on Graphs:Deep Insights and New Directions"

SelfTask-GNN A PyTorch implementation of "Self-supervised Learning on Graphs: Deep Insights and New Directions". [paper] In this paper, we first deepe

Wei Jin 85 Oct 13, 2022
This code implements constituency parse tree aggregation

README This code implements constituency parse tree aggregation. Folder details code: This folder contains the code that implements constituency parse

Adithya Kulkarni 0 Oct 11, 2021
Deep Learning Package based on TensorFlow

White-Box-Layer is a Python module for deep learning built on top of TensorFlow and is distributed under the MIT license. The project was started in M

YeongHyeon Park 7 Dec 27, 2021