[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Last update: Jan 03, 2023

Overview

MMChat

This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.

Dataset

MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details. The images in the dataset are hosted on Weibo's static image server. You can refer to the scripts provided in data_processing/weibo_image_crawler to download these images.

Two sample dialogues form MMChat are given below (translated from Chinese):

MMChat is released in different versions:

Rule Filtered Raw MMChat

This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:

Item Description	Count
Sessions	4.257 M
Sessions with more than 4 utterances	2.304 M
Utterances	18.590 M
Images	4.874 M
Avg. utterance per session	4.367
Avg. image per session	1.670
Avg. character per utterance	14.104

We devide above dialogues into 9 splits to facilitate the download:

LCCC Filtered MMChat

This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset. Specifically, some dialogues in MMChat are also contained in LCCC. We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises. This version of MMChat is obtained using the script data_processing/LCCC_filter.py The following table shows some basic statistics:

Item Description	Count
Sessions	492.6 K
Sessions with more than 4 utterances	208.8 K
Utterances	1.986 M
Images	1.066 M
Avg. utterance per session	4.031
Avg. image per session	2.514
Avg. character per utterance	11.336

We devide above dialogues into 9 splits to facilitate the download:

MMChat

The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:

Item Description	Count
Sessions	120.84 K
Sessions with more than 4 utterances	17.32 K
Utterances	314.13 K
Images	198.82 K
Avg. utterance per session	2.599
Avg. image per session	2.791
Avg. character per utterance	8.521

The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.

MMChat-hf

We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.

Item Description	Count
Sessions	19.90 K
Sessions with more than 4 utterances	8.91 K
Utterances	81.06 K
Images	52.66K
Avg. utterance per session	4.07
Avg. image per session	2.70
Avg. character per utterance	11.93

We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.

Code

We are also releasing all the codes used for our experiments. You can use the script run_training.sh in each folder to launch the distributed training.

For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features

The model shown in our paper can be found in dialog_image:

Reference

Please cite our paper if you find our work useful ;)

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}

@inproceedings{wang2020chinese,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.03946}
}

[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Related tags

Overview

MMChat

Dataset

Rule Filtered Raw MMChat

LCCC Filtered MMChat

MMChat

MMChat-hf

Code

Reference

Owner

Silver

Learn other languages using artificial intelligence with python.

A fast implementation of bss_eval metrics for blind source separation

Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

Pose estimation with MoveNet Lightning

Food recognition model using convolutional neural network & computer vision

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021

Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields"

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

A strongly-typed genetic programming framework for Python

Efficient semidefinite bounds for multi-label discrete graphical models.

A GridMixup augmentation, inspired by GridMask and CutMix

Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?

A python comtrade load library accelerated by go

A practical ML pipeline for data labeling with experiment tracking using DVC.

D2Go is a toolkit for efficient deep learning

Source code for Task-Aware Variational Adversarial Active Learning

Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Weight estimation in CT by multi atlas techniques

[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Related tags

Overview

MMChat

Dataset

Rule Filtered Raw MMChat

LCCC Filtered MMChat

MMChat

MMChat-hf

Code

Reference

Owner

Silver

Learn other languages ​​using artificial intelligence with python.

A fast implementation of bss_eval metrics for blind source separation

Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

Pose estimation with MoveNet Lightning

Food recognition model using convolutional neural network & computer vision

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021

Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields"

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

A strongly-typed genetic programming framework for Python

Efficient semidefinite bounds for multi-label discrete graphical models.

A GridMixup augmentation, inspired by GridMask and CutMix

Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?

A python comtrade load library accelerated by go

A practical ML pipeline for data labeling with experiment tracking using DVC.

D2Go is a toolkit for efficient deep learning

Source code for Task-Aware Variational Adversarial Active Learning

Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Weight estimation in CT by multi atlas techniques

Learn other languages using artificial intelligence with python.