[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Overview

MMChat

This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.

Dataset

MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details. The images in the dataset are hosted on Weibo's static image server. You can refer to the scripts provided in data_processing/weibo_image_crawler to download these images.

Two sample dialogues form MMChat are given below (translated from Chinese): A sample dialogue from MMChat

MMChat is released in different versions:

Rule Filtered Raw MMChat

This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:

Item Description Count
Sessions 4.257 M
Sessions with more than 4 utterances 2.304 M
Utterances 18.590 M
Images 4.874 M
Avg. utterance per session 4.367
Avg. image per session 1.670
Avg. character per utterance 14.104

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

LCCC Filtered MMChat

This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset. Specifically, some dialogues in MMChat are also contained in LCCC. We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises. This version of MMChat is obtained using the script data_processing/LCCC_filter.py The following table shows some basic statistics:

Item Description Count
Sessions 492.6 K
Sessions with more than 4 utterances 208.8 K
Utterances 1.986 M
Images 1.066 M
Avg. utterance per session 4.031
Avg. image per session 2.514
Avg. character per utterance 11.336

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

MMChat

The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:

Item Description Count
Sessions 120.84 K
Sessions with more than 4 utterances 17.32 K
Utterances 314.13 K
Images 198.82 K
Avg. utterance per session 2.599
Avg. image per session 2.791
Avg. character per utterance 8.521

The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.

MMChat-hf

We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.

Item Description Count
Sessions 19.90 K
Sessions with more than 4 utterances 8.91 K
Utterances 81.06 K
Images 52.66K
Avg. utterance per session 4.07
Avg. image per session 2.70
Avg. character per utterance 11.93

We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.

Code

We are also releasing all the codes used for our experiments. You can use the script run_training.sh in each folder to launch the distributed training.

For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features

The model shown in our paper can be found in dialog_image: Model

Reference

Please cite our paper if you find our work useful ;)

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}
@inproceedings{wang2020chinese,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.03946}
}
Owner
Silver
Dialogue System, Natural Language Processing
Silver
Learn other languages ​​using artificial intelligence with python.

The main idea of ​​the project is to facilitate the learning of other languages. We created a simple AI that will interact with you. Just ask questions that if she knows, she will answer.

Pedro Rodrigues 2 Jun 07, 2022
A fast implementation of bss_eval metrics for blind source separation

fast_bss_eval Do you have a zillion BSS audio files to process and it is taking days ? Is your simulation never ending ? Fear no more! fast_bss_eval i

Robin Scheibler 99 Dec 13, 2022
Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

semantic-segmentation-tensorflow This is a Tensorflow implementation of semantic segmentation models on MIT ADE20K scene parsing dataset and Cityscape

HsuanKung Yang 83 Oct 13, 2022
Pose estimation with MoveNet Lightning

Pose Estimation With MoveNet Lightning MoveNet is the TensorFlow pre-trained model that identifies 17 different key points of the human body. It is th

Yash Vora 2 Jan 04, 2022
Food recognition model using convolutional neural network & computer vision

Food recognition model using convolutional neural network & computer vision. The goal is to match or beat the DeepFood Research Paper

Hemanth Chandran 1 Jan 13, 2022
Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

Learning Structural Edits via Incremental Tree Transformations Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21) 1.

NeuLab 40 Dec 23, 2022
Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021

Spatial-Temporal Transformer for Dynamic Scene Graph Generation Pytorch Implementation of our paper Spatial-Temporal Transformer for Dynamic Scene Gra

Yuren Cong 119 Jan 01, 2023
Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields"

NeRF++ Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields" Work with 360 capture of large-scale unbounded scenes. Sup

Kai Zhang 722 Dec 28, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
A strongly-typed genetic programming framework for Python

monkeys "If an army of monkeys were strumming on typewriters they might write all the books in the British Museum." monkeys is a framework designed to

H. Chase Stevens 115 Nov 27, 2022
Efficient semidefinite bounds for multi-label discrete graphical models.

Low rank solvers #################################### benchmark/ : folder with the random instances used in the paper. ############################

1 Dec 08, 2022
A GridMixup augmentation, inspired by GridMask and CutMix

GridMixup A GridMixup augmentation, inspired by GridMask and CutMix Easy install pip install git+https://github.com/IlyaDobrynin/GridMixup.git Overvie

IlyaDo 42 Dec 28, 2022
Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?

Adversrial Machine Learning Benchmarks This code belongs to the papers: Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness? Det

Adversarial Machine Learning 9 Nov 27, 2022
A python comtrade load library accelerated by go

Comtrade-GRPC Code for python used is mainly from dparrini/python-comtrade. Just patch the code in BinaryDatReader.parse for parsing a little more eff

Bo 1 Dec 27, 2021
A practical ML pipeline for data labeling with experiment tracking using DVC.

Auto Label Pipeline A practical ML pipeline for data labeling with experiment tracking using DVC Goals: Demonstrate reproducible ML Use DVC to build a

Todd Cook 4 Mar 08, 2022
D2Go is a toolkit for efficient deep learning

D2Go D2Go is a production ready software system from FacebookResearch, which supports end-to-end model training and deployment for mobile platforms. W

Facebook Research 744 Jan 04, 2023
Source code for Task-Aware Variational Adversarial Active Learning

Contrastive Coding for Active Learning under Class Distribution Mismatch Official PyTorch implementation of ["Contrastive Coding for Active Learning u

27 Nov 23, 2022
Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

Yunyao 35 Oct 16, 2022
Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Rich Semantics Improve Few-Shot Learning Paper Link Abstract : Human learning benefits from multi-modal inputs that often appear as rich semantics (e.

Mohamed Afham 11 Jul 26, 2022
Weight estimation in CT by multi atlas techniques

maweight A Python package for multi-atlas based weight estimation for CT images, including segmentation by registration, feature extraction and model

György Kovács 0 Dec 24, 2021