WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Overview

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for over 100+ languages.
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

Latest Updates

2021-04-14: Happy to share the good news that our paper got accepted at SIGIR Conference. From ACM site, you can find our paper, slides and presentation.

2021-09-14: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set.

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filering these carefully, we get a clean high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

Citing WIT

If you use the WIT dataset, you can cite our work as follows.

@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

Contact

For any questions, please contact [email protected].

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

Nik 103 Dec 26, 2022
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

Sipeed 267 Dec 25, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022