This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Last update: Dec 13, 2022

Related tags

Overview

OpenWebText2

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.

The plug and play version of OpenWebText2 contains:

17,103,059 documents
65.86GB uncompressed text

Download Dataset / Documentation

For further information please visit our documentation.

Acknowledgements

researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.
sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping.
leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping.
Colaboratory VMs helped us with about 10% of our overall scraping.
The Eye host our processed datasets.
Read The Docs host our documentation.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

7.5k Feb 17, 2021

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

1 Jan 28, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

1 Jan 4, 2022

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

Comments

Fixing an issue with sha256 checking

The pushshift.pushshift_to_sqlite method passes the arguments to best_download.download_file in a wrong order, and the code crashes. Hence, the dataset is not reproducible without this modification.

opened by ardacihaner 0

Releases(v1.0)

v1.0(Aug 29, 2021)

Initial Release.
Source code(tar.gz)
Source code(zip)

Owner

EleutherAI

GitHub Repository

NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP（Natural Language Processing）是指自然语言处理，他的目的是让计算机可以听懂人话。下面是我将2万条豆瓣影评训练之后，随意输入一段新影评交给神经网络，最终AI推断出的结果。 "很好，演技不错

3 Apr 15, 2022

Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

HAN2HAN : Hangul Font Generation

36 Dec 28, 2022

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

13 Dec 13, 2022

Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

2 May 12, 2022

Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

613 Jan 07, 2023

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

1.1k Jan 03, 2023

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 06, 2023

Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包，它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

3.7k Jan 02, 2023

Random Directed Acyclic Graph Generator

DAG_Generator Random Directed Acyclic Graph Generator verison1.0 简介工作流通常由DAG（有向无环图）来定义，其中每个计算任务$T_i$由一个顶点(node,task,vertex)表示。同时，任务之间的每个数据或控制依赖性由一条加权

17 Dec 27, 2022

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

20 Oct 11, 2022

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Related tags

Overview

OpenWebText2

Download Dataset / Documentation

Acknowledgements

You might also like...

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Comments

Fixing an issue with sha256 checking

Releases(v1.0)

v1.0(Aug 29, 2021)

Owner

EleutherAI

NLP-based analysis of poor Chinese movie reviews on Douban

Chinese segmentation library

HAN2HAN : Hangul Font Generation

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Snowball compiler and stemming algorithms

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Facilitating the design, comparison and sharing of deep text matching models.

Random Directed Acyclic Graph Generator

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

Easy to start. Use deep nerual network to predict the sentiment of movie review.

A workshop with several modules to help learn Feast, an open-source feature store

💫 Industrial-strength Natural Language Processing (NLP) in Python

Trained T5 and T5-large model for creating keywords from text

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Python library for parsing resumes using natural language processing and machine learning

Simple Annotated implementation of GPT-NeoX in PyTorch