Russian GPT3 models.

Overview

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.

We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.

Usage examples are described in detail here.

Old version of code you can find here

Table of contents

Setup and usage

Models can be used for inference or finetuning with two ways: 🤗 HuggingFace interface or our code based on this implementation.

For both ways install transformers:

pip install transformers==3.5.0

HuggingFace interface

We support 🤗 HuggingFace interface only for ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large models. For RuGPT3XL please use code in this repo because RuGPT3XL model was trained with sparse attention.

Here we can obtain examples of finetuning or generation.

Also this examples is adapted for google colab:

  • finetuning: finetuning
  • generation: generation

Basic usage:

from transformers import GPT2LMHeadModel, GPT2Tokenizer


model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()
text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
out = model.generate(input_ids.cuda())
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
# Output should be like this:
# Александр Сергеевич Пушкин родился в \n1799 году. Его отец был крепостным крестьянином, а мать – крепостной крестьянкой. Детство и юность Пушкина прошли в деревне Михайловское под Петербургом. В 1820-х годах семья переехала

For more information about 🤗 HuggingFace interface please follow this documentation.

Data issues

For training pass single txt file.

Megatron interface

Without deepspeed

For using our code for finetuning without deepspeed (not recommended) we should install apex:

%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

sh setup.sh

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed

For using our code for finetuning with deepspeed (recommended) we should install apex (see previous section) and deepspeed:

pip install deepspeed==0.3.7

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! For using deepspeed we should specify environ variable before all your python scripts and run with torch.distributed or mpi:

USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
  --train-data-path "train.list" \
  --test-data-path "valid.list" \
  --max-files-per-process 100 \
  --save model \
  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --fp16 \
  --checkpoint-activations \
  --deepspeed-activation-checkpointing \
  --deepspeed \
  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json
Data issues

We use custom implementation of distributed dataset. For training and evaluating we should specify file file.list with list of paths to txt files. All files from file.list will be splitted between aviable GPUs. The logic of splitting is described by the following code:

shard_size = len(files) // world_size
shard_start = rank * shard_size
shard_end = (rank + 1) * shard_size
files = files[shard_start:shard_end]

For more details please see full code of dataset: src.dataset_rugpt3.RuGpt3TextDataset and example.

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed and sparsity

This section is used mostly for usage of RuGPT3XL model and training models with sparse attention.

apt-get install llvm-9-dev
pip install cpufeature
pip install triton==0.2.3
DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7

Test installation of deepspeed you can with the following command: ds_report.

Example of inference of RuGPT3XL here or Open In Colab

Example of finetune, load finetuned model and generate is here.

For using sparse layers in model use --sparse-mode and specify key "sparse_attention" at deepspeed_config (RuGPT3XL config example). Modes can be: fixed, bigbird, bslongformer, variable, dense.

More information about sparse attention here.

Pretraining details

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by SberDevices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗 HuggingFace model card link.

See more details for generation here or Open In Colab.

Example of finetune, load finetuned model and generate is here.

Our pretraining script here

Example of finetuning script here

Pretraining ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3large_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Small

Model was trained with sequence length 1024 using transformers by SberDevices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3small_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT2Large

Model was trained with sequence length 1024 using transformers by SberDevices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model by using transformers with model name sberbank-ai/rugpt2large.

🤗 HuggingFace model card link

Advanced

Pretrained scripts (advanced)

Also we add pretraining scripts for all models (except RuGPT2Large). See scripts dir.

Note! All training params (such as lr, wd, ...) may was different while real training. This is just for example.

Convert checkpoint to HuggingFace

For converting megatron checkpoint to HuggingFace format use the following script (example for RuGPT3Small):

python convert2huggingface.py \
  --load /path/to/save/dir/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --max-position-embeddings 2048 \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim \
  --export-huggingface /path/to/converted/checkpoint

After converting we can use HuggingFace model:

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("/path/to/converted/checkpoint")

Note! Conversion is worked for all models except RuGPT3XL. For using of RuGPT3XL see example of inference of RuGPT3XL here or Open In Colab.

Owner
Sberbank AI
Sberbank AI
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 276 Dec 26, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

SimCSE复现 项目描述 SimCSE是一种简单但是很巧妙的NLP对比学习方法,创新性地引入Dropout的方式,对样本添加噪声,从而达到对正样本增强的目的。 该框架的训练目的为:对于batch中的每个样本,拉近其与正样本之间的距离,拉远其与负样本之间的距离,使得模型能够在大规模无监督语料(也可以

58 Dec 20, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022