Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Last update: Jan 07, 2023

Related tags

Overview

japanese-gpt2

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

Train a Japanese GPT-2 from scratch on your own machine

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

Train a medium-sized GPT-2 on 4 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account; Create a model repo; Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Related tags

Overview

japanese-gpt2

Train a Japanese GPT-2 from scratch on your own machine

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your training script

License

Owner

rinna Co.,Ltd.

Datasets of Automatic Keyphrase Extraction

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

AI-Broad-casting - AI Broad casting with python

原神抽卡记录数据集-Genshin Impact gacha data

Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Open-World Entity Segmentation

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

MEDIALpy: MEDIcal Abbreviations Lookup in Python

I can help you convert your images to pdf file.

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Beautiful visualizations of how language differs among document types.

Kinky furry assitant based on GPT2

Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning