Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Related tags

Text Data & NLPlassl
Overview

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

What is LASSLHow to Use

License Issues

What is LASSL

LASSL은 LAnguage Semi-Supervised Learning의 약자로, 데이터만 있다면 누구나 쉽게 자신만의 언어모델을 가질 수 있도록 Huggingface의 Transformers, Datasets 라이브러리를 이용해 언어 모델 사전학습을 제공합니다.

Environment setting

아래 명령어를 통해 필요한 패키지를 설치하거나,

pip3 install -r requirements.txt

poetry를 이용하여 환경설정을 할 수 있습니다.

# poetry 설치
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
# poetry dependencies 환경 설정
poetry install

How to Use

1. Train Tokenizer

python3 train_tokenizer.py \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --sampling_ratio $SAMPLING_RATIO \
    --model_type $MODEL_TYPE \
    --vocab_size $VOCAB_SIZE \
    --min_frequency $MIN_FREQUENCY
# poetry 이용
poetry run python3 train_tokenizer.py \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --sampling_ratio $SAMPLING_RATIO \
    --model_type $MODEL_TYPE \
    --vocab_size $VOCAB_SIZE \
    --min_frequency $MIN_FREQUENCY

2. Serialize Corpora

python3 serialize_corpora.py \
    --model_type $MODEL_TYPE \
    --tokenizer_dir $TOKENIZER_DIR \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --max_length $MAX_LENGTH \
    --num_proc $NUM_PROC
# poetry 이용
poetry run python3 serialize_corpora.py \
    --model_type $MODEL_TYPE \
    --tokenizer_dir $TOKENIZER_DIR \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --max_length $MAX_LENGTH \
    --num_proc $NUM_PROC

3. Pretrain Language Model

python3 pretrain_language_model.py --config_path $CONFIG_PATH
# poetry 이용
poetry run python3 pretrain_language_model.py --config_path $CONFIG_PATH
# TPU를 사용할 때는 아래 명령어를 사용합니다. (poetry 환경은 PyTorch XLA를 기본으로 제공하지 않습니다.)
python3 xla_spawn.py --num_cores $NUM_CORES pretrain_language_model.py --config_path $CONFIG_PATH

Contributors

김보섭 류민호 류인제 박장원 김형석
image1 image2 image3 image4 image5
Github Github Github Github Github

Acknowledgements

LASSL은 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

Comments
  • Ready to release v0.1.0

    Ready to release v0.1.0

    Summary

    기본적으로 전체적인 틀은 잡혀있는 사항 v0.1.0을 release하기에 앞서 다음의 내용에 대해서 논의

    • serialize_corpora.pytrain_tokenizer.py가 지원하는 model_type에 이격이 존재
      • serialie_corpora.py: roberta, gpt2, albert
      • train_tokenizer.py: bert-uncased, bert-cased, gpt2, roberta, albert, electra
    • README.md
    help wanted 
    opened by seopbo 10
  • Refactor codes relevant to pretrain

    Refactor codes relevant to pretrain

    • 학습하고자하는 plm 별로 DataCollatorFor{MODEL}을 추가함.
    • pretrain_language_model.py에서 model_type_to_collator를 정의하여, model_type 별로 collator를 가져옴.
      • config 파일의 collator 항목에서 collator를 위한 args (e.g. mlm_probability)를 가져옴.
    • pretrain_language_model.py에서 eval_dataset을 사용하기위한 코드추가
      • config 파일의 data 항목에서 eval_dataset을 설정하기위한 test_size arg를 가져옴.
    • 그 의 isort, black 돌림.

    Refs: #30

    opened by seopbo 6
  • Add UL2 Language Modeling

    Add UL2 Language Modeling

    슬랙에서도 소개하긴 했는데 Universal Language Learning Paradigm 논문에 소개된 Mixture of Denoisers 를 활용한 목적함수가 기존 Span corruption, MLM, CLM 보다 전반적으로 좋다고 합니다. 저도 마침 회사에서 활용해 볼 생각이 있어서 lassl에 collator 및 processor를 구현하려고 하는데 어떻게 생각하시나요??

    opened by DaehanKim 4
  • Support training BART

    Support training BART

    Is your feature request related to a problem? Please describe. BART processor, collator 추가하기

    Describe the solution you'd like text_infilling 방법을 collator로 추가한다.

    enhancement 
    opened by bzantium 4
  • Add keep_in_memory option in load_dataset

    Add keep_in_memory option in load_dataset

    Is your feature request related to a problem? Please describe.

    • TPU VM에서 학습하는 과정에서 캐쉬로 인해 메모리가 충분함에도 disk 용량이 꽉차는 이슈가 발생함

    Describe the solution you'd like

    • load_dataset 단계에서 keep_in_memory 옵션을 추가하여 해결
    • Serialize과정이완료된 데이터는 disk에 저장되므로, train 단계에서는 필요가 없고 tokenizer, serialize과정에서만 추가
    opened by iron-ij 2
  • KoRobertaSmall training

    KoRobertaSmall training

    TODO

    Training tokenizer

    poetry run python3 train_tokenizer.py --corpora_dir corpora \
    --corpus_type sent_text \
    --model_type roberta \
    --vocab_size 51200 \
    --min_frequency 2
    

    Serializing corpora

    poetry run python3 serialize_corpora.py --model_type roberta \
    --tokenizer_dir tokenizers/roberta \
    --corpora_dir corpora \
    --corpus_type sent_text \
    --max_length 512 \
    --num_proc 96 \
    --batch_size 1000 \
    --writer_batch_size 1000
    

    ref:

    • https://github.com/huggingface/blog/blob/master/notebooks/13_pytorch_xla.ipynb
    help wanted 
    opened by seopbo 2
  • Support corpus_type

    Support corpus_type

    • "docu_text", "docu_json", "sent_text", "sent_json"으로 corpus_type을 정의함.
      • 위에 대응하여 load_corpora 함수를 수정함.
      • "sent_text"에 대응되는 loading scripts의 이름과 class 명을 수정함
      • serialize_corpora.py에서 corpus_type에 대응되게 argument parser를 수정함.
      • train_tokenizer.py에서 corpus_type에 대응되게 refactoring을 수행함.
      • model_name -> model_type으로 수정함.

    Refs: #23

    enhancement 
    opened by seopbo 2
  • Support setting arguments of pretraining by a config file

    Support setting arguments of pretraining by a config file

    • config 파일하나로 pretrain_language_model.py에 실행에 필요한 arguments를 전달함.
    • nested dict 처리를 위한 Omegaconf library 추가
    • CONFIG_MAPPING을 활용하여 class 생성자 호출

    Refs: #16

    opened by seopbo 2
  • argument setting

    argument setting

    To Do

    • https://github.com/lassl/lassl/blob/c507a547e5e22a3bc89bf65e448712783e688211/pretrain_language_model.py#L47
    • set ModelArguments from config.json file
    • set TrainingArguments from config.json file
    enhancement 
    opened by alxiom 2
  • Single-stage Electra collator refactored

    Single-stage Electra collator refactored

    src/lassl/collators.py

    1. Simplified the main operation (all-in-tensor)
    2. change the function name pad_for_token_type_ids -> _token_type_ids_with_pad for clarity
    documentation 
    opened by Doohae 1
  • Add config files #82

    Add config files #82

    Add config files for following:

    • bert-small.yaml
    • albert-small.yaml
    • gpt2-small.yaml
    • roberta-small.yaml Also add readme file for brief explanation of config files in general For issue #82 @seopbo
    opened by Doohae 1
  • Can you give some examples or benchmarks, that use this pretrain framework make downstream task better ?

    Can you give some examples or benchmarks, that use this pretrain framework make downstream task better ?

    I think if you can give a evidence that use this framework will improve the performance in some self build corpus in some downstream task, will make this project more attractive.

    opened by svjack 1
  • Change default save format to parquet

    Change default save format to parquet

    TODO

    • Currently, serialize_corpora.py saves encoded dataset with save_to_disk.
    • In this issue, we replace calling save_to_disk with calling `to_parquet``

    cc: @Doohae @DaehanKim

    enhancement 
    opened by seopbo 0
Releases(v1.0.0)
  • v1.0.0(Nov 2, 2022)

    What's Changed

    • [mixed] refactor: Refactor for v1.0.0 by @seopbo in https://github.com/lassl/lassl/pull/102
    • Currently, lassl suports to train bert, albert, roberta, gpt2, bart, t5, ul2
    • In next, lassl will suport to train electra. Moreover train_universal_tokenizer.py will be added to lassl.
      • train_universal_tokenizer.py will train tokenizer used to train all types of model which are supported by lassl.

    Full Changelog: https://github.com/lassl/lassl/compare/v0.2.0...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Sep 22, 2022)

    What's Changed

    • Support training BART by @seopbo in https://github.com/lassl/lassl/pull/81
    • Support training T5 model by @DaehanKim in https://github.com/lassl/lassl/pull/87
    • Add config files #82 by @Doohae in https://github.com/lassl/lassl/pull/88
    • Support Electra pretrain by @Doohae in https://github.com/lassl/lassl/pull/91
    • Add UL2 Language Modeling by @DaehanKim in https://github.com/lassl/lassl/pull/98

    New Contributors

    • @DaehanKim made their first contribution in https://github.com/lassl/lassl/pull/87

    Full Changelog: https://github.com/lassl/lassl/compare/v0.1.4...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Mar 18, 2022)

    Summary

    • Refactor lassl for packaging modules to library
    • Add a function of dataset blending

    What's Changed

    • Add dataset blender by @hyunwoongko in https://github.com/lassl/lassl/pull/73
    • Remove poetry dependencies by @seopbo in https://github.com/lassl/lassl/pull/76

    New Contributors

    • @hyunwoongko made their first contribution in https://github.com/lassl/lassl/pull/73

    Full Changelog: https://github.com/lassl/lassl/compare/v0.1.2...v0.1.3

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Dec 30, 2021)

    Summary

    • Fix bugs in src/collators.py

    What's Changed

    • [python] fix: Fix importing a invalid module by @seopbo in https://github.com/lassl/lassl/pull/72

    Full Changelog: https://github.com/lassl/lassl/compare/v0.1.1...v0.1.2

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Dec 20, 2021)

    Summary

    • Update README.md
      • Support README.md in english.
      • Support README_ko.md in korean.
    • Fix bugs of training GPT2
    • Add examples configs for gpu, tpu environments.

    What's Changed

    • [docs] fix: Change a license by @seopbo in https://github.com/lassl/lassl/pull/64
    • [etc] docs: Add English version of README by @bzantium in https://github.com/lassl/lassl/pull/66
    • Add example configs for gpu, tpu by @seopbo in https://github.com/lassl/lassl/pull/65
    • [python] fix: debug GPT2 processor and collator by @bzantium in https://github.com/lassl/lassl/pull/69
    • Update README.md by @bzantium in https://github.com/lassl/lassl/pull/70

    Full Changelog: https://github.com/lassl/lassl/compare/v0.1.0...v0.1.1

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Dec 15, 2021)

    Summary

    • First release

    What's Changed

    • Feature/#2 by @seopbo in https://github.com/lassl/lassl/pull/4
    • feat: TPU compatibility by @monologg in https://github.com/lassl/lassl/pull/8
    • Feature/#3 GPT2Preprocessor 추가 by @iron-ij in https://github.com/lassl/lassl/pull/10
    • [docs] chore: Add authors by @seopbo in https://github.com/lassl/lassl/pull/13
    • Feature/#9 ALBERT용 Processor, Collator 추가 by @bzantium in https://github.com/lassl/lassl/pull/14
    • [python] feat: Save tokenizer by @seopbo in https://github.com/lassl/lassl/pull/19
    • [python] mixed: Support sentence per line type doc by @seopbo in https://github.com/lassl/lassl/pull/20
    • Support setting arguments of pretraining by a config file by @seopbo in https://github.com/lassl/lassl/pull/22
    • Support corpus_type by @seopbo in https://github.com/lassl/lassl/pull/25
    • Support adding additional special tokens by @seopbo in https://github.com/lassl/lassl/pull/26
    • [python] feat: Add bert processor by @bzantium in https://github.com/lassl/lassl/pull/29
    • Refactor codes relevant to pretrain by @seopbo in https://github.com/lassl/lassl/pull/31
    • Update issue templates by @seopbo in https://github.com/lassl/lassl/pull/34
    • [python] fix: sampling_ratio 조건 추가하기 by @bzantium in https://github.com/lassl/lassl/pull/36
    • [python] chore: Update dependencies by @seopbo in https://github.com/lassl/lassl/pull/38
    • [python] fix: Fix a buffer in processing.py by @seopbo in https://github.com/lassl/lassl/pull/41
    • [mixed] fix: xla_spawn 변경, config 추가 및 주석 by @bzantium in https://github.com/lassl/lassl/pull/44
    • [python] feat: add keep_in_memory option in serialize_corpora by @iron-ij in https://github.com/lassl/lassl/pull/43
    • [chore] fix: Fix a requirements.txt by @seopbo in https://github.com/lassl/lassl/pull/46
    • [python] fix: sampling할 때 중복샘플링 옵션 제거 by @bzantium in https://github.com/lassl/lassl/pull/48
    • [etc] docs: README 추가 by @bzantium in https://github.com/lassl/lassl/pull/39
    • [etc] docs: README에 LASSL 약자소개 추가하기 by @bzantium in https://github.com/lassl/lassl/pull/52
    • [python] chore: Update dependencies by @seopbo in https://github.com/lassl/lassl/pull/54
    • [python] fix: GPT2 Collator CollatorForLM 상속하기 by @bzantium in https://github.com/lassl/lassl/pull/57
    • [etc] docs: Add additional information to doc by @seopbo in https://github.com/lassl/lassl/pull/59

    New Contributors

    • @seopbo made their first contribution in https://github.com/lassl/lassl/pull/4
    • @monologg made their first contribution in https://github.com/lassl/lassl/pull/8
    • @iron-ij made their first contribution in https://github.com/lassl/lassl/pull/10
    • @bzantium made their first contribution in https://github.com/lassl/lassl/pull/14

    Full Changelog: https://github.com/lassl/lassl/commits/v0.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
LASSL: LAnguage Self-Supervised Learning
LASSL: LAnguage Self-Supervised Learning
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022
voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

Michael Hansen 988 Jan 04, 2023
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022
Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Structural Guidance for Transformer Language Models This repository accompanies the paper, Structural Guidance for Transformer Language Models, publis

International Business Machines 10 Dec 14, 2022
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

Marco Tulio Correia Ribeiro 1.8k Dec 28, 2022
Lyrics generation with GPT2-based Transformer

HuggingArtists - Train a model to generate lyrics Create AI-Artist in just 5 minutes! 🚀 Run the demo notebook to train 🚀 Run the GUI demo to test Di

Aleksey Korshuk 65 Dec 19, 2022
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Nmt - TensorFlow Neural Machine Translation Tutorial

Neural Machine Translation (seq2seq) Tutorial Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tut

6.1k Dec 29, 2022
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021