🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Last update: Dec 14, 2022

Overview

Pretrained BigBird Model for Korean

What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation

한국어 | English

What is BigBird?

BigBird: Transformers for Longer Sequences에서 소개된 sparse-attention 기반의 모델로, 일반적인 BERT보다 더 긴 sequence를 다룰 수 있습니다.

🦅 Longer Sequence - 최대 512개의 token을 다룰 수 있는 BERT의 8배인 최대 4096개의 token을 다룸

⏱️ Computational Efficiency - Full attention이 아닌 Sparse Attention을 이용하여 O(n²)에서 O(n)으로 개선

How to Use

🤗 Huggingface Hub에 업로드된 모델을 곧바로 사용할 수 있습니다:)
일부 이슈가 해결된 transformers>=4.11.0 사용을 권장합니다. (MRC 이슈 관련 PR)
BigBirdTokenizer 대신에 BertTokenizer 를 사용해야 합니다. (AutoTokenizer 사용시 BertTokenizer가 로드됩니다.)
자세한 사용법은 BigBird Tranformers documentation을 참고해주세요.

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

자세한 내용은 [Pretraining BigBird] 참고

	Hardware	Max len	LR	Batch	Train Step	Warmup Step
KoBigBird-BERT-Base	TPU v3-8	4096	1e-4	32	2M	20k

모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습
ITC (Internal Transformer Construction) 모델로 학습 (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

자세한 내용은 [Finetune on Short Sequence Dataset] 참고

	NSMC (acc)	KLUE-NLI (acc)	KLUE-STS (pearsonr)	Korquad 1.0 (em/f1)	KLUE MRC (em/rouge-w)
KoELECTRA-Base-v3	91.13	86.87	93.14	85.66 / 93.94	59.54 / 65.64
KLUE-RoBERTa-Base	91.16	86.30	92.91	85.35 / 94.53	69.56 / 74.64
KoBigBird-BERT-Base	91.18	87.17	92.61	87.08 / 94.71	70.33 / 75.34

2. Long Sequence (>=1024)

자세한 내용은 [Finetune on Long Sequence Dataset] 참고

	TyDi QA (em/f1)	Korquad 2.1 (em/f1)	Fake News (f1)	Modu Sentiment (f1-macro)
KLUE-RoBERTa-Base	76.80 / 78.58	55.44 / 73.02	95.20	42.61
KoBigBird-BERT-Base	79.13 / 81.30	67.77 / 82.03	98.85	45.42

Docs

Citation

KoBigBird를 사용하신다면 아래와 같이 인용해주세요.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird는 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

또한 멋진 로고를 제공해주신 Seyun Ahn님께 감사를 전합니다.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

3 May 23, 2022

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

3 Dec 2, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 2, 2023

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

0 Nov 29, 2022

Comments

Pretraining Epoch 질문
Checklist

[x] I've searched the project's issues

❓ Question

안녕하세요 저는 현재 친구들과 함께 4096 토큰을 입력받아 요약 태스크를 수행할 수 있는 모델을 만들고 있습니다. 처음엔 빅버드 + 버트 조합으로 해보려고 했는데, 이미 monologg 님께서 만들어주셨더라구요 ㅎㅎ 그래서 롱포머 + 바트 + 페가수스 조합으로 학습을 진행하려 하고 있습니다. pretrained된 KoBart를 기반으로 어텐션을 롱포머로 바꾼 후, 페가수스 task를 수행하는 구조로 되어 있습니다.

현재 13GB의 데이터를 모아서 전처리와 데이터로더 작성, 모델 코드까지는 완료한 상태입니다. 이번 주 내로 학습을 진행하려 하고 있습니다.

저희가 가진 GPU로는 대략 이틀이면 1 에포크를 돌 수 있을 것 같은데, monologg님께서는 KoBirBird 모델 개발 시 에포크를 얼마나 도셨는지 여쭤보고 싶습니다.

아무래도 pretrained 된 모델을 가져다 쓰다보니 에포크를 많이 돌 필요는 없을 것 같은데, 기준점으로 삼고 싶어서요!

말이 길어졌는데 요약하자면, KoBirBird 학습 시 에포크를 얼마나 주셨는지 궁금합니다. 또한, 그 기준은 무엇으로 삼으셨는지도 궁금합니다.
question
opened by KimJaehee0725 2
Specific information about this model.
Checklist

[ x ] I've searched the project's issues

❓ Question

You mentioned "모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습" and I want to know the size of total corpus for pre-training.

Also I want to know the vocab size of this model.

📎 Additional context
question
opened by midannii 2
Fix some minors

Description

코드와 주석 등을 읽다가 보인 작은 오타 등을 수정했습니다

다양한 노하우를 아낌없이 공유해주신 @monologg , @donggyukimc 에게 감사의 말씀드립니다.

이후에는 GPU 환경에서 finetuning을 테스트해 볼 예정입니다 고맙습니다.

Related Issue
chore

opened by sackoh 0

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Initial release for KoBigBird - Pretrained BigBird Model for Korean
Source code(tar.gz)
Source code(zip)

Owner

Jangwon Park

GitHub Repository https://huggingface.co/monologg/kobigbird-bert-base

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

22 Dec 29, 2022

Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

33 Dec 27, 2022

Dope Wars game engine on StarkNet L2 roll-up

RYO Dope Wars game engine on StarkNet L2 roll-up. What TI-83 drug wars built as smart contract system. Background mechanism design notion here. Initia

104 Dec 04, 2022

Model for recasing and repunctuating ASR transcripts

Recasing and punctuation model based on Bert Benoit Favre 2021 This system converts a sequence of lowercase tokens without punctuation to a sequence o

88 Dec 29, 2022

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

17 Nov 09, 2022

Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

1.6k Dec 29, 2022

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

5 Oct 21, 2022

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

69 Jan 03, 2023

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

1 Feb 17, 2022

ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

237 Dec 10, 2022

COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

28 Oct 30, 2022

History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

46 Nov 23, 2022

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。基于pytorch框架，训练关于垂直领域语料的预训练语言模型，目的是提升下游任务的表现。 Pretraining Task Mask Language Model，简称Mask LM，即

24 Dec 10, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 07, 2023

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

10 Dec 06, 2022