๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Overview

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.

You might also like...
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis ์™œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์ผ๊นŒ?์—์„œ ์‹œ์ž‘๋œ ํ”„๋กœ์ ํŠธ Environment: Pytorch, Da

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticaรงรฃo รญntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) รฉ uma bilioteca criada para ser utilizada na geraรงรฃo de tokens seguros e รญntegros, ou seja, nรฃ

Comments
  • Pretraining Epoch ์งˆ๋ฌธ

    Pretraining Epoch ์งˆ๋ฌธ

    Checklist

    • [x] I've searched the project's issues

    โ“ Question

    ์•ˆ๋…•ํ•˜์„ธ์š” ์ €๋Š” ํ˜„์žฌ ์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ 4096 ํ† ํฐ์„ ์ž…๋ ฅ๋ฐ›์•„ ์š”์•ฝ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ์—” ๋น…๋ฒ„๋“œ + ๋ฒ„ํŠธ ์กฐํ•ฉ์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๋ฏธ monologg ๋‹˜๊ป˜์„œ ๋งŒ๋“ค์–ด์ฃผ์…จ๋”๋ผ๊ตฌ์š” ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ๋กฑํฌ๋จธ + ๋ฐ”ํŠธ + ํŽ˜๊ฐ€์ˆ˜์Šค ์กฐํ•ฉ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. pretrained๋œ KoBart๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ดํ…์…˜์„ ๋กฑํฌ๋จธ๋กœ ๋ฐ”๊พผ ํ›„, ํŽ˜๊ฐ€์ˆ˜์Šค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

    ํ˜„์žฌ 13GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ๋กœ๋” ์ž‘์„ฑ, ๋ชจ๋ธ ์ฝ”๋“œ๊นŒ์ง€๋Š” ์™„๋ฃŒํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์ฃผ ๋‚ด๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    ์ €ํฌ๊ฐ€ ๊ฐ€์ง„ GPU๋กœ๋Š” ๋Œ€๋žต ์ดํ‹€์ด๋ฉด 1 ์—ํฌํฌ๋ฅผ ๋Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, monologg๋‹˜๊ป˜์„œ๋Š” KoBirBird ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋„์…จ๋Š”์ง€ ์—ฌ์ญค๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

    ์•„๋ฌด๋ž˜๋„ pretrained ๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋‹ค๋ณด๋‹ˆ ์—ํฌํฌ๋ฅผ ๋งŽ์ด ๋Œ ํ•„์š”๋Š” ์—†์„ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ธฐ์ค€์ ์œผ๋กœ ์‚ผ๊ณ  ์‹ถ์–ด์„œ์š”!

    ๋ง์ด ๊ธธ์–ด์กŒ๋Š”๋ฐ ์š”์•ฝํ•˜์ž๋ฉด, KoBirBird ํ•™์Šต ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฃผ์…จ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ทธ ๊ธฐ์ค€์€ ๋ฌด์—‡์œผ๋กœ ์‚ผ์œผ์…จ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

    question 
    opened by KimJaehee0725 2
  • Specific information about this model.

    Specific information about this model.

    Checklist

    • [ x ] I've searched the project's issues

    โ“ Question

    • You mentioned "๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต" and I want to know the size of total corpus for pre-training.

    • Also I want to know the vocab size of this model.

    ๐Ÿ“Ž Additional context

    question 
    opened by midannii 2
  • Fix some minors

    Fix some minors

    Description

    ์ฝ”๋“œ์™€ ์ฃผ์„ ๋“ฑ์„ ์ฝ๋‹ค๊ฐ€ ๋ณด์ธ ์ž‘์€ ์˜คํƒ€ ๋“ฑ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค

    ๋‹ค์–‘ํ•œ ๋…ธํ•˜์šฐ๋ฅผ ์•„๋‚Œ์—†์ด ๊ณต์œ ํ•ด์ฃผ์‹  @monologg , @donggyukimc ์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค.

    ์ดํ›„์—๋Š” GPU ํ™˜๊ฒฝ์—์„œ finetuning์„ ํ…Œ์ŠคํŠธํ•ด ๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

    Related Issue

    chore 
    opened by sackoh 0
Releases(v1.0.0)
Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

[UPDATED] A TensorFlow Implementation of Attention Is All You Need When I opened this repository in 2017, there was no official code yet. I tried to i

Kyubyong Park 3.8k Dec 26, 2022
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Francesco Pham 94 Dec 25, 2022
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
๐Ÿ– Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit์—์„œ ๋น„ํŠธ์ฝ”์ธ์„ ์ž๋™๋งค๋งคํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์กฐ์ฝ”๋”ฉ ์œ ํŠœ๋ธŒ ์ฑ„๋„์—์„œ ์ž์„ธํ•œ ๊ฐ•์˜ ์˜์ƒ์„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ๋น„ํŠธ์ฝ”์ธ ํˆฌ์ž ์ž๋™ํ™” ๊ฐ•์˜ ์ฝ”๋“œ by ์œ ํŠœ๋ธŒ ์กฐ์ฝ”๋”ฉ ์ฑ„๋„ pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit ๊ฑฐ๋ž˜์†Œ์—์„œ ๋น„ํŠธ์ฝ”์ธ ์ž๋™๋งค๋งค๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ํŒŒ์ผ ๊ตฌ์„ฑ test.py : ์ž”๊ณ  ์กฐํšŒ (1๊ฐ•) backtest.py : ๋ฐฑํ…Œ์ŠคํŒ… ์ฝ”๋“œ (2๊ฐ•) bestK.p

์กฐ์ฝ”๋”ฉ JoCoding 186 Dec 29, 2022
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
CCF BDCI 2020 ๆˆฟไบง่กŒไธš่Šๅคฉ้—ฎ็ญ”ๅŒน้…่ต›้“ Aๆฆœ47/2985

CCF BDCI 2020 ๆˆฟไบง่กŒไธš่Šๅคฉ้—ฎ็ญ”ๅŒน้… Aๆฆœ47/2985 ่ต›้ข˜ๆ่ฟฐ่ฏฆ่ง๏ผšhttps://www.datafountain.cn/competitions/474 ๆ–‡ไปถ่ฏดๆ˜Ž data: ๅญ˜ๆ”พ่ฎญ็ปƒๆ•ฐๆฎๅ’Œๆต‹่ฏ•ๆ•ฐๆฎไปฅๅŠ้ข„ๅค„็†ไปฃ็  model_bert.py: ็ฝ‘็ปœๆจกๅž‹็ป“ๆž„ๅฎšไน‰ adv_train

shuo 40 Sep 28, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022