IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Last update: Nov 30, 2022

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).

2. About

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.

In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.

3. Pretraining Data

We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.

4. How to use

Load model and tokenizer (tested with transformers==3.5.1)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")

Preprocessing Steps:

lower-case all words
converting user mentions and URLs into @USER and HTTPURL, respectively
translating emoticons into text using the emoji package.

5. Results over 7 Indonesian Twitter Datasets

Models	Sentiment		Emotion	Hate Speech		NER		Average
Models	IndoLEM	SmSA	EmoT	HS1	HS2	Formal	Informal	Average
mBERT	76.6	84.7	67.5	85.1	75.1	85.2	83.2	79.6
malayBERT	82.0	84.1	74.2	85.0	81.9	81.9	81.3	81.5
IndoBERT (Willie, et al., 2020)	84.1	88.7	73.3	86.8	80.4	86.3	84.3	83.4
IndoBERT (Koto, et al., 2020)	84.1	87.9	71.0	86.4	79.3	88.0	86.9	83.4
IndoBERTweet (1M steps from scratch)	86.2	90.4	76.0	88.8	87.5	88.1	85.4	86.1
IndoBERT + Voc adaptation + 200k steps	86.6	92.7	79.0	88.4	84.0	87.7	86.9	86.5

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Related tags

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

2. About

3. Pretraining Data

4. How to use

5. Results over 7 Indonesian Twitter Datasets

Owner

IndoLEM

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

This repository is home to the Optimus data transformation plugins for various data processing needs.

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

TPlinker for NER 中文/英文命名实体识别

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

中文空间语义理解评测

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

Tracking Progress in Natural Language Processing

A paper list of pre-trained language models (PLMs).

Athena is an open-source implementation of end-to-end speech processing engine.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

TTS is a library for advanced Text-to-Speech generation.

DLO8012: Natural Language Processing & CSL804: Computational Lab - II

a test times augmentation toolkit based on paddle2.0.