AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

This repository contains the code for the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages which appears in the first workshop on Multilingual Representation Learning at EMNLP 2021.

AfriBERTa was trained on 11 languages - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. AfriBERTa was evaluated on NER and text classification spanning 10 languages (some of which it was not pretrained on). It outperformed mBERT and XLM-R on several languages and is very competitive overall.

Pretrained Models and Dataset

Models:

We release the following pretrained models:

AfriBERTa Small (97M params)
AfriBERTa Base (111M params)
AfriBERTa Large (126M params)

Dataset:

https://huggingface.co/datasets/castorini/afriberta-corpus

Reproducing Experiments

Datasets and Tokenizer

Below are details on how to obtain the datasets and trained sentencepiece tokenizer:

Language Modelling: The data for language modelling can be downloaded from this URL

NER: To obtain the NER dataset, please download it from this repository

Text Classification: To obtain the topic classification dataset, please download it from this repository

Tokenizer: The trained sentencepiece tokenizer can be downloaded from this URL

Training

To train AfriBERTa and evaluate on both downstream tasks, simply install all requirements in requirements.txt, download the relevant datasets and run the following script:

bash run_all.sh

This script will:

Train the multilingual language model from scratch and save the model as well as relevant logs
Evaluate the trained language model on NER for all ten languages over 5 seeds
Evaluate the trained language model on text classification for all two languages over 5 seeds

Citation

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
classification_scripts		classification_scripts
mlm_configs		mlm_configs
ner_scripts		ner_scripts
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classification_scripts

classification_scripts

mlm_configs

mlm_configs

ner_scripts

ner_scripts

scripts

scripts

src

src

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

run_all.sh

run_all.sh

Repository files navigation

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Pretrained Models and Dataset

Reproducing Experiments

Datasets and Tokenizer

Training

Citation

About

Releases

Packages

Contributors 2

Languages

License

castorini/afriberta

Folders and files

Latest commit

History

Repository files navigation

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Pretrained Models and Dataset

Reproducing Experiments

Datasets and Tokenizer

Training

Citation

About

Resources

License

Stars

Watchers

Forks

Languages