AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Overview

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

This repository contains the code for the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages which appears in the first workshop on Multilingual Representation Learning at EMNLP 2021.

AfriBERTa was trained on 11 languages - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. AfriBERTa was evaluated on NER and text classification spanning 10 languages (some of which it was not pretrained on). It outperformed mBERT and XLM-R on several languages and is very competitive overall.

Pretrained models

We release the following pretrained models:

Reproducing Experiments

Datasets and Tokenizer

Below are details on how to obtain the datasets and trained sentencepiece tokenizer:

Language Modelling: The data for language modelling can be downloaded from this URL

NER: To obtain the NER dataset, please download it from this repository

Text Classification: To obtain the topic classification dataset, please download it from this repository

Tokenizer: The trained sentencepiece tokenizer can be downloaded from this URL

Training

To train AfriBERTa and evaluate on both downstream tasks, simply install all requirements in requirements.txt, download the relevant datasets and run the following script:

bash run_all.sh

This script will:

  1. Train the multilingual language model from scratch and save the model as well as relevant logs
  2. Evaluate the trained language model on NER for all ten languages over 5 seeds
  3. Evaluate the trained language model on text classification for all two languages over 5 seeds

Citation

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}
Owner
Kelechi
nlp research
Kelechi
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE)

OG-SPACE Introduction Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE) is a computational framewo

Data and Computational Biology Group UNIMIB (was BI*oinformatics MI*lan B*icocca) 0 Nov 17, 2021
Neural Network Libraries

Neural Network Libraries Neural Network Libraries is a deep learning framework that is intended to be used for research, development and production. W

Sony 2.6k Dec 30, 2022
Implementation of CVPR'2022:Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors

Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors (CVPR 2022) Personal Web Pages | Paper | Project Page This repository contains

151 Dec 26, 2022
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

Computational Photography Lab @ SFU 1.1k Jan 02, 2023
Training deep models using anime, illustration images.

animeface deep models for anime images. Datasets anime-face-dataset Anime faces collected from Getchu.com. Based on Mckinsey666's dataset. 63.6K image

Tomoya Sawada 61 Dec 25, 2022
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

Meaningful titles for tabs and PDF downloads! Also supports tab search.

arxiv-utils If you are a researcher that reads a lot on ArXiv, you'll benefit a lot from this web extension. Renames the title of PDF page to the pape

Johnson 174 Dec 20, 2022
PN-Net a neural field-based framework for depth estimation from single-view RGB images.

PN-Net We present a neural field-based framework for depth estimation from single-view RGB images. Rather than representing a 2D depth map as a single

1 Oct 02, 2021
QICK: Quantum Instrumentation Control Kit

QICK: Quantum Instrumentation Control Kit The QICK is a kit of firmware and software to use the Xilinx RFSoC to control quantum systems. It consists o

81 Dec 15, 2022
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 885 Jan 01, 2023
HackBMU-5.0-Team-Ctrl-Alt-Elite - HackBMU 5.0 Team Ctrl Alt Elite

HackBMU-5.0-Team-Ctrl-Alt-Elite The search is over. We present to you ‘Health-A-

3 Feb 19, 2022
f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation

f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation [Paper] [PyTorch] [MXNet] [Video] This repository provides code for training

Visual Understanding Lab @ Samsung AI Center Moscow 516 Dec 21, 2022
Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021 [WIP] The code for CVPR 2021 paper 'Disentangled Cycle Consistency for H

ChongjianGE 94 Dec 11, 2022
Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Şebnem 6 Jan 18, 2022
Harmonious Textual Layout Generation over Natural Images via Deep Aesthetics Learning

Harmonious Textual Layout Generation over Natural Images via Deep Aesthetics Learning Code for the paper Harmonious Textual Layout Generation over Nat

7 Aug 09, 2022
A Python library that provides a simplified alternative to DBAPI 2

A Python library that provides a simplified alternative to DBAPI 2. It provides a facade in front of DBAPI 2 drivers.

Tony Locke 44 Nov 17, 2021
Clustering is a popular approach to detect patterns in unlabeled data

Visual Clustering Clustering is a popular approach to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a data

Tarek Naous 24 Nov 11, 2022
Self-Supervised Multi-Frame Monocular Scene Flow (CVPR 2021)

Self-Supervised Multi-Frame Monocular Scene Flow 3D visualization of estimated depth and scene flow (overlayed with input image) from temporally conse

Visual Inference Lab @TU Darmstadt 85 Dec 22, 2022
ML-Decoder: Scalable and Versatile Classification Head

ML-Decoder: Scalable and Versatile Classification Head Paper Official PyTorch Implementation Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baru

189 Jan 04, 2023