RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

Overview

RoBERTa base model for Marathi Language (मराठी भाषा)

Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in this paper and first released in this repository. We trained RoBERTa model for Marathi Language during community week hosted by Huggingface 🤗 using JAX/Flax for NLP & CV jax.

RoBERTa base model for Marathi language (मराठी भाषा)

huggingface-marathi-roberta

Model description

Marathi RoBERTa is a transformers model pretrained on a large corpus of Marathi data in a self-supervised fashion.

Intended uses & limitations ❗️

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. We used this model to fine tune on text classification task for iNLTK and indicNLP news text classification problem statement. Since marathi mc4 dataset is made by scraping marathi newspapers text, it will involve some biases which will also affect all fine-tuned versions of this model.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-base-mr')
>>> unmasker("मोठी बातमी! उद्या दुपारी <mask> वाजता जाहीर होणार दहावीचा निकाल")
[{'score': 0.057209037244319916,'sequence': 'मोठी बातमी! उद्या दुपारी आठ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 2226,
  'token_str': 'आठ'},
 {'score': 0.02796074189245701,
  'sequence': 'मोठी बातमी! उद्या दुपारी २० वाजता जाहीर होणार दहावीचा निकाल',
  'token': 987,
  'token_str': '२०'},
 {'score': 0.017235398292541504,
  'sequence': 'मोठी बातमी! उद्या दुपारी नऊ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 4080,
  'token_str': 'नऊ'},
 {'score': 0.01691395975649357,
  'sequence': 'मोठी बातमी! उद्या दुपारी २१ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 1944,
  'token_str': '२१'},
 {'score': 0.016252165660262108,
  'sequence': 'मोठी बातमी! उद्या दुपारी  ३ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 549,
  'token_str': ' ३'}]

Training data 🏋🏻‍♂️

The RoBERTa Marathi model was pretrained on mr dataset of C4 multilingual dataset:

C4 (Colossal Clean Crawled Corpus), Introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The dataset can be downloaded in a pre-processed form from allennlp or huggingface's datsets - mc4 dataset. Marathi (mr) dataset consists of 14 billion tokens, 7.8 million docs and with weight ~70 GB of text.

Data Cleaning 🧹

Though initial mc4 marathi corpus size ~70 GB, Through data exploration, it was observed it contains docs from different languages especially thai, chinese etc. So we had to clean the dataset before traning tokenizer and model. Surprisingly, results after cleaning Marathi mc4 corpus data:

Train set:

Clean docs count 1581396 out of 7774331.
~20.34% of whole marathi train split is actually Marathi.

Validation set

Clean docs count 1700 out of 7928.
~19.90% of whole marathi validation split is actually Marathi.

Training procedure 👨🏻‍💻

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s> The details of the masking procedure for each sentence are the following:

  • 15% of the tokens are masked.
  • In 80% of the cases, the masked tokens are replaced by <mask>.
  • In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
  • In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

Pretraining

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, β1 = 0.9, β2 = 0.98 and ε = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments and hyperparameter tunning on weights and biases platform. Here is link to main dashboard:
Link to Weights and Biases Dashboard for Marathi RoBERTa model

Pretraining Results 📊

RoBERTa Model reached eval accuracy of 85.28% around ~35K step with train loss at 0.6507 and eval loss at 0.6219.

Fine Tuning on downstream tasks

We performed fine-tuning on downstream tasks. We used following datasets for classification:

  1. IndicNLP Marathi news classification
  2. iNLTK Marathi news headline classification

Fine tuning on downstream task results (Segregated)

1. IndicNLP Marathi news classification

IndicNLP Marathi news dataset consists 3 classes - ['lifestyle', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9672 477 478

💯 Our Marathi RoBERTa **roberta-base-mr model outperformed both classifier ** mentioned in Arora, G. (2020). iNLTK and Kunchukuttan, Anoop et al. AI4Bharat-IndicNLP.

Dataset FT-W FT-WC INLP iNLTK roberta-base-mr 🏆
iNLTK Headlines 83.06 81.65 89.92 92.4 97.48

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK Headlines classification dataset model:

flax-community/mr-indicnlp-classifier

🧪 Fine tuning experiment's weight and biases dashboard link

2. iNLTK Marathi news headline classification

This dataset consists 3 classes - ['state', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9658 1210 1210

💯 Here as well roberta-base-mr outperformed iNLTK marathi news text classifier.

Dataset iNLTK ULMFiT roberta-base-mr 🏆
iNLTK news dataset (kaggle) 92.4 94.21

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK news classification dataset model:

flax-community/mr-inltk-classifier

Fine tuning experiment's weight and biases dashboard link

Want to check how above models generalise on real world Marathi data?

Head to 🤗 Huggingface's spaces 🪐 to play with all three models:

  1. Mask Language Modelling with Pretrained Marathi RoBERTa model:
    flax-community/roberta-base-mr
  2. Marathi Headline classifier:
    flax-community/mr-indicnlp-classifier
  3. Marathi news classifier:
    flax-community/mr-inltk-classifier

alt text Streamlit app of Pretrained Roberta Marathi model on Huggingface Spaces

image

Team Members

Credits

Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to @patil-suraj & @patrickvonplaten for mentoring during whole week.

Owner
Nipun Sadvilkar
I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-tools😅
Nipun Sadvilkar
A benchmark dataset for mesh multi-label-classification based on cube engravings introduced in MeshCNN

Double Cube Engravings This script creates a dataset for multi-label mesh clasification, with an intentionally difficult setup for point cloud classif

Yotam Erel 1 Nov 30, 2021
[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

DrRepair: Learning to Repair Programs from Error Messages This repo provides the source code & data of our paper: Graph-based, Self-Supervised Program

Michihiro Yasunaga 155 Jan 08, 2023
Code and experiments for "Deep Neural Networks for Rank Consistent Ordinal Regression based on Conditional Probabilities"

corn-ordinal-neuralnet This repository contains the orginal model code and experiment logs for the paper "Deep Neural Networks for Rank Consistent Ord

Raschka Research Group 14 Dec 27, 2022
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
Categorical Depth Distribution Network for Monocular 3D Object Detection

CaDDN CaDDN is a monocular-based 3D object detection method. This repository is based off of [OpenPCDet]. Categorical Depth Distribution Network for M

Toronto Robotics and AI Laboratory 289 Jan 05, 2023
This is an (re-)implementation of DeepLab-ResNet in TensorFlow for semantic image segmentation on the PASCAL VOC dataset.

DeepLab-ResNet-TensorFlow This is an (re-)implementation of DeepLab-ResNet in TensorFlow for semantic image segmentation on the PASCAL VOC dataset. Up

19 Jan 16, 2022
PyTorch-centric library for evaluating and enhancing the robustness of AI technologies

Responsible AI Toolbox A library that provides high-quality, PyTorch-centric tools for evaluating and enhancing both the robustness and the explainabi

24 Dec 22, 2022
An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Kazakh Named Entity Recognition This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation gui

ISSAI 9 Dec 23, 2022
Perform zero-order Hankel Transform for an 1D array (float or real valued).

perform zero-order Hankel Transform for an 1D array (float or real valued). An discrete form of Parseval theorem is guaranteed. Suit for iterative problems.

1 Jan 17, 2022
hySLAM is a hybrid SLAM/SfM system designed for mapping

HySLAM Overview hySLAM is a hybrid SLAM/SfM system designed for mapping. The system is based on ORB-SLAM2 with some modifications and refactoring. Raú

Brian Hopkinson 15 Oct 10, 2022
学习 python3 以来写的一些垃圾玩具……

和东哥做兄弟 Author: chiupam 版权 未经本人同意,仓库内所有资源文件,禁止任何公众号、自媒体、开发者进行任何形式的转载、发布、搬运。 声明 这不是一个开源项目,只是把 GitHub 当作一个代码的存储空间,本项目不接受任何开源要求。 仅用于学习研究,禁止用于商业用途,不能保证其合法性

Chiupam 67 Mar 26, 2022
Code for the Paper: Conditional Variational Capsule Network for Open Set Recognition

Conditional Variational Capsule Network for Open Set Recognition This repository hosts the official code related to "Conditional Variational Capsule N

Guglielmo Camporese 35 Nov 21, 2022
Pytorch implementation of the Variational Recurrent Neural Network (VRNN).

VariationalRecurrentNeuralNetwork Pytorch implementation of the Variational RNN (VRNN), from A Recurrent Latent Variable Model for Sequential Data. Th

emmanuel 251 Dec 17, 2022
Code corresponding to The Introspective Agent: Interdependence of Strategy, Physiology, and Sensing for Embodied Agents

The Introspective Agent: Interdependence of Strategy, Physiology, and Sensing for Embodied Agents This is the code corresponding to The Introspective

0 Jan 10, 2022
Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

Hammer Lab 106 Jan 06, 2023
RGBD-Net - This repository contains a pytorch lightning implementation for the 3DV 2021 RGBD-Net paper.

[3DV 2021] We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator

Phong Nguyen Ha 4 May 26, 2022
House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

Gurpreet Singh 1 Jan 01, 2022
Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020.

RegNet Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020. Paper | Official Implementation RegNet offer a very

Vishal R 2 Feb 11, 2022
GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily

GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily Abstract Graph Neural Networks (GNNs) are widely used on a

10 Dec 20, 2022
The Wearables Development Toolkit - a development environment for activity recognition applications with sensor signals

Wearables Development Toolkit (WDK) The Wearables Development Toolkit (WDK) is a framework and set of tools to facilitate the iterative development of

Juan Haladjian 114 Nov 27, 2022