RoBERTa Marathi Language model trained from scratch during huggingface ЁЯдЧ x flax community week

Overview

RoBERTa base model for Marathi Language (рдорд░рд╛рдареА рднрд╛рд╖рд╛)

Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in this paper and first released in this repository. We trained RoBERTa model for Marathi Language during community week hosted by Huggingface ЁЯдЧ using JAX/Flax for NLP & CV jax.

RoBERTa base model for Marathi language (рдорд░рд╛рдареА рднрд╛рд╖рд╛)

huggingface-marathi-roberta

Model description

Marathi RoBERTa is a transformers model pretrained on a large corpus of Marathi data in a self-supervised fashion.

Intended uses & limitations тЭЧя╕П

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. We used this model to fine tune on text classification task for iNLTK and indicNLP news text classification problem statement. Since marathi mc4 dataset is made by scraping marathi newspapers text, it will involve some biases which will also affect all fine-tuned versions of this model.

How to use тЭУ

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-base-mr')
>>> unmasker("рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА <mask> рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓")
[{'score': 0.057209037244319916,'sequence': 'рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА рдЖрда рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓',
  'token': 2226,
  'token_str': 'рдЖрда'},
 {'score': 0.02796074189245701,
  'sequence': 'рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА реиреж рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓',
  'token': 987,
  'token_str': 'реиреж'},
 {'score': 0.017235398292541504,
  'sequence': 'рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА рдирдК рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓',
  'token': 4080,
  'token_str': 'рдирдК'},
 {'score': 0.01691395975649357,
  'sequence': 'рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА реирез рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓',
  'token': 1944,
  'token_str': 'реирез'},
 {'score': 0.016252165660262108,
  'sequence': 'рдореЛрдареА рдмрд╛рддрдореА! рдЙрджреНрдпрд╛ рджреБрдкрд╛рд░реА  рей рд╡рд╛рдЬрддрд╛ рдЬрд╛рд╣реАрд░ рд╣реЛрдгрд╛рд░ рджрд╣рд╛рд╡реАрдЪрд╛ рдирд┐рдХрд╛рд▓',
  'token': 549,
  'token_str': ' рей'}]

Training data ЁЯПЛЁЯП╗тАНтЩВя╕П

The RoBERTa Marathi model was pretrained on mr dataset of C4 multilingual dataset:

C4 (Colossal Clean Crawled Corpus), Introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The dataset can be downloaded in a pre-processed form from allennlp or huggingface's datsets - mc4 dataset. Marathi (mr) dataset consists of 14 billion tokens, 7.8 million docs and with weight ~70 GB of text.

Data Cleaning ЁЯз╣

Though initial mc4 marathi corpus size ~70 GB, Through data exploration, it was observed it contains docs from different languages especially thai, chinese etc. So we had to clean the dataset before traning tokenizer and model. Surprisingly, results after cleaning Marathi mc4 corpus data:

Train set:

Clean docs count 1581396 out of 7774331.
~20.34% of whole marathi train split is actually Marathi.

Validation set

Clean docs count 1700 out of 7928.
~19.90% of whole marathi validation split is actually Marathi.

Training procedure ЁЯСиЁЯП╗тАНЁЯТ╗

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s> The details of the masking procedure for each sentence are the following:

  • 15% of the tokens are masked.
  • In 80% of the cases, the masked tokens are replaced by <mask>.
  • In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
  • In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

Pretraining

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, ╬▓1 = 0.9, ╬▓2 = 0.98 and ╬╡ = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments and hyperparameter tunning on weights and biases platform. Here is link to main dashboard:
Link to Weights and Biases Dashboard for Marathi RoBERTa model

Pretraining Results ЁЯУК

RoBERTa Model reached eval accuracy of 85.28% around ~35K step with train loss at 0.6507 and eval loss at 0.6219.

Fine Tuning on downstream tasks

We performed fine-tuning on downstream tasks. We used following datasets for classification:

  1. IndicNLP Marathi news classification
  2. iNLTK Marathi news headline classification

Fine tuning on downstream task results (Segregated)

1. IndicNLP Marathi news classification

IndicNLP Marathi news dataset consists 3 classes - ['lifestyle', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9672 477 478

ЁЯТп Our Marathi RoBERTa **roberta-base-mr model outperformed both classifier ** mentioned in Arora, G. (2020). iNLTK and Kunchukuttan, Anoop et al. AI4Bharat-IndicNLP.

Dataset FT-W FT-WC INLP iNLTK roberta-base-mr ЁЯПЖ
iNLTK Headlines 83.06 81.65 89.92 92.4 97.48

ЁЯдЧ Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK Headlines classification dataset model:

flax-community/mr-indicnlp-classifier

ЁЯзк Fine tuning experiment's weight and biases dashboard link

2. iNLTK Marathi news headline classification

This dataset consists 3 classes - ['state', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9658 1210 1210

ЁЯТп Here as well roberta-base-mr outperformed iNLTK marathi news text classifier.

Dataset iNLTK ULMFiT roberta-base-mr ЁЯПЖ
iNLTK news dataset (kaggle) 92.4 94.21

ЁЯдЧ Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK news classification dataset model:

flax-community/mr-inltk-classifier

Fine tuning experiment's weight and biases dashboard link

Want to check how above models generalise on real world Marathi data?

Head to ЁЯдЧ Huggingface's spaces ЁЯкР to play with all three models:

  1. Mask Language Modelling with Pretrained Marathi RoBERTa model:
    flax-community/roberta-base-mr
  2. Marathi Headline classifier:
    flax-community/mr-indicnlp-classifier
  3. Marathi news classifier:
    flax-community/mr-inltk-classifier

alt text Streamlit app of Pretrained Roberta Marathi model on Huggingface Spaces

image

Team Members

Credits

Huge thanks to Huggingface ЁЯдЧ & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to @patil-suraj & @patrickvonplaten for mentoring during whole week.

Owner
Nipun Sadvilkar
I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-toolsЁЯШЕ
Nipun Sadvilkar
Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning" (AAAI 2021)

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic

NAVER/LINE Vision 30 Dec 06, 2022
Generic image compressor for machine learning. Pytorch code for our paper "Lossy compression for lossless prediction".

Lossy Compression for Lossless Prediction Using: Training: This repostiory contains our implementation of the paper: Lossy Compression for Lossless Pr

Yann Dubois 84 Jan 02, 2023
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 07, 2022
Code, pre-trained models and saliency results for the paper "Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images".

Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB This repository is the official implementation of the paper. Our results comming soon in

Xiaoqiang Wang 8 May 22, 2022
This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

How to Implement a First-Order Low-Pass Filter in Discrete Time We often teach or learn about filters in continuous time, but then need to implement t

Joshua Marshall 4 Aug 24, 2022
Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

26 Dec 07, 2022
BRNet - code for Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss function

BRNet code for "Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss func

Yong Pi 2 Mar 09, 2022
Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation Introduction This is a PyTorch

XMed-Lab 30 Sep 23, 2022
The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

Rule-based Representation Learner This is a PyTorch implementation of Rule-based Representation Learner (RRL) as described in NeurIPS 2021 paper: Scal

Zhuo Wang 53 Dec 17, 2022
A crash course in six episodes for software developers who want to become machine learning practitioners.

Featured code sample tensorflow-planespotting Code from the Google Cloud NEXT 2018 session "Tensorflow, deep learning and modern convnets, without a P

Google Cloud Platform 2.6k Jan 08, 2023
Generative Models as a Data Source for Multiview Representation Learning

GenRep Project Page | Paper Generative Models as a Data Source for Multiview Representation Learning Ali Jahanian, Xavier Puig, Yonglong Tian, Phillip

Ali 81 Dec 03, 2022
We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

ConTNet Introduction ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large rec

93 Nov 08, 2022
A deep learning object detector framework written in Python for supporting Land Search and Rescue Missions.

AIR: Aerial Inspection RetinaNet for supporting Land Search and Rescue Missions AIR is a deep learning based object detection solution to automate the

Accenture 13 Dec 22, 2022
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 03, 2023
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022
[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

KPAC: Kernel-Sharing Parallel Atrous Convolutional block This repository contains the official Tensorflow implementation of the following paper: Singl

Hyeongseok Son 50 Dec 29, 2022
Jaxtorch (a jax nn library)

Jaxtorch (a jax nn library) This is my jax based nn library. I created this because I was annoyed by the complexity and 'magic'-ness of the popular ja

nshepperd 17 Dec 08, 2022
SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

SOFT: Softmax-free Transformer with Linear Complexity SOFT: Softmax-free Transformer with Linear Complexity, Jiachen Lu, Jinghan Yao, Junge Zhang, Xia

Fudan Zhang Vision Group 272 Dec 25, 2022
Storchastic is a PyTorch library for stochastic gradient estimation in Deep Learning

Storchastic is a PyTorch library for stochastic gradient estimation in Deep Learning

Emile van Krieken 140 Dec 30, 2022
PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

PASTRIE Official release of the corpus described in the paper: Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schn

NERT @ Georgetown 4 Dec 02, 2021