Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

🏆 • 5050 most frequent words in 109 languages

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

skweak: A software toolkit for weak supervision applied to NLP tasks

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

A CSRankings-like index for speech researchers

Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

Dope Wars game engine on StarkNet L2 roll-up

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Pytorch version of BERT-whitening

Neural network sequence labeling model

自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

Python wrapper for Stanford CoreNLP tools v3.4.1

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

A method to generate speech across multiple speakers

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow