Pre-training BERT masked language models with custom vocabulary

Overview

Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints 

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
๐Ÿค— Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | ็ฎ€ไฝ“ไธญๆ–‡ | ็น้ซ”ไธญๆ–‡ State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ๐Ÿค— Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 03, 2023
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

RUO 2 Feb 22, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ร–MER YILDIZ 4 Mar 20, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
๋ฌธ์žฅ๋‹จ์œ„๋กœ ๋ถ„์ ˆ๋œ ๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ์…‹. Releases์—์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ฑฐ๋‚˜, tfds-korean์„ ํ†ตํ•ด ๋‹ค์šด๋กœ๋“œ ๋ฐ›์œผ์„ธ์š”.

Namuwiki corpus ๋ฌธ์žฅ๋‹จ์œ„๋กœ ๋ฏธ๋ฆฌ ๋ถ„์ ˆ๋œ ๋‚˜๋ฌด์œ„ํ‚ค ์ฝ”ํผ์Šค. ๋ชฉ์ ์ด LM๋“ฑ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด๋ผ, ๋งํฌ/์ด๋ฏธ์ง€/ํ…Œ์ด๋ธ” ๋“ฑ๋“ฑ์ด ์ž˜๋ ค์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์žฅ ๋‹จ์œ„ ๋ถ„์ ˆ์€ kss๋ฅผ ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ผ์ด์„ ์Šค๋Š” ๋‚˜๋ฌด์œ„ํ‚ค์— ๋ช…์‹œ๋œ ๋ฐ”์™€ ๊ฐ™์ด CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb โ€“ Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

Juan Camilo Lรณpez Montes 1 Jan 11, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022