超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Last update: Dec 18, 2022

Overview

bert4pytorch

2021年8月27更新：

感谢大家的star，最近有小伙伴反映了一些小的bug，我也注意到了，奈何这个月工作上实在太忙，更新不及时，大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本，然后会新添加一些关键注释。再增加对抗训练的内容，更新一个完整的finetune案例。

背景

目前最流行的pytorch版本的bert框架，莫过于huggingface团队的Transformers项目，但是随着项目的越来越大，显得很重，对于初学者、有一定nlp基础的人来说，想看懂里面的代码逻辑，深入了解bert，有很大的难度。

另外，如果想修改Transformers的底层代码也是想当困难的，导致很难对模型进行魔改。

本项目把整个bert架构，浓缩在几个文件当中（主要修改自Transfomers开源项目），删除大量无关紧要的代码，新增了一些功能，比如：ema、warmup schedule，并且在核心部分，添加了大量中文注释，力求解答读者在使用过程中产生的一些疑惑。

此项目核心只有三个文件，modeling、tokenization、optimization。并且都在几百行内完成。结合大量的中文注释，分分钟透彻理解bert。

功能

现在已经实现

加载bert、RoBERTa-wwm-ext的预训练权重进行fintune
实现了带warmup的优化器
实现了模型权重的指数滑动平均（ema）

未来将实现

albert、GPT、XLnet等网络架构
实现对抗训练、conditional Layer Norm等功能（想法来自于苏神(苏剑林)的bert4keras开源项目，事实上，bert4pytorch就是受到了它的启发）
添加大量的例子和中文注释，减轻学习难度

安装

pip install bert4pytorch==0.1.2

使用

加载预训练模型

from bert4pytorch.modeling import BertModel, BertConfig
from bert4pytorch.tokenization import BertTokenizer
from bert4pytorch.optimization import AdamW, get_linear_schedule_with_warmup
import torch

model_path = "/model/pytorch_bert_pretrain_model"
config = BertConfig(model_path + "/config.json")

tokenizer = BertTokenizer(model_path + "/vocab.txt")
model = BertModel.from_pretrained(model_path, config)

input_ids, token_type_ids = tokenizer.encode("今天很开心")

input_ids = torch.tensor([input_ids])
token_type_ids = torch.tensor([token_type_ids])

model.eval()

outputs = model(input_ids, token_type_ids, output_all_encoded_layers=True)

## orther code

带warmup的优化器实现

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
                if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer
                if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5, correct_bias=False)

num_training_steps=train_batches * num_epoches
num_warmup_steps=num_training_steps * warmup_proportion
schedule = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)

其他

最初整理这个项目，只是为了自己方便。这一段时间，经常逛苏剑林大佬的博客，里面的内容写得相当精辟，更加感叹的是，苏神经常能闭门造车出一些还不错的trick，只能说，大佬牛逼。

所以本项目命名也雷同bert4keras，以感谢苏大佬无私的分享。

后来，慢慢萌生把学习中的小小成果开源出来，后期会渐渐补充例子，前期会借用苏神的bert4keras里面的例子，实现pytorch版本。如果有问题，欢迎讨论；如果本项目对您有用，请不吝star！

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Related tags

Overview

bert4pytorch

2021年8月27更新：

背景

功能

现在已经实现

未来将实现

安装

使用

其他

Owner

muqiu

Translation to python of Chris Sims' optimization function

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Facilitating the design, comparison and sharing of deep text matching models.

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Code and data accompanying Natural Language Processing with PyTorch

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

This is a NLP based project to extract effective date of the contract from their text files.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Simple Text-To-Speech Bot For Discord

History Aware Multimodal Transformer for Vision-and-Language Navigation

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Japanese synonym library

Use Tensorflow2.7.0 Build OpenAI'GPT-2

Data loaders and abstractions for text and NLP

Lattice methods in TensorFlow

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

jiant is an NLP toolkit