A Transformer Implementation that is easy to understand and customizable.

Last update: Jan 20, 2022

Overview

Simple Transformer

I've written a series of articles on the transformer architecture and language models on Medium.

This repository contains an implementation of the Transformer architecture presented in the paper Attention Is All You Need by Ashish Vaswani, et. al.

My goal is to write an implementation that is easy to understand and dig into nitty-gritty details where the devil is.

Python environment

You can use any Python virtual environment like venv and conda.

For example, with venv:

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install -e.

Spacy Tokenizer Data Preparation

To use Spacy's tokenizer, make sure to download required languages.

For example, English and Germany tokenizers can be downloaded as below:

python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

Text Data from Torchtext

This project uses text datasets from Torchtext.

from torchtext import datasets

The default configuration uses Multi30k dataset.

Training

python train.py config_path

The default config path is config/config.yaml.

It is possible to resume training from a checkpoint.

python train.py --checkpoint_path runs/20220108-164720-Multi30k-Transformer/checkpoint-010-2.3343.pt

You can run tensorboard to see the training progress.

tensorboard --logdir=runs

The logs are created under runs.

Test

python test.py checkpoint_path

Example,

python test.py runs/20220108-164720-Multi30k-Transformer/checkpoint-010-2.3343.pt

config.yaml is copied to the model folder when training starts, and the test.py assumes the existence of a config yaml file.

Unit tests

There are some unit tests in the tests folder.

pytest tests

A Transformer Implementation that is easy to understand and customizable.

Related tags

Overview

Simple Transformer

Python environment

Spacy Tokenizer Data Preparation

Text Data from Torchtext

Training

Test

Unit tests

References:

Owner

Naoki Shibuya

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

The Sudachi synonym dictionary in Solar format.

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Outreachy TFX custom component project

NLTK Source

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Utilize Korean BERT model in sentence-transformers library

基于“Seq2Seq+前缀树”的知识图谱问答

Text editor on python tkinter to convert english text to other languages with the help of ployglot.