A fast and easy implementation of Transformer with PyTorch.

Overview

FasySeq

FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which can be trained efficiently and modified easily. This toolkit is based on Transformer(Vaswani et al.), and will add more seq2seq models in the future.

Dependency

PyTorch >= 1.4
NLTK

Result

...

Structure

...

To Be Updated

  • top-k and top-p sampling
  • multi-GPU inference
  • length penalty in beam search
  • ...

Preprocess

Build Vocabulary

createVocab.py

NamedArguments Description
-f/--file The files used to build the vocabulary.
Type: List
--vocab_num The maximum size of vocabulary, the excess word will be discard according to the frequency.
Type: Int Default: -1
--min_freq The minimum frequency of token in vocabulary. The word with frequency less than min_freq will be discard.
Type: Int Default: 0
--lower Whether to convert all words to lowercase
--save_path The path to save voacbulary.
Type: str

Process Data

preprocess.py

NamedArguments Description
--source The path of source file.
Type: str
[--target] The path of target file.
Type: str
--src_vocab The path of source vocabulary.
Type: str
[--tgt_vocab] The path of target vocabulary.
Type: str
--save_path The path to save the processed data.
Type: str

Train

train.py

NamedArguments Description
Model -
--share_embed Source and target share the same vocabulary and word embedding. The max position of embedding is max(max_src_position, max_tgt_position) if the model employ share embedding.
--max_src_position The maximum source position, all src-tgt pairs which source sentences' lenght are greater than max_src_position will be cut or discard. If max_src_position > max source length, it wil be set to max source length.
Type: Int Default: inf
--max_tgt_position The maximum target position, all src_tgt pairs which target sentences' length are greater than max_tgt_position will be cut or discard. If max_tgt_position > max target length, it wil be set to max target length.
Type: Int Default: inf
--position_method The method to introduce positional information.
Option: encoding/embedding
--normalize_before Leveraging before layer normalization. See Xiong et al.
Checkpoint -
--checkpoint_path The path to save checkpoint file.
Type: str Default: None
--restore_file The checkpoint file to be loaded.
Type: str Default: None
--checkpoint_num Save the nearest checkpoint_num breakpoint.
Type: Int Default: inf
Data -
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--file The training data file.
Type: str
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--discard_invalid_data The data which length of source or data is more than maximum position will be discard if use this option, otherwise the long sentences will be cut into max position.
Train -
--cuda_num The device's ID of GPU.
Type: List
--grad_accumulate The num of gradient accumulate.
Type: Int Default: 1
--epoch The total epoch to train.
Type: Int Default: inf
--batch_print_info The number of batch to print training information.
Type: Int Default: 1000

Inference

generator.py

NamedArguments Description
--cuda_num The device's ID of GPU.
Type: List
--file The inference data file which has been processed.
Type: str
--raw_file The raw inference data file, and will be preprocessed before generated.
Type: str
--ref_file The reference file.
Type: str
--max_length
--max_alpha
--max_add_token
Maximum generated length = min(max_length, max_alpha * max_src_len, max_add_token + max_src_token)
Type: Int Default: inf
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--model_path The path of pre-trained model.
Type: str
--output_path The path of output. the result will be saved into output_path/result.txt.
Type: str
--decode_method The decode method.
Option:greedy/beam
--beam Beam size.
Type: Int Default: 5

Postpreposs

avg_param.py

The average parameter code we employed is the same as fairseq.

License

FasySeq(-py) is Apache-2.0 License. The license applies to the pre-trained models as well.

You might also like...
Fast, general, and tested differentiable structured prediction in PyTorch
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Reformer, the efficient Transformer, in Pytorch
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Google's Meena transformer chatbot implementation
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Owner
宁羽
宁羽
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 02, 2023
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021