Implementation of the GBST block from the Charformer paper, in Pytorch

Overview

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
Comments
  • positional embedding

    positional embedding

    Screenshot from 2021-06-30 12-12-17

    in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment

    help wanted 
    opened by lucidrains 3
  • Cannot tokenize on GPU

    Cannot tokenize on GPU

    Hi,

    I'm using Charformer to do some error corrections on Colab. But I found that after I pass tokens to CUDA and start tokenizing, this would show up: image

    Did I do it in a wrong way?

    opened by Shamepoo 2
  • example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

    example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

    Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.

    I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.

    I tried looking at the source code, and the other issues here, but haven't yet found the details.

    Some specific questions:

    • how do I "train" this tokenizer on a .txt file?
    • is it compatible with this section of the HF notebook, aka can it be passed into LineByLineTextDataset?
    from transformers import LineByLineTextDataset
    
    dataset = LineByLineTextDataset(
        tokenizer=tokenizer,
        file_path="./oscar.eo.txt",
        block_size=128,
    )
    

    When I tried doing that line, I got the following error:

    /usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the ๐Ÿค— Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
      FutureWarning,
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-38-1688c68b48be> in <module>()
          5     tokenizer=tokenizer,
          6     file_path="./oscar.eo.txt",
    ----> 7     block_size=128,
          8 )
    
    1 frames
    
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    TypeError: forward() got an unexpected keyword argument 'add_special_tokens'
    
    opened by cdleong 0
  • Sequence Length Problem in NMT

    Sequence Length Problem in NMT

    After downsampling, the length of the sequence has been shortened. But how can I return the sequence to its original length since I may need to do sentence generation in error correction?

    Thank you!

    opened by Shamepoo 1
  • Bytes vs. Characters

    Bytes vs. Characters

    The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2โ€“3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise. and the example ๅญ่ฏๅˆ†่ฏ, it becomes ๅญๅญๅญ่ฏ่ฏ่ฏๅˆ†ๅˆ†ๅˆ†่ฏ่ฏ่ฏ, with the 3 bytes in every character.

    What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding? If so, how to decide the number of bytes.

    Thank you.

    opened by jamfly 0
Releases(0.0.4)
Owner
Phil Wang
Working with Attention
Phil Wang
Dynamica causal Bayesian optimisation

Dynamic Causal Bayesian Optimization This is a Python implementation of Dynamic Causal Bayesian Optimization as presented at NeurIPS 2021. Abstract Th

nd308 18 Nov 22, 2022
Epidemiology analysis package

zEpid zEpid is an epidemiology analysis package, providing easy to use tools for epidemiologists coding in Python 3.5+. The purpose of this library is

Paul Zivich 111 Jan 08, 2023
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

้ƒญ้ฃž 3.7k Jan 03, 2023
Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Detecting Twenty-thousand Classes using Image-level Supervision Detic: A Detector with image classes that can use image-level labels to easily train d

Meta Research 1.3k Jan 04, 2023
Learning Modified Indicator Functions for Surface Reconstruction

Learning Modified Indicator Functions for Surface Reconstruction In this work, we propose a learning-based approach for implicit surface reconstructio

4 Apr 18, 2022
Experiments with differentiable stacks and queues in PyTorch

Please use stacknn-core instead! StackNN This project implements differentiable stacks and queues in PyTorch. The data structures are implemented in s

Will Merrill 141 Oct 06, 2022
Code of paper: "DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks"

DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks Abstract: Adversarial training has been proven to

ๅ€ชไป•ๆ–‡ (Shiwen Ni) 58 Nov 10, 2022
Non-stationary GP package written from scratch in PyTorch

NSGP-Torch Examples gpytorch model with skgpytorch # Import packages import torch from regdata import NonStat2D from gpytorch.kernels import RBFKernel

Zeel B Patel 1 Mar 06, 2022
Automated Evidence Collection for Fake News Detection

Automated Evidence Collection for Fake News Detection This is the code repo for the Automated Evidence Collection for Fake News Detection paper accept

Mrinal Rawat 2 Apr 12, 2022
VGGFace2-HQ - A high resolution face dataset for face editing purpose

The first open source high resolution dataset for face swapping!!! A high resolution version of VGGFace2 for academic face editing purpose

Naiyuan Liu 232 Dec 29, 2022
Repository for the electrical and ICT benchmark model developed in the ERIGrid 2.0 project.

Benchmark Model Electrical and ICT System This repository contains the documentation, code, and models for the electrical and ICT benchmark model deve

ERIGrid 2.0 1 Nov 29, 2021
A library for performing coverage guided fuzzing of neural networks

TensorFuzz: Coverage Guided Fuzzing for Neural Networks This repository contains a library for performing coverage guided fuzzing of neural networks,

Brain Research 195 Dec 28, 2022
Jaxtorch (a jax nn library)

Jaxtorch (a jax nn library) This is my jax based nn library. I created this because I was annoyed by the complexity and 'magic'-ness of the popular ja

nshepperd 17 Dec 08, 2022
A demo of how to use JAX to create a simple gravity simulation

JAX Gravity This repo contains a demo of how to use JAX to create a simple gravity simulation. It uses JAX's experimental ode package to solve the dif

Cristian Garcia 16 Sep 22, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 02, 2023
Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games How to run our algorithm? Create the new environment using: conda

MARL @ SJTU 8 Dec 27, 2022
Self-training with Weak Supervision (NAACL 2021)

This repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: "Self-Training with Weak Supervision"

Microsoft 148 Nov 20, 2022
PyTorch code for JEREX: Joint Entity-Level Relation Extractor

JEREX: "Joint Entity-Level Relation Extractor" PyTorch code for JEREX: "Joint Entity-Level Relation Extractor". For a description of the model and exp

LAVIS - NLP Working Group 50 Dec 01, 2022
Malware Analysis Neural Network project.

MalanaNeuralNetwork Description Malware Analysis Neural Network project. Table of Contents Getting Started Requirements Installation Clone Set-Up VENV

2 Nov 13, 2021
Model parallel transformers in Jax and Haiku

Mesh Transformer Jax A haiku library using the new(ly documented) xmap operator in Jax for model parallelism of transformers. See enwik8_example.py fo

Ben Wang 4.8k Jan 01, 2023