ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

Conversational text Analysis using various NLP techniques

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

An evaluation toolkit for voice conversion models.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

VD-BERT: A Unified Vision and Dialog Transformer with BERT

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Smart discord chatbot integrated with Dialogflow

Contract Understanding Atticus Dataset

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

AllenNLP integration for Shiba: Japanese CANINE model

Opal-lang - A WIP programming language based on Python

Question and answer retrieval in Turkish with BERT

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy