Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Last update: Dec 13, 2022

Related tags

Overview

T-TA (Transformer-based Text Auto-encoder)

This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning) using TensorFlow 2.

How to train T-TA using custom dataset

Prepare datasets. You need text line files.

Example:
```
Sentence 1.
Sentence 2.
Sentence 3.
```
Train the sentencepiece tokenizer. You can use the train_sentencepiece.py or train sentencepiece model by yourself.

Train T-TA model. Run train.py with customizable arguments. Here's the usage.

$ python train.py --help
usage: train.py [-h] [--train-data TRAIN_DATA] [--dev-data DEV_DATA] [--model-config MODEL_CONFIG] [--batch-size BATCH_SIZE] [--spm-model SPM_MODEL]
                [--learning-rate LEARNING_RATE] [--target-epoch TARGET_EPOCH] [--steps-per-epoch STEPS_PER_EPOCH] [--warmup-ratio WARMUP_RATIO]

optional arguments:
    -h, --help            show this help message and exit
    --train-data TRAIN_DATA
    --dev-data DEV_DATA
    --model-config MODEL_CONFIG
    --batch-size BATCH_SIZE
    --spm-model SPM_MODEL
    --learning-rate LEARNING_RATE
    --target-epoch TARGET_EPOCH
    --steps-per-epoch STEPS_PER_EPOCH
    --warmup-ratio WARMUP_RATIO

I want to train models until the designated steps, so I added the steps_per_epoch and target_epoch arguments. The total steps will be the steps_per_epoch * target_epoch.

(Optional) Test your model using KorSTS data. I trained my model with the Korean corpus, so I tested it using KorSTS data. You can evaluate KorSTS score (Spearman correlation) using evaluate_unsupervised_korsts.py. Here's the usage.

$ python evaluate_unsupervised_korsts.py --help
usage: evaluate_unsupervised_korsts.py [-h] --model-weight MODEL_WEIGHT --dataset DATASET

optional arguments:
    -h, --help            show this help message and exit
    --model-weight MODEL_WEIGHT
    --dataset DATASET
$ # To evaluate on dev set
$ # python evaluate_unsupervised_korsts.py --model-weight ./path/to/checkpoint --dataset ./path/to/dataset/sts-dev.tsv

Training details

Training data: lovit/namuwikitext
Peak learning rate: 1e-4
learning rate scheduler: Linear Warmup and Linear Decay.
Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
Vocab size: 15000
num layers: 3
intermediate size: 2048
hidden size: 512
attention heads: 8
activation function: gelu
max sequence length: 128
tokenizer: sentencepiece
Total steps: 1M
Final validation accuracy of auto encoding task (ignores padding): 0.5513
Final validation loss: 2.1691

Unsupervised KorSTS

Model	Params	development	test
My Implementation	17M	65.98	56.75
-	-	-	-
Korean SRoBERTa (base)	111M	63.34	48.96
Korean SRoBERTa (large)	338M	60.15	51.35
SXLM-R (base)	270M	64.27	45.05
SXLM-R (large)	550M	55.00	39.92
Korean fastText	-	-	47.96

KorSTS development and test set scores (100 * Spearman Correlation). You can check the details of other models on this paper (KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding).

How to use pre-trained weight using tensorflow-hub

>>> import tensorflow as tf
>>> import tensorflow_text as text
>>> import tensorflow_hub as hub
>>> # load model
>>> model = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/model.tar.gz")
>>> preprocess = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/preprocess.tar.gz")
>>> # inference
>>> input_tensor = preprocess(["이 모델은 나무위키로 학습되었습니다.", "근데 이 모델 어디다가 쓸 수 있을까요?", "나는 고양이를 좋아해!", "나는 강아지를 좋아해!"])
>>> representation = model(input_tensor)
>>> representation = tf.reduce_sum(representation * tf.cast(input_tensor["input_mask"], representation.dtype)[:, :, tf.newaxis], axis=1)
>>> representation = tf.nn.l2_normalize(representation, axis=-1)
>>> similarities = tf.tensordot(representation, representation, axes=[[1], [1]])
>>> # results
>>> similarities
<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[0.9999999 , 0.76468784, 0.7384633 , 0.7181306 ],
       [0.76468784, 1.        , 0.81387675, 0.79722893],
       [0.7384633 , 0.81387675, 0.9999999 , 0.96217746],
       [0.7181306 , 0.79722893, 0.96217746, 1.        ]], dtype=float32)>

References

짧은 영어를 뒤로 하고, 대부분의 독자분이실 한국분들을 위해 적어보자면, 단순히 "회사에서 구상중인 모델 구조가 좋을까?"를 테스트해보기 위해 개인적으로 학습해본 모델입니다. 어느정도로 잘 나오는지 궁금해서 작성한 코드이기 때문에 하이퍼 파라미터 튜닝이라던가, 데이터셋을 신중히 골랐다던가 하는 것은 없었습니다. 단지 학습해보다보니 생각보다 값이 잘 나와서 결과와 함께 공개하게 되었습니다. 커밋 로그를 보시면 짐작하실 수 있겠지만, 하루 정도에 후다닥 짜서 작은 GPU로 약 50시간 가량 돌린 모델입니다.

원 논문에 나온 값들을 최대한 따라가려 했으며, 밤에 작성했던 코드라 조금 명확하지 않은 부분이 있을 수도 있고, 원 구현과 다를 수도 있습니다. 해당 부분은 이슈로 달아주신다면 다시 확인해보겠습니다.

트러블 슈팅에 도움을 주신 백영민님(@baekyeongmin)께 감사드립니다.

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

922 Dec 10, 2021

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

20 Jul 11, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

33 Sep 22, 2022

Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Releases(0)

0(Feb 6, 2021)
Training data: lovit/namuwikitext

Peak learning rate: 1e-4

learning rate scheduler: Linear Warmup and Linear Decay.

Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)

Vocab size: 15000

num layers: 3

intermediate size: 2048

hidden size: 512

attention heads: 8

activation function: gelu

max sequence length: 128

tokenizer: sentencepiece

Total steps: 1M

Final validation accuracy of auto encoding task (ignores padding): 0.5513

Final validation loss: 2.1691

Source code(tar.gz)
Source code(zip)
model.tar.gz(60.93 MB)
preprocess.tar.gz(507.45 KB)

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Related tags

Overview

T-TA (Transformer-based Text Auto-encoder)

How to train T-TA using custom dataset

Training details

Unsupervised KorSTS

How to use pre-trained weight using tensorflow-hub

References

You might also like...

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Making text a first-class citizen in TensorFlow.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Releases(0)

0(Feb 6, 2021)

Owner

Jeong Ukjae

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

The guide to tackle with the Text Summarization

MEDIALpy: MEDIcal Abbreviations Lookup in Python

Blackstone is a spaCy model and library for processing long-form, unstructured legal text

A Practitioner's Guide to Natural Language Processing

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

🏆 • 5050 most frequent words in 109 languages

Spert NLP Relation Extraction API deployed with torchserve for inference

Need: Image Search With Python

Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

Pangu-Alpha for Transformers

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

Sentence Embeddings with BERT & XLNet

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.