GULAG: GUessing LAnGuages with neural networks

Related tags

Deep Learninggulag
Overview

GULAG: GUessing LAnGuages with neural networks

Main Code style: black Checked with mypy GitHub license GitHub stars

cannon on sparrows

Classify languages in text via neural networks.

> Привет! My name is Egor. Was für ein herrliches Frühlingswetter, хутка расцвітуць дрэвы.
ru -- Привет
en -- My name is Egor
de -- Was für ein herrliches Frühlingswetter
be -- хутка расцвітуць дрэвы

Usage

Use requirements.txt to install necessary dependencies:

pip install -r requirements.txt

After that you can either train model:

python -m src.main train --gin-file config/train.gin

Or run inference:

python -m src.main infer

Training

All training details are covered by PyTorch-Lightning. There are:

Both modules have explicit documentation, see source files for usage details.

Dataset

Since extracting languages from a text is a kind of synthetic task, then there is no exact dataset of that. A possible approach to handle this is to use general multilingual corpses to create a synthetic dataset with multiple languages per one text. Although there is a popular mC4 dataset with large texts in over 100 languages. It is too large for this pet project. Therefore, I used wikiann dataset that also supports over 100 languages including Russian, Ukrainian, Belarusian, Kazakh, Azerbaijani, Armenian, Georgian, Hebrew, English, and German. But this dataset consists of only small sentences for NER classification that make it more unnatural.

Synthetic data

To create a dataset with multiple languages per example, I use the following sampling strategy:

  1. Select number of languages in next example
  2. Select number of sentences for each language
  3. Sample sentences, shuffle them and concatenate into single text

For exact details about sampling algorithm see generate_example method.

This strategy allows training on a large non-repeating corpus. But for proper evaluation during training, we need a deterministic subset of data. For that, we can pre-generate a bunch of texts and then reuse them on each validation.

Model

As a training objective, I selected per-token classification. This automatically allows not only classifying languages in the text, but also specifying their ranges.

The model consists of two parts:

  1. The backbone model that embeds tokens into vectors
  2. Head classifier that predicts classes by embedding vector

As backbone model I selected vanilla BERT. This model already pretrained on large multilingual corpora including non-popular languages. During training on a target task, weights of BERT were frozen to enhance speed.

Head classifier is a simple MLP, see TokenClassifier for details.

Configuration

To handle big various of parameters, I used gin-config. config folder contains all configurations split by modules that used them.

Use --gin-file CLI argument to specify config file and --gin-param to manually overwrite some values. For example, to run debug mode on a small subset with a tiny model for 10 steps use

python -m src.main train --gin-file config/debug.gin --gin-param="train.n_steps = 10"

You can also use jupyter notebook to run training, this is a convenient way to train with Google Colab. See train.ipynb.

Artifacts

All training logs and artifacts are stored on W&B. See voudy/gulag for information about current runs, their losses and metrics. Any of the presented models may be used on inference.

Inference

In inference mode, you may play with the model to see whether it is good or not. This script requires a W&B run path where checkpoint is stored and checkpoint name. After that, you can interact with a model in a loop.

The final model is stored in voudy/gulag/a55dbee8 run. It was trained for 20 000 steps for ~9 hours on Tesla T4.

$ python -m src.main infer --wandb "voudy/gulag/a55dbee8" --ckpt "step_20000.ckpt"
...
Enter text to classify languages (Ctrl-C to exit):
> İrəli! Вперёд! Nach vorne!
az -- İrəli
ru -- Вперёд
de -- Nach vorne
Enter text to classify languages (Ctrl-C to exit):
> Давайте жити дружно
uk -- Давайте жити дружно
> ...

For now, text preprocessing removes all punctuation and digits. It makes the data more robust. But restoring them back is a straightforward technical work that I was lazy to do.

Of course, you can use model from the Jupyter Notebooks, see infer.ipynb

Further work

Next steps may include:

  • Improved dataset with more natural examples, e.g. adopt mC4.
  • Better tokenization to handle rare languages, this should help with problems on the bounds of similar texts.
  • Experiments with another embedders, e.g. mGPT-3 from Sber covers all interesting languages, but requires technical work to adopt for classification task.
Owner
Egor Spirin
DL guy
Egor Spirin
Official implementation for NIPS'17 paper: PredRNN: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal LSTMs.

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning The predictive learning of spatiotemporal sequences aims to generate future

THUML: Machine Learning Group @ THSS 243 Dec 26, 2022
Deep Q-learning for playing chrome dino game

[PYTORCH] Deep Q-learning for playing Chrome Dino

Viet Nguyen 68 Dec 05, 2022
x-transformers-paddle 2.x version

x-transformers-paddle x-transformers-paddle 2.x version paddle 2.x版本 https://github.com/lucidrains/x-transformers 。 requirements paddlepaddle-gpu==2.2

yujun 7 Dec 08, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
Source code related to the article submitted to the International Conference on Computational Science ICCS 2022 in London

POTHER: Patch-Voted Deep Learning-based Chest X-ray Bias Analysis for COVID-19 Detection Source code related to the article submitted to the Internati

Tomasz Szczepański 1 Apr 29, 2022
RIM: Reliable Influence-based Active Learning on Graphs.

RIM: Reliable Influence-based Active Learning on Graphs. This repository is the official implementation of RIM. Requirements To install requirements:

Wentao Zhang 4 Aug 29, 2022
A Free and Open Source Python Library for Multiobjective Optimization

Platypus What is Platypus? Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs)

Project Platypus 424 Dec 18, 2022
GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

Guidedog Authors: Kyuhee Jo, Steven Gunarso, Jacky Wang, Raghav Sharma GuideDog is an AI/ML-based mobile app designed to assist the lives of the visua

Kyuhee Jo 5 Nov 24, 2021
Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

Adrian Edwards 290 Jan 09, 2023
git《Joint Entity and Relation Extraction with Set Prediction Networks》(2020) GitHub:

Joint Entity and Relation Extraction with Set Prediction Networks Source code for Joint Entity and Relation Extraction with Set Prediction Networks. W

130 Dec 13, 2022
ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Benchmark for Tuning Accuracy and Efficiency Overview The benchmark includes our

HPC-AI Tech 31 Oct 07, 2022
Robotics with GPU computing

Robotics with GPU computing Cupoch is a library that implements rapid 3D data processing for robotics using CUDA. The goal of this library is to imple

Shirokuma 625 Jan 07, 2023
Python periodic table module

elemenpy Hello! elements.py is a small Python periodic table module that is used for calling certain information about an element. Installation Instal

Eric Cheng 2 Dec 27, 2021
DCGAN LSGAN WGAN-GP DRAGAN PyTorch

Recommendation Our GAN based work for facial attribute editing - AttGAN. News 8 April 2019: We re-implement these GANs by Tensorflow 2! The old versio

Zhenliang He 408 Nov 30, 2022
Randstad Artificial Intelligence Challenge (powered by VGEN). Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato

Randstad Artificial Intelligence Challenge (powered by VGEN) Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato Struttura director

Stefano Fiorucci 1 Nov 13, 2021
This is an early in-development version of training CLIP models with hivemind.

A transformer that does not hog your GPU memory This is an early in-development codebase: if you want a stable and documented hivemind codebase, look

<a href=[email protected]"> 4 Nov 06, 2022
Facial Image Inpainting with Semantic Control

Facial Image Inpainting with Semantic Control In this repo, we provide a model for the controllable facial image inpainting task. This model enables u

Ren Yurui 8 Nov 22, 2021
Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

Mathis Petrovich 248 Dec 23, 2022
Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

61 Jan 07, 2023
My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs (GNN, GAT, GraphSAGE, GCN)

machine-learning-with-graphs My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs Course materials can be

Marko Njegomir 7 Dec 14, 2022