Yaspeller Dictionary (Auto)builder

Usage

# this sample command generates `./yaspeller_report.json`
# yaspeller --report json --ignore-digits --ignore-text "'.*" --ignore-latin --only-errors --file-extensions ".md" --lang ru

python -m venv env
source env/bin/activate
pip install 
python src/dictionary.py yaspeller_report.json

Why

Yaspeller is nice, but there are too many anglicisms in a usual documentation. Normally you just want to ignore that, but there's the only possibility to add a regexp-array to ignore words.

This generates a array of dictionary words including all lexems for all cases like

[
    "[бБ]аг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[дД]ифф(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[кК]оммит(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[пП]атчинг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[рР]убист(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[сС]амоорганизованн(ого|ом|ому|ую|ые|ый|ым|ыми|ых)",
    "[тТ]икет(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "коммитить"
]

from yaspeller errors (in text format looking like)

Spelling check:
✗ www.ruby-lang.org/ru/community/ruby-core/index.md 130 ms
-----
Typos: 9
1. патчингом (36:27)
2. коммитить (68:32, suggest: комитет)
3. багах (75:15, suggest: богах, баках, бегах)
4. баги (89:24, suggest: багги)
5. баг (96:25)
6. тикет (107:14, suggest: этикет)
7. дифф (115:18)
8. коммиту (147:24, suggest: комету, комнату)
9. коммита (148:58, suggest: комета)
-----

Live example

Initially created for www.ruby-lang.org translations spellchecking

🤕 spelling exceptions builder for lazy people

Related tags

Overview

Yaspeller Dictionary (Auto)builder

Usage

Why

Live example

Owner

Vlad Bokov

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Natural Language Processing library built with AllenNLP 🌲🌱

The swas programming language

NLP command-line assistant powered by OpenAI

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

[ICLR'19] Trellis Networks for Sequence Modeling

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Client library to download and publish models and other files on the huggingface.co hub

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

New Modeling The Background CodeBase

A library for Multilingual Unsupervised or Supervised word Embeddings

An open-source NLP library: fast text cleaning and preprocessing.

Flaxformer: transformer architectures in JAX/Flax

Transformers and related deep network architectures are summarized and implemented here.

Lumped-element impedance calculator and frequency-domain plotter.