KaziText

KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatical error correction corpora in M2 format.

The tool was introduced in Understanding Model Robustness to User-generated Noisy Texts.

Requirements

A set of requirements is listed in requirements.txt. Moreover, UDPipe model has to be downloaded for used languages (see http://hdl.handle.net/11234/1-3131) and linked in udpipe_tokenizer.py.

Overview

KaziText defines a set of aspects located in aspects. These model following phenomena:

Casing Errors
Common Other Errors (for most common phrases)
Errors in Diacritics
Punctuation Errors
Spelling Errors
Errors in wrongly used suffix/prefix
Whitespace Errors
Word-Order Errors

Each aspect has a set of internal probabilities (e.g. the probability of a user typing first letter of a starting word in lower-case instead of upper-case) that are estimated from M2 GEC corpora.

A complete set of aspects with their internal probabilities is called profile. We provide precomputed profiles for Czech, English, Russian and German in profiles as json files. The profiles are additionally split into dev and test. Also there are 4 profiles for Czech and 2 profiles for English differing in the underlying user domain (e.g. natives vs second learners).

To noise a text using a profile, use:

python introduce_errors.py $infile $outfile $profile $lang

introduce_errors.py script offers a variety of switches (run python introduce_errors.py --help to display them). One noteworthy is --alpha that serves for regulating final text error rate (set it to value lower than 1 to reduce number of errors; set to to value bigger than 1 to have more noisy texts). Apart for profiles themselves, we also precomputed set of alphas that are stored as .csv files in respective profiles folders and store values for alphas to reach 5-30 final text word error rates as well as so called reference-alpha word error rate that corresponds to the same error rate as the original M2 files the profile was estimated from had. To have for example noisy text at circa 5% word error rate noised by Romani profile, use --profile dev/cs_romi.json --alpha 0.2.

Moreover, we provide several scripts (noise*.py) for noising specific data formats.

To estimate a profile for given M2 file, run:

python estimate_all_ratios.py $m2_pattern outfile

To estimate normalization alphas file, see estimate_alpha.sh that describes iterative process of noising clean texts with an alpha, measuring text's noisiness and changing alpha respectively.

Other notes

Russian RULEC-GEC was normalized using normalize_russian_m2.py

KaziText is a tool for modelling common human errors.

Related tags

Overview

KaziText

Requirements

Overview

Other notes

Owner

ÚFAL

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

A unified 3D Transformer Pipeline for visual synthesis

Towards End-to-end Video-based Eye Tracking

Implementation of the paper ''Implicit Feature Refinement for Instance Segmentation''.

Speech Recognition using DeepSpeech2.

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

Open-source Monocular Python HawkEye for Tennis

SimulLR - PyTorch Implementation of SimulLR

SciPy fixes and extensions

A semismooth Newton method for elliptic PDE-constrained optimization

Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

Code for our CVPR2021 paper coordinate attention

Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation

Official PyTorch implementation of paper: Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation (ICCV 2021 Oral Presentation)

A modular domain adaptation library written in PyTorch.

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

(NeurIPS 2020) Wasserstein Distances for Stereo Disparity Estimation

MoveNetを用いたPythonでの姿勢推定のデモ

SiamMOT is a region-based Siamese Multi-Object Tracking network that detects and associates object instances simultaneously.

Codebase for ECCV18 "The Sound of Pixels"