KaziText is a tool for modelling common human errors.

Related tags

Deep Learningkazitext
Overview

KaziText

KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatical error correction corpora in M2 format.

The tool was introduced in Understanding Model Robustness to User-generated Noisy Texts.

Requirements

A set of requirements is listed in requirements.txt. Moreover, UDPipe model has to be downloaded for used languages (see http://hdl.handle.net/11234/1-3131) and linked in udpipe_tokenizer.py.

Overview

KaziText defines a set of aspects located in aspects. These model following phenomena:

  • Casing Errors
  • Common Other Errors (for most common phrases)
  • Errors in Diacritics
  • Punctuation Errors
  • Spelling Errors
  • Errors in wrongly used suffix/prefix
  • Whitespace Errors
  • Word-Order Errors

Each aspect has a set of internal probabilities (e.g. the probability of a user typing first letter of a starting word in lower-case instead of upper-case) that are estimated from M2 GEC corpora.

A complete set of aspects with their internal probabilities is called profile. We provide precomputed profiles for Czech, English, Russian and German in profiles as json files. The profiles are additionally split into dev and test. Also there are 4 profiles for Czech and 2 profiles for English differing in the underlying user domain (e.g. natives vs second learners).

To noise a text using a profile, use:

python introduce_errors.py $infile $outfile $profile $lang 

introduce_errors.py script offers a variety of switches (run python introduce_errors.py --help to display them). One noteworthy is --alpha that serves for regulating final text error rate (set it to value lower than 1 to reduce number of errors; set to to value bigger than 1 to have more noisy texts). Apart for profiles themselves, we also precomputed set of alphas that are stored as .csv files in respective profiles folders and store values for alphas to reach 5-30 final text word error rates as well as so called reference-alpha word error rate that corresponds to the same error rate as the original M2 files the profile was estimated from had. To have for example noisy text at circa 5% word error rate noised by Romani profile, use --profile dev/cs_romi.json --alpha 0.2.

Moreover, we provide several scripts (noise*.py) for noising specific data formats.

To estimate a profile for given M2 file, run:

python estimate_all_ratios.py $m2_pattern outfile

To estimate normalization alphas file, see estimate_alpha.sh that describes iterative process of noising clean texts with an alpha, measuring text's noisiness and changing alpha respectively.

Other notes

  • Russian RULEC-GEC was normalized using normalize_russian_m2.py
Owner
ÚFAL
Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
ÚFAL
A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

Rockpool Rockpool is a Python package for developing signal processing applications with spiking neural networks. Rockpool allows you to build network

SynSense 21 Dec 14, 2022
A unified 3D Transformer Pipeline for visual synthesis

Overview This is the official repo for the paper: NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion. NÜWA is a unified multimodal p

Microsoft 2.6k Jan 06, 2023
Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

Seonwook Park 76 Dec 12, 2022
Implementation of the paper ''Implicit Feature Refinement for Instance Segmentation''.

Implicit Feature Refinement for Instance Segmentation This repository is an official implementation of the ACM Multimedia 2021 paper Implicit Feature

Lufan Ma 17 Dec 28, 2022
Speech Recognition using DeepSpeech2.

deepspeech.pytorch Implementation of DeepSpeech2 for PyTorch using PyTorch Lightning. The repo supports training/testing and inference using the DeepS

Sean Naren 2k Jan 04, 2023
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

CV Lab @ Yonsei University 87 Dec 30, 2022
Open-source Monocular Python HawkEye for Tennis

Tennis Tracking 🎾 Objectives Track the ball Detect court lines Detect the players To track the ball we used TrackNet - deep learning network for trac

ArtLabs 188 Jan 08, 2023
SimulLR - PyTorch Implementation of SimulLR

PyTorch Implementation of SimulLR There is an interesting work[1] about simultan

11 Dec 22, 2022
SciPy fixes and extensions

scipyx SciPy is large library used everywhere in scientific computing. That's why breaking backwards-compatibility comes as a significant cost and is

Nico Schlömer 16 Jul 17, 2022
A semismooth Newton method for elliptic PDE-constrained optimization

sNewton4PDEOpt The Python module implements a semismooth Newton method for solving finite-element discretizations of the strongly convex, linear ellip

2 Dec 08, 2022
Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

DreamerPro Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFl

22 Nov 01, 2022
Code for our CVPR2021 paper coordinate attention

Coordinate Attention for Efficient Mobile Network Design (preprint) This repository is a PyTorch implementation of our coordinate attention (will appe

Qibin (Andrew) Hou 726 Jan 05, 2023
Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation

Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation This is the inference codes of Context-Aware Image Matting for Simultaneo

Qiqi Hou 125 Oct 22, 2022
Official PyTorch implementation of paper: Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation (ICCV 2021 Oral Presentation)

SML (ICCV 2021, Oral) : Official Pytorch Implementation This repository provides the official PyTorch implementation of the following paper: Standardi

SangHun 61 Dec 27, 2022
A modular domain adaptation library written in PyTorch.

A modular domain adaptation library written in PyTorch.

Kevin Musgrave 225 Dec 29, 2022
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

This is the official implementation of our paper Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR), which has been accepted by WSDM2022.

Yongchun Zhu 81 Dec 29, 2022
(NeurIPS 2020) Wasserstein Distances for Stereo Disparity Estimation

Wasserstein Distances for Stereo Disparity Estimation Accepted in NeurIPS 2020 as Spotlight. [Project Page] Wasserstein Distances for Stereo Disparity

Divyansh Garg 92 Dec 12, 2022
MoveNetを用いたPythonでの姿勢推定のデモ

MoveNet-Python-Example MoveNetのPythonでの動作サンプルです。 ONNXに変換したモデルも同梱しています。変換自体を試したい方はMoveNet_tf2onnx.ipynbを使用ください。 2021/08/24時点でTensorFlow Hubで提供されている以下モデ

KazuhitoTakahashi 38 Dec 17, 2022
Codebase for ECCV18 "The Sound of Pixels"

Sound-of-Pixels Codebase for ECCV18 "The Sound of Pixels". *This repository is under construction, but the core parts are already there. Environment T

Hang Zhao 318 Dec 20, 2022