GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Overview

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger is an open-source toolkit for grammatical profiling for language learning. It can analyze text in English and Chinese and show you grammatical items included in the input, along with its estimated difficulty.

Usage

GrammarTagger is written in Python (3.7+) and AllenNLP (2.1.0+). If you have conda installed, you can set up the environment as follows:

git clone https://github.com/octanove/grammartagger.git
cd grammartagger
conda create -n grammartagger python=3.7
conda activate grammartagger
pip install -r requirements.txt

Also, download the pretrained models (see below). After these steps, you can run GrammarTagger as follows:

English:

echo 'He loves to learn new languages, and last month he practiced some lessons in Spanish.' | python scripts/predict.py model-en-multi.tar.gz | jq
{
  "spans": [
    {
      "span": [0, 3],
      "tokens": ["[CLS]", "he", "loves", "to"],
      "label": "194:VP.SV.AFF"
    },
    {
      "span": [2, 2],
      "tokens": ["loves"],
      "label": "60:TA.PRESENT.does.AFF"
    },
    {
      "span": [2, 4],
      "tokens": ["loves", "to", "learn"],
      "label": "101:TO.VV_to_do"
    },
    ...
  ],
  "tokens": [
      "[CLS]", "he", "loves", "to", "learn", "new", "languages", ",",
      "and", "last", "month", "he", "practiced", "some", "lessons", "in", "spanish", ".", "[SEP]"
  ],
  "level_probs": {
    "c2": 0.008679441176354885,
    "b2": 0.005526999477297068,
    "c1": 0.05267713591456413,
    "b1": 0.06360447406768799,
    "a2": 0.06990284472703934,
    "a1": 0.7954732775688171
  }
}

Chinese:

$ echo '她住得很远,我想送她回去。' | python scripts/predict.py model-zh-multi.tar.gz | jq
{
  "spans": [
    {
      "span": [2, 5],
      "tokens": ["住", "得", "很", "远"],
      "label": "2.12.1:V 得 A:(using adverbs)"
    },
    {
      "span": [4, 4]
      "tokens": ["很"],
      "label": "1.06.2:很:very"
    },
    {
      "span": [8, 8],
      "tokens": ["想"],
      "label": "1.08.1:想:to want"
    }
  ],
  "tokens": ["[CLS]", "她", "住", "得", "很", "远", ",", "我", "想", "送", "她", "回", "去", "。", "[SEP]"],
  "level_probs": {
    "HSK 6": 9.971807230613194e-06,
    "HSK 5": 0.0011904890416190028,
    "HSK 3": 0.005279902834445238,
    "HSK 4": 0.00014815296162851155,
    "HSK 2": 0.9917035102844238,
    "HSK 1": 0.0016456041485071182
  }
}

Technical details

GrammarTagger is based on pretrained contextualizers, namely BERT (Devlin et al. 2019), and span classification. See the following paper for more details.

Hagiwara et al. 2021. GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education

Pretrained models

These pretrained models are licensed under CC BY-NC-ND 4.0 for academic/personal uses. If you are interested in a commercial license, please contact [email protected]. We are also working on improved models with wider grammar coverage and higher accuracy.

Owner
Octanove Labs
Octanove Labs
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
PUA Programming Language written in Python.

pua-lang PUA Programming Language written in Python. Installation git clone https://github.com/zhaoyang97/pua-lang.git cd pua-lang pip install . Try

zy 4 Feb 19, 2022
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural languag

Luyu Gao 140 Dec 06, 2022
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

Juan Camilo López Montes 1 Jan 11, 2022
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023