A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

XLNet: Generalized Autoregressive Pretraining for Language Understanding

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A Fast Sequence Transducer Implementation with PyTorch Bindings

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Open Source Neural Machine Translation in PyTorch

Package for controllable summarization

Scene Text Retrieval via Joint Text Detection and Similarity Learning

This program do translate english words to portuguese

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

A fast, efficient universal vector embedding utility package.

This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Shirt Bot is a discord bot which uses GPT-3 to generate text

Pipeline for training LSA models using Scikit-Learn.

JaQuAD: Japanese Question Answering Dataset

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

NLP - Machine learning

Utilize Korean BERT model in sentence-transformers library

Binaural Speech Synthesis

Snowball compiler and stemming algorithms

Code for using and evaluating SpanBERT.