This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

Overview

About spellchecker.py

Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levenshtein string metric for measuring edit distance between two sequences of characters.

How to Write Your Own Test Cases

In the lib folder, you will see two different text files called 'candidate_words.txt' and 'incorrect_words.txt':

  • The candidate_words.txt text file can contain an unlimited amount of CORRECTLY spelled words, with each word written on a new line.
  • The incorrect_words.txt text file can contain an unlimited amount of INCORRECTLY spelled words, with each word written on a new line. However, each incorrectly spelled word in this list MUST have its correctly spelled counterpart contained somewhere in the 'candidate_words.txt' text file. It doesn't matter where, since the 'candidate_words.txt' file will be randomly shuffled anyway.

In the test folder, you will see a text file called target_words.txt:

  • The 'target_words.txt' file will contain the CORRECT spelling of each word contained in the 'incorrect_words.txt' text file, with each being on a new line in the same exact order that you inserted their incorrectly spelled counterparts in the 'incorrect_words.txt' text file. It is important that both the incorrectly and correctly spelled words are in the same order to be able to calculate the accuracy of the spell checker.

To view an example on how to create your own test cases, take a look at the files provided in either folder.

How to Run the Program

Enter the folder's directory using your terminal. Then, simply run python3 spellchecker.py

  • The only thing you will need to modify are the files in the lib and test folders if you want to try the program with your own test cases. The program does not need to be touched, unless you'd like to modify the global variable 'THRESHOLD', which is used as the threshold to find an incorrectly spelled word's closest approximation.
  • The incorrectly spelled words in 'incorrect_words.txt' will be run through the program to find its closest lexical match from the candidate_words.txt text file using the Damerau-Levenshtein algorithm.
  • The spellchecked words will then be, in order, cross checked against its intended counterparts in target_words.txt to calculate the overall accuracy of the spellchecking algorithm.

The results of the program will then be printed to your terminal.

Dependencies

Ensure that you have difflib installed for python3.

Final Words

Feel free to use or modify this program for your intended purposes!

Owner
Raihan Ahmed
Pursuing a degree in CS with concentrations in Computer Science, Computer Networks and Security, and Information Technology. Minoring in Linguistics.
Raihan Ahmed
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

Victor Zhong 33 Dec 27, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Understanding the Difficulty of Training Transformers

Admin Understanding the Difficulty of Training Transformers Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successful

Liyuan Liu 300 Dec 29, 2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

APEACH - Korean Hate Speech Evaluation Datasets APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of

Kevin-Yang 70 Dec 06, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 04, 2023
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

Zhuosheng Zhang 3 Jun 13, 2022
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023