Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

scikit-learn wrappers for Python fastText.

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Code for Emergent Translation in Multi-Agent Communication

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

This repository contains examples of Task-Informed Meta-Learning

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

结巴中文分词

Implementation of Fast Transformer in Pytorch

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

precise iris segmentation

Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

KR-FinBert And KR-FinBert-SC

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Huggingface Transformers + Adapters = ❤️

YACLC - Yet Another Chinese Learner Corpus

An easier way to build neural search on the cloud

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.