An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode.

Overview

WordleSolver

An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode.

How to use the program


Copy this project with git clone and run python3 solver.py in the terminal.

When you run the program, the algorithm will provide you with an educated guess. Then, you type the guess into Wordle. Once you get the result of how many letters were right, you input it back into the program and will get another guess back. This process will continue until you have solved the puzzle!

Inputting the result of your guesses is easy. If a character is gray, enter '_', if a character is yellow, enter the lowercase letter, and if a character is green, enter the uppercase letter. For example, if the program told you to guess "aeros" and the result of the guess was:

image

You would enter the result as: __r__

Here is another example:

image

You would enter the result as: DR_k_

How the algorithm works

Here's a quick run-down of how the algorithm works. We keep a list of words that the answer can be and keep removing from the list until only one word remains or we guess the right answer. Each word has a unique number associated with it. We can use this number to quickly determine if a word can be an answer based on the results of other guesses. If a word cannot be the answer, it will be removed from our list. The key to the accuracy and efficiency of this algorithm is how this unique number is generated.

The number is the product of a few prime numbers which lets us use modular arithmetic in a clever way! Each letter will have 6 prime numbers associated with it. One "yellow" number and five "green" numbers. We use the one yellow number when we know a letter is in the word but we don't know where. We use one of the green letters when we know that a letter is in a specific spot. You can see these prime numbers in charDict.json. To actually calculate the number of a word, we multiply all the yellow numbers of the characters that make up the word together as well as certain green numbers. The green number we multiply depends on the position the letter appears. If the letter D appears in the first spot, we multiply by its 1st green number. If it was instead in the last spot of the word, we multiply by its 5th green number. The reason we do this is we can utilize modulo to check if a certain word can be an answer based on the result of another guess. For example, if we guessed "aeros" and the word we were trying to find was "drink", we will find that r is somewhere in the word but not in the third spot. Let us say a word has number n. If n%r's yellow number does not equal 0, then we know that word cannot be zero and we can remove it from the list. Also, if n%r's third green number equals 0, we know that it cannot be the answer because r cannot be in the third spot. Similar logic is applied when multiple letters are yellow or some letters come up green. The value of each word does not change, so we can process this information once and store it in a txt file to be used later which is what I did in wordList.txt! If you would like to use a different set of words than what I used, feel free to change the words.txt file and run process.py to generate a new wordList file.

Optimizations

One way to make the algorithm take fewer guesses is to make smarter guesses. As such, an optimization I decided to make is to take into account letter frequency. Letters that appear more often have lower prime numbers associated with them and also that the word that is guessed always has the smallest number associated with it. Now, the primes associated with each letter aren't just chosen arbitrarily and actually tell us some information. "e" is the most common letter and as such has the six smallest prime numbers. I can sort the wordlist and make the algorithm guess the word with the smallest number. So, our algorithm is more likely to guess a word with "e" in it than "q" since words with "e" will probably be smaller. This is good because "e" is much more likely to be in the word than "q". Also, I only need to sort the list once in process.py so there is no significant performance hit!

A drawback of this approach is that words that are made up of repetitive common letters have very low values and are guessed much more. This is not good because words with repeating letters make it harder to narrow down our potential guesses! For example, consider the word "esses" which is made up of only of the two most common letters. It's good that our guesses consist of letters that are common but it is bad that we only get information about two different letters. The way I fixed this is by multiplying words that have characters repeated two or three times by a much bigger prime number so they are weighed down and guesses less often.

Another optimization I made is taking into account how common a word is. There are a lot of niche words in the list that are very rarely used which are likely not the answer to the puzzle. So, once I've narrowed down the possible words to less than a hundred, it makes sense to guess the more common words first. This is why I introduced a second number that is associated which each word. The second number is the frequency of a word in Wikipedia articles. Once there are less than 100 words in the list, the list is resorted by this second number rather than the first and so each guess will be the most common word remaining!

Further Optimizations

As I mentioned before, one of the optimizations I made was having more common letters correspond with smaller prime numbers and sorting the list of words based on the number associated with each word. This is all done just once for each set of words in process.py and is very computationally efficient. However, if more accuracy is desired, the prime number associated with each letter can be re-generated after each guess because the frequency of each letter is likely to change. This may increase accuracy slightly but will take much longer to process which is why I opted against it. After each guess, I would have to re-check the frequency of each letter, calculate the value of each word, and then resort to the entire list based on this new value.

Sources

  • Wordle is by PowerLanguage
  • List of 5 letter words is based on SOWPODS and was taken from Word Game Dictionary. I suspect that PowerLanguage used the same source for wordle as he used a similar source for another project.
  • The frequency of words was taken from lexepedia with a minimum frequency of 1, length of 5, and only includes Wiktionary Words.
Owner
Akil Selvan Rajendra Janarthanan
yo!
Akil Selvan Rajendra Janarthanan
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

Tianyi 1k Jan 08, 2023
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
C.J. Hutto 3.8k Dec 30, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023