Wordle solving algorithms using Information
Theory

Algoritmos de resolución de Wordle usando Teoría
de la Información

Authors: Ángel Benítez Gómez and Enrique Cavanillas Puga

Supervisor: Manuel Núñez García

Final Degree Project of the Double Degree in Computer Science
Engineering and Mathematics

Faculty of Computer Science

Universidad Complutense de Madrid

Year 2021-2022


I am so clever that sometimes I don’t understand a single word of what I am saying.
– Oscar Wilde

I


II


Abstract (English)

Wordle is a popular web based game where players must guess five-letter words in six
attempts or less. Players are given hints about which letters are incorrect, misplaced or
correct, and must use this information to discard and select new candidates for their next
guesses. In this project we explore the connection between Information Theory and puzzle
solving by using entropy-related concepts and algorithms to solve Wordle. First, we make
a practical introduction to Information Theory, entropy and its applications. We define
multiple Greedy and Genetic Algorithms and analyze them in order to improve their average
scores and reduce their miss rates. We test our algorithms against relaxed versions of
the puzzles to estimate a lower bound for the average scores of algorithm-based solutions.
Finally, we discuss our results, provide suggestions for future reexaminations and present
our candidate for best Wordle starter.

Keywords: Information Theory, Wordle, entropy, uncertainty, information, algorithms,
genetic algorithms, puzzle solving.

III


IV

Resumen (español)

Wordle es un juego de navegador popular en el que los jugadores deben averiguar palabras
de cinco letras en seis intentos o menos. Se dan pistas a los jugadores acerca de las letras
que son incorrectas, están mal situadas o son correctas, y deben usar esta información para
descartar y seleccionar nuevos candidatos para sus siguientes intentos. En este proyecto
exploramos la conexión entre la Teoría de la Información y la resolución de puzles mediante
conceptos y algoritmos relacionados con la entropía. Primero hacemos una introducción
práctica a la Teoría de la Información, la entropía y sus aplicaciones. Definimos múltiples
algoritmos voraces y genéticos y los analizamos para mejorar sus puntuaciones medias y
reducir sus porcentajes de fallo. Evaluamos nuestros algoritmos frente a versiones relajadas
de los puzles para estimar una cota inferior de las puntuaciones medias de soluciones basadas
en algoritmos. Finalmente, estudiamos nuestros resultados, sugerimos mejoras para futuras
revisiones y presentamos nuestra palabra candidata a mejor apertura del Wordle.

Palabras clave: Teoría de la Información, entropía, incertidumbre, información, algo-
ritmos, algoritmos genéticos, resolución de puzles.


Contents

1 Introduction 1
1.1 Historical background. What is information? . . . . . . . . . . . . . . . . . . 1
1.2 Definition and Properties of Entropy . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Applications of Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Project Management and Scheduling . . . . . . . . . . . . . . . . . . . . . . . 10

2 Using entropy to solve Wordle 13
2.1 Wordle: Rules and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Information Theory Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Obtaining the patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Precomputation and best starters . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Naïve Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Greedy algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.5 Relaxations of the problem . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Using Genetic Algorithms to solve Wordle 25
3.1 Introduction to Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Solution to Wordle with a Genetic Algorithm . . . . . . . . . . . . . . . . . . 31

3.2.1 First Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Genetic Algorithm with Information Theory . . . . . . . . . . . . . . . 37

4 Individual contributions 41
4.1 Contribution of Ángel Benítez Gómez . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Contribution of Enrique Cavanillas Puga . . . . . . . . . . . . . . . . . . . . . 43

5 Results and Conclusions 47
5.1 Information Theory Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Naïve Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

V


VI CONTENTS

5.1.3 Relaxations of the problem . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Genetic Information Theory Algorithm . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Comparison to other algorithms and Conclusions . . . . . . . . . . . . . . . . 55

Bibliography 57


Chapter 1

Introduction

Information Theory is a field of Mathematics that has made great contributions to the
advancement of technology, and still has the potential to achieve major accomplishments
in Computer Science, among many other fields. In this project we apply some of the core
concepts of Information Theory to an apparently unrelated problem such as puzzle solving.
In particular, we study the use of entropy in solving Wordle’s puzzles. The goal of this
project is to be a self contained exposition of the algorithms we used, with a detailed and
practical introduction to Information Theory.

This chapter is an introduction to the key notions of Information Theory that will be
used throughout the text and presents our work plan and descriptions of the design methods
that we followed during the project. Chapter 2 describes how Wordle works and presents our
Information Theory Algorithm. Chapter 3 consists of an introduction to Genetic Algorithms
and our approach to solve Wordle using an algorithm of this kind. Chapter 4 describes our
individual contributions. Finally, Chapter 5 contains a detailed discussion of the final results
of our algorithms and our findings.

1.1 Historical background. What is information?

Information Theory is the scientific field that studies information transmission and stor-
age, founded by Harry Nyquist, Ralph Hartley and Claude Shannon. The works of Nyquist
and Hartley regarding information transmission [1, 2] laid the mathematical groundwork
whereupon Shannon would later lay the foundation of Information Theory [3]. Shannon
developed a way to quantify information through probability, a measure he would name
entropy, by separating information from meaning.

To understand Information Theory and the metrics we will use to design our algorithms,

1


2 CHAPTER 1. INTRODUCTION

one must first fully understand the concept of entropy and information. When studying
information storage, Shannon noticed three fundamental features.

• The amount of information in a message cannot be negative. When stripped of mean-
ing, a message can only inform.

• The most unlikely events give the most information when observed. For example,
when doctors examine their patients, they look for uncommon characteristics before
making any decisions. Unremarkable, very likely features, give little to no information.

• The observation of independent events gives the same information as the observation
of each event independently. The height and weight of a person will be somewhat
related, so measuring one of these features will give us some information about the
other. Learning the persons name, however, tells us nothing of their other features.

The amount of information h given by an event with probability p ∈ (0, 1] is defined as
h(p) := − log p, which trivially satisfies the three features of information we described.

Example 1.1. Suppose we flip two coins. We can represent our sample space with 2 bits;
each bit represents a coin, 0 represents heads and 1 represents tails.

heads tails
heads 00 01
tails 10 11

Figure 1.1: Sample space, two coins

Suppose we now measure the outcome of the first coin and we observe heads. Now we
only need 1 bit to represent what remains of our sample space.

heads tails
0 1

Figure 1.2: Sample space, one coin

By learning the outcome of the first coin, we gained one bit of information. This coincides
exactly with

h(
1

2
) = − log2

1

2
= 1 (bit)

Just like log2 n tells us the number of bits that are necessary to store n ∈ Z+ different
values, h tells us how many bits of information we gain by observing an event, by measuring
the reduction of the sample space in bits.


1.2. DEFINITION AND PROPERTIES OF ENTROPY 3

These concepts would later have a great impact on not just Communication Theory, but
on many other scientific fields. Information Theory has made fundamental contributions
to Computer Science, Probability and Statistics and Statistical Physics, among others. In
the next section, we will define the entropy of a random variable as the expected amount of
information given by its support.

1.2 Definition and Properties of Entropy

Entropy is the most important metric in Information Theory. It measures the amount
of information –sometimes called uncertainty– in a random variable, and it has proven to be
very useful in many parts of Computer Science, most notably in Deep Learning algorithms.
In this section, we will introduce its basic definition and properties, along with some other
noteworthy concepts. We will then show some properties that these metrics hold.

The concept of entropy was originally introduced by Claude E. Shannon [3] in his studies
of the transmission of information. While there are many definitions of entropy, Shannon’s
proposal is the most commonly used.

Definition 1.1. Let X be a discrete random variable with alphabet X and probability mass
function p(x) = Pr{X = x}, for x ∈ X . Then, the entropy of X is defined as:

H(X) = −
∑
x∈X

p(x) · log2 p(x)

Remark 1.1. We usually choose base 2 for the logarithm because most digital systems are
binary. By choosing base 2, we are also choosing the bit as the standard unit of information
quantity. Entropy then becomes the average number of bits that are needed to describe a
random variable.

Remark 1.2. We will use the convention that 0 · log 0 = 0, based on the fact that when
x→ 0 then x · log x→ 0 . This follows the idea that if we add elements with null probability
to the support of a random variable, the amount of information of the variable does not
change.

Remark 1.3. Entropy can be understood as the expected amount of information of a
random variable. Let I(x) := − log2 p(x) for x ∈ X such that p(x) ̸= 0, this function
quantifies information of events as we described in Example 1.1. Then H(X) ≡ E[I(X)].

Let us show a simple example of how this definition may be applied when we are given
the distribution of a random variable.

Example 1.2. Suppose we have a discrete random variable X that can take 16 different
values and is uniformly distributed, that is, each element in X has a probability of 1/16.


4 CHAPTER 1. INTRODUCTION

Then the entropy of X is:

H(X) = −
∑
x∈X

p(x) · log2 p(x) = −
16∑
i=1

1

16
· log2

1

16
= log2 16 = 4

This result shows that we would need 4 bits to describe the information of the random
variable, which is exactly the number of bits that we would normally use to code the 16
different values that the variable can take. This in fact leads to one of the most impor-
tant applications of entropy in Computer Science, which we will study in more detail in
Section 1.3.

The next definitions take inspiration from Probability Theory. So far, we have only
used one random variable in our definitions. Can we describe the information given by two
random variables? What about the information given by one random variable knowing the
value of a different variable? The answer to both of these questions is yes, by using joint
probability and conditional probability. By making parallelisms with probability, we can
study the uncertainty of anything that can be described through multiple random variables.

Definition 1.2. Let X and Y be a pair of discrete random variables with supports X and
Y, respectively, and joint distribution p(x, y) = Pr{X = x, Y = y}, for x ∈ X , y ∈ Y.
Then, the joint entropy of (X,Y ) is defined as:

H(X,Y ) = −
∑
x∈X

∑
y∈Y

p(x, y) · log2 p(x, y)

The conditional entropy of (X,Y ) is defined as:

H(X|Y ) =
∑
y∈Y

p(y) · H(X|Y = y)

= −
∑
y∈Y

p(y)
∑
x∈X

p(x|y) · log2 p(x|y)

= −
∑
x∈X

∑
y∈Y

p(x, y) · log2 p(x|y)

We will now introduce the concept of mutual information, which measures the quantity of
information that a random variable contains about another random variable. In other words,
it can be seen as the reduction in the uncertainty of a random variable given knowledge of
another one.


1.2. DEFINITION AND PROPERTIES OF ENTROPY 5

Definition 1.3. Let X and Y be a pair of discrete random variables with marginal distri-
butions p(x) = Pr{X = x}, x ∈ X , and p(y) = Pr{Y = y}, y ∈ Y, respectively, and joint
distribution p(x, y) = Pr{X = x, Y = y}, for x ∈ X , y ∈ Y. Then, their mutual information
is defined as:

I(X;Y ) =
∑
x∈X

∑
y∈Y

p(x, y) · log2
p(x, y)

p(x) · p(y)

To finish this section, we will show some properties of the previous definitions that are
commonly used in studies that involve the use of entropy.

Lemma 1.1. Let X be any discrete random variable. Then H(X) ≥ 0.

Lemma 1.2. Let X and Y be a pair of discrete random variables. Then, I(X;Y ) ≥ 0 and
I(X;Y ) = I(Y ;X).

Theorem 1.1. Let X be a discrete random variable. Its entropy, H(X) is maximized when
X is uniformly distributed.

This last result is very important because during many experiments or studies we might
not know how our random variables are distributed. In an attempt to solve this problem,
investigators follow the Maximum Entropy Formalism, assuming a uniform distribution
whenever possible to maximize uncertainty.

Example 1.3. Let X be a random variable that takes the value 1 with probability p and 0
with probability 1− p. That is,

X =

{
1 with probability p

0 with probability 1− p

Then, the entropy of X, expressed in terms of p is

H(X) = −p log2 p− (1− p) log2(1− p)

Figure 1.3 shows the graph of the value of the entropy for the different values of p. Con-
sonantly with Theorem 1.1, the maximum is reached at p = 1/2, which corresponds to
a uniform distribution in X. The proof of the following result can be easily obtained by
rearranging terms in Definitions 1.2 - 1.3.

Theorem 1.2. Let X and Y be two discrete random variables. Then we have the following
results:

H(X,Y ) = H(X) +H(Y |X)

I(X;Y ) = H(X)−H(X|Y )

H(X)−H(X|Y ) = H(Y )−H(Y |X)

I(X;Y ) = H(X) +H(Y )−H(X,Y )


6 CHAPTER 1. INTRODUCTION

0.1 0.3 0.5 0.7 0.9
0

0.2

0.4

0.6

0.8

1

p

H(X)

Figure 1.3: Entropy of X for the different values of p

Theorem 1.2 gives us an idea of how the different metrics are related to each other.
Figure 1.4 places the different metrics in a Venn Diagram, making the relationships between
them easier to appreciate.

H(X) H(Y )

H(X,Y )

I(X;Y )H(X|Y ) H(Y |X)

Figure 1.4: Relationship between entropy and mutual information

1.3 Applications of Entropy

In this section, we will present some examples of the uses of entropy. The first example
connects entropy and data compression. Suppose two people are communicating digitally
using up to eight possible different messages. If we used regular binary coding for these
messages, we would need three bits to represent each of them. For instance, we could use
the mapping shown in Figure 1.5.

This is not the only way we could code the messages, however. An alternative to regular
coding is, for instance, Huffman Coding [4], which assigns binary codes to the messages in


1.3. APPLICATIONS OF ENTROPY 7

Message Code Message Code
message1 000 message5 100
message2 001 message6 101
message3 010 message7 110
message4 011 message8 111

Figure 1.5: Regular coding for eight messages

such a way that the most common ones use less bits than the least common ones. A possible
Huffman coding for a random assortment of probabilities for the messages can be seen in
Figure 1.6.

Message Probability Code Message Probability Code
message1 0.323 11 message5 0.025 10101
message2 0.217 01 message6 0.179 00
message3 0.017 101001 message7 0.002 101000
message4 0.124 100 message8 0.113 1011

Figure 1.6: Huffman coding for eight messages

The weighted average length of the codes is L = 2.68 bits, which is less than the three
bits that we would need regularly. It has been proven that entropy is a lower bound for the
average length of any coding. Let X be the random variable over the possible messages from
the previous example. The uncertainty of X , however, is H(X) = 2.43, meaning entropy
could be used to estimate data compression minimum sizes. This concept is reflected in
Shannon’s First Theorem, also known as the Source Coding Theorem.

Theorem 1.3 (Shannon’s First Theorem). Let X be a discrete random variable. Consider
an optimal code for the possible values of X that has an expected length of L =

∑
x∈X

p(x)·l(x),

where p(x) = Pr{X = x} and l(x) is the length of the code given to the element x. Then,

H(X) ≤ L < H(X) + 1

Some Machine Learning algorithms also take advantage of entropy. Decision trees, for
example, are a supervised learning method used to solve classification problems. ID3 is
an algorithm invented by Ross Quinlan [5] that generates a decision tree from a data set
using entropy. In every iteration, the algorithm calculates the entropy of each attribute and
chooses the one with the lowest uncertainty value to make a tree node. To better understand
how this algorithm uses entropy, we will give a more specific example.


8 CHAPTER 1. INTRODUCTION

Day Outlook Temperature Humidity Windy Play Tennis
1 sunny hot high false no
2 sunny hot high true no
3 overcast hot high false yes
4 rain mild high false yes
5 rain cool normal false yes
6 rain cool normal true no
7 overcast cool normal true yes
8 sunny mild high false no
9 sunny cool normal false yes
10 rain mild normal false yes
11 sunny mild normal true yes
12 overcast mild high true yes
13 overcast hot normal false yes
14 rain mild high true no

Figure 1.7: Weather conditions for the last 14 days

Example 1.4. We have gathered information about the weather conditions for the last 14
days and determined whether tennis could be played or not. This information is shown in
Figure 1.7. Let X be the random variable that describes whether tennis can be played. The
algorithm checks which attribute – in this case among Outlook, Temperature, Humidity
and Windy– has the lowest entropy in order to make the first node of the tree. Let O, T,H

and W be the random variables over each of the attributes, respectively. The algorithm
now computes the conditional entropy between X and these variables to choose the first
attribute. Using Definition 1.2 and the conditional probabilities from Figure 1.8, we get:

H(X|O) = 0.693

H(X|T ) = 0.911

H(X|H) = 0.790

H(X|W ) = 0.891

The algorithm now chooses the attribute that leads to the lowest conditional entropy:
Outlook. In other words, choosing Outlook maximizes the mutual information between X

and each of the other random variables. The idea behind this is that the first question you
make in the decision tree should be the one that gives you the most information, so that you
need as few questions as possible to classify a new day. The algorithm would then repeat
this process recursively, to generate the rest of the nodes and build the tree. The final result
ID3 would produce is shown in Figure 1.9.


1.3. APPLICATIONS OF ENTROPY 9

Outlook Play Tennis Probability Temperature Play Tennis Probability

overcast
Yes 1

hot
Yes 2/4

No 0 No 2/4

rain
Yes 3/5

mild
Yes 4/6

No 2/5 No 2/6

sunny
Yes 2/5

cool
Yes 3/4

No 3/5 No 1/4

Humidity Play Tennis Probability Windy Play Tennis Probability

normal
Yes 6/7

true
Yes 3/6

No 1/7 No 3/6

high
Yes 3/7

false
Yes 6/8

No 4/7 No 2/8

Figure 1.8: Conditional probabilities

Outlook

Humidity

No

high

Yes

normal

sunny

Yes

overcast

Windy

Yes

false

No

true

rain

Figure 1.9: Decision tree generated by the ID3 algorithm

There are many other examples of how entropy can be used in different scientific fields.
Many current investigations use entropy to define new metrics that could potentially have
important consequences on their respective fields. For example, with regard to Software
Testing, a metric called Squeeziness [6, 7], which is based on the concept of variation of
entropy, has been defined to treat and analyze the propagation of errors in a program.
Multiple studies regarding Squeeziness have been made already with promising results [8, 9,
10, 11]. Squeeziness, and its relation with Failed Error Propagation, is studied in detail in
the Trabajos de Fin de Grado that we are presenting in the Faculty of Mathematics.


10 CHAPTER 1. INTRODUCTION

This concludes our introduction to Information Theory, which we have mostly based on
T. M. Cover and J. A. Thomas’s Elements of Information Theory [12]. In the next chapters
we use these concepts to construct Information Theory algorithms with the goal of solving
Wordle puzzles.

1.4 Project Management and Scheduling

Now we will describe the general organization structure that we followed during the
entire project’s duration, emphasizing the Project Management and Scheduling tasks.

This project followed a horizontal hierarchy with both members taking the same amount
of responsibility and the same authority over work flow decisions and goals. With the help of
our supervisor, we discussed the goals for each stage of the project, as well as the resources
and dedication that should be directed to each part. An effort was made by both of us to
devote the same amount of resources to the elaboration of the project, although individual
involvement in each section of the project naturally varies, the extent of which we will discuss
in Sections 4.1 and 4.2.

This report was entirely written in LATEX and our algorithms were coded in Python 3.9.
We used Github in order to organize the different code files of the project because we were
familiar with the environment and it eased the coordination between us when uploading
or modifying the different versions of our algorithms. The repository1 has three different
folders:

• FirstExamples: includes some of the first experiments we made.

• GeneticAlgorithm: includes the code for our Genetic Algorithms.

• InformationTheoryAlgorithm: includes the code for our IT Algorithms.

The decision to work on a Final Degree Project related with Information Theory was
made in June 2021, when we first contacted our supervisor. Since we are students from the
Double Degree in Computer Science Engineering and Mathematics, we needed to elaborate
two Final Degree Projects and we both made our choices with Information Theory in mind.
The project was officially assigned to us in September 2021. Initially, our project was to be
based on the notion of Squeeziness [6] and its adaptation to quantum systems in black-box
scenarios. For that reason, during the first semester of this academic year, we started in-
vestigating about the core concepts of Information Theory and Quantum Computing. With
the start of the second semester, we began writing this report focusing on the Introduction
to Information Theory.

1https://github.com/enricava/CustomWordleAlgs

https://github.com/enricava/CustomWordleAlgs


1.4. PROJECT MANAGEMENT AND SCHEDULING 11

While we were describing the fundamentals of Information Theory, we wanted to design
original examples with real-world applications. At the time, Wordle was very interesting
to us and we made several attempts to study it from an Information Theory perspective.
After discussing our endeavour with our supervisor, in March 2022, we decided to move
away from Quantum Computing and focus entirely on the Information Theory aspect of the
project. From that point onward, we focused on designing our Wordle solving algorithms,
maintaining constant communication, with continuous revisions and corrections of our work.
The last month was mainly focused on revision to ensure the quality of the project.

With regard to the process of creating and revising the code, we mainly followed an agile
iterative and incremental design method. The process of planning, developing, testing and
meeting to interpret the results was very helpful in keeping a good workflow. We feel that
we made the right choice because we were able to deliver experimental results as soon as new
ideas or techniques came to mind, and we felt that we were always making good progress.


12


Chapter 2

Using entropy to solve Wordle

Wordle is a simple word game created by Josh Wardle in 2013 and published later in
October 2021. This year, its puzzles are all the rage and garner more than three hundred
thousand players on a daily basis [13, 14, 15]. The rules of the game are simple and the
goal is even simpler. In this chapter we will briefly explain how the game works and we will
design an Information Theory algorithm that can solve its puzzles in the fewest number of
steps.

2.1 Wordle: Rules and Goal

Wordle is not a new game. In fact, Wordle is just a simpler modification of the 1955
head-to-head logic game Jotto [16]. What makes Wordle stand out is its simplicity and
widespread competition online. Every day, a new secret word is chosen as the daily puzzle’s
solution. Players must follow the rules in order to find out which word it is.

The daily solution is randomly chosen from a list of commonly used five-letter words.
Our task is to find this secret word in six or less tries. In each attempt, the player is allowed
to input a five-letter word and receives clues pointing towards the answer in the form of
colors over each input letter. The goal of the game is not only to find the secret word but
also to achieve it using as few attempts as possible.

The clues are given to the player every time an input is entered, with the only restriction
that it has to be a real five-letter word in the same language. We will illustrate each of the
rules with an example, so that they become clear for the few who have not heard about this
game yet. We encourage the reader to try to solve today’s puzzle after reading the rules, to
internalize the game mechanics. For the sake of simplicity, we will use the word EULER as
the solution of an example puzzle.

13


14 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

1. Every letter that is not present in the solution will have its background turn gray.

E U L E R

C H A I N

2. The background of letters that are present in the solution but are misplaced will
become amber.

E U L E R

R O A R S

Note that only the first R has an amber background. This is because the solution itself
only has one R.

3. Letters that are both present in the solution and placed in the correct position will get
a green background. This rule overrides the previous rules, and therefore duplicates
of a correctly placed letter in the input will not receive an amber background unless
the letter is also repeated in the solution.

E U L E R

R U L E R

The next example showcases each of the rules in a possible sequence of inputs. We
encourage the reader to follow closely if they are unfamiliar with the game mechanics.

Example 2.1. The following is a real attempt to solve a daily puzzle, without following
any particular algorithm. Note that the player can use words that are not commonly used,
such as PALEA, although daily solutions are guaranteed to be somewhat common words.

C R A N E
P A L E A
G R O A N
S I N U S
S Q U A D

The goal of hardcore Wordle players is to guess the solution in the fewest tries – three
attempts or less are considered an excellent score– and therefore there is a huge debate
regarding the best opening inputs. In the next section we will propose an algorithm based
solely on Information Theory concepts in an attempt to produce a game winning algorithm.
As a side effect of this goal, we will also present our candidate for best opener word when
we discuss the results in Section 5.1.


2.2. INFORMATION THEORY ALGORITHM 15

2.2 Information Theory Algorithm

The algorithm we created is based solely on Information Theory concepts and uses
entropy to rank words according to how much information they provide. The intuition
behind this choice is that entropy is a way to measure how a sample space is partitioned
after knowing a piece of information. When a player submits a five letter word as an attempt,
the pattern that results after comparing the submitted word and the solution produces a
reduction in the sample space – words that are incompatible with the pattern should be
eliminated from it, or should not make it into further attempts to reach the solution. An
argument can be made that testing words incompatible with the pattern is a good way to
make out a couple of letters that can be present in the answer, which is why this algorithm
is not presented as an optimal solution.

Suppose the player submitted the word BLACK and obtained the following answer:

B L A C K

Then, the remaining 17 compatible words are:

chack crack drack frack kiack knack kyack quack shack
smack snack stack swack thack track whack wrack

This reduction of the sample space brings conditional probability to mind: this is what
gave us the idea to use Information Theory on Wordle. For any given word, we can assign
to each pattern its probability of occurring by counting how many words would be left in
the sample space, and assuming a uniform distribution over the set of allowed words. For
this particular pattern, its probability is 17 divided by the number of allowed words.

We will identify each of the 35 possible patterns by its ternary representation, where 0
identifies gray, 1 identifies amber and 2 identifies green. The leftmost ‘trit’ represents the
leftmost position in the word.

Figure 2.1 presents the relative frequency of each pattern for the word BLACK. The bar
plot of Figure 2.1 is very intuitive: the pattern represented by 0 contains all the words that
do not share any letter with BLACK, and it makes sense that it is also the most frequent
pattern. The frequencies seem to be –almost– decreasing, and this is a consequence of
the ternary representation we chose, where patterns represented by smaller numbers are
produced by the words with the least similarities to BLACK. The appearance of spikes also
makes sense given that words are not simply random sequences and, therefore, many words
share similarities.

Figure 2.1 is also an interesting example to motivate why we chose entropy as a measure
of how good a guess can be. The most improbable patterns are also the ones that give us the


16 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

0 25 50 75 100 125 150 175 200 225 250
0

0.05

0.1

0.15

0.2

0.25

0.3

Patterns (base-10)

R
el

at
iv

e
fr

eq
ue

nc
y

Figure 2.1: Relative pattern frequencies for BLACK

most information if they are chosen. For example, if the leftmost pattern –all grays– were
to appear after using the word BLACK, we would not learn much about the solution since it
is the most probable pattern. On the other hand, if we obtained the rightmost pattern –all
greens and least probable– we would obtain all knowledge of the solution.

We need to know every pattern for every allowed word. The list of words that can be
used has been made publicly available: it contains 12972 words and only a subset of those
can be answers. This subset of possible solutions can be obtained by inspecting the game’s
code as well, but we will only use this knowledge to simulate the games and not as a cheat
to reach the solutions in fewer guesses. While we explore this idea further in Section 2.2.5,
we believe that our algorithm’s goal should be to play following the same rules as a human
player, which entails not knowing which words can be solutions.

Before we continue, we must set a boundary for the tools our algorithms can use: we may
use the list of allowed guesses in our algorithms because inputting words that do not exist
has no impact on the games –failed attempts to produce an allowed guess are not penalized
by Wordle’s rules, and the game simply ignores the guesses until a valid one is entered–. By
limiting our guesses to the list of allowed words, we are effectively removing the process of
‘producing valid guesses’ without breaking any rules or using knowledge that should be kept
secret from the player. On the other hand, our algorithms may not use the knowledge of
which words can ultimately be solutions because the player is not meant to be able to tell
the difference. We will, however, design our simulation bench to test our algorithms against
the list of solutions to be able to accurately benchmark the results.


2.2. INFORMATION THEORY ALGORITHM 17

2.2.1 Obtaining the patterns

To find the pattern that we would obtain after submitting word2 when the solution is
word1, we first do a green pass over both strings and mark the letters that are correctly
positioned. Then, we do an amber pass with the remaining letters to find the ones that are
misplaced. The rest of the letters are gray – corresponding to 0 in our ternary representation
– and therefore need no further consideration.

def pattern(word1 , word2):
used1 = [False for _ in range (5)]
used2 = [False for _ in range (5)]
return ( greenPass(word1 , word2 , used1 , used2)

+ amberPass(word1 , word2 , used1 , used2) )

Figure 2.2: Pattern function 1

The green pass simply finds out which letters are the same when checking the same
position in each word and marks them as used so that they are not checked again in the
amber pass. The amber pass then checks for misplaced letters that have not already been
previously used, that is, each letter in word1 can only account for one color change according
to the rules of the game.

def greenPass(word1 , word2 , used1 , used2):
result = 0
for i in range (5):

if word1[i] == word2[i]:
result += 2 * 3**(4-i)
used1[i] = used2[i] = True

return result

def amberPass(word1 , word2 , used1 , used2):
result = 0
for (i,c1) in enumerate(word1):

for (j,c2) in enumerate(word2):
if c1 == c2 and not (used1[i] or used2[j]):

result += 3**(4-j)
used1[i] = used2[j] = True
break

return result

Figure 2.3: Green and amber passes 1


18 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

The first iterations of our algorithms used the code in Figures 2.2 and 2.3. The compu-
tation of the patterns for every pair of words took around 15 minutes. In order to shorten
execution time, we replaced loops by vectorized operations taking inspiration from Grant
Sanderson’s implementation [17]. By using vectorized operations inside Python’s Numpy
module, we can make use of SIMD instructions and decrease execution times substantially.

After refactoring our code, we now compute every pattern simultaneously. Let list1
and list2 be lists of words with lengths l1 and l2. We will compute the patterns that
result after submitting each word in list1 when the solutions are the words in list2.

First, we will create an equality matrix with the results of comparing each pair of words
in a tensor-like manner by using Numpy’s ufunc.outer() function. The code snippet in
Figure 2.4 produces a l1 × l2 × 5 × 5 matrix where:

equalities[a,b,i,j] = (list1[a][i] == list2[b][j]).

import numpy as np
from itertools import product
...
equalities = np.zeros((l1 , l2, 5, 5), dtype=bool)
for i, j in product(range (5), range (5)):

equalities [:,:,i,j] = np.equal.outer(list1[:,i], list2[:,j])

Figure 2.4: Equality matrix

Next, we follow the same steps as in Figures 2.2 and 2.3 and make green and amber
passes. We will store the results in an l1 × l2 matrix. By exploiting matrix indexing, we
are again taking advantage of modern processors’ architectures and speeding up the code.

matrix = np.zeros((l1,l2), dtype=np.uint8)

#Green pass
for i in range (5):

# matches[a, b] : words1[a][i] == words2[b][i]
matches = equalities [:, :, i, i]. flatten ()
matrix.flat[matches] += 2 * 3**(4-i)
for k in range (5):

# Avoid amber pass
equalities [:, :, k, i].flat[matches] = False
equalities [:, :, i, k].flat[matches] = False

Figure 2.5: Green pass 2


2.2. INFORMATION THEORY ALGORITHM 19

# Amber pass
for i, j in product(range (5), range (5)):

# matches[a, b] : words1[a][i] == words2[b][j]
matches = equalities [:, :, i, j]. flatten ()
matrix.flat[matches] += 3**(4 -i)
#Avoid next passes
for k in range (5):

equalities [:, :, k, j].flat[matches] = False
equalities [:, :, i, k].flat[matches] = False

Figure 2.6: Amber pass 2

After incorporating the instructions in Figures 2.5 and 2.6, the code for the computation
of the patterns saw a significant speed up.

2.2.2 Precomputation and best starters

Like we mentioned before, we are using a ternary representation for the patterns. There-
fore, instead of storing lists of colours, we store the base-10 number that represents each of
the possible 35 patterns. To further optimize storage space, we use Numpy’s uint8 variables
for the patterns and we use positions in the list of allowed words to index our matrix. For
example, the section of our pattern matrix where the words PIVOT and PIXEL are submitted
against the solutions JOLLY, JOLTS and JOLTY is:

... jolly jolls jolty
pivot [0,0,0,1,0] [0,0,0,1,1] [0,0,0,1,1]
pixel [0,0,0,0,1] [0,0,0,0,1] [0,0,0,0,1]

→
... 5585 5586 5587

8294 3 4 4
8295 1 1 1

Let fw,i be the relative frequency of the i-th pattern after submitting the word w. We
compute the entropy of w as described in Definition 1.1, using a random variable Xw over
the set of patterns with probability distribution pXw(i) = fw,i for 0 ≤ i ≤ 35 − 1.

H(Xw) = −
35−1∑
i=0

pXw(i) log2 pXw(i)

For example, the entropy of the word BLACK that we used as an example in Figure 2.1 is
equal to:

H(XBLACK) = 4.140 bits

We are ready to rank each word according to its entropy. Figure 2.7 contains the top 10
words sorted by decreasing entropy.


20 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

w H(Xw)

TARES 6.1940525443754435
LARES 6.149918742453124
RALES 6.114343099454227
RATES 6.096242642514605
TERAS 6.076619177276175
NARES 6.066830765753897
SOARE 6.061395399096263
TALES 6.054987761401191
REAIS 6.049777632888325
TEARS 6.032338670239807

Figure 2.7: Naïve best starters

We call this first list the naïve best starters. According to the list, TARES should be the
most informing first word on average.

2.2.3 Naïve Algorithm

Our first attempt naively assumes that entropy alone is the best metric for ranking the
words. The algorithm starts with a user input opener and obtains a pattern. Then, it
removes words from the sample space that are incompatible with the pattern and computes
a new ranking for the remaining compatible words. The algorithm picks the best ranked
word and repeats these steps until either finding the solution or reaching 9 attempts.

def play(solution , allowed=words , starter=’tares ’, mode=’naive ’):
remaining_words = np.array(allowed , dtype = ’str’)
next = starter
for attempt in range (9):

pattern = get_pattern(next , solution)
if pattern == 242:

break # found the solution!
remaining_patterns = get_pattern_matrix(next , remaining_words)
remaining_words = remaining_words[remaining_patterns.flatten () ==

pattern]
i, next = make_guess(remaining_words , mode)
remaining_words = np.delete(remaining_words , i)

return attempt

Figure 2.8: Structure of the simulations

As we mentioned before, the list of all possible solutions is available. So, in order to


2.2. INFORMATION THEORY ALGORITHM 21

test how well the algorithm performs, we simulated each of the 2309 possible games. To
verify that the naïve best starters are not necessarily the best starters –or well sorted–, the
simulations were run 10 additional times, starting the games with each of the top ten naïve
best starters.

After the simulations were finished, it became clear that TARES was not the best opener.
By following our naïve strategy, TALES produced a lower score on average. We discuss the
results of the algorithm in more detail in Section 5.1. The main flaw of this approach is
that, while the solutions of the puzzles are guaranteed to be somewhat common words, the
algorithm will often favor uncommon words that are more informative over words that could
be the solution, resulting in a worse score.

2.2.4 Greedy algorithm

The next step is to somehow distinguish common words from uncommon words. A cheat
that produces excellent results is to only use words from the list of possible solutions. How-
ever, we would like to design an algorithm that can solve Wordle puzzles without knowing
which words can be solutions. We use Python’s module wordfreq to obtain the relative
frequencies of each of the allowed guesses in the English language. We do not only want to
use word frequencies to break ties. In addition, we would like our algorithm to sometimes
prioritize guesses that are more common, even if they have a lower entropy.

Let G be the set of allowed words and rw be the relative frequency of the word w ∈ G.
The greedy algorithm will always choose the word that reaches the following maximum:

max
w∈G
{ g(rw) · H(Xw) }

where g : R → (0, 1] is a continuous non decreasing function that we must define. In fact,
most of the time invested in this adaptation of the naïve algorithm was spent deciding the
behaviour of g. For example, when g is the identity function, words such as TARES, with
a very high entropy, are never chosen because of their low relative frequency. In this case,
rTARES = 7.94 · 10−8.

After much consideration, it became clear that the values returned by g should be very
close to binary values –that is, close to 0 and 1– so that the least common words have a very
low prospect of being chosen and common words have a higher prospect of being chosen. We
explicitly exclude 0 to allow cases where the expected amount of information to be gained
balances a low relative frequency. Thus, with this behaviour we mimic the act of ‘sacrificing’
a guess to learn more information and achieve an overall lower score.

Defining g as a step function with values 0 and 1 would be the same as restricting the set
of allowed words to a subset, and, as we mentioned before, there are cases where we want


22 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

to be able to use words with very low relative frequency. Instead of arbitrarily determining
which words make it into this new subset of words, we use a shifted and adjusted sigmoid
function such as:

g(rw) =
1

1 + e−10·(rw−0.5)

so that the transition between 0 and 1 is smoother and no words are immediately discarded.

The algorithm that chooses the next guess can be seen in Figure 2.9. We use the @cache
decorator from the functools module to take advantage of memoization and speed up the
simulations, and compute the entropy of the words via a Numpy dot product.

@cache
def make_guess(wordlist , mode):

patterns = get_pattern_matrix(wordlist , wordlist)
l = len(wordlist)
maxe = best = -1
for i,w in enumerate(patterns):

# List of distinct patterns and their counts
ps, counts = np.unique(w, return_counts=True)
e = - np.dot(counts/l, np.log2(counts/l))
if mode == ’greedy1 ’:

e *= get_word_priority(wordlist[i])
if e > maxe:

maxe , best = e, i
return best , wordlist[best]

Figure 2.9: Next guess choice function

Before advancing to Section 3.1, where we attempt to solve Wordle by using Genetic
Algorithms, we will test our algorithms with different ‘relaxed’ versions of the original prob-
lem.

2.2.5 Relaxations of the problem

Although we have implicitly described the goal of our algorithms, we would like to
precisely describe the problem that we are solving. We call the original problem P0.

P0

Find the solution to any Wordle puzzle in as few attempts as possible
using the words from the list of allowed words.


2.2. INFORMATION THEORY ALGORITHM 23

In order to test whether the Greedy Algorithm could be improved, we decided to modify
the Naïve Algorithm so that its goal is to solve P1, a relaxation of P0.

P1

Find the solution to any Wordle puzzle in as few attempts as possible
using only words from the list of solutions.

By removing many words from the set of allowed words, we are also removing uncertainty
from the problem. In order to show this, let X and Y be uniformly distributed random
variables over the set of allowed words G and the set of Wordle solutions S, respectively.
Then:

H(X)−H(Y ) = log2 |G| − log2 |S| = log2 12972− log2 2309 = 2.49 (bits)

It makes sense that, by reducing the set of allowed words to a fourth of its size, the solutions
can be reached in less attempts, especially when there are words in S with similar entropy to
those in G. In Figure 2.10 we enumerate the 80 best naïve starters, with entropies between
6.2 and 5.8 bits. Note how there are many solutions –in blue– among the most informing
words.

TARES ARLES LANES SARED LEATS TAROS RAITS SNARE
LARES TORES LAERS ROLES NEARS RILES RACES SIREN
RALES SALET PARES TEALS TOEAS ARETS RITES PALES
RATES AEROS CARES AURES STRAE HARES LARIS TONES
TERAS DARES TIRES EARLS RONES TORAS RAPES RINES
NARES SANER SAINE TAELS NATES SLATE LEANS LASER
SOARE REALS SERAL RAISE EARNS DEARS AEONS DATES
TALES LEARS MARES TRIES TASER SOLER DORES TILES
REAIS LORES REANS ROTES TOLES GARES RAILS LAIRS
TEARS SERAI ALOES SOREL DALES BARES STARE READS

Figure 2.10: Solutions among best naïve starters

After testing our Naïve Algorithm for the problem P1, we obtained a significantly better
score, even better, in fact, than with the Greedy Algorithm and P0. We will discuss this
further in Section 5.1, but the reason behind this is that the set S of solutions does not have
strictly more common words than any of G \ S, because S has been handpicked without
following any particular method.

We also decided to test our algorithms for uninterrupted, competitive Wordle. Compet-
itive Wordle players keep an average of their accumulated scores in order to compete with
each other. Instead of playing singular Wordle puzzles, we decided to test the algorithms
for the continued solution of Wordle’s daily puzzles.


24 CHAPTER 2. USING ENTROPY TO SOLVE WORDLE

P2

Find the solution to all of Wordle’s daily puzzles in as few attempts as possible
using the words from the list of allowed words.

The introduction of P2 allows us to make use of a hidden rule of Wordle in our simulations:
any daily puzzle can only happen once. Therefore, past solutions will no longer be considered.
This relaxation has a peculiarity, which is that the average score depends on the order of
the sequence of tested solutions.

Finally, just for the sake of curiosity, we also tested our algorithms for a final relaxation
of P0, which is a combination of the previous problems P1 and P2:

P3

Find the solution to all of Wordle’s daily puzzles in as few attempts as possible
using only words from the list of solutions.

The simulations of P3 gave us the best scores for both of the algorithms and we consider
these a lower bound for the scores that we can achieve without cheating.

The results of our simulations for the problems P0, P1, P2 and P3 will be discussed in
more detail in Section 5.1.


Chapter 3

Using Genetic Algorithms to solve
Wordle

In this chapter, we will study other ways in which we may find solutions for Wordle.
In particular, we will study the use of Genetic Algorithms. The chapter starts with an
introduction to Genetic Algorithms, where we will present their general structure and show
specific examples. Next, we show and discuss the techniques we used in our algorithm.

3.1 Introduction to Genetic Algorithms

Genetic Algorithms [18, 19] are an attempt to mimic the processes that take place in
natural evolution, invented by John Holland [20] in the 1960s. They aim to emulate a
population’s evolution over time, taking inspiration from natural genetic variation, which
involves the combination of chromosomes or their mutation, and natural selection or ‘sur-
vival of the fittest’. Evolutionary computation techniques try to convert these properties
of natural evolution into algorithms that can find optimal solutions for a class of problems.
The task of a Genetic Algorithm is to find an optimal solution among the many possible
solutions that may exist for a given problem. This would be done by generating populations
with increasingly superior individuals. The advantages that Genetic Algorithms have over
other search algorithms are their simple operators and that they need very few parameters
in comparison to them. In fact, the ‘only’ thing a Genetic Algorithm needs is a function
that is able to describe how good or ‘fit’ a solution is.

The terminology used in Genetic Algorithms comes from the analogy made with genetics.
For that reason, we shall start this introduction by defining some fundamental concepts:

25


26 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

• Chromosome: It is a candidate solution to the problem, often encoded as a string of
bits.

• Genes: They are the parts in which a chromosome is divided into. If a chromosome
is a bit string, then a gene could be one of those bits or a group of them if they all
represent an aspect of the solution.

• Alleles: They are the possible values that a gene can take. In the case of bit strings,
alleles would be either 0 or 1.

• Fitness function: It is a function that assigns chromosomes a value in terms of how
adequate they are. This function helps classify chromosomes heuristically; we say a
chromosome is better than others if it has a higher fitness value. Fitness functions
help us artificially select candidates for evolution, which is how we emulate natural
selection in an attempt to find the optimal solution.

Note that in the previous definitions we mention the codification of a possible solution.
That is because the first step in a Genetic Algorithm is to encode solutions in a way that
makes it easy to work with the operators used. The most common procedure is to use
a string of bits to represent a possible solution, but other approaches can be made. For
instance, you could encode a solution using a string of numbers from 0 to 9. Later in this
section we will provide an example of a Genetic Algorithm that illustrates these ideas.

Now that we have defined how a solution might be represented, or encoded, we need
to do some operations with our population of chromosomes to produce new generations of
individuals. This is the part of Genetic Algorithms that ties with the interactions that take
place during natural evolution. The operators used are the following:

• Selection/Reproduction: It is the process in which individuals of the population are
selected to produce offspring. Individuals with higher fitness values have a higher
probability of producing offspring for the next generation. This is the operator that
matches the idea of natural selection. One of the most typical and simple ways to
implement this operator is to create a biased roulette wheel where each individual has
a wheel slot sized proportionally to its fitness value. Let us show an example of this.
Suppose our population is composed of the individuals presented in Figure 3.1. The
weight of each chromosome is computed by the following expression:

pi =
fitness(xi)
n∑

i=1
fitness(xi)

, i = 1, 2, ...n,

where {x1, x2, ..., xn} are the individuals of the population. Therefore, for the previous
example we would get the weights shown in Figure 3.2.


3.1. INTRODUCTION TO GENETIC ALGORITHMS 27

Chromosome Fitness Value
x1 1 0 1 0 0 1 0 151
x2 0 1 0 1 1 1 1 245
x3 1 0 1 0 1 0 1 491
x4 1 0 0 0 1 0 1 94

Figure 3.1: Example of a population in a Genetic Algorithm

Fitness Value Weight (pi) Cumulative Probability (qi)
x1 151 151/981 = 0.154 0.154
x2 245 245/981 = 0.25 0.404
x3 491 491/981 = 0.501 0.905
x4 94 94/981 = 0.095 1

Figure 3.2: Information used to create a biased roulette wheel for a population

The corresponding roulette wheel that would be used for this example population is
shown in Figure 3.3. A simple method to select individuals from this structure is to
generate a random number a between 0 and 1. Then, the individual xi will be selected
if qi−1 < a ≤ qi, for 1 ≤ i ≤ n and q0 = 0, where qi are the cumulative probabilities
given in Figure 3.2.

x1

15.4%

x2

25%

x3

50.1%
x4

9.5%

Figure 3.3: Example of a roulette wheel used for the selection operator

• Crossover: It is the process in which the selected individuals combine with each other.
That is, two individuals combine their genes to produce a new chromosome. The
most commonly used method to implement crossover is by dividing the strings that


28 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

represent individuals into two parts at a random point k ∈ [0, l − 2], where l is the
length of the chromosomes, and swapping all the genes between positions k + 1 and
l − 1, both included. For instance, consider the chromosomes x1 = 1 0 1 0 0 0 1 and
x2 = 0 1 0 1 1 0 0. We would need to generate a random integer between 0 and 5.
Suppose we get k = 3, then the offspring generated in this crossover would result from
swapping the substrings from positions 4 to 6 between the chromosomes. The result-
ing chromosomes would be x′1 = 1 0 1 0 1 0 0 and x′2 = 0 1 0 1 0 0 1. This method
for implementing crossover is called 1-point crossover. It is important to note that
depending on how the representation of the possible solutions is designed, a crossover
might produce a new individual that does not meet the requirements of a solution.
In these situations, we would need to define the crossover operator more carefully.
Consider the following example: you are tasked to maximize a function f that takes
10 arguments in the form of numbers from 0 to 9, with the restriction that you can
only use each number once. Suppose you have the following chromosomes in your
population, and they both reach high values for f .

Chromosome x1: 1 2 3 4 5 6 7 8 9 0 → f(1, 2, 3, 4, 5, 6, 7, 8, 9, 0)

Chromosome x2: 6 5 8 3 9 0 1 7 2 4 → f(6, 5, 8, 3, 9, 0, 1, 7, 2, 4)

It would be reasonable to think that by combining x1 and x2 we could maximize f

further. If we implement a 1-point crossover like the one we have just described, we
could get the following offspring (suppose k = 3):

x′1 = 1 2 3 4 9 0 1 7 2 4
x′2 = 6 5 8 3 5 6 7 8 9 0

Clearly, none of the offspring produced fit the problem requirements. It is the task
of the crossover operator to handle these situations: either by defining an operation
that always generates possible solutions or by rejecting offspring which do not meet
the requirements of the problem.

• Mutation: It is an occasional random alteration of a gene in a chromosome. Although
it could seem counter-intuitive, mutation can be useful to reach possibilities that might
otherwise not have been examined with just the use of crossover. Genetic Algorithms
usually set a low probability for mutation. When using the bit string representation,
mutation often negates a random bit in the string. Similarly to the crossover operator,
situations in which the new individual is not a possible solution must be taken care
of.

The only thing that remains to be seen is how the algorithms are initialized and when
they stop. Initialization often consists of randomly generating a number of individuals which
will be part of the first population. The algorithm will stop after the optimal solution is


3.1. INTRODUCTION TO GENETIC ALGORITHMS 29

Data: N > 0, maxNG ≥ 0, crossoverProb ∈ [0, 1], mutationProb ∈ [0, 1]

Result: Best solution found for the problem
population← createF irstGeneration(N)

numberGens← 0

while not solutionFound and numberGens ̸= maxNG do
getF itness(population)

nextGeneration← empty

for i = 0 to N/2 do
parent1, parent2← select(population)

r ← randomV alue(0, 1)

if r < crossoverProb then
child1, child2← crossover(parent1, parent2)

else
child1, child2← parent1, parent2

end
mutation(child1,mutationProb)

mutation(child2,mutationProb)

add(nextGeneration, child1, child2)

end
population← nextGeneration

numberGens++
end
return bestIndividual(population)

Algorithm 1: Pseudocode for a Genetic Algorithm

found or after a previously set number of generations have been evaluated. In the latter case,
the algorithm will end with a final population from which we may choose an individual as the
best solution found. The general structure of Genetic Algorithms is given in Algorithm 1.

To finish our introduction, we will explore the solution to a problem that can be solved
through Genetic Algorithms, emphasizing the importance of the decisions that have to be
made to design a Genetic Algorithm.

Example 3.1. Suppose we have a device with five switches. For every combination of the
switches, the device gives us an output value which we want to maximize; for simplicity we
call this function f . Our task here is to find the combination of switches that maximizes
the output of f . Note that we do not have any information about f –we can only test f by
getting output values for specific combinations of the switches. In order to create a working
Genetic Algorithm, the following steps should be taken:


30 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

• Defining the representation of a solution.
We are working with a device that has five switches and each of them has two different
positions, on and off. We choose the usual binary representation where 1 represents
on and 0 represents off. As there are five switches, we will need a string of five bits to
represent each combination of the switches. In other words, a possible solution will be
a bit string of length 5. Therefore, using the terminology we have defined, we have:

– Chromosome: A bit string of length 5.

– Gene: Each bit of the string, which represents a switch.

– Allele: 0 or 1 for each gene, which represents a switch being on or off.

• Defining the fitness function.
Our choice of fitness function should allow us to measure how good a given possible
chromosome is. In this example, we can simply use f as our fitness function.

• Implementing a selection operator.
We will implement this operator with a biased roulette wheel, as we defined it previ-
ously in this section.

• Implementing a crossover operator.
Given two possible solutions, note that a 1-point crossover between them produces
chromosomes that still meet the problem’s requirements, that is, they will still be
strings composed of only 0’s and 1’s. We choose a probability of 0.7 for a crossover
occurring.

• Implementing a mutation operator.
The choice of binary representation allows us to define mutation as the bitwise negation
of a random bit in the string. Another possible mutation operator could be defined as
swapping two genes in a chromosome. We will give mutation a probability of 0.1.

Now that all of the algorithm’s parameters have been set, we will show the first iteration
of the execution of our Genetic Algorithm, following the structure of Algorithm 1.

1. Create the first generation.
We will randomly generate n = 4 (small for Genetic Algorithms standards) bit strings
of length 5. The first population is made up of the individuals shown in the next table:

Chromosome Fitness Value Chromosome Fitness Value
0 0 0 0 1 5 1 1 1 0 1 136
1 0 1 0 1 24 0 1 1 0 0 37

2. Selection.
First, the results of implementing the roulette wheel are shown below:


3.2. SOLUTION TO WORDLE WITH A GENETIC ALGORITHM 31

Fitness Value pi qi

0 0 0 0 1 5/202 = 0.154 0.025
1 0 1 0 1 24/202 = 0.25 0.144
1 1 1 0 1 136/202 = 0.501 0.817
0 1 1 0 0 37/202 = 0.095 1

Now, we generate two random numbers between 0 and 1: r1 = 0.8887 and r2 = 0.1801.
The wheel’s design means parent1 = 0 1 1 0 0 and parent2 = 1 1 1 0 1 are chosen as
parents for a crossover.

3. Crossover.
We generate a random number to check whether we should apply the operator or not.
We get r = 0.2307 < 0.7, which means a 1-point crossover will happen, and k = 1 is
randomly selected as the position for the crossover. The offspring are child1 = 0 1 1 0 1
and child2 = 1 1 1 0 0.

4. Mutation.
We generate two random numbers that will determine whether a mutation takes place
in each of the offspring. This can be seen below:

r1 = 0.0415 < 0.1 → k=3. Negate child1[3] → child1 mutates to 0 1 1 1 1
r2 = 0.1471 > 0.1 → No mutation → child2 stays the same

5. The chromosomes generated by the crossover and mutation operators are then added
to the next generation and the process and steps 2-5 are repeated until the maximum
number of generations is reached.

Once the maximum number of generations have been evaluated and a final population
is reached, the algorithm chooses the most fit individual from the final population as the
solution to the problem.

3.2 Solution to Wordle with a Genetic Algorithm

From what we saw in the previous section, we could say that the idea behind Genetic
Algorithms is to have an initial collection of possible solutions that, after combining with
each other, lead to better and better solutions until the best one is found. This idea seemed
very similar to how Wordle works in the sense that one starts with a word and tries to
‘improve’ guesses for the next attempts until the solution is found. This is what gave us the
motivation to try to solve Wordle with a Genetic Algorithm. In this section, we will explain
how we approached the problem to design a Genetic Algorithm for the game.


32 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

3.2.1 First Genetic Algorithm

Our first approach was to design our algorithm by following the exact structure of Ge-
netic Algorithms, while using our knowledge of Information Theory. However, we encoun-
tered several problems in our endeavour to adapt some of the ideas that characterize these
algorithms.

The first obstacle we encountered was the choice and behaviour of the fitness function.
In order to determine the fitness value of a guess, we have to be able to measure how close
it is to the solution of the puzzle. The fitness function is usually a ‘fixed’ function in the
sense that it behaves the same way for every generation. In this case, depending on what we
know about the solution, the next guesses will have a varying degree of compatibility. For
example, if we know that the solution has a letter B in the first position, the words BLACK
and BLOCK could have very similar fitness values. If we then learn that the solution also
contains a letter A, then BLACK should have a higher fitness value than BLOCK. The simplest
fitness function would be one that directly compares every individual with the solution, but
we decided to dismiss this idea because our algorithm must abide by the rules of the game,
and therefore should only compare guesses with the solution 6 times. Since the objective is
to make as few guesses as possible, we decided the algorithm should only attempt to guess
the word once for every generation. Our fitness function would have to take into account the
letter matching of a word with the information that we have obtained about the solution.
Entropy would then be used to ‘break ties’ between words that share similarities with the
solution. The formula that we reached for the function is the following:

fitness(w) = H(Xw) + 5 ·G(w) + 2 ·A(w)− 2 ·N(w),

where G(w) is the number of correctly placed letters in w, A(w) is the number of letters
that we know are present in both w and the solution but are misplaced in w and N(w) is
the number of letters that w has that are not contained in the solution. Note that we do
not directly learn this information by using the solution, but with the data that we have
acquired in the previous attempts at guessing the solution.

Suppose the information we have about the solution is that its first letter is a P, that it
contains an R, but not in the third position, and that it does not contain letters A or K. The
algorithm will store this information as:

P - R - -
gray = {A, K}

The fitness values for PREYS and SPACE is:

fitness(PREYS) = H(XPREYS) + 5 · 1 + 2 · 1 = 5.10480 + 7 = 12.10480


3.2. SOLUTION TO WORDLE WITH A GENETIC ALGORITHM 33

fitness(SPACE) = H(XSPACE)− 2 · 1 = 5.29281− 2 = 3.29281

The code in Figure 3.4 shows the implementation of the fitness function. The pattern
variable stores the ternary representation of the colors obtained for each of the letters of the
previous attempt, while guessed stores the actual letters that have been guessed as green
or amber. For example, if pattern[i] = 2 and guessed[i] = ‘b’, then the solution has
a letter B in the i-th position. The letters that have been marked as incorrect are stored
in gray. After this, the function computes the fitness value of a w word as we explained
before. If the algorithm has no information about the solution, then its fitness will just be
the entropy of the word.

def fitness(w, guessed , pattern , gray):
f = get_entropy(get_word_position(w))
if guessed != ’-----’: # No information

for i in range (5):
if pattern[i] == 2 and guessed[i] == w[i]:

f += 5
elif pattern[i] == 1 and guessed[i] != w[i] and guessed[i] in w:

f += 2
if w[i] in gray:

f -= 2
return f

Figure 3.4: Fitness function for the initial Genetic Algorithm

We mentioned that we would only attempt to guess the solution once every generation.
Let us discuss how we implemented this step. The usual procedure in a Genetic Algo-
rithm is to take two individuals from the population (selection operator) and combine them
(crossover operator) to produce new individuals. This is the way the algorithm improves the
population, by selecting solutions to generate offspring that could potentially be better fit
than their parents. As we mentioned before, the way Wordle players gain information is by
submitting a word and receiving feedback for the comparison of said word with the solution.
At first glance, crossover seems difficult to implement because a vast majority of the times,
combining the letters of two words will not result in an actual word –remember that Wordle
only accepts real words. Our idea for initial tests was to define a selection operator in a way
that it returned the ‘best’ individual of the population, which would be used as an attempt
to guess the solution. The code for this operator is shown in Figure 3.5.

Using the word obtained in the selection operator, we get a color pattern that can be
used in the crossover operator to generate new words that match the overall information
that we have about the solution. The code of the function that is in charge of doing this can


34 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

def selection(population , guessed , pattern , gray):
bestfit = fitness(population [0],guessed , pattern , gray)
bestword = population [0]
for i in range(1,len(population)):

fit = fitness(population[i],guessed , pattern , gray)
if fit >= bestfit:

bestfit = fit
bestword = population[i]

return bestword

Figure 3.5: Selection operator for the initial Genetic Algorithm

be seen in Figure 3.6. Instead of giving crossover a probability of occurring, we decided that
it should always take place. Figure 3.7 shows the implementation of the crossover operator.

def attempt(best , solution , guessed , colors , gray):
newpattern = get_pattern(get_word_position(best),
get_word_position(solution))
for i in range (5):

if newpattern[i] > colors[i]:
colors[i] = newpattern[i]
guessed[i] = best[i]

if newpattern[i] == 0:
gray.add(best[i])

return guessed , colors , gray

Figure 3.6: Function that simulates an attempt of guessing the solution

Mutation remains mostly unchanged in how it works in regular Genetic Algorithms, as
it can be implemented by simply changing a letter in a word, making sure that the result
is an allowed word. We will have to handle the cases in which a word cannot mutate. For
example, the word IVORY will not produce a new word with the mutation operator because
there are no other words that match the patterns -VORY, I-ORY, IV-RY, IVO-Y or IVOR-.
Figure 3.8 shows the code used for the mutation operator.

To show how the crossover and mutation operators work, Figure 3.9 is an example of
offspring produced by this algorithm and examples of mutations of the word BLACK. Note
that, for this specific example, crossover guarantees that the chromosomes have a B in the
first position and an A in a position that is not the third one –since it was shown as amber.
If any of these words include letters that we know to be gray, their fitness will be lower.

In our first experiments, the algorithm took 5 attempts in average to reach the solution.


3.2. SOLUTION TO WORDLE WITH A GENETIC ALGORITHM 35

def crossover(guessed , colors , attempts):
found = False
while not found:

k = rnd.randint(0,len(words) -1)
word = words[k]
if not word in attempts:

found = True
for i in range (5):

if colors[i] == 2:
found = found and word[i] == guessed[i]

elif colors[i] == 1:
found = found and word[i] != guessed[i]
found = found and guessed[i] in word

return word

Figure 3.7: Crossover operator for the initial Genetic Algorithm

def mutation(child , attempts , maxTries = 100):
found = False
tries = 0
while not found and tries < maxTries:

w = list(child)
i = rnd.randint (0,4)
w[i] = chr(rnd.randrange(ord(’a’), ord(’z’)))
new_word = ’’.join(map(str ,w))
k = get_word_position(new_word)
# Check if new word exists
found = k > 0 and k < len(words) and words[k] == new_word
found = found and new_word != child
found = found and not new_word in attempts
tries += 1

if found:
return words[k]

else:
return child

Figure 3.8: Crossover operator for the initial Genetic Algorithm

Words with high entropy would be reached quicker, within 2 or 3 tries, but some words took
more than 15 attempts. This is because many of the operations that take place in Genetic
Algorithms depend on random decisions and the algorithm could be unlucky in obtaining
new better words. We observed that the number of attempts used in the algorithm highly


36 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

B - A - -

B U L L A
B A N J O
B A R D S
B E L A R
B A C K S
B A L K Y
B I V I A

B L A C K

B L O C K
F L A C K
S L A C K
B L A N K
C L A C K

Figure 3.9: Example of results given by the crossover and mutation operators

depends on the initial population and the first choices. If the initial words are not similar to
the solution, the algorithm might have to use 2 or 3 attempts to finally learn something about
the solution. On the contrary, if the algorithm starts with a guess that is very similar to the
solution, it sometimes reaches the solution very quickly. For example, while attempting to
reach the solution MERRY, we obtained the following sequence of words:

M E R R Y

O R L E S
M A I R E
M E R R Y

By learning that the letters E and R were misplaced in ORLES, the word MAIRE was randomly
generated. Immediately after, MERRY was guessed. This example shows how dependent
the algorithm is on chance, as simulations for the word MERRY gave an average number of
attempts greater than 5. An example of a very bad guess from this algorithm was found for
the word FIZZY, which took 16 tries. In particular, the algorithm got the sequence CASTE,
BARON, KLIEG, PICUL, MIKED, NIFTY, JIFFY, WITHY, WINEY, SIXTY, DIKEY, TIZZY, DIZZY,
MIZZY, BIZZY and, finally, FIZZY. Not only can we see that the algorithm tried many words
ending in IZZY but it also took many tries until it finally learnt any information about the
solution.

Although Genetic Algorithms usually take a little bit of time to find solutions, we decided
to look for an improvement in the efficiency of this first approach as we felt that some
simulations took more time that they should. This is because of the way some of the
operators were defined. Take the crossover operator, for example. We take a random word
from the list of allowed words and check if it respects the information that we have already
gathered. If we are lucky then the algorithm might take a word that does fulfill all the current
requirements for the solution but the it might have to repeat this process of checking the
requirements for a lot of words until it finally finds a valid one. This complicated simulations


3.2. SOLUTION TO WORDLE WITH A GENETIC ALGORITHM 37

and making comparisons to the previous algorithms because sometimes a single simulation
for a word took around 1 minute and it took more than an hour to make a simulation for all
the possible solutions. Results obtained with this initial algorithm are shown in Chapter 5.

The huge gap in execution time that this algorithm presented compared to the other ones
that we had studied made us reconsider the decisions that we had taken for this algorithm
and so for a better iteration of a Genetic Algorithm we decided to take a different approach.
Instead of trying to build a Genetic Algorithm from scratch, we would use the information
and concepts used in the Information Theory Algorithm presented in Section 2.2, which
we already observed provides good results, to adapt them into the structure of a Genetic
Algorithm and see if it could be improved with that. The next section shows the new
algorithm that we designed.

3.2.2 Genetic Algorithm with Information Theory

As we explained in the previous section, our goal for this final algorithm was to use all
the information that we has gathered in the Information Theory Algorithm described in
Section 2.2 and adapt it into the structure of a Genetic Algorithm to see if its performance
improves with it. In this section we will provide the details of how the algorithm was defined,
explained each of the operation used in the final Genetic Algorithm.

• Defining the representation of a solution.
For this algorithm, we decided to treat the solutions as integers, which represent a
word from the allowed list. That is, the ‘solution’ i represents the i -th word from the
allowed list. We decided to do this because for many operations, we do not need to
check the letters of the actual word like we did in our first Genetic Algorithm. Every
time we need information relative to a word we can easily access it in the precomputed
data using the integer that represents the word.

• Defining the fitness function.
As we observed that entropy gives positive results for choosing which words to use for
attempts in Wordle, we decided to define the fitness value of a word as the product of
its entropy and its frequency, which are both precomputed. The use of the frequency of
words is due to the improvement that it showed for the Information Theory Algorithm.

• Implementing the selection operator.
This time, we decided to implement the selection operator with a biased roulette wheel.
We thought this should help in situations where the algorithm has to choose among
very similar words. If we recall the example with the word MERRY, the initial algorithms
would choose uncommon words like DERRY or HERRY before guessing the right word.
Using the biased roulette wheel, as MERRY is a more usual word, it would have a higher


38 CHAPTER 3. USING GENETIC ALGORITHMS TO SOLVE WORDLE

chance of being chosen as an attempt to guess the solution in the game. The code
that we made for this implementation can be seen in Figure 3.10.

def selection(population):
totalFitness = 0
individuals = []
for w in population:

f = fitness(w)
totalFitness += f
individuals.append ([w,f])

probs = np.empty(len(individuals))
probs [0] = individuals [0][1]/ totalFitness
for i in range(1,len(individuals)):

probs[i] = probs[i-1] + individuals[i][1]/ totalFitness
r = rnd.uniform (0,1)
p = np.searchsorted(probs ,r)
return individuals[p][0]

Figure 3.10: Selection operator for the Genetic Information Theory Algorithm

• Implementing the crossover operator.
This operator will stay similar to how it was in our first Genetic Algorithm. The change
that we added is that once we get the pattern for the word obtained in selection and
before executing crossover, we reduce the space of possible solutions to keep the words
that match the pattern that we received, just like the Information Theory algorithm
does. At that point, the crossover operator gets a random word from the list of words
that remain after the reduction.

• Implementing the mutation operator.
Mutation functions like it did in our first Genetic Algorithm. Given a word, it ran-
domly changes one of its letters, making sure the result is a word accepted in Wordle.

The final algorithm that we implemented follows the structure show in Algorithm 2.
Next we show some examples of good executions of the algorithm to guess the word MERRY.

T A P A S
L O U N D
M I C H E
M E R R Y

F L E X O
M E S I C
M E R R Y

C H A R K
S W O R N
M I T R E
M E R R Y

We can appreciate, like in our first Genetic Algorithm, that although some similarities
can be observed in the obtained solutions, the algorithm could be very lucky and guess
the solution in very few steps and without having a lot of information about the solution.


3.2. SOLUTION TO WORDLE WITH A GENETIC ALGORITHM 39

Data: N > 0, mutationProb ∈ [0, 1]

Result: List of attempts made until the solutions is found
population← createF irstGeneration(N)

remainingWords← allowedWords

attempts← empty

while not solutionFound do
best← selection(population)

pattern← color pattern with the solution
add(attempts, best)

solutionFound← best = solution

update remainingWords
nextGeneration← empty

for i = 0 to N do
child← crossover(remainingWords)

r ← randomV alue(0, 1)

if r < mutationProb then
child← mutation(child)

end
add(nextGeneration, child)

end
population← nextGeneration

end
return attempts

Algorithm 2: Pseudocode for our Genetic Information Theory Algorithm

Unfortunately, it can also be very unlucky and encounter situations in which it tries, for
example, the words DERRY, HERRY, KERRY or TERRY before it finally guesses the correct word.

In conclusion, we think that this algorithm represents an improvement with respect to
the previous Genetic Algorithm. In the first experiments, this algorithm seemed to use less
attempts to guess the solution and it was more uncommon to see situations in which it needs
more than 10 tries to guess the word. In terms of time, it also has much shorter execution
times and closer to the Information Theory Algorithm’s execution times –simulations for
all the possible solutions took around a minute. At this point, we can say that with this
algorithm we have gained more evidence that entropy is a good metric in dealing with
puzzles like Wordle. The results observed for the algorithm will be studied in more detail
in Chapter 5, where we will present how this algorithm works depending on the values
of its parameters and give some ideas on how this algorithm could be improved in future
revisions.


40


Chapter 4

Individual contributions

This chapter presents a desription of the individual contributions made to the project
by each of the members.

4.1 Contribution of Ángel Benítez Gómez

The first contact made with our supervisor goes back to June of 2021, after the list of
available Final Degree Projects was published. An important factor that led us to choos-
ing our project was the fact that you could combine Computer Science Engineering’s and
Mathematics’s Final Degree Projects. In other words, the same topic could be developed,
following different directions, for both faculties and, for that reason, there are similarities
between this project and the one for the Degree in Mathematics. These similar features
especially relate to the general introduction to Information Theory that both projects have.
It is important to note that during the entire elaboration of this project, there has been
constant communication between the two members and it would be difficult to say that a
particular part of the project was made by only one of them.

After the first meeting with our supervisor, we decided to start doing some research about
notions of Information Theory and Quantum Computing, which were the initial topics that
the project revolved around. Aside from some material given to us during the years of the
degree, the main source that I used to study and research Information Theory was the book
Elements of Information Theory [12], by T. M. Cover and J. A. Thomas. As for Quantum
Computing, I made a general research of the key concepts and initially used IBM’s Quantum
Lab to get familiar with quantum circuits, which would have been needed for the project.

With the start of the second semester, the topic of the project for Computer Science
Engineering changed slightly, leaving out the part related to Quantum Computing and

41


42 CHAPTER 4. INDIVIDUAL CONTRIBUTIONS

entirely focusing on Information Theory. This is the moment when Enrique and I started to
organize and plan the steps that we would follow and settled on the idea that we would try
to communicate with each other as much as possible, so that everything done in this project
had the insight from both members. The first thing we wanted to do was to determine
how the introduction to Information Theory would be made in this project, from which
I gathered more information about the applications that entropy has over other scientific
fields, including Deep Learning and Data Compression. The meeting with our supervisor
that led to the change of the topic of the project helped us come up with the idea of using
some sort of Genetic Algorithm combined with concepts and ideas from Information Theory
to solve Wordle. For that reason, I started making a more thorough research of Genetic
Algorithms. We had seen an introduction to them in the subject of Artificial Intelligence
that is taught in the degree. Using the material from this subject and more specific books
gave me a better understanding of Genetic Algorithms and I came up with several ideas
of how they could be adapted to solve Wordle. The main books that I used to gather
information were An introduction to genetic algorithms [18], by M. Mitchell, and Genetic
Algorithms in Search, Optimisation and Machine Learning [19], by D. E. Goldberg, which
also helped me elaborate the examples related to Genetic Algorithms that are presented in
this project.

At this point, we started to code our first versions of the algorithms. We continuously
talked about the ideas and possible implementations of the algorithms and, as I had made a
slightly bigger effort in researching Genetic Algorithms, we decided that I should be in charge
of programming the Genetic Algorithm version of Wordle’s solver. Once the first versions of
the algorithms were finished, we gave each other insight about possible improvements and
future directions for them. Because the Information Theory Algorithm gave better results
than the Genetic Algorithm, we decided to make an algorithm that would adapt all the ideas
from the Information Theory Algorithm into a genetic structure. Once again, we decided I
should be in charge of that and so I had more impact over the Genetic Information Theory
Algorithm. Once the algorithms were completed and after the various meetings with our
supervisor, we decided that our next step would be to evaluate how the algorithms worked
under certain conditions or relaxations. Following the general structure that we had used
to divide our tasks, my workmate focused on studying the final versions and results for the
Information Theory Algorithm and I centered my attention on the corresponding work for
the Genetic Algorithms.

The final goal that we had with our project was to compare it to other algorithms
that people had made to solve Wordle. Although we had already observed how some of
the already existing algorithms worked, we both made some more research and shared our
findings in a joint session, which served to write the final results section together.


4.2. CONTRIBUTION OF ENRIQUE CAVANILLAS PUGA 43

In regard to writing this report, the organization that we followed ties with how we
worked in the rest of the stages of the project. We did not ‘assign’ particular sections to
any of the members, but instead we freely decided to write about the advances that we
had made as they were happening. After the final topic of the project was changed, we
made some brainstorming about how each of us pictured the project in their minds and
the potential investigations that we considered interesting. This discussion led to a ‘final’
structure of the project that we used for this report. The fact that each of us programmed
different algorithms made it so that each would have more control over the sections that
involved them. That means that Sections 3.2 and 5.2 were initially developed entirely by
me and were later revised by Enrique. As I mentioned, the rest of the sections were made
collaboratively, so that I revised what Enrique wrote and he did the same with the parts that
I wrote, making it so that the final text has been revised and accepted by both members.

To conclude, I would like to give special thanks to my workmate Enrique as I consider
his work has had great impact on this project and his involvement was excellent. I would
also like to thank our supervisor for his help and support during the entire development of
this project. Some of the ideas that we thought about were unfortunately unable to be done
in the time that we had to make this project and, hopefully, they can be worked upon in
the future.

4.2 Contribution of Enrique Cavanillas Puga

In June 2021, one of the Final Degree Project titles that we were offered at UCM piqued
my interest: Squeeziness in quantum computers. This project seemed to connect two very
separate areas of science, both of which I would have loved to study in more detail throughout
the course of the Double Degree. After researching more about the topics mentioned in the
project’s description and an exchange of emails with our supervisor, it was clear to me that
this assignment had a lot of potential.

Ángel and I have a very fruitful record of working together on assignments throughout
the degree, so the decision to work on this jointly was natural when he also showed interest in
the project. On my part, after meeting with the project’s supervisor and setting an initial
course for the project, I started studying Information Theory and Quantum Computing
through multiple sources, and often discussed my findings and doubts with Ángel and our
supervisor.

Information Theory was not exactly new to me, entropy was a concept that we learnt
in Artificial Intelligence I and II, but it was clear that we had barely scratched the surface
of what it had to offer. To properly prepare for the project, I studied T.M. Cover and J.A.
Thomas’s introduction to Information Theory [12], and subsequently carefully examined


44 CHAPTER 4. INDIVIDUAL CONTRIBUTIONS

previous research that related Squeeziness with Software Testing [6, 9, 11]. At the same
time, I familiarized myself with Quantum Computing; I studied the basics mainly through
other Final Degree Projects such as L. A. Galindo and J. Pellejero’s Computación cuántica:
pruebas de mutación [21], and tinkered generously with IBM’s Quantum Lab, Q# and
Qiskit.

It was after reading an extensive collection of papers on Information Theory that I
realized that very little effort had been made by the authors to carefully explain the concept
of information, most likely because it came naturally to experienced researchers and the
extent of their papers is usually limited. After reading Shannon’s original article on entropy,
A Mathematical Theory of Communication [3], I decided to thoroughly explain his method
for designing the formula for entropy, and determined to use many examples to explain the
concept before moving on to Squeeziness and Quantum Computing.

While designing an example of the applications of Information Theory, I tried to connect
entropy with Wordle, and started to code what would eventually become the first iteration
of the Information Theory Algorithm. I met with Ángel and Dr. Núñez and commented
that this example of the practicality of entropy could be a worthwhile experiment, and
eventually we decided to pursue the goal of solving Wordle through Information Theory.
Ángel and Dr. Núñez had the idea of designing a Genetic Algorithm that used Information
Theory to solve some of the shortcomings of the original algorithm. At the same time, there
were some obstacles that had to be overcome before our algorithms could be extended to
simulate puzzles, mainly the execution times of the entropy calculations and restructuring
of the sample space after every guess made by the player.

At this point, Ángel started to work on the structure of the Genetic Algorithm and
I undertook the task of fixing the code’s slow execution. After restructuring some parts
of the code, some for loops where still giving us slow execution speeds, so I investigated
other attempts to solve Wordle. Grant Sanderson [17] mentioned the use of vectorization
in the calculation of the patterns, so I did some research on SIMD instructions on Python,
mainly by reading the documentation on the Numpy module [22]. Once Ángel and I had
set the foundation of both algorithms, we started working together again by giving each
other pointers on how to improve the code, the scores, and the execution times. By working
together and experiencing the differences in the answers given by both of our algorithms, we
gained insight into how to improve all of our work. Since the Information Theory Algorithm
was ahead in terms of average scores, it allowed me to tinker with its parameters and make
many modifications in search of even better scores, like introducing word frequencies when
looking for the next guesses.

Our open communication and the freedom we had when working separately allowed us
to produce better results after every meeting, and eventually we set to find a practical lower


4.2. CONTRIBUTION OF ENRIQUE CAVANILLAS PUGA 45

bound for the average scores of our algorithms. I came up with some relaxations of the
original problem that could perhaps point us in the right direction and modified our test
bench accordingly.

When it came to writing the report, we both took the same amount of responsibility
and distributed the sections among ourselves with no particular structure. Since I was more
involved with the Information Theory Algorithm, I took on the task of writing the draft
for Sections 2.2 and 5.1, and took the responsibility of revising and editing Ángel’s work as
well. The rest of the report was actively written and edited by both members equally.

Ángel and our supervisor’s involvement in the project was invaluable; Ángel had excellent
ideas and the resolve to always see them through. He also did not complain at all any time
I pointed out our algorithms had to be completely redesigned to abide by the rules or to
shave a couple of seconds of execution time. Working with Manuel Núñez allowed us to see
the inside of a researcher’s mind and he always had relevant suggestions and pointers for
each of my qualms.


46


Chapter 5

Results and Conclusions

To conclude this project, this final chapter describes the results that we obtained for our
different algorithms in full detail. We also compare these results with other algorithms that
have also been designed to solve Wordle. Finally, we present some thoughts and reflections
on our findings.

5.1 Information Theory Algorithm

5.1.1 Naïve Algorithm

1 2 3 4 5 6 7 8 9
0

500

1,000

Score

Fr
eq

ue
nc

y

Starter = TARES. Score = 4.227

Figure 5.1: Simulation results of the Naïve Algorithm

With the initial naïve strategy, we achieved an average score of 4.227 attempts after
simulating games for each of the 2309 possible solutions using TARES as starter. As shown
in Figure 5.1, we had that 4.07% of the solutions could not be reached within 6 tries and

47


48 CHAPTER 5. RESULTS AND CONCLUSIONS

required 7 or more attempts. These results are not outstanding. In fact, the average score
achieved by this first algorithm is on par with the average score for human players. As we
mentioned in Section 2.2.3, one of the most likely reasons for this is that we are assuming a
uniform distribution over the set of allowed words, which is a mistake. Since we know that
solutions will always be common words, the algorithm should also use as many common
words as possible when attempting to reach the solution. For example, when the solution is
BLESS, the following sequence of attempts is made by the algorithm:

B L E S S

T A R E S
D E I L S
B L E Y S
B L E B S
B L E S S

The algorithm does not discern between common and uncommon words, which seems to
be very useful until a lot of information has been gained. Once the set of compatible words
is very small, choices like BLEBS are made over BLESS, leading to unnecessary attempts.

The most notable example of this problem is the solution MERRY, which takes 9 attempts
to be found and points to another additional problem of this naïve approach. The follow-
ing sequence is obtained: TARES, RORIE, BERRY, DERRY, FERRY, HERRY, JERRY, KERRY,
MERRY. Not only is the Naïve Algorithm favoring extremely uncommon words like DERRY and
HERRY, but the sample space is so restricted that it just tests every word ending in -ERRY in
aplhabetical order.

5.1.2 Greedy Algorithm

In order to solve this problem, we decided to implement the Greedy Algorithm that
considers the relative frequency of the words, so that even if we are very restricted to a
handful of words with similar entropy, we always check the most common ones first. After
incorporating word frequencies, the average scores saw a significant improvement, and the
best starters also changed. Now, the highest ranked starter is RATES, which is almost identical
in structure to TARES, but is a much more common word. The average score for RATES using
this algorithm is 3.843, and less than 1% of the solutions required more than 6 attempts.

During the testing stage, we also found that some starters with lower entropy and lower
g-priority score can achieve a better score than higher ranked words. The best score was
obtained by starting with TALES, with 3.773 attempts and even a lower percentage of failed
attempts than RATES despite having less entropy and a sixth of its relative frequency. Thus,
we present TALES as our candidate for best opener.


5.1. INFORMATION THEORY ALGORITHM 49

1 2 3 4 5 6 7 8 9
0

200

400

600

800

1,000

Score

Fr
eq

ue
nc

y

Starter = RATES. Score = 3.843

1 2 3 4 5 6 7 8 9
0

200

400

600

800

1,000

Score
Fr

eq
ue

nc
y

Starter = TALES. Score = 3.773

Figure 5.2: RATES vs TALES

5.1.3 Relaxations of the problem

As we mentioned in Section 2.2.5, we designed several relaxations of the original problem
P0. These allowed us to estimate a lower bound for the scores of our algorithms, and gave
us insight into how they could be improved in future investigations.

P1 Find the solution to any Wordle puzzle in as few attempts as possible
using only words from the list of solutions.

P2 Find the solution to all of Wordle’s daily puzzles in as few attempts as possible
using the words from the list of allowed words.

P3 Find the solution to all of Wordle’s daily puzzles in as few attempts as possible
using only words from the list of solutions.

Problem P1 obtained a score as low as 3.527 after opening with SLATE and using the
Naïve Algorithm, only missing 7 solutions. Remember that SLATE is the 56th best naïve
opener when every word is considered. When restricting the set of allowed words to just
the solutions, however, SLATE is the second ranked opener, slightly behind RAISE. P1 made
us realize that, while using the frequency of words improved the scores significantly, our
function g could never be adjusted to reach such good scores, because the solution set had
been handpicked. For example, SANER is 100 times less frequent than TALES –our candidate
for best starter– and yet TALES is not in the solution set, and SANER is. This human factor
is a limiting factor that should be considered in further investigations.


50 CHAPTER 5. RESULTS AND CONCLUSIONS

1 2 3 4 5 6 7
0

200

400

600

800

1,000

1,200

Score

Fr
eq

ue
nc

y

Starter = SLATE. Score = 3.212

Figure 5.3: Simulation result of the Naïve Algorithm for P3

Problem P2 obtained an average score of 3.535 for multiple random orderings of the
solutions when opening with TALES using the Greedy Algorithm, and missed only a couple
of solutions on average. If we decided to make a ‘daily’ Wordle bot, we would certainly
implement it this way, since it never reuses solutions and is not cheating by knowing exactly
which words could be solutions. The only caveat is that the score obtained is slightly
dependent on the ordering of the solutions, but in our testing the differences were consistently
less than 1% of the score.

Finally, the last relaxation of the original problem, P3, obtained the lowest average score
of 3.225, and usually reached most of the solutions using SLATE as starter and the Greedy
Algorithm. Our greedy algorithm would be the perfect Wordle cheat for competitors who
are looking to boast their scores without having to expense the effort of solving the puzzles.
Figure 5.3 is of the bar plot of one of the simulations of P3.

We consider this score a lower bound of what can be achieved through ‘fair’ competition
for P0, the original problem. This lower bound seems to be consistent with other investi-
gations by Grant Sanderson [17], Jonathan Olson [23] and Alex Selby [24]. Sanderson goes
as far as to compute the best starter and best follow up through a brute force search and
concludes that 1.16 bits of uncertainty remain in the puzzle and, therefore, 21.16 = 1.12

more attempts need to be made, which leaves us with a lower bound of 3.12 for the average
score.


5.2. GENETIC INFORMATION THEORY ALGORITHM 51

5.2 Genetic Information Theory Algorithm

As we stated in Section 3.2.1, testing our first Genetic Algorithm did not seem very in-
teresting because we could see that it had a worse performance, both in execution time and
average tries, than the Information Theory Algorithm. For that reason, we made few simu-
lations with it and focused more on the Genetic Information Theory Algorithm. Figure 5.4
shows an example of results obtained from simulations made for our first Genetic Algorithm.
An important highlight from this algorithm is how volatile it is. That is, although several
simulations gave a similar average number of tries, simulating words individually several
times gave very different results. In Section 3.2.1 we talked about how having the word
FIZZY as the solution led to the algorithm guessing it in 16 tries, but other executions had
the algorithm guess it in just 3 tries. It is also important to note that the simulations that
we made for this algorithm always showed that more than 12% of the words could not be
guessed with 6 or less tries.

0 2 4 6 8 10 12

100

200

300

400

500

600

700

Population size 50
Mutation Probability 0.1

Average number of tries 4.921
Maximum number of tries 12

Failed words 13.29%

Figure 5.4: Simulations results for an execution of our first Genetic Algorithm

For the Genetic Information Theory Algorithm, we first decided to study the results
that we would obtain using different values for its parameters, the population size and
the probability of mutation. Starting with the size of the population, we decided to run
several simulations to determine which sizes lead to better results. We measured the average
number of tries that the algorithm takes to guess every word, the maximum number of tries
that it needed to guess a word during the simulation and the percentage of words that the
algorithm could not guess with 6 or less tries. Note that, being a Genetic Algorithm, different
executions would lead to different results because of the random choices made within the
algorithm. However, the results were consistent and Figure 5.5 shows what we obtained for


52 CHAPTER 5. RESULTS AND CONCLUSIONS

some of these simulations, where the probability of mutation was always set to 0.1.

Population size 10 25 50 100 250 500
Average number of tries 4.737 4.764 4.718 4.707 4.699 4.722

Maximum number of tries 11 12 12 12 11 11
Failed words 8.7% 8.7% 9.2% 7.5% 8.5% 8.8%

Figure 5.5: Results of experiments with different population sizes

Although the results are very similar for the different values of the size of the population,
we can see a slightly better performance of the algorithm in the range between 100 and
250. Figure 5.6 shows graphs for different executions using 100 as the population size. All
these simulations with different population sizes showed us that the algorithm usually takes
between 4 and 5 tires to guess a word, using up to 12 in the worst cases. More than 90% of
the times, the algorithm guesses the word in 6 or less attempts.

0 2 4 6 8 10 12

100

200

300

400

500

600

700

800

0 2 4 6 8 10 12

100

200

300

400

500

600

700

800

Figure 5.6: Frequencies of the different scores obtained using a population size of N = 100

Our next step was to analyze the impact of mutation on the algorithm. For that, we set
the population size to 100 and tried executing the algorithm with different values for the
probability of mutation. As we made the probability higher, we could tell a tendency of the
algorithm getting worse results. This made us think, that mutation might be counterpro-
ductive for this algorithm as very high values for the probability led to executions in which
more than 15 tries were done to guess a word. Figure 5.7 shows the results we obtained
for this analysis. We studied how the algorithm works if we did not consider mutation and
the results were better than any that we had gathered. With no mutation involved in the
algorithm, the average number of tries dropped to around 4.6, which, although not by much,
is lower than what previous simulations showed. Figure 5.8 shows the best execution of the
algorithm that we obtained after giving mutation a null probability.


5.2. GENETIC INFORMATION THEORY ALGORITHM 53

Mutation Probability 0.1 0.2 0.3 0.4
Average number of tries 4.760 4.852 4.951 5.086

Maximum number of tries 12 12 14 15
Failed words 8.3% 10.82% 13.16% 15.5%

Figure 5.7: Results of experiments with different mutation probabilities

0 2 4 6 8 10 12

100

200

300

400

500

600

700

800

Population size 100
Average number of tries 4.614

Maximum number of tries 11
Failed words 7.01%

Figure 5.8: Best execution obtained using null mutation probability

To summarize the conclusions that we obtained from these simulations, the best popula-
tion size that we found was N = 100 and using a mutation probability equal to zero led to the
best results. Following these simulations, we tried to incorporate ideas from the Information
Theory Algorithm to try to improve the performance of the genetic version. We though
about how important it is to choose a good starting word when you play Wordle. However,
the Genetic Algorithm starts with a random population, which means that you might get
very good words from the beginning or that the words randomly generated might be very
bad. For that reason, we thought that initializing the population with the best words could
be of great help for the algorithm. That is, instead of randomly generating N words, the
initial population would be composed of the N words with the most entropy. Using that,
combined with the previous parameters that showed better results, we consistently obtained
better executions of the algorithm. Figure 5.9 shows the best execution that we obtained
during our process of testing the algorithm.

The results that this algorithm obtained were encouraging, especially with how we were
able to find configurations and tiny improvements that provided better results. When it
comes to execution time, we were able close the gap between the Genetic and Information


54 CHAPTER 5. RESULTS AND CONCLUSIONS

0 2 4 6 8 10

100

200

300

400

500

600

700

800

900

Population size 100
Average number of tries 4.463

Maximum number of tries 9
Failed words 6.06%

Figure 5.9: Best execution obtained for the Genetic Information Theory Algorithm

Theory algorithms, as well as in precision –the percentage of words that the algorithms
guess with 6 or less tries– were more similar too. The main goal now would be to reduce
the average number of tries, which is what makes up for the biggest difference between both
algorithms and makes this one a worse option. As we could not focus a lot on that part of
this algorithm, future work could revolve around upgrading the fitness function. That is,
defining different fitness functions that provide better results for the selection operator.

Word Entropy Priority Fitness Value
BERRY 4.143 0.841 3.484
DERRY 4.319 0.779 3.367
FERRY 3.970 0.843 3.349
JERRY 3.683 0.855 3.149
KERRY 4.098 0.834 3.421
MERRY 4.233 0.833 3.526
PERRY 4.248 0.853 3.624
TERRY 4.480 0.853 3.825

Figure 5.10: Fitness values for the words that end in -ERRY.

In some situations, the fitness values of several words were not ‘different enough’ to make
common words stand out from other less common ones. The values from Figure 5.10 show
the entropy and word priority of several words ending in -ERRY. Since we know that the
only possible solutions are BERRY, FERRY and MERRY – these are as well the most common
words in the set – we want the fitness function to give them a considerably better fitness


5.3. COMPARISON TO OTHER ALGORITHMS AND CONCLUSIONS 55

value. However our fitness function gives them all very similar fitness scores, which would
mean that the selection operator is going to choose among them without almost the same
probability. Future investigations could also revolve around the initial population, which is
something that we could tell made an improvement in this algorithm.

5.3 Comparison to other algorithms and Conclusions

The competition to make the best Wordle solver is fierce, although the goals vary among
researchers. Jonathan Olson [23], for example, designed decision trees coupled with heuristic
metrics in the search of an algorithm that can find the solution to all words in less than 4
attempts. Currently, his research supports the idea that it is computationally verifiable that
this is impossible to accomplish and his algorithms are short by 18 words, or 0.78% of the
solutions.

Many researchers use brute force methods to explore the entire decision tree, and use
intelligent pruning methods to come up with good strategies. Tyler Glaiel’s [25] algorithm
uses metrics that reminded us of our first Genetic Algorithm (see Section 3.2.1) because it
scores possible guesses based on the colours of their letters and it originally achieved similar
average scores to ours.

The most comprehensive and formal research was done by Alex Selby [24]. His work
includes his list of the best 100 starter words, after using strategies similar to the famous
adversarial game with the only drawback that his code takes very long to compute, in his
own words, when approaching his best scores. Selby uses a very smart method of producing
upper bounds for the scores of each word in a very short amount of time, after which he
re-evaluates his candidates ensuring exact answers.

Grant Sanderson’s algorithm is very similar to ours in the sense that his team also
uses Information Theory metrics in their research [17]. Sanderson’s work includes a 2-step
calculation where words are ranked according to the conditional entropy of the best follow-
up guess as well as their own entropy score. Sanderson’s investigation is very similar to ours
and we were delighted to find that his best naïve starters matched ours in Figure 2.7. After
reviewing Sanderson’s code, we found he had designed a very astute way to compute the
patterns, and we decided to overhaul our pattern constructing algorithm to integrate SIMD
instructions to speed up our code.

In general, most research seems to halt when the average scores reach 3.43. As we
mentioned in Section 5.1, we believe there is a practical lower bound when closing in on
a 3.2 average score, and lower scores necessarily must include some form of cheating. The
problem with trying to break the 3.42 barrier is that execution times reach a tipping point
of diminishing returns, with algorithms taking two days in Alex Selby’s case.


56 CHAPTER 5. RESULTS AND CONCLUSIONS

Researcher Best straters Average Best Score
Grant Sanderson SALET, CRATE, TRACE 3.43

Alex Selby SALET, CRATE, REAST 3.42
Jonathan Olson SALET, RANCE, REAST 3.42

Tyler Glaiel SOARE, ROATE, RAISE 3.43

Figure 5.11: Caption

In our research, the words in Figure 5.11 were excellent starters and their structures
seem similar to our initial and final candidates, TARES and TALES. We were pleased to find
that our candidates are close to the top in the lists of best starters of the other researchers,
even if they were not the top candidates.

In the last stages of our project we had already begun to experiment with other metrics
and obtained positive results. We feel encouraged to continue revising and improving the
project in the future and already have some ideas to potentially enhance our algorithms. In
particular, we believe that implementing multiple-step entropy calculations could improve
our algorithms significantly, at the cost of some execution time. With regard to our Genetic
Algorithms, we feel that an effort can be made to improve their performance through the
definition of more elaborate fitness functions.

Overall, we had a very positive experience researching and developing the strategies we
used in our investigation, and look forward to other opportunities to use our mathematical
background to solve stimulating puzzles that at first glance have no relationship with com-
puter science or mathematics. We are highly grateful to our supervisor, Manuel Núñez, for
his support in the development of the project and for the positive atmosphere we were all
able to participate in.


Bibliography

[1] H. Nyquist. “Certain Factors Affecting Telegraph Speed”. In: The Bell System Tech-
nical Journal 3.2 (1924), pp. 324–346.

[2] R. Hartley. “Transmission of information”. In: The Bell System Technical Journal 7.3
(1927), 535–563.

[3] C. E. Shannon. “A Mathematical Theory of Communication”. In: The Bell System
Technical Journal 27 (1948), pp. 379–423, 623–656.

[4] D. A. Huffman. “A Method for the Construction of Minimum Redundancy Codes”. In:
Proceedings of the Institute of Radio Engineers 40.9 (1952), pp. 1098–1101.

[5] J. R. Quinlan. “Induction of decision trees”. In: Machine Learning 1.1 (1986), pp. 81–
106.

[6] D. Clark and R. M. Hierons. “Squeeziness: An information theoretic measure for avoid-
ing fault masking”. In: Information Processing Letters 112.8-9 (2012), pp. 335–340.

[7] D. Clark, R. M. Hierons, and K. Patel. “Normalised Squeeziness and Failed Error
Propagation”. In: Information Processing Letters 149 (2019), pp. 6–9.

[8] K. Androutsopoulos et al. “An analysis of the relationship between conditional en-
tropy and failed error propagation in software testing”. In: 36th Int. Conf. on Software
Engineering, ICSE’14. ACM Press, 2014, pp. 573–583.

[9] M. Núñez A. Ibias and R. M. Hierons. “Using Squeeziness to test component-based
systems defined as Finite State Machines”. In: Information & Software Technology 112
(2019), pp. 132–147.

[10] A. Ibias and M. Núñez. “Estimating Fault Masking using Squeeziness based on Rényi’s
Entropy”. In: 35th ACM Symposium on Applied Computing, SAC’20. ACM Press, 2020,
pp. 1936–1943.

[11] K. Patel, R. M. Hierons, and D. Clark. “An information theoretic notion of software
testability”. In: Information & Software Technology 143 (2022).

[12] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2nd. Wiley, 2006.

57


58 BIBLIOGRAPHY

[13] S. Machkovech. Wordle creator describes game’s rise, says NYT sale was “a way to walk
away”. 2022. url: https://arstechnica.com/gaming/2022/03/wordle-creator-
describes-games-rise-says-nyt-sale-was-a-way-to-walk-away/.

[14] D. Victor. Wordle is a love story. 2022. url: https://www.nytimes.com/2022/01/
03/technology/wordle-word-game-creator.html.

[15] R. Hall. Wordle creator overwhelmed by global success of hit puzzle. 2022. url: https:
//www.theguardian.com/games/2022/jan/11/wordle-creator-overwhelmed-by-
global-success-of-hit-puzzle.

[16] S. O’Sullivan. Jotto Game Review. 2002. url: http://www.panix.com/~sos/bc/
jotto.html.

[17] G. Sanderson. Solving Wordle using Information Theory. 2022. url: 3blue1brown.
com.

[18] M. Mitchell. An introduction to genetic algorithms. MIT Press, 1998.
[19] D. E. Goldberg. Genetic Algorithms in Search, Optimisation and Machine Learning.

Addison-Wesley, 1989.
[20] J. H. Holland. Adaptation in Natural and Artificial Systems. MIT Press, 1992.
[21] L. Aguirre Galindo and J. Pellejero Ortega. Computación cuántica: pruebas de mu-

tación. Trabajo de Fin de Grado. Facultad de Informática. Universidad Complutense
de Madrid. 2020.

[22] C. R. Harris et al. “Array programming with NumPy”. In: Nature 585.7825 (2020),
pp. 357–362.

[23] J. Olson. Optimal Wordle Solutions. 2022. url: https : / / jonathanolson . net /
experiments/optimal-wordle-solutions.

[24] A. Selby. The best strategies for Wordle. 2022. url: https://sonorouschocolate.
com/notes/index.php?title=The_best_strategies_for_Wordle.

[25] T. Glaiel. The mathematically optimal first guess in Wordle. 2022. url: https://
medium.com/@tglaiel/the-mathematically-optimal-first-guess-in-wordle-
cbcb03c19b0a.

https://arstechnica.com/gaming/2022/03/wordle-creator-describes-games-rise-says-nyt-sale-was-a-way-to-walk-away/
https://arstechnica.com/gaming/2022/03/wordle-creator-describes-games-rise-says-nyt-sale-was-a-way-to-walk-away/
https://www.nytimes.com/2022/01/03/technology/wordle-word-game-creator.html
https://www.nytimes.com/2022/01/03/technology/wordle-word-game-creator.html
https://www.theguardian.com/games/2022/jan/11/wordle-creator-overwhelmed-by-global-success-of-hit-puzzle
https://www.theguardian.com/games/2022/jan/11/wordle-creator-overwhelmed-by-global-success-of-hit-puzzle
https://www.theguardian.com/games/2022/jan/11/wordle-creator-overwhelmed-by-global-success-of-hit-puzzle
http://www.panix.com/~sos/bc/jotto.html
http://www.panix.com/~sos/bc/jotto.html
3blue1brown.com
3blue1brown.com
https://jonathanolson.net/experiments/optimal-wordle-solutions
https://jonathanolson.net/experiments/optimal-wordle-solutions
https://sonorouschocolate.com/notes/index.php?title=The_best_strategies_for_Wordle
https://sonorouschocolate.com/notes/index.php?title=The_best_strategies_for_Wordle
https://medium.com/@tglaiel/the-mathematically-optimal-first-guess-in-wordle-cbcb03c19b0a
https://medium.com/@tglaiel/the-mathematically-optimal-first-guess-in-wordle-cbcb03c19b0a
https://medium.com/@tglaiel/the-mathematically-optimal-first-guess-in-wordle-cbcb03c19b0a

	Introduction
	Historical background. What is information?
	Definition and Properties of Entropy
	Applications of Entropy
	Project Management and Scheduling

	Using entropy to solve Wordle
	Wordle: Rules and Goal
	Information Theory Algorithm
	Obtaining the patterns
	Precomputation and best starters
	Naïve Algorithm
	Greedy algorithm
	Relaxations of the problem


	Using Genetic Algorithms to solve Wordle
	Introduction to Genetic Algorithms
	Solution to Wordle with a Genetic Algorithm
	First Genetic Algorithm
	Genetic Algorithm with Information Theory


	Individual contributions
	Contribution of Ángel Benítez Gómez
	Contribution of Enrique Cavanillas Puga

	Results and Conclusions
	Information Theory Algorithm
	Naïve Algorithm
	Greedy Algorithm
	Relaxations of the problem

	Genetic Information Theory Algorithm
	Comparison to other algorithms and Conclusions

	Bibliography