UNIVERSIDAD COMPLUTENSE DE MADRID 
FACULTAD DE FILOLOGÍA 

 
TESIS DOCTORAL 
 

Attribution of authorship of arden of faversham: a forensic 
linguistic study of William Shakespeare and Christopher 

Marlowe 
 

 Atribución de autoría de arden of faversham: un estudio 
lingüístico forense de William Shakespeare and Christopher 

Marlowe 
 

MEMORIA PARA OPTAR AL GRADO DE DOCTOR 
 

PRESENTADA POR 
 

Juan Antonio Latorre García 
 

Directoras 
 

María Goicoechea de Jorge 
Elena Martínez Caro 

 
Madrid 
 
 
© Juan Antonio Latorre García, 2022 


UNIVERSIDAD COMPLUTENSE DE MADRID 

FACULTAD DE FILOLOGÍA 
 

TESIS DOCTORAL 

 
ATTRIBUTION OF AUTHORSHIP OF ARDEN OF FAVERSHAM: A 

FORENSIC LINGUISTIC STUDY OF WILLIAM SHAKESPEARE 

AND CHRISTOPHER MARLOWE 

ATRIBUCIÓN DE AUTORÍA DE ARDEN OF FAVERSHAM: UN 

ESTUDIO LINGÜÍSTICO FORENSE DE WILLIAM SHAKESPEARE 

Y CHRISTOPHER MARLOWE 

 
MEMORIA PARA OPTAR AL GRADO DE DOCTOR 

PRESENTADA POR 

 
Juan Antonio Latorre García 

 
DIRECTORAS 

 
María Goicoechea de Jorge 

Elena Martínez Caro 

 
UNIVERSIDAD COMPLUTENSE DE MADRID 

FACULTAD DE FILOLOGÍA 

 
ATTRIBUTION OF AUTHORSHIP OF ARDEN OF FAVERSHAM: A 

FORENSIC LINGUISTIC STUDY OF WILLIAM SHAKESPEARE AND 

CHRISTOPHER MARLOWE 

ATRIBUCIÓN DE AUTORÍA DE ARDEN OF FAVERSHAM: UN 

ESTUDIO LINGÜÍSTICO FORENSE DE WILLIAM SHAKESPEARE Y 

CHRISTOPHER MARLOWE 

 
Tesis presentada para optar al Grado de Doctor por 

Juan Antonio Latorre García 

 
Directoras: 

Dr. María Goicoechea de Jorge 

Dr. Elena Martínez Caro 

 
Madrid, 2021 


ACKNOWLEDGEMENTS 

Como no podía ser de otra manera, quiero empezar esta sección dedicándole unas 

palabras a mis directoras, María Goicoechea de Jorge y Elena Martínez Caro. María, 

gracias por la brillantez de tus ideas. Si tú no me hubieras hablado de Arden of Faversham, 

esta tesis habría sido algo completamente distinto. Además, quiero que sepas que siempre 

te estaré agradecido por la calidad humana y la empatía que me mostraste en el momento 

más delicado de mi vida académica. Elena, gracias por haberme convertido en un lingüista 

infinitamente mejor de lo que era cuando nos conocimos. Después de estos cinco años, te 

veo más como una amiga que como una directora, y eso refleja la manera que tienes de 

preocuparte por la gente que te rodea.  

Krzysztof Kredens, thank you for supervising my research mobility and for teaching me 

so much about forensic linguistics. When I look back, I realize I barely knew anything 

about the discipline when I got there. I admire you intellectually and I have enjoyed each 

of our conversations. 

Rui Sousa-Silva and Gerardo Sierra, thank you for your commendable work as external 

reviewers of the thesis. The implementation of your suggestions has increased its quality 

substantially. I hope we can be in touch after the oral defence, since I would love to keep 

learning from you. 

Victoria Martín de la Rosa, gracias por ser la primera persona de la Complutense que vio 

algo especial en mí y por tu interés constante a lo largo de todos estos años. Si bien mi 

timidez no me permite ser más expresivo en ciertas ocasiones, espero que sepas que te 

aprecio enormemente y que siempre puedes contar conmigo. 

David Vallejo y Alekos Camino, gracias por haberme contestado unas mil doscientas 

veintisiete veces a la pregunta «¿A vosotros esta oración os suena bien?» Gracias también 

por ayudarme con la maquetación de la tesis y, sobre todo, por vuestra infinita paciencia.  

Carlos Antón, gracias por embarcarte en el Proyecto ALTXA conmigo. Quién nos lo iba 

a decir cuando nos conocimos hace quince años. ¿Recuerdas cuando te dejaste el pelo 

largo y se metían contigo en el instituto? No sé si alguna vez te lo dije, pero siempre te 

admiré por tener la personalidad suficiente como para seguir llevándolo así hasta que tú 

quisiste. Eres un buen amigo y estoy orgulloso de la persona que eres. 

Irene Mezquita, gracias por haberme hecho tan, tan feliz. Siempre serás «mi florecita». 


Si hay dos personas a las que quiero dedicar esta tesis son mi padre, José Antonio Latorre, 

y mi madre, Amalia García. Me habéis enseñado a ser un buen padre. Papá, gracias por 

inculcarme la importancia de amar lo que hago. Mamá, gracias por ayudarme a desarrollar 

mi sensibilidad. Ojalá pueda disfrutar de vosotros mucho más tiempo (papá, deja de 

fumar). 

He escrito estos agradecimientos unos días antes de entregar la tesis y me siento como un 

personaje de Joyce en plena epifanía. Precisamente por eso, porque creo que ahora mismo 

poseo una visión de la realidad que se va a esfumar más pronto que tarde, quiero dejarme 

algo escrito a mí mismo. Cuando vuelvas a leer esta tesis dentro de unos meses o unos 

años y se te ocurra la manera de mejorarla o encuentres alguna errata (sé que no vas a 

parar hasta que lo consigas), no seas un cretino contigo mismo. Puedes continuar siéndolo 

al valorar el resto de los aspectos de tu vida, pero no este. Hazme ese favor, majo. 

Ahora sí, como dijo Humbert Humbert, «contemplen esta maraña de espinas». 


i 
 

TABLE OF CONTENTS 

 
Table of contents ............................................................................................................... i 

Abstract ............................................................................................................................ vi 

Resumen ........................................................................................................................ viii 

List of tables ...................................................................................................................... x 

List of figures ................................................................................................................ xiii 

 
CHAPTER 1 | INTRODUCTION ................................................................................ 14 

1.1. Background and rationale for research ................................................................. 14 

1.2. Objectives and hypotheses .................................................................................... 17 

1.3. Overview and organization of the thesis ............................................................... 20 

 
CHAPTER 2 | HISTORICAL AND LITERARY BACKGROUND ........................ 23 

2.1. William Shakespeare ............................................................................................ 23 

2.2. Christopher Marlowe ............................................................................................ 27 

2.3. The anonymous play Arden of Faversham ........................................................... 31 

2.4. Summary ............................................................................................................... 33 

 
CHAPTER 3 | LINGUISTIC BACKGROUND: AN INTRODUCTION TO 

FORENSIC LINGUISTICS AND AUTHORSHIP ATTRIBUTION STUDIES .... 35 

3.1. Definition of forensic linguistics .......................................................................... 35 

3.2. Historical development of forensic linguistics ..................................................... 36 

3.3. Areas of forensic linguistics .................................................................................. 39 

         3.3.1. The written language of the law ..................................................................... 42 

        3.3.2. The spoken language of the law .................................................................... 48 

        3.3.3. The linguist as an expert witness ................................................................... 53 


ii 
 

3.4. Authorship attribution studies ............................................................................... 56 

        3.4.1. Attribution of authorship in cases of plagiarism ........................................... 58 

        3.4.2. Attribution of authorship of criminal texts with an open set of suspects ...... 59 

        3.4.3. Attribution of authorship of criminal texts with a close set of suspects ........ 61 

        3.4.4. Attribution of authorship of historical texts .................................................. 62 

3.5. Summary ............................................................................................................... 75 

 
CHAPTER 4 | METHODOLOGY ............................................................................... 76 

4.1. Delimitation of the scope of the investigation ...................................................... 76 

4.2. Data collection ...................................................................................................... 78 

4.3. Extraction and adaptation of the samples ............................................................. 80 

4.4. Structure of the analysis ........................................................................................ 83 

4.5. Selection of the authorship tests for the analysis and the role of ALTXA ........... 86 

        4.5.1. Quantification of the relative frequency of keywords ................................... 89 

        4.5.2. Quantification of the average number of words per sentence ....................... 91 

        4.5.3. Quantification of the lexical richness ............................................................ 93 

        4.5.4. N-gram tracing ............................................................................................... 94 

        4.5.5. The Zeta test ................................................................................................ 100 

4.6. Summary ............................................................................................................. 105 

 
CHAPTER 5 | PRE-STUDIES ................................................................................... 107 

5.1. Pre-study on the calculation of the average number of words per sentence (Pre-  

 study 1) .............................................................................................................. 107 

        5.1.1. Average number of words per sentence of scenes of between 100 and 450   

 words .................................................................................................................. 108 

        5.1.2. Average number of words per sentence of scenes of between 500 and 950  

 words .................................................................................................................. 109 


iii 
 

        5.1.3. Average number of words per sentence of scenes of between 1,100 and    

 1,700 words ........................................................................................................ 111 

        5.1.4. Average number of words per sentence of scenes of almost 2,000 words or 

 more ................................................................................................................... 112 

        5.1.5. Conclusions derived from Pre-study 1 ........................................................ 114 

5.2. Pre-study on the calculation of the lexical richness (Pre-study 2) ...................... 114 

        5.2.1. Lexical richness of scenes of between 100 and 450 words ......................... 115 

        5.2.2. Lexical richness of scenes of between 500 and 950 words ......................... 117 

        5.2.3. Lexical richness of scenes of between 1,100 and 1,700 words ................... 119 

        5.2.4. Lexical richness of scenes of almost 2,000 words or more ......................... 120 

        5.2.5. Conclusions derived from Pre-study 2 ........................................................ 123 

5.3. Pre-study on n-gram tracing (Pre-study 3) .......................................................... 123 

        5.3.1. N-gram tracing with scenes of between 100 and 450 words ....................... 124 

        5.3.2. N-gram tracing with scenes of between 500 and 950 words ....................... 135 

        5.3.3. N-gram tracing with scenes of between 1,100 and 1,700 words ................. 146 

        5.3.4. N-gram tracing with scenes of almost 2,000 words or more ....................... 157 

        5.3.5. Conclusions derived from Pre-study 3 ........................................................ 167 

5.4. Pre-study on the conduction of the Zeta test (Pre-study 4) ................................. 168 

        5.4.1. Zeta test with scenes of almost 2,000 words or more .................................. 169 

        5.4.2. Interpretation of the results .......................................................................... 178 

        5.4.3. Conclusions derived from Pre-study 4 ........................................................ 179 

5.5. Summary ............................................................................................................. 180 

 
CHAPTER 6 | CASE STUDY: ATTRIBUTION OF AUTHORSHIP OF THE 

SCENES OF ARDEN OF FAVERSHAM  ................................................................. 182 

6.1. Scene I.i (5,135 words) ....................................................................................... 182 


iv 
 

6.2. Scene II.i (916 words) ......................................................................................... 186 

6.3. Scene II.ii (1,694 words) ..................................................................................... 187 

6.4. Scene III.i (822 words) ....................................................................................... 189 

6.5. Scene III.ii (516 words) ...................................................................................... 190 

6.6. Scene III.iii (357 words) ..................................................................................... 191 

6.7. Scene III.iv (240 words) ..................................................................................... 192 

6.8. Scene III.v (1,293 words) .................................................................................... 193 

6.9. Scene III.vi (1,265 words) .................................................................................. 195 

6.10. Scene IV.i (838 words) ..................................................................................... 196 

6.11. Scene IV.ii (263 words) .................................................................................... 197 

6.12. Scene IV.iii (593 words) ................................................................................... 198 

6.13. Scene IV.iv (1,251 words) ................................................................................ 199 

6.14. Scene V.i (3,477 words) .................................................................................... 200 

6.15. Scene V.ii (106 words) ..................................................................................... 203 

6.16. Scene V.iii (179 words) .................................................................................... 204 

6.17. Scene V.iv (117 words) ..................................................................................... 205 

6.18. Scene V.v (321 words) ...................................................................................... 206 

6.19. Epilogue or Scene V.vi (148 words) ................................................................. 206 

6.20. Summary ........................................................................................................... 207 

 
CHAPTER 7 | DISCUSSION OF THE RESULTS  ................................................. 209 

 
CHAPTER 8 | CONCLUSION AND FUTURE LINES OF RESEARCH  ............ 217 

8.1. Summary and implications of the findings ......................................................... 217 

8.2. Limitations and future lines of research .............................................................. 223 

 
PRIMARY SOURCES  ............................................................................................... 226 

 
v 
 

BIBLIOGRAPHY AND REFERENCES  ................................................................. 227 

 
APPENDICES  ............................................................................................................. 235 

Appendix 1 ................................................................................................................. 236 

Appendix 2 ................................................................................................................. 237 

Appendix 3 ................................................................................................................. 239 

Appendix 4 ................................................................................................................. 244 

Appendix 5 ................................................................................................................. 254 

 
vi 
 

ABSTRACT 

This research project sets out to accomplish two main objectives. On the one hand, to 

determine the authorship of the Elizabethan play Arden of Faversham with a forensic 

linguistic analysis considering William Shakespeare and Christopher Marlowe as the 

possible candidates. On the other hand, to develop the computational program ALTXA, 

which can carry out authorship attribution tests within the disciplinary framework of 

forensic linguistics and has an intuitive interface, which will facilitate the work of other 

linguists and the spread of studies of this kind in educational contexts. 

Firstly, some biographical data of Shakespeare and Marlowe is offered to establish a 

connection between both which justifies their possible cooperation in the elaboration of 

Arden of Faversham, together with a historical and literary analysis of the play itself. 

Afterwards, forensic linguistics is defined and a series of basic notions about its historical 

development and main areas of study are provided to narrow down progressively the 

scope of the thesis until authorship attribution studies are presented and explained in more 

depth, with a special emphasis on previous investigations on the authorship of Arden of 

Faversham. These sections are not merely descriptive, since they include theoretical 

contributions that anticipate the methodological approach selected for the posterior 

analysis. 

To study the authorship of Arden of Faversham, a corpus with undisputed plays was 

compiled for each of the two candidates of the investigation following the hypothesis that, 

if the idiolect of an author is a dynamic phenomenon, these reference corpora should be 

formed by plays that were written in a similar period to that in which the disputed work 

was created, with which they should also share a tragic tone. In addition, under the belief 

that the validity of each attribution method depends on the type of text and authors with 

which it is applied, the thesis is divided into a series of pre-studies and a case study. 

The pre-studies have the purpose of evaluating which authorship attribution methods 

present a high degree of effectiveness to distinguish between undisputed scenes of 

Shakespeare and Marlowe depending on their length. These scenes were divided in four 

groups whose range of words is from 100 to 450, from 500 to 950, from 1,100 to 1,700 

and almost 2,000 or more. To carry out the pre-studies, five authorship attribution 

methods were selected and programmed as functionalities of ALTXA. These are based 

on the calculation of the relative frequency of a list of keywords selected by the 


vii 
 

researcher, the quantification of the average number of words per sentence of the texts 

and their lexical richness, tracing common n-grams and the conduction of the Zeta test. 

The first of these methods was eventually discarded because of its reliance on subjective 

criteria, whereas the others were included in the pre-studies. The identification of 

common n-grams proved to be effective to distinguish between Shakespearean and 

Marlowian scenes from the four groups, whereas the Zeta test proved its reliability to 

analyse scenes from the fourth group. Consequently, these were the methods employed 

in the case study, that is, in the attribution of authorship of the scenes of Arden of 

Faversham, which were studied independently, since the play may have been written in 

collaboration. 

The results of the case study associate the authorship of 15 of the 19 scenes of the play 

with Marlowe, whereas only one of them has a higher degree of resemblance with the 

Shakespearean idiolect. The three remaining scenes present inconclusive results. Even 

though there is the need to include other Elizabethan playwrights as possible candidates 

in future research, this thesis provides sufficient evidence to suggest that the participation 

of Shakespeare in the elaboration of Arden of Faversham is minor or non-existent, which 

is already a significant finding that contradicts what has been stated by other scholars. 

Furthermore, it also suggests that the participation of Marlowe is undeniable, especially 

in the elaboration of Scene V.i, whose results are so overwhelming that it seems 

unthinkable that it could have been written by another author. 

In sum, the present doctoral thesis attributes to Marlowe the authorship of a section of the 

Elizabethan play Arden of Faversham, which has been catalogued as anonymous for over 

four centuries. This breakthrough has been accomplished with the assistance of the 

software ALTXA, which will be used to build an educational project that aims at 

contributing to the development of the discipline, that has been constantly evolving over 

the last decades as a result of the irruption of new technologies. 

 
viii 
 

RESUMEN 

Esta investigación pretende cumplir dos objetivos principales. Por un lado, determinar la 

autoría de la obra teatral isabelina Arden of Faversham mediante un análisis lingüístico 

forense con William Shakespeare y Christopher Marlowe como posibles candidatos. Por 

otro lado, desarrollar el programa informático ALTXA, capaz de llevar a cabo tareas de 

atribución de autoría comunes en el ámbito disciplinario de la lingüística forense a través 

de una interfaz intuitiva, lo que permitirá facilitar la labor de otros lingüistas y la difusión 

de este tipo de estudios en contextos docentes.  

Primeramente, se aportan datos biográficos de Shakespeare y Marlowe para establecer 

una conexión entre ellos que justifique su posible colaboración en la elaboración de Arden 

of Faversham, así como un breve análisis literario e histórico de la propia obra. 

Posteriormente, se define qué es la lingüística forense y se ofrecen una serie de nociones 

básicas acerca de su desarrollo histórico y principales áreas de estudio con el propósito 

de acotar progresivamente el foco de la tesis hasta que los estudios de atribución de 

autoría son presentados y explicados de forma más exhaustiva, con un énfasis especial en 

aquellos estudios previos sobre la autoría de Arden of Faversham. Estas secciones de la 

tesis no son puramente descriptivas, sino que incluyen contribuciones teóricas que 

anticipan el enfoque metodológico seleccionado para realizar el análisis posterior. 

Para estudiar la autoría de Arden of Faversham, se compiló un corpus de obras 

indubitadas para cada uno de los dos candidatos de la investigación bajo la hipótesis de 

que, si el idiolecto de un autor es un fenómeno dinámico, estos corpus de referencia deben 

estar formados únicamente por obras teatrales que fueron escritas en un período similar 

al de la obra disputada, con la que además deben compartir un tono trágico. Asimismo, 

con la creencia de que la validez de cada método de atribución depende del tipo de texto 

y los autores sobre los que se aplica, esta tesis está dividida en una serie de estudios 

previos y un estudio de caso. 

Los estudios previos tienen el propósito de evaluar qué métodos de atribución de autoría 

poseen un alto índice de efectividad para distinguir entre escenas indubitadas de 

Shakespeare y Marlowe en función de la longitud de estas, las cuales fueron divididas en 

cuatro grupos. El rango de palabras que presenta cada grupo de escenas es de entre 100 y 

450, entre 500 y 950, entre 1.100 y 1.700 y casi 2.000 o más. Para la realización de estos 

estudios previos, se eligieron cinco métodos de atribución de autoría que fueron 


ix 
 

programados como funcionalidades de ALTXA. Estos se basan en el cálculo de la 

frecuencia relativa de una lista de palabras clave seleccionadas por el investigador, la 

cuantificación del número medio de palabras por frase de los textos y su riqueza léxica, 

la identificación de n-gramas en común y la conducción del Zeta test. El primero de estos 

métodos fue finalmente descartado por su carácter subjetivo, mientras que los demás sí 

formaron parte de los estudios previos. La identificación de n-gramas comunes demostró 

su efectividad para distinguir entre escenas de Shakespeare y Marlowe de los cuatro 

grupos, mientras que el Zeta test probó su efectividad con las escenas del cuarto. Por ello, 

estos fueron los métodos empleados en el estudio de caso, es decir, en la atribución de 

autoría de las escenas de Arden of Faversham, las cuales fueron estudiadas de forma 

independiente, puesto que la obra pudo haber sido escrita en colaboración. 

Los resultados del estudio de caso asocian la autoría de 15 de las 19 escenas de la obra 

con Marlowe, mientras que solo una de ellas guarda un índice de similitud mayor con el 

idiolecto shakespeareano. Las tres escenas restantes presentan resultados no 

concluyentes. A pesar de que existe la necesidad de incluir a otros dramaturgos isabelinos 

como posibles candidatos en futuras investigaciones, esta tesis ofrece pruebas suficientes 

para sugerir que la participación de Shakespeare en la elaboración de Arden of Faversham 

es menor o inexistente, lo cual ya es un hallazgo valioso que contradice lo expuesto por 

otros académicos. Asimismo, también sugiere que la participación de Marlowe es 

innegable, especialmente en la primera escena del quinto acto, donde los resultados son 

tan abrumadores que parece impensable que esta pueda haber sido escrita por otro autor. 

En suma, la presente tesis doctoral atribuye a Christopher Marlowe la autoría de una parte 

de la obra isabelina Arden of Faversham, la cual ha permanecido catalogada como 

anónima durante más de cuatro siglos. Este hallazgo ha sido posible gracias al software 

ALTXA, sobre el cual se pretende construir un proyecto docente que contribuya al 

desarrollo de la disciplina, que ha estado evolucionando de forma constante durante las 

últimas décadas como consecuencia de la irrupción de las nuevas tecnologías. 

 
x 
 

LIST OF TABLES 

 
Table 1 | Length of the scenes of Arden of Faversham .................................................. 84 

Table 2 | Stage 1 of the pre-study on the average number of words per sentence ........ 108 

Table 3 | Stage 2 of the pre-study on the average number of words per sentence ........ 110 

Table 4 | Stage 3 of the pre-study on the average number of words per sentence ........ 111 

Table 5 | Stage 4 of the pre-study on the average number of words per sentence ........ 113 

Table 6 | Stage 1 of the pre-study on the lexical richness ............................................. 115 

Table 7 | Stage 2 of the pre-study on the lexical richness ............................................. 117 

Table 8 | Stage 3 of the pre-study on the lexical richness ............................................. 119 

Table 9 | Stage 4 of the pre-study on the lexical richness ............................................. 121 

Table 10 | N-gram tracing with Scene II.iii from Richard III ....................................... 125 

Table 11 | N-gram tracing with Scene III.iii from Richard III ..................................... 126 

Table 12 | N-gram tracing with Scene V.ii from Richard III........................................ 127 

Table 13 | N-gram tracing with Scene II.iv from Richard II ........................................ 128 

Table 14 | N-gram tracing with Scene III.i from Richard II ......................................... 129 

Table 15 | N-gram tracing with Scene II.iii from Edward II ........................................ 130 

Table 16 | N-gram tracing with Scene III.i from Edward II ......................................... 131 

Table 17 | N-gram tracing with Scene IV.i from Edward II ......................................... 132 

Table 18 | N-gram tracing with Scene IV.iv from Edward II ....................................... 133 

Table 19 | N-gram tracing with Scene III.i from The Jew of Malta.............................. 134 

Table 20 | N-gram tracing with Scene II.iv from Richard III ....................................... 136 

Table 21 | N-gram tracing with Scene III.iv from Richard III...................................... 137 

Table 22 | N-gram tracing with Scene IV.ii from Richard III ...................................... 138 

Table 23 | N-gram tracing with Scene III.iv from Richard II ....................................... 139 

Table 24 | N-gram tracing with Scene V.i from Richard II .......................................... 140 

Table 25 | N-gram tracing with Scene II.i from Edward II .......................................... 141 


xi 
 

Table 26 | N-gram tracing with Scene III.iii from Edward II ....................................... 142 

Table 27 | N-gram tracing with Scene III.iii from The Jew of Malta ........................... 143 

Table 28 | N-gram tracing with Scene IV.v from The Jew of Malta ............................ 144 

Table 29 | N-gram tracing with Scene V.i from The Jew of Malta ............................... 145 

Table 30 | N-gram tracing with Scene I.i from Richard III .......................................... 146 

Table 31 | N-gram tracing with Scene II.ii from Richard III ........................................ 147 

Table 32 | N-gram tracing with Scene I.i from Richard II............................................ 148 

Table 33 | N-gram tracing with Scene II.ii from Richard II ......................................... 149 

Table 34 | N-gram tracing with Scene II.iii from Richard II ........................................ 150 

Table 35 | N-gram tracing with Scene I.i from Edward II ............................................ 151 

Table 36 | N-gram tracing with Scene III.ii from Edward II ........................................ 152 

Table 37 | N-gram tracing with Scene V.i from Edward II .......................................... 154 

Table 38 | N-gram tracing with Scene I.i from The Jew of Malta ................................ 155 

Table 39 | N-gram tracing with Scene IV.iv from The Jew of Malta ........................... 156 

Table 40 | N-gram tracing with Scene I.iii from Richard III ........................................ 157 

Table 41 | N-gram tracing with Scene IV.iv from Richard III ..................................... 158 

Table 42 | N-gram tracing with Scene V.iii from Richard III ...................................... 159 

Table 43 | N-gram tracing with Scene I.iii from Richard II ......................................... 160 

Table 44 | N-gram tracing with Scene II.i from Richard II .......................................... 161 

Table 45 | N-gram tracing with Scene I.iv from Edward II .......................................... 163 

Table 46 | N-gram tracing with Scene II.ii from Edward II ......................................... 164 

Table 47 | N-gram tracing with Scene I.ii from The Jew of Malta ............................... 165 

Table 48 | N-gram tracing with Scene II.iii from The Jew of Malta ............................. 166 

Table 49 | N-gram tracing with Scene I.i from Arden of Faversham ........................... 182 

Table 50 | N-gram tracing with Scene II.i from Arden of Faversham .......................... 186 

Table 51 | N-gram tracing with Scene II.ii from Arden of Faversham ......................... 187 

Table 52 | N-gram tracing with Scene III.i from Arden of Faversham......................... 189 


xii 
 

Table 53 | N-gram tracing with Scene III.ii from Arden of Faversham ....................... 191 

Table 54 | N-gram tracing with Scene III.iii from Arden of Faversham ...................... 192 

Table 55 | N-gram tracing with Scene III.iv from Arden of Faversham....................... 193 

Table 56 | N-gram tracing with Scene III.v from Arden of Faversham ........................ 194 

Table 57 | N-gram tracing with Scene III.vi from Arden of Faversham....................... 195 

Table 58 | N-gram tracing with Scene IV.i from Arden of Faversham ........................ 196 

Table 59 | N-gram tracing with Scene IV.ii from Arden of Faversham ....................... 197 

Table 60 | N-gram tracing with Scene IV.iii from Arden of Faversham ...................... 198 

Table 61 | N-gram tracing with Scene IV.iv from Arden of Faversham ...................... 199 

Table 62 | N-gram tracing with Scene V.i from Arden of Faversham .......................... 200 

Table 63 | N-gram tracing with Scene V.ii from Arden of Faversham......................... 203 

Table 64 | N-gram tracing with Scene V.iii from Arden of Faversham ....................... 204 

Table 65 | N-gram tracing with Scene V.iv from Arden of Faversham ........................ 205 

Table 66 | N-gram tracing with Scene V.v from Arden of Faversham ......................... 206 

Table 67 | N-gram tracing with Scene V.vi from Arden of Faversham ........................ 207 

Table 68 | Summary of the results derived from the case study ................................... 209 

 
xiii 
 

LIST OF FIGURES 

 
Figure 1 | Interface of ALTXA for text analysis ............................................................ 90 

Figure 2 | Interface of ALTXA for n-gram tracing......................................................... 96 

Figure 3 | Interface of ALTXA for the Zeta test ........................................................... 101 

Figure 4 | Zeta test with Scene I.ii from Richard III .................................................... 169 

Figure 5 | Zeta test with Scene I.iii from Richard III ................................................... 171 

Figure 6 | Zeta test with Scene V.iii from Richard III .................................................. 172 

Figure 7 | Zeta test with Scene I.iii from Richard II ..................................................... 173 

Figure 8 | Zeta test with Scene IV.i from Richard II .................................................... 174 

Figure 9 | Zeta test with Scene I.iv from Edward II ..................................................... 175 

Figure 10 | Zeta test with Scene II.ii from Edward II ................................................... 176 

Figure 11 | Zeta test with Scene I.ii from The Jew of Malta ........................................ 177 

Figure 12 | Zeta test with Scene II.iii from The Jew of Malta ...................................... 178 

Figure 13 | Zeta test with Scene I.i from Arden of Faversham..................................... 185 

Figure 14 | Zeta test with Scene II.ii from Arden of Faversham .................................. 189 

Figure 15 | Zeta test with Scene V.i from Arden of Faversham ................................... 202 

 
14 
 

CHAPTER 1 | INTRODUCTION 

1.1. Background and rationale for research 

The present doctoral thesis intends to conduct a forensic linguistic analysis of the 

authorship of the Elizabethan play Arden of Faversham considering William Shakespeare 

and Christopher Marlowe as the possible candidates. This analysis will be carried out 

with a software named ALTXA, which has been specifically designed for its conduction 

and whose implementation in educational and professional contexts stands as the second 

main objective of the thesis. 

My interest in the forensic analysis of Elizabethan texts was developed in my MA 

dissertation entitled Attribution of Authorship of “The Merchant of Venice” and “Henry 

VI” through Linguistic Parameters: A Contrastive Study between William Shakespeare 

and Christopher Marlowe. While The Merchant of Venice has been attributed to 

Shakespeare without major doubts for centuries, Henry VI, Part I had been recently 

attributed to Shakespeare and Marlowe as a collaborative play (see Section 2.2) a few 

years before I started working on it. Given my lack of expertise in the subject, the main 

objective of my MA dissertation was to work on the authorship of well-attributed plays 

to determine if its approach could reach similar conclusions to those presented by experts 

in the field.  

As suggested by one of my supervisors –Dr. María Goicoechea–, I decided to focus 

on the authorship of Arden of Faversham in this doctoral thesis, given that it could 

constitute a natural continuity of my previous work. Arden of Faversham is an 

Elizabethan play that remains anonymous and hence this project could move a step further 

than analysing already well-attributed plays and fill a gap in knowledge, since there is not 

much research on this issue from a forensic linguistic perspective.  

The play Arden of Faversham was approximately elaborated in 1592 and, despite the 

presence of studies that have attempted to link its authorship to Shakespeare, Marlowe 

and other playwrights (see Section 2.3), it is still considered anonymous due to a lack of 

conclusive evidence (Kinney, 2009). This work narrates the killing of a landowner from 

Faversham named Arden by his wife, his wife’s lover and two professional criminals. 

The play was inspired by a real event that had been documented by Raphael Holinshed 

in his historical work entitled Chronicles of England, Scotland and Ireland (1577; second 


15 
 

edition, 1587). The fact that the text has remained anonymous makes it suitable for a 

study of this kind, whose approach will be briefly described in the following paragraphs. 

Forensic linguistics can be defined as a moderately recent branch of applied 

linguistics that focuses on those legal cases in which the use of language is involved to 

some extent (Tiersma, 1993; McMenamin, 2002; Gibbons, 2003; Olsson, 2008; Momeni, 

2011; Perkins & Grant, 2012; Correa, 2013; Udina, 2017). One of the many applications 

of this discipline, which will be addressed in detail in Chapter 3, is the attribution of 

authorship of anonymous or disputed texts, such as threatening notes, suicide letters and, 

as in the case of the present research, literary texts.  

Even though the establishment of the authorship of Arden of Faversham has no major 

legal implications, the forensic approach adopted for the thesis is justified by the 

development of computational tools over the last decades, which allows researchers to 

take into account statistical variables that could not be accessed before and thus produce 

more precise results than previous studies conducted from both literary and linguistic 

perspectives (Kinney, 2009). In other words, the present investigation aims to cover a gap 

in knowledge that has been present for over four centuries by using computational 

resources that facilitate the adoption of innovative empirical approaches that differ from 

more traditional ones, such as those that characterize the field of literary criticism. 

It is probable that Arden of Faversham was written in collaboration, since most of the 

plays that were elaborated during the Elizabethan period had more than one author 

(Kermode, 2005; Holland, 2007). Following this line of thought, the attribution of 

authorship of the disputed text, that is, Arden of Faversham, will consist of 19 distinct 

analyses, one for each of its scenes, given that if two or more playwrights were involved 

in the elaboration of the play, a reasonable possibility is that they divided it in terms of 

the thematic content of its scenes (see Section 4.4). This means that the scenes of Arden 

of Faversham will be analysed as independent texts to obtain results that may provide 

substantial evidence for the presence of more than one author involved in its creation, 

which would reflect more faithfully the reality of the time in which it was elaborated. 

As will be developed in Section 4.5, the methods with which the scenes of Arden of 

Faversham will be analysed are based on the quantification of linguistic variables, given 

that the study belongs to the disciplinary field of forensic linguistics. Studies of this kind 

are built on the notion of idiolect, which stands as the variety of the language that each 


16 
 

individual uses and is reflected in their written or spoken production (Coulthard, 2004). 

Hence, authorship attribution studies within the field of forensic linguistics are based on 

the study of the idiolectal features of the possible authors of the disputed text by analysing 

their undisputed works, that is, those texts that have been attributed to them beyond any 

reasonable doubt, for a posterior comparison with such disputed text to discern with 

which of the idiolectal models it presents a higher degree of resemblance (Coulthard et 

al., 2010). The tests that will be considered for the attribution of authorship of Arden of 

Faversham will be revealed in Section 1.2. 

The criteria for the compilation of the corpora of undisputed works of each candidate 

of the study, also known as the reference corpora, becomes of paramount importance for 

the development of the research and may have a crucial impact on its outcome, as can be 

inferred from the previous paragraph. The present doctoral thesis intends to suggest a 

distinct approach to compile these corpora in comparison with previous studies on the 

same subject, which will be briefly discussed in Section 1.2 and addressed in depth in 

Section 4.2. In addition, a series of methodological decisions will be made during the 

conduction of certain tests to increase their effectiveness, which will be also mentioned 

in Section 1.2 and discussed in more detail in Section 4.5. 

As pointed out at the beginning of the chapter, this doctoral thesis has two main 

interrelated objectives. It seeks to investigate the authorship of Arden of Faversham and 

to elaborate a computational tool oriented to the conduction of authorship attribution 

studies, which might contribute to facilitate the work of the forensic linguist and the 

implementation of these studies in educational contexts. With such purpose in mind, 

ALTXA, a program that presents an intuitive interface and allows for the conduction of 

a wide range of authorship tests that are common in the field of forensic linguistics, has 

been created in collaboration with computer programmer Carlos Antón and will be 

offered as a free software to the academic community. The main reasons underlying the 

creation of this software will be addressed in the following section, whereas its 

functionalities and what makes it different from other computational tools for text 

analysis will be expounded in Section 4.5. 

In sum, the present investigation aims to analyse the authorship of Arden of 

Faversham, a literary text that has remained anonymous since the Elizabethan period, 

considering Shakespeare and Marlowe as the candidates for such attribution. An 

innovative approach for the compilation of the reference corpora and the application of 


17 
 

certain authorship tests will be adopted. The analysis of the scenes of Arden of Faversham 

will be carried out with this newly designed software called ALTXA, which has been 

specifically programmed for the conduction of this research and has the purpose of 

proving its validity in authorship attribution studies within the framework of forensic 

linguistics. These objectives and more specific questions will be expanded in the 

following section. 

1.2. Objectives and hypotheses 

The overall objectives of the investigation are to discern the likeliest authorship of the 19 

scenes of the Elizabethan play Arden of Faversham considering William Shakespeare and 

Christopher Marlowe as the potential candidates and to develop the software ALTXA, 

which will be employed for such authorship analysis. With the purpose of meeting the 

abovementioned objectives, the following subgoals and hypotheses need to be addressed. 

The first subgoal consists in the compilation of a Shakespearean and a Marlowian 

reference corpus for a posterior comparison with Arden of Faversham to determine with 

which of the two idiolectal models the play presents a higher degree of resemblance. This 

compilation will be built upon the most relevant hypothesis of the thesis, which is related 

to what can be considered a representative sample of an author’s idiolect. While many 

scholars have compiled the reference corpora of the candidates involved in the attribution 

of authorship of Arden of Faversham with texts that belong to distinct periods (see 

Kinney, 2009) and even to dissimilar literary genres (see Taylor, 2019), the 

Shakespearean and the Marlowian reference corpora of the present study will be formed 

by plays that were written no more than three years apart from the creation of Arden of 

Faversham and are not comedies. Such decision derives from the belief that, when two 

authors that have highly similar styles are compared, the most representative reference 

corpora are not the largest, but those that are able to represent more faithfully the 

conditions in which the disputed text was written. In other words, the present investigation 

intends to suggest the hypothesis that Shakespeare and Marlowe may have adopted a 

series of idiolectal features that were only present during a specific period of time and in 

plays with a tragic tone, for which their identification and classification can be more 

useful to determine the likeliest authorship of the disputed text than those features that 

are present in their entire work, which were probably quite similar in many playwrights 

at the time. This issue will be discussed in depth in Chapter 4. 


18 
 

While Richard III and Richard II will be used for the compilation of the 

Shakespearean reference corpus, Edward II and The Jew of Malta will integrate the 

Marlowian corpus (see Section 4.2 for an account of the reasons underlying the selection 

of these plays). The next subgoal of the thesis is therefore to clean these works as well as 

Arden of Faversham with the purpose of making the subsequent analysis as precise as 

possible. For such end, every stage direction or linguistic element that is not part of a 

dialogue will be erased under the assumption that these constitute a distinct subgenre 

within the play where idiolectal features are less likely to be found. In other words, only 

the direct interventions of the characters will be taken into consideration in the authorship 

analysis. 

The following subgoal consists in the selection of a series of authorship tests for the 

conduction of the study, and these will be based on the quantification of the relative 

frequency of a group of keywords chosen by the researcher in the plays, the calculation 

of their lexical richness and their average number of words per sentence, the analysis of 

the common n-grams between the disputed text and the reference corpora and the 

conduction of the Zeta test (see Section 4.5 for a detailed explanation of these 

procedures).  

It seems reasonable to test the effectiveness of these methods before applying them in 

the analysis of the scenes of Arden of Faversham, for which a series of pre-studies where 

the attribution of authorship of scenes taken from the Shakespearean and the Marlowian 

reference corpora will be carried out. The main reason behind the conduction of these 

pre-studies is to only include in the final case study on the authorship of Arden of 

Faversham those tests that have been proved to be reliable to distinguish between samples 

written by Shakespeare and Marlowe. 

Some of the methods selected for the conduction of these pre-studies and the final 

case study will be applied in a slightly distinct way than in the works of other scholars 

under the following hypotheses. The first one is that word n-grams reflect more distinctive 

linguistic constructions than character n-grams (see Section 4.5 for a thorough 

explanation of the fundamentals of n-gram tracing). Secondly, the hypothesis that a Zeta 

test should not compare an author versus a group of authors, as has been done by other 

scholars (see Kinney, 2009; Elliott & Greatley-Hirsch, 2017), but that it should only 

compare candidates individually will be suggested (see Section 4.5 for an account of the 


19 
 

reasons that justify the adoption of this principle, as well as of the procedures underlying 

the conduction of a Zeta test).  

Lastly, a few basic notions about the computational tool that will be used to conduct 

the authorship tests of the pre-studies and the case study need to be provided to justify the 

selection of its development as one of the two main objectives of the thesis. As will be 

addressed more extensively in Section 4.5, the computer programs and programming 

languages that are currently available to carry out a forensic authorship analysis could be 

generally divided into those that present an intuitive interface, but lack some of the 

advanced functionalities that a study of this nature requires, as it is the case of Voyant 

Tools, and those that include a broad range of functionalities, but whose usage is only 

accessible to people with a solid IT background, as happens with the programming 

language R. The need to create a tool that combines a wide catalogue of authorship tests 

that are common within the framework of forensic linguistic studies with an intuitive 

interface arose, for which I decided to design, with the assistance of computer 

programmer Carlos Antón, a software called ALTXA. This tool offers the possibility to 

carry out the authorship tests selected for the conduction of the research and presents an 

accessible interface so that it can be used by linguists without experience in programming 

(see Section 4.5 for a tutorial on how to use the software and an account of what makes 

it different from others of this kind).  

There is a complementary relationship between the study of the authorship of the 

scenes of Arden of Faversham and the creation of a computational tool that can simplify 

the work of other forensic linguists and enhance the implementation of these studies in 

educational settings. Even though this discipline has generated a growing interest among 

students, it is not part of the curriculum of many European universities due to a lack of 

experts in the field and/or educational tools, with a few exceptions like Aston University 

and Cardiff University in the United Kingdom, or the Universidad Autónoma de Madrid 

and the Universitat Pompeu Fabra in Spain. In other words, the accomplishment of the 

two main objectives of this doctoral thesis can be seen as a contribution to the 

development of this relatively modern discipline in the academic community.  

This section has depicted the main objectives that this project seeks to accomplish, as 

well as a series of subgoals that allow for the fulfilment of these objectives and the main 

hypotheses on which the investigation is built. The following section will provide a 

general overview of the scope of the chapters in which the thesis is organized. 


20 
 

1.3. Overview and organization of the thesis 

The present thesis is divided into eight chapters, whose thematic content will be briefly 

described in this section. The previous sections of this chapter have explained the 

background and the rationale for the research, the main objectives and the subgoals that 

it sets out to attain and a series of hypotheses that allow for the adoption of an innovative 

approach for the conduction of the authorship tests on which it is built. 

Following this introductory chapter, Chapters 2 and 3 will be devoted to providing 

the reader with a solid historical, literary and linguistic background that facilitates the 

understanding of the subsequent authorship analysis of Arden of Faversham.  

Chapter 2 will focus on the historical and literary background of the thesis and will 

be divided into three sections, the first one being a simplified biography of William 

Shakespeare that aims to offer some basic notions about this historical figure and the 

possible manners in which he might have been involved in the elaboration of Arden of 

Faversham. Afterwards, the chapter will address the life of Christopher Marlowe and his 

connections with William Shakespeare in order to provide substantial historical evidence 

to suggest a possible cooperation between both playwrights in the elaboration of the play. 

Lastly, Chapter 2 will offer an in-depth explanation of the story behind the play Arden of 

Faversham, its main literary features, the historical implications derived from its 

publication and the distinct approaches that have been adopted over the years to deal with 

the question of its disputed authorship. 

Chapter 3 will provide the reader with a general overview of what forensic linguistics 

consists in and the three main branches in which it is divided, the first one being the so-

called the written language of the law, which focuses on the adaptation of legal documents 

to make them more accessible to those citizens that do not have a deep understanding of 

the law. The second branch of the discipline is known as the spoken language of the law 

and focuses on the oral interactions underlying the legal proceedings, such as police 

investigative interviews. Lastly, the many applications of the third branch of forensic 

linguistics, entitled the forensic linguist as an expert witness, will be developed with a 

special focus on authorship attribution studies. The chapter will end with a critical review 

of previous studies on the attribution of authorship of literary texts in general and that of 

Arden of Faversham in particular, which will be of vital importance to justify the 

approach and the authorship tests selected for this investigation. 


21 
 

Chapter 4 will focus on the methodological aspects of the research. It will start by 

explaining the reasons why Shakespeare and Marlowe have been selected as the 

candidates for the attribution of authorship of Arden of Faversham, instead of other 

playwrights that have also been suggested as its potential authors in previous studies. A 

detailed explanation of the criteria underlying the selection of the plays to compile the 

Shakespearean and the Marlowian reference corpora will also be provided, together with 

the process by which these texts and Arden of Faversham will be cleaned to optimize the 

effectiveness of the subsequent authorship analysis. Afterwards, this chapter will address 

the distinct methods selected for the conduction of the research and the need to test them 

in a series of pre-studies that will focus on the analysis of well-attributed scenes of 

Shakespeare and Marlowe as if they were disputed texts not only to discern if these 

methods are effective enough to distinguish between the two authors, but also to estimate 

what kind of results can be considered significant in the posterior analysis of the scenes 

of Arden of Faversham. This chapter will also provide an in-depth account of the creation 

of the software ALTXA, its functionalities and the niche that it could occupy in the 

academic community.  

Chapter 5 will present the results derived from the pre-studies. As underlined earlier, 

these will analyse undisputed scenes of Shakespeare and Marlowe to determine which 

methods can be considered effective enough to be included in the final case study on the 

authorship of the scenes of Arden of Faversham.  

Chapter 6 will show the results of the case study, where the authorship of the 19 scenes 

in which Arden of Faversham is divided will be analysed independently. Only those 

methods that have proved their reliability to distinguish between scenes written by 

Shakespeare and Marlowe will be used.  

The results of the case study will be commented from a more holistic perspective in 

Chapter 7, which will allow for the attribution of certain groups of scenes of the play to 

Shakespeare or Marlowe. In addition, Chapter 7 will assess whether the objectives that 

have been previously delineated in this introductory chapter have been accomplished or 

not, as well as the vailidity of the hypotheses that have also been formulated in the 

previous section of this chapter. 


22 
 

Finally, Chapter 8 will summarize the main findings of the doctoral thesis and how 

these relate to its objectives and hypotheses. It will also highlight the main limitations 

identified during its conduction and suggest possible lines of future research. 

  
23 
 

CHAPTER 2 | HISTORICAL AND LITERARY BACKGROUND 

The present chapter intends to offer a historical and literary introduction about the authors 

and the play that constitute the focus of the thesis. Considering that Arden of Faversham 

was approximately written in 1592, William Shakespeare’s life events until the last 

decade of the sixteenth century and a complete biography of Christopher Marlowe will 

be provided, since the latter was murdered in the year 1593. In other words, this chapter 

aims to offer a general idea of what both playwrights had accomplished before and during 

the period in which Arden of Faversham was created. Additionally, the play itself will be 

presented and discussed from a historical and literary perspective that will address the 

question of its disputed authorship, which will allow for the establishment of a connection 

between this chapter of the thesis and the following, where the fundamentals of forensic 

linguistics and authorship attribution studies will be expounded. 

2.1. William Shakespeare 

The main objective of this section is to present a simplified biography of William 

Shakespeare that will mainly focus on the events that occurred until the period in which 

Arden of Faversham was written. It must be borne in mind that his relationship with 

Christopher Marlowe will be discussed after the biography of the latter is presented in the 

next section. 

William Shakespeare (1564-1616) was the son of John Shakespeare, a Catholic glover 

who managed to become a successful businessman by selling wool and, ultimately, a 

distinguished member of the political elite in Stratford-upon-Avon, although he ended up 

facing economic and legal issues during the last years of his life (Fallow, 2016). Wood 

(2016) states that, despite the fact that there is little historical record of Mary Shakespeare, 

it is known that she inherited lands from her father and married John Shakespeare, with 

whom she had eight children, three of which experimented a premature death. As a result, 

William Shakespeare became the eldest of the five siblings who reached adulthood. 

       Halliday (1964) suggests that if the social status of his parents is taken into 

consideration, the likeliest possibility is that William Shakespeare had the opportunity to 

attend the local school in Stratford-upon-Avon, where he received a free education until 

the age of sixteen. According to Schoenbaum (1985) and Honigman (2001), the Bard 

attended the New King’s School, where he primarily focused on Ovid’s Metamorphoses, 

as well as on the works of Virgil, Plautus and Cicero. Even though the Elizabethan 


24 
 

dramatist Ben Jonson accused him of knowing “little Latin and less Greek” in his famous 

poem,1 Honigman argues that “Shakespeare probably read Latin as easily as most 

graduates with honours in Latin today” (2001, p. 2). He further adds that the Bard was 

acquainted with Greek tragedies, “either in the original or in Seneca’s adaptations” (2001, 

p. 3). 

As previously mentioned, Shakespeare abandoned his studies at the age of sixteen, 

which is the point of departure of his so-called lost years (Holland, 2007), given the gap 

of historical knowledge regarding his whereabouts throughout the subsequent years. 

There is considerable speculation about his development as a playwright after he left the 

King’s School, but the most accepted theory is that Shakespeare worked as a country 

schoolmaster in Lancashire (Losey, 1927; Honigman, 2001; Holland, 2007; Potter, 2012). 

This theory is built upon the figure of John Cottom, one of Shakespeare’s teachers during 

his last year at school who came back to his hometown in Lancashire with his brother, a 

Catholic that was eventually tried and executed. According to Holland, it was John Cotton 

who “encouraged Shakespeare, as a member of a recusant Catholic family, to be a 

schoolteacher in a staunchly Catholic household in the north of England” (2007, p. 8). 

The main piece of evidence that has led to such supposition can be found in the will of 

Alexander de Hoghton of Lea Hall, where he advises his neighbour in Lancashire, Sir 

Thomas Hesketh, to hire someone called William Shakeshaft as a servant (Honigman, 

2001; Holland, 2007). Additionally, Potter explains that “Hoghton bequeathed his 

musical instruments and ‘play-clothes’ to his heir, in case he wanted to ‘keep players’” 

(2012, p. 48). Hence, Shakespeare may have started to write and perform in the 

abovementioned plays, given that “the performance of plays by boys was recommended 

by forward-looking schoolmasters” (Honigman, 2001, p. 3). 

Regardless of what Shakespeare did during those years, historians agree on the fact 

that he was back in Stratford by November 1582, since the license for his marriage with 

Anne Hathaway, who was already pregnant with their first daughter, Susanna, is still 

preserved (Honigman, 2001; Holland, 2007; Potter, 2012). Two years after their first 

daughter was born, William Shakespeare and Anne Hathaway had twins, named Judith 

and Hamnet, and Holland points out that “there are no records of further children” (2007, 

p. 10). One could ponder that it was unusual for a couple to only have three children at 

 
1 https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-the-author-mr-william-

shakespeare   

https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-the-author-mr-william-shakespeare
https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-the-author-mr-william-shakespeare


25 
 

that time, and hence Honigman suggests that “it may have been shortly thereafter that he 

left home for a career in the theatre” (2001, p. 3). 

As Holland (2007) explains, what Shakespeare did between 1585 and 1592 remains 

unclear and has been an object of speculation for scholars. According to Potter (2012), it 

is probable that the Bard joined the Queen’s Men after they performed in Stratford in 

1587. This theory is built upon the idea that Shakespeare was incorporated as a 

replacement for William Knell, one of the leading actors of the company who was 

murdered in a fight that year. Regarding the reasons behind the selection of Shakespeare 

for such position, Potter states the following: 

The 23-year-old Shakespeare would have had to be very impressive to take over 

from the man who played the title role in The Famous Victories of Henry V; it 

would have been easier for this large and distinguished company to promote one 

of its own players. (2012, p. 54)  

It is worth mentioning that, with the purpose of supporting the abovementioned theory, 

Holland (2007) illustrated in his work the way in which some of the plays that were 

performed by the Queen’s Men may have had an influence on Shakespeare’s early plays. 

The first solid piece of evidence of Shakespeare’s reputation as a playwright dates 

back to 1592 and it is a written document in which Robert Greene, another dramatist, 

presented the Bard as an intruder who had undeservedly gained popularity among his 

contemporaries: 

In his Groat’s Worth of Wit Robert Greene addressed three “gentlemen, his 

quondam acquaintance, that spend their wits in making plays” (Marlowe, Peele, 

Nashe) and denounced “an upstart crow, beautified with our feathers, that with 

his ‘Tiger’s heart wrapped in a player’s hide’ supposes he is as well able to 

bombast out [i.e. write] a blank verse as the best of you: and, being an absolute 

Johannes fac totum, is in his own conceit the only Shake-scene in a country.” 

(Honigman, 2001, pp. 3-4) 

Honigman (2001) further states that Greene was clearly mocking the verse “O tiger’s 

heart wrapped in a woman’s hide” from Henry VI, Part III and was trying to create a 

distance between Shakespeare and the rest of the Elizabethan dramatists like Marlowe 

and himself, who did attend university, in contrast to the Bard. As can be inferred from 


26 
 

the quote presented above, by the year 1592, William Shakespeare was already 

established as a prominent playwright in London, where he complemented the elaboration 

of his plays with interpreting his own characters on stage, as it could have been the case 

with Arden of Faversham (see Section 2.3). 

It seems impossible to discern the exact date in which the Bard’s early plays were 

elaborated and thus experts cannot determine with certainty whether Marlowe was 

Shakespeare’s predecessor or his contemporary (Honigman, 2001). As Holland explains, 

scholars have structured the chronology of these plays according to their own vision of 

the playwright, given that “each reordering produces a new narrative for Shakespeare’s 

contact with other plays and other dramatists, his reading, and his development as a 

dramatist” (2007, p. 14). In any case, Greene’s text in 1592 proves beyond reasonable 

doubt that Shakespeare had already written the three parts of Henry VI by that year, and 

the likeliest possibility is that he collaborated with Marlowe in their creation, as will be 

explained further on (see Section 2.2). 

Regardless of the exact date in which they were elaborated, scholars agree on the fact 

that Shakespeare also wrote, among other plays, The Two Gentlemen of Verona, The 

Taming of the Shrew, Titus Andronicus and Richard III during the first half of the decade, 

that is, when Arden of Faversham was produced, and that he was already established as 

a prestigious playwright in London. Furthermore, due to the closing of theatres in the city 

between 1592 and 1594 because of the plague, Shakespeare wrote the poems Venus and 

Adonis and The Rape of Lucrece, which he dedicated to the Earl of Southampton 

(Halliday, 1964; Schoenbaum, 1985; Honigman, 2001; Holland, 2007; Potter, 2012).  

As time went by, Shakespeare proved to have inherited his father’s talent for business. 

Honigman explains that “as he prospered, he took on new responsibilities, with four 

distinct roles in his company: ‘sharer’ […] of the company’s assets […], ‘house-holder’ 

[…] of the Globe and Blackfriars theatres, dramatist [and] actor” (2001, p. 5). Even 

though the posterior years of his life do not constitute the focus of this biography, it is 

noteworthy to mention that, during the beginning of the seventeenth century, the Bard 

acquired more properties and experimented the most prolific period of his career as a 

playwright, in which he created literary masterpieces such as Hamlet and Othello until 

his death in 1616 (Honigman, 2001). 


27 
 

In sum, this simplified biography has provided an insight into William Shakespeare’s 

early education, as well as the most plausible speculations concerning his development 

and establishment as a playwright, which differs from the traditional academic path 

followed by authors like Christopher Marlowe, who constitutes the focus of the following 

section.  

2.2. Christopher Marlowe  

This section intends to offer a brief biography of the dramatist Christopher Marlowe that 

will address his educational background, the details of his alternative life as a spy and his 

premature death at the age of 27. Afterwards, his relationship with William Shakespeare 

and the collaboration between both playwrights in the elaboration of Henry VI (Parts I, 

II and III) will be discussed. Lastly, the popular hypothesis about Marlowe’s allegedly 

fake death will be examined from a historical perspective with the aim of providing the 

reader with a background for the many speculations that have been created over the years 

concerning his figure. 

Christopher Marlowe (1564-1593) was the son of a humble shoemaker in Canterbury 

(Riggs, 2004; Hopkins, 2008; Greenblatt & Logan, 2012; Nicholl, 2016). According to 

Riggs, his education began when he was six in petty school. The instruction of such 

lessons did not have a permanent building assigned, but the likeliest possibility is that 

Marlowe learned how to read and write in the church of St. George the Martyr, where the 

syllabus was mainly based on “religious instruction rather than practical skills” (2004, p. 

25), given that Queen Elizabeth regarded the education of children “as a way of 

fashioning obedient subjects” (2004, p. 27).  

During the year 1578, Marlowe got a scholarship to attend the King’s School in 

Canterbury, even though it is believed that he had already been studying there before he 

was given the scholarship. Two years later, in 1580, Marlowe moved to Cambridge, 

where he was admitted in the Corpus Christi College once he was awarded the Parker 

Scholarship (Honan, 2006; Hopkins, 2008).  

Hopkins highlights the fact that the Parker Scholarship was “essentially designed to 

be held primarily by students intending to proceed to holly orders” (2008, p. 5) and she 

further suggests that it is unclear whether Marlowe really had in mind the idea of pursuing 

an ecclesiastical career or if he simply saw this scholarship as an opportunity to secure a 


28 
 

high-quality education. In any case, the dramatist ended up being accused of atheism after 

he allegedly criticized and mocked Christianity in public, as will be developed further on. 

Marlowe finished his BA Degree and a Master of Arts Degree in Cambridge, where 

he particularly focused on theology, philosophy and Greek and worked on the translation 

of the authors Ovid and Lucan (Riggs, 2004; Hopkins, 2008). It was during those years 

that the young dramatist established a close relationship with adult playwrights such as 

Robert Greene and Robert Sidney, who may have contributed to shape his literary style 

(Tallent, 2007; Hopkins, 2008; Nicholl, 2016). As a matter of fact, Hopkins states that it 

is “highly probable that he had already written one or both of Dido and the first part of 

Tamburlaine while still at the university” (2008, p. 8). As can be inferred from the 

biographical notes presented above, Marlowe was a precocious talent that managed to 

stand out as a promising playwright from an early age. Furthermore, the dramatist 

apparently reconciled his life as a student with working as a spy for the Protestants 

(Honan, 2006; Hopkins, 2008; Greenblatt & Logan, 2012). 

The aforementioned theory about Marlowe’s collaboration with the Protestant regime 

as a spy becomes considerably feasible if the many controversies that arose when he 

applied for his MA degree in 1587 are noted. The university was reluctant to give 

Marlowe his degree on the ground that the young dramatist had the intention of going to 

Rheims, which “had been the home of the seminary to which young English Catholic 

gentlemen could go in secret to train for the priesthood, which they were forbidden to do 

in Elizabeth’s Protestant England” (Hopkins, 2008, p. 10). Nevertheless, the Privy 

Council contacted the university and demanded that Marlowe should be granted his 

degree under the principle that “it is not Her Majesty’s pleasure […] that anyone 

employed as he had been in matters touching the benefit of his country should be defamed 

by those that are ignorant in the affairs he went about” (Greenblatt & Logan, 2012, p. 

1106), which makes it seem that Marlowe was sent to Rheims by the Protestants 

themselves to spy on the Catholics, according to the authors. Indeed, Tallent (2007) 

stresses the fact that his labour as a spy was crucial for his development as a playwright 

and that both professions were highly complementary. 

Despite his probable cooperation with the Protestant regime, Marlowe was accused 

of atheism by Thomas Kyd and Richard Baines, who testified that “it was the dramatist’s 

custom in table talk to jest at the Scriptures, gibe at the efficacy of prayer, and strive in 

argument to confute the sayings of prophets and holy men” (Kocher, 1948, p. 111). 


29 
 

Hopkins (2008) states that Kyd probably gave such testimony under torture and that 

Richard Baines should not be considered a reliable witness, given that he and Marlowe 

were arrested the previous year in Flushing for coining and both accused each other, 

which might reflect the hostility that previously existed between them. Greenblatt and 

Logan explain that these accusations could have been a relevant factor for his premature 

death:  

On May 30, 1593, an informer named Richard Baines submitted a note to the 

Council on which, on the evidence on Marlowe’s own alleged utterances, branded 

him with atheism, sedition and homosexuality. Four days later, at an inn in the 

London suburb of Deptford, Marlowe was killed by a dagger thrust, purportedly 

in an argument over the bill. (2012, p. 1107) 

Even though Hopkins (2008) doubts whether Richard Baines’ note was submitted on May 

27 or June 2, she supports the theory that Marlowe was stabbed to death at an inn in 

Deptford by a man called Ingram Freezer. Regarding the reasons behind the murder, the 

author points out that there were three people with Marlowe at the crime scene. Firstly, 

Ingram Freezer himself, who was known to be involved in “shady business dealings” 

(2008, p. 18) with Nicholas Skeres, who was also present at the inn with Robert Polley, a 

member of the intelligence services. She further suggests the following: 

The fact that the men spent all day together before Marlowe died does not really 

suggest a premeditated killing; it perhaps indicates more negotiations that had 

gone wrong, or, as they themselves say, an unexpected disagreement, in which 

Marlowe was outnumbered. (2008, p. 19) 

As can be observed in the quote presented above, the events that led to Marlowe’s 

assassination remain mysterious and it seems impossible to discern if it was due to a 

simple argument about the bill, a negotiation that went wrong, or if the Protestant regime 

ordered his execution as a result of the accusations of atheism. It is of paramount 

importance to highlight the fact that “those who were arrested in connection with the 

murder were briefly held and then quietly released” (Greenblatt & Logan, 2012, p. 1107). 

Once the main events of Christopher Marlowe’s life have been depicted, it is time to 

discuss his relationship with William Shakespeare. Even though it cannot be assured with 

certainty that both writers knew each other, this could be seen as a solid theory if certain 

factors are taken into consideration. Firstly, the fact that their residences in London were 


30 
 

considerably close (Hopkins, 2008) and, secondly, that both playwrights were widely 

acknowledged in their guild (Astrana, 1964). Lastly, it must be pointed out that, given the 

strict deadlines that had to be met, the cooperation between two or more playwrights in 

the production of their plays became a frequent practice during the Elizabethan period. 

Indeed, Holland suggests that “a minority of plays had a single author” (2007, p. 15) and 

Kermode (2005) further hypothesized that the five acts of some Elizabethan plays might 

have been elaborated by five distinct playwrights due to the abovementioned time 

constraints.  

For such reason, there is a plethora of studies with the purpose of offering substantial 

evidence for the existence of collaboration in plays whose authorship has been 

traditionally attributed to Shakespeare. These studies have been conducted from 

historical, literary and, as in the case of this thesis, linguistic approaches (see Section 

3.4.4). In the light of the findings provided by these lines of research, The New Oxford 

Shakespeare has credited Christopher Marlowe as the co-author of Henry VI (Parts I, II 

and III).2  

Finally, the hypothesis that Marlowe’s assassination was a set up in which he 

exchanged his clothes with a corpse to leave the country and cover up later clandestine 

activities will be briefly addressed. Nicholl (2016) presented an extensive review of this 

conspiracy theory, which is based on the notion that Christopher Marlowe faked his own 

death with the purpose of escaping from the accusations of heresy and ran away to 

Europe, where he continued his labour as a playwright and sent his works back to 

England, which were ultimately signed by William Shakespeare. I will not dwell on the 

details of this hypothesis since, as Nicholl proves in his work, it lacks a solid historical 

basis, given that much of the information that has been presented as proof was indeed 

taken from fictional works. Consequently, the present thesis is built upon the idea that 

Marlowe was murdered in 1593, which allows to put its focus on the authorship of Arden 

of Faversham, since it was written before that year. The real events that inspired the 

elaboration of this play, as well as its main literary features and reception in the academic 

community, will be addressed in the following section. 

 
2 https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-credited-as-one-of-

shakespeares-co-writers 

https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-credited-as-one-of-shakespeares-co-writers
https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-credited-as-one-of-shakespeares-co-writers


31 
 

2.3. The anonymous play Arden of Faversham 

The final section of this chapter aims to provide information about the plot of the play 

Arden of Faversham, its historical origin, the impact that it may have caused on the 

Elizabethan society and the wide range of approaches that have been adopted over the 

years to attribute to the play its likeliest authorship, given that it still remains unclear. 

Lastly, the main reasons underlying the selection of this text as the focus of the 

investigation will be pointed out. 

In his Chronicles of England, Scotland and Ireland, Holinshed stated that “there was 

at Fa[v]ersham in Kent a gentleman named Arden, most cruell[y] murdered […] by the 

procurement of his own wife” (1587, p. 1062). The author further explained the whole 

story behind the assassination of Arden, a landowner from Faversham who was stabbed 

at his own residence while he was playing backgammon. This crime was perpetrated by 

Alice, who was his wife, Mosby, who was maintaining an adulterous relationship with 

Alice, and two criminals who were hired for this endeavour. Therefore, the idea for the 

play Arden of Faversham was inspired by a real event that had been depicted in 

Holinshed’s historical work, which was a common source of inspiration for dramatists 

(Barker & Hinds, 2003; Dudgeon, 2009). 

Before addressing Arden’s death, the play portrays a succession of attempts at killing 

his character that consistently fail, sometimes in a comical manner, for instance when 

Black Will and Shakebag, the two killers that were hired by Alice, desperately try to find 

Arden in the middle of the fog until Shakebag falls into a ditch. It must be pointed out 

that all the characters of the play preserved the original name of those who were originally 

described in Holinshed’s Chronicles of England, Scotland and Ireland except for the 

criminal Loosebag, who was portrayed as Shakebag in the play, which could be an 

indication of Shakespeare’s participation in its performance and, perhaps, its elaboration. 

The text is catalogued as a domestic play (Barker & Hinds, 2003; Richardson, 2006; 

Dudgeon, 2009; Christensen, 2017), which means that it deals with the life events of the 

middle classes, instead of narrating the misfortunes of kings and nobles, on whom 

Elizabethan tragedies were mainly focused. According to Barker and Hinds (2003), one 

of the most innovative aspects of the play is that it becomes hard for the audience to 

sympathize with any of the characters, given that even Arden, the victim of the crime, is 

presented as a sinner. Christensen points out the fact that, in addition to being a greedy 


32 
 

landowner who shows no mercy with those from whom he took their lands throughout 

the play, the character of Arden “comes home only long enough to leave again, attending 

to a succession of business obligations, yet he is also unwilling to transfer power at home” 

(2017, p. 33). As a result, every character is punished with death at the end of the play, 

with the notable exception of Franklin, Arden’s best friend and one of the few relatable 

characters for the audience together with Bradshaw, a goldsmith who was not involved 

nor aware of the assassination plans but ended up being executed for delivering a letter 

of the criminals. 

Taking into consideration the abovementioned notions about the characters’ sins and 

their subsequent punishment, one could ponder that the play intends to enhance the 

traditional family values that characterized the Protestant society. Nevertheless, Barker 

and Hinds state that if Arden of Faversham is not read as a fictional play, but as a historical 

text, the audience may switch the attention from the flaws of its characters to the 

economic, social and political agents affecting their actions. They further indicate that 

“[f]rom this perspective, Arden of Faversham becomes a play profoundly concerned with 

the deleterious impact of the Reformation, and the consequent transfers of land 

ownership, on social kinship bonds and responsibilities” (2003, p. 78). In sum, the authors 

suggest that the interpretation of the play as a historical document enables a social 

analysis that otherwise would have remained overlooked. 

Once a summary of the play has been provided, together with a brief historical, 

literary and social analysis, the text will be approached from a legal perspective. Even 

though Arden of Faversham is still considered anonymous, it has been traditionally 

associated with Shakespeare, Marlowe and even Thomas Kyd (Barker & Hinds, 2003). 

The fact that the text entered into the Register of the Stationers Company in 1592 and was 

printed that same year by Edward White, who also published William Shakespeare’s Titus 

Andronicus, Cristopher Marlowe’s The Massacre at Paris and Thomas Kyd’s The 

Spanish Tragedy could be considered as an indication of a common link among the three 

authors and their possible cooperation in the elaboration of the play (M. Goicoechea, 

personal communication, June 7, 2020). Nevertheless, as can be seen in the works of 

Kinney (2009) and Taylor (2019), there is a plethora of alternative candidates for the 

attribution of authorship of the play that differ from the three abovementioned 

playwrights, such as Robert Greene, Anthony Munday, George Peele or Thomas Watson, 

among others (see Section 3.4.4 for a more detailed list of the possible authors and Section 


33 
 

4.1 for an explanation of the reasons underlying the selection of Shakespeare and 

Marlowe as the candidates for the present study). 

Kinney (2009) made a distinction among the three approaches that have been adopted 

over the years with the purpose of determining the likeliest authorship of Arden of 

Faversham. Firstly, between the sixteenth and the eighteenth century, it was based on 

paratextual parameters, such as the claims that appeared on the title pages. Secondly, the 

author mentions the existence of a period in which the attribution of authorship of the 

play relied on literary criteria, for instance “shared common words, parallel passages, and 

even commonality of tone” (2009, p. 81). Lastly, he points out that the nineteenth century 

was the point of departure for a scientific approach to which the statistical procedures that 

currently characterise the field of forensic linguistics can be related. In fact, Kinney 

(2009) conducted a forensic linguistic analysis of the play where, although he did not 

achieve solid results for most of the scenes, he attributed the authorship of certain sections 

of the text to William Shakespeare using the Zeta test (see Section 3.4.4). 

Finally, the two main reasons behind the selection of Arden of Faversham as the focus 

of this thesis will be briefly highlighted. Firstly, the fact that the play remains anonymous 

is highly convenient for a study of this kind. In addition, the computational resources that 

have been developed over the last decades allow for the adoption of innovative ways to 

analyse ancient texts, which can complement the works of other scholars (see Section 

1.1). 

2.4. Summary 

On the whole, this chapter has provided the reader with a historical and a literary 

approximation to the play whose authorship constitutes the focus of the thesis and the 

playwrights that will be considered as the potential candidates for such attribution. These 

candidates are William Shakespeare and Christopher Marlowe, who were active 

playwrights at the time in which Arden of Faversham was published and are known to 

have worked together in the elaboration of Henry VI (Parts I, II and III). The play Arden 

of Faversham has been portrayed as a literary work with a historical origin and its plot 

and main literary features have been commented under the belief that they will be of use 

to have a better understanding of the subsequent linguistic analysis. This study will be 

conducted from a forensic linguistic approach, which belongs to a modern branch of 


34 
 

applied linguistics that will be explained in depth throughout the following chapter, which 

will stand as the linguistic background for the investigation.  

  
35 
 

CHAPTER 3 | LINGUISTIC BACKGROUND: AN INTRODUCTION TO 

FORENSIC LINGUISTICS AND AUTHORSHIP ATTRIBUTION STUDIES 

The present thesis intends to conduct a forensic linguistic study of the play Arden of 

Faversham to discern its likeliest authorship, and therefore the first section of this chapter 

aims to expound what forensic linguistics is. Afterwards, an explanation of its historical 

development and applications will be provided, as well as a review of previous research 

on authorship attribution studies in general and, ultimately, on the authorship of Arden of 

Faversham in particular, with the objective of narrowing down progressively the scope 

of the investigation. 

3.1. Definition of forensic linguistics 

A series of complementary definitions of forensic linguistics will be presented to provide 

the reader with a basic notion of what this discipline is based on. The International 

Association for Forensic and Legal Linguistics (IAFLL) states on its website that the 

discipline “covers all areas where law and language intersect” (2020, About section). 

McMenamin defined it as “the scientific study of language as applied to forensic purposes 

and contexts” (2002, p. 67). Similarly, Perkins and Grant delineated it as “a branch of 

applied linguistics relating to the law and legal processes” (2012, p. 174) and Momeni 

stated that “forensic linguistics as a sub-branch of linguistics is a new-born science which 

makes a connection between linguistics and the law” (2011, p. 733). Gibbons presented 

a definition given by the AILA Scientific Commission on Forensic Linguistics, which was 

based on the idea that the objective of a forensic linguist is “to support the study of the 

link between language and law in all its forms” (2003, p. 12). Lastly, Olsson pointed out 

that there are two ways to describe what forensic linguistics is. On the one hand, it could 

be outlined “by considering the kinds of text forensic linguists are sometimes asked to 

examine. If a text is somehow implicated in a legal or criminal context then it is a forensic 

text” (2008, p. 1). On the other hand, it could be labelled as “the application of linguistics 

to legal questions and issues” (2008, p. 3). 

Therefore, the presence of expert linguists in courts should be normalized, as pointed 

out by Shuy:  

[…] specialists in any field often have something useful to contribute to lawyers 

as they try their cases. For many years, medical doctors, psychiatrists, engineers 


36 
 

and others have been called on to testify many times in civil or criminal law cases. 

(2002, p. 24) 

In other words, the emergence of the figure of the forensic linguist could be seen as 

necessary, considering that there is no law without language (Tiersma, 1993; Correa, 

2013; Udina, 2017).  

In brief, forensic linguistics could be broadly designated as the intersection between 

language and law, and it is the inherent relationship between both which justifies the 

necessity of this discipline, given that the law is articulated with language. 

3.2. Historical development of forensic linguistics 

The birth of the term forensic linguistics dates back to the year 1968, when Professor Jan 

Svartvik published The Evans Statements: A Case for Forensic Linguistics. This 

investigation focused on four statements that Timothy John Evans, accused of murdering 

his wife and daughter, had allegedly dictated to police officers in 1949, incriminating 

himself in the homicides for which he was ultimately hanged a year later. In his work, 

Svartvik proved that those statements were unlikely to have been uttered by someone with 

Evan’s educational background and that they presented a series of idiolectal 

inconsistencies among them, for instance when giving time indications:  

In the present case, it seems unlikely that the illiterate Evans would have said “the 

12.55 a.m. train”, particularly since in two previous statements and in the witness-

box at the trial, he is recorded as saying “the five to one train” in describing the 

same event. (1968, p. 20)  

After the execution of Evans, it was discovered that it was John Christie, who lived in the 

same building as Evans, the one who killed his wife and daughter, which reinforces 

Svartvik’s hypothesis that somebody edited the transcription of the statement provided 

by Evans.  

Even though the term forensic linguistics was not used until 1968, it should be pointed 

out that “forensic linguistics is a new discipline with a long history” (Goustos, 1995, p. 

99). As a matter of fact, the application of linguistic knowledge to legal contexts could 

be found in ancient Greece, where playwrights used to accuse each other of plagiarism 

(Olsson, 2008). Coulthard et al. explain that philosophers showed great interest in the 

relationship between language and law, and for instance Aristotle wrote in the fourth 


37 
 

century B.C. a “typology of rhetoric according to the occasions it served, distinguishing 

between political, ceremonial and forensic oratory, the latter associated with the 

courtroom” (2010, p. 529). Similarly, during the first century, Gaius Aelius Gallus 

elaborated a monolingual dictionary in Latin with the purpose of providing accurate 

definitions for terms that were frequent in the legal contexts of the time (Coulthard et al., 

2010). 

These authors further explain that there has been a plethora of historical moments in 

which laws have been enforced to have a direct effect on linguistic practices and name a 

few cases: 

On a practical level, issues of language rights and language planning also have a 

long history. In England, the Pleading in English Act of 1362 was enacted to 

replace French with English in legal proceedings, and the Blasphemy Act of 1650 

penalized acts of, inter alia, “filthy and lascivious speaking”, although rather than 

being aimed at suppressing bad language, it was in fact an attempt to silence a 

Protestant sect known as the Ranters […]. A law with a significant impact on the 

linguistic situation in Spain was King Charles III’s 1768 decree giving the 

Castillian dialect priority in administration and education. (Coulthard et al., 2010, 

p. 530) 

Continuing on the subject of the historical relationship between language and law, Olsson 

(2004, 2008) points out that it was in the first decades of the eighteenth century when the 

earliest controversy about the Bible’s authorship was documented. This arose when a 

priest from Germany called H. B. Witther suggested that the Pentateuch may have been 

written in collaboration by several unknown authors. These lines of thought were 

supported a century later by J.G. Eichhorn, a professor at the University of Jena, and by 

the end of the nineteenth century, the arrival of Darwinism generated a deeper interest in 

authorship attribution studies concerning the Bible. 

According to Olsson (2004, 2008), the attribution of authorship of Shakespearean 

texts has also been an object of speculation among scholars for over two centuries, 

especially after Reverend James Wilmot wrote in 1785 that it was Francis Bacon the 

actual author of some of the plays whose authorship had been traditionally attributed to 

the Bard. 


38 
 

The author further suggests in his work that the first properly scientific paper on 

authorship attribution was that of Mendelhall in 1887, in which, based on a letter sent by 

Professor De Morgan thirty years before, he conducted a study that was built upon the 

following principle:  

[…] every writer makes use of a vocabulary which is peculiar to himself […]. In 

the use of that vocabulary in composition, personal peculiarities in the 

construction of sentences will, in the long-run, recur with such regularity that short 

words, long words, and words of medium length, will occur with definite relative 

frequencies. (1887, p. 238-239)  

As can be inferred from the quote presented above, Mendelhall defended that the average 

number of letters per word could be a reliable discriminator to determine the likeliest 

authorship of a given text. To validate this hypothesis, he compared fragments of Charles 

Dickens’ works among themselves and, at the same time, with excerpts taken from other 

authors, and concluded that there was a high degree of resemblance among the extracts 

of Dickens and that these were different from the ones written by the other authors. 

Afterwards, he carried out similar procedures with the works of others, such as John 

Stuart Mill, in order to demonstrate that an author’s literary style remains stable and thus 

some of its features could be quantified for further studies on authorship attribution.  

During the twentieth century, there were studies that analysed the intersection 

between language and law before Svartvik coined the term forensic linguistics, for 

instance Philbrick’s Language and the Law: The Semantics of Forensic English (1949), 

where he deconstructed the language used in courts by analysing its principles; and 

Mellinkoff’s The Language of the Law (1963), in which he made an accurate description 

of the language used in British and American legal contexts, providing an exhaustive 

explanation of its development since the times before the Norman conquest until the 

twentieth century. Furthermore, he argued that legal language should be more intelligible, 

presenting cases where misunderstandings were prone to happen, this being a premonition 

of Gibbons’ current lines of research (see Section 3.3.1). 

After the publication of The Evan’s Statement: A Case for Forensic Linguistics, which 

marked, as pointed out at the beginning of this section, the official birth of the discipline, 

Coulthard (2010) explains that there was little research on the field for the next decades, 

with the exception of Robert Shuy’s contributions in America (see Section 3.3.3). 


39 
 

Nevertheless, already a decade ago, Coulthard highlighted the fact that “during the past 

fifteen years there has been a rapid growth in the frequency with which legal professionals 

and courts in a number of countries have called upon the expertise of linguists” (2010, p. 

15). 

Forensic linguistics has turned into a relatively well-established discipline with its 

own association, which is called the International Association for Forensic and Legal 

Linguistics (IAFLL)3 and organizes a biennial international conference, as well as its own 

specialized scientific journal, entitled The International Journal of Speech, Language and 

the Law, formerly known as Forensic Linguistics. In addition, there has been an 

exponential growth in the amount of specialized undergraduate courses on forensic 

linguistics and the Universities of Aston and Cardiff offer their own MA in Forensic 

Linguistics. Research in this discipline is currently divided into three main fields of study, 

which will be presented in the following section. 

3.3. Areas of forensic linguistics 

This section aims to offer a brief overview of the main research areas that can be found 

within the framework of forensic linguistic studies to provide the reader with a general 

idea of the types of investigation which have been developed over the last decades. A 

more exhaustive explanation of the principles of each area and a review of their most 

famous cases will be presented in the following sub-sections.  

Forensic linguistics is an interdisciplinary field, and this can be exemplified by the 

wide range of crimes that can be perpetrated through language to some extent. As 

highlighted by Momeni, “[l]anguage crimes are insult, foul language, bribery, perjury, 

false advertisement, etc. Even crimes like larceny, kidnapping and murder which require 

language before realization can be considered as language crimes; therefore, they need 

linguistic analysis” (2011, p. 733). 

Despite the bewildering variety of crimes in which language can be involved and the 

many legal contexts in which the figure of a linguist could be of use, the forensic linguistic 

community has reached an agreement on the three main branches in which their research 

can be divided, which could be delineated as follows: 

 
3 This name was selected in 2021 by the members of the association, who were given the opportunity to 

vote for it. Before that, it was called the International Association for Forensic Linguistics (IAFL). 


40 
 

A) The written language of the law: This area of forensic linguistics is focused on the 

deconstruction of written legal texts, such as contracts and courtroom instructions, 

with the purpose of finding the linguistic patterns that characterize them. The main 

objective underlying these lines of research is to cope with the problems that may 

be found by average citizens when they do not understand the content and, by 

extension, the implications of these documents due to the complexity of their 

vocabulary and/or grammatical constructions. In other words, what forensic 

linguists try to do is to make the law more accessible to the majority of the 

population, regardless of their educational background (Coulthard, 2010; Perkins 

& Grant, 2012). 

B) The spoken language of the law: Forensic linguists are expected to examine the 

oral interactions that take place in legal contexts, for instance during the 

communication of rights at the moment of arrest, in police investigative interviews 

or in those courtroom interactions in which interpreters are required to assist 

someone who is involved directly in the case and does not speak the native 

language of the members in court (Coulthard, 2010; Coulthard et al., 2010; 

Kredens, 2016). 

There are cases in which the spoken and the written legal discourse may overlap, as 

happens with police cautions, which are written statements uttered by police officers 

during an arrest and before investigative interviews to inform the suspect about their 

rights. Hence, police cautions could be considered as “a written legal text that has to be 

performed as spoken interaction” (Perkins & Grant, 2012, p. 175). For that reason, these 

two branches of forensic linguistics could be unified and presented as “the language of 

the legal process” (Coulthard & Johnson, 2010, p. 11). 

C) The linguist as expert witness: Forensic linguists can be called on stage and 

provide their expert testimony in court, as well as linguistic evidence that may 

have an impact on the result of the trial. In other words, this branch could be 

designated as “that portion of forensic linguistics which provides advice and 

opinions for investigative and evidential purposes” (Coulthard et al. 2010, p. 536). 

Even though one of the main scopes of interest of this branch are authorship 

attribution studies, that is, the focus of the present research, there are many other 

cases in which the expertise of a forensic linguist can determine the course of a 


41 
 

trial. These research areas will be schematically presented below (Gibbons, 2011) 

and further explained in Sections 3.3.3 and 3.4: 

• Cases of inappropriate communication and language crimes, such as 

vilification or harassment. 

• Cases of legal disputes among trademarks. 

• Cases of meaning transfer. 

• Cases of disputed or anonymous authorship. 

The criteria for the admissibility of evidence tend to differ depending on the system where 

the trial takes place. More than a decade ago, Grant pointed out that “the United Kingdom 

jurisdictions do not yet use specific scientific criteria to decide on the acceptability of 

scientific evidence. It is the expert rather than the method that is approved” (2007, p. 3). 

Nevertheless, as the author predicted in his article, the British courts have progressively 

embraced the influence of the United States, where the admissibility of evidence in legal 

settings is determined by the so-called Daubert criteria, which were established after the 

resolution of a trial in which two people alleged that the serious defects with which they 

were born came as a result of the Bendectin that their mothers ingested while being 

pregnant (Daubert v. Merrel Dow Pharmaceuticals Inc., 1993). Even though the 

prosecution presented the testimony of well-credentialed experts, the case was dismissed 

under the principle that the scientific techniques through which they incriminated the 

pharmaceutical company had to be unanimously considered reliable by the scientific 

community to be admissible in court. 

The outcome of this trial established a precedent that determined the nature of the 

evidence that can be admitted in court, especially when novel sciences such as forensic 

linguistics are involved in the case (Howald, 2008). According to the Daubert criteria, the 

requirements for the admissibility of evidence in a trial could be described as follows:  

Whether the theory or technique has been or can be tested; whether the technique 

has been subjected to peer review or publications; whether the technique is 

generally accepted in the scientific community; [and] whether the technique has 

a known or potential error rate. (Ishihara, 2014, p. 25) 

Due to the influence of these criteria, there has been a tendency over the last decades 

towards the usage of quantitative methods in forensic linguistics, given that statistical 

analyses allow for a properly scientific presentation of results, which increases the 


42 
 

forensic expert’s credibility in court (Grant, 2007). For that reason, the present 

investigation will rely on statistical procedures (see Chapter 4). 

Once this section has made a brief introduction about the three main branches in 

which forensic linguistics can be divided, these will be explained in depth in the following 

sub-sections. 

3.3.1. The written language of the law 

This section intends to describe the main features of the language of the law and the 

consequences derived from the difficulties that the average reader has in order to 

understand it. Afterwards, the Plain English Movement will be presented and supported 

with some practical examples concerning jury instructions and police cautions. 

Even though the laws of a country apply to its entire population, the way in which 

they are written creates a distance between them and many citizens (Gibbons, 2003; 

Tiersma, 2009; Coulthard et al., 2010; Perkins & Grant, 2012; Correa, 2013). The irony 

behind this situation is the point of departure for the forensic linguist’s work, as Gibbons 

points out: 

[…] the Common Law presumes that “ignorance of the law is no defence.” If the 

law is presented in language that cannot be understood by the people to whom it 

applies, this presumption can lead to grave injustice as well as logical absurdity. 

This means that legal language should be intelligible to the audience for that 

language, including the people affected by it […]. Perfect understanding of the 

law and the justice may prove unachievable, but its pursuit is imperative. (2003, 

pp. 162-163) 

The most distinctive linguistic features of legal language will be schematically presented 

below with the purpose of illustrating the abovementioned difficulties that many citizens 

have to face when they are exposed to legal documents and, as a result, the need of 

linguistic expertise to build a bridge between both parts. 

A) The use of extremely long sentences with complex syntactic structures, such as 

embedded sub-clauses, which are more complicated to process at a cognitive level 

(Gibbons, 2003; Alcaraz, 2005; Correa, 2013). Furthermore, legal contracts in 

English are usually characterized not only by the prominence of excessively long 


43 
 

sentences, but also by a lack of punctuation (Coulthard et al. 2010; Perkins & 

Grant, 2012). 

B) The abundance of archaisms like herein and self-referential terms, as well as 

lexical items which are only accessible to a specialized audience, as it is the case 

of contingency (Gibbons, 2003; Coulthard et al., 2010; Perkins & Grant, 2012; 

Correa, 2013). 

C) The profusion of passive constructions without an agent, which hinders the 

identification of the participants involved in the action to which the text refers. As 

a matter of fact, there is also a considerable number of expressions in texts of this 

nature that constitute a source of ambiguity for the identification of participants, 

such as the party of the third part (Gibbons, 2003). 

D) The frequency of impersonal sentences (Correa, 2013). 

E) The coexistence of binomial terms in the same text, as in the case of will and 

testament, whose simultaneous usage could generate in the reader the idea that 

they have dissimilar meanings (Perkins & Grant, 2012). 

F) The usage of polysemic words, such as enterprise (Alcaraz, 2005).  

G) The tendency towards nominalization (Gibbons, 2003; Correa, 2013). 

H) The presence of double negatives, some of them “including ‘hidden’ negatives 

such as unless, forbid and deny” (Gibbons, 2003, p. 171). 

I) The use of formulaic or stereotyped expressions with a vague meaning, as in the 

case of beyond reasonable doubt (Dumas, 2002; Alcaraz, 2005).  

With the purpose of exemplifying some of the features presented above, I will briefly 

analyse a sample taken from Article 84 of the Spanish Criminal Code: 

Si se hubiera tratado de un delito cometido sobre la mujer por quien sea o haya 

sido su cónyuge, o por quien esté o haya estado ligado a ella por una relación 

similar de afectividad, aun sin convivencia, o sobre los descendientes, 

ascendientes o hermanos por naturaleza, adopción o afinidad propios o del 

cónyuge o conviviente, o sobre los menores o personas con discapacidad 

necesitadas de especial protección que con él convivan o que se hallen sujetos a 

la potestad, tutela, curatela, acogimiento o guarda de hecho del cónyuge o 

conviviente, el pago de la multa a que se refiere la medida 2.ª del apartado anterior 

solamente podrá imponerse cuando conste acreditado que entre ellos no existen 


44 
 

relaciones económicas derivadas de una relación conyugal, de convivencia o 

filiación, o de la existencia de una descendencia común. (2015, p. 32) 

As can be observed in the text, there is a lack of empathy for the reader in the elaboration 

of this article, considering the length of the sentence (136 words), its syntactic 

complexity, the use of specialized terms in legal Spanish which are not accessible for an 

average reader, such as curatela, and the presence of the self-referential expression la 

medida 2ª del apartado anterior.  

If the legal features presented above and the problems that may derive from them are 

taken into consideration, it seems that there is an urgent need to democratize law, that is, 

to make it accessible to all the citizens to whom it applies. For such reason, the Plain 

English Movement is deeply rooted as one of the main scopes of action of forensic 

linguists and was defined by Felsenfeld as “the first effective effort to […] write legal 

documents, particularly those used by consumers, in a manner that can be understood not 

just by the legal technicians who draft them, but by the consumers who are bound by their 

terms” (1981, p. 408). Nevertheless, with the notable exceptions of Tiersma’s redraft of 

the Pattern Jury Instructions of California in 2005 and Gibbons’ improvement of the New 

South Wales Police Caution in 2001, the work conducted by forensic linguists in this area 

has had little acceptance (Coulthard, 2010). The main reasons underlying the necessity to 

reform jury instructions and police cautions will be presented below, as well as a depiction 

of real cases that illustrate the consequences derived from their linguistic flaws and a 

series of measures suggested by linguistic experts that could facilitate the understanding 

of these legal texts. 

Peter M. Tiersma has devoted considerable research to describe the linguistic barrier 

between jury instructions and the average citizen and to suggest ways in which these 

difficulties may be overcome. The author clarified the difference between the role of the 

jury and the judge by stating that “the jury’s function is to determine what happened, or 

the facts, as well as to reach a verdict. It has become the exclusive duty of judges to decide 

the rules of law that apply to those facts” (2009, p. 1). He further explained that the judge 

has to communicate to the jury these rules in the form of jury instructions. The word 

communicate is of paramount importance in this context, since, as Tiersma highlights, 

“communication […] requires not just that you speak or read to someone but also that the 

audience actually understand what you intended to communicate” (2009, p. 1).  


45 
 

Dumas (2002) reported a case in which the misunderstanding of jury instructions led 

to the execution of Bruce Charles Jacobs in Texas. Jacobs was accused of stabbing to 

death a sixteen-year-old boy in his own house and police officers found a knife close to 

the kid’s residence without recognisable prints. A number of witnesses claimed to have 

seen Jacobs in the neighbourhood, for which he was arrested. As a result of the jury’s 

verdict, Jacob was found guilty of capital murder and ultimately executed. Nevertheless, 

the defence counsel gathered a team of linguistic experts to discern whether jurors had 

reached full comprehension of their instructions and they concluded that not only did they 

misunderstood the term reasonable doubt, but that they were not properly explained the 

following legal conditions: 

That Jacobs could be convicted of a lesser included offence (murder or burglary 

of a habitation); that jurors needed to find that Jacobs committed a felony offence 

of burglary as well as one of murder; […] that the word deliberately does not 

mean the same as intentionally; that the word probability means something more 

than a possibility; [and] that the terms criminal acts of violence, continuing threat, 

reasonable expectation and society have legal definitions that may be different 

from ordinary meanings. (2002, pp. 246-247) 

As underlined earlier, the experts determined that Jacob’s punishment may have been 

imposed by the members of a jury that did not properly understand the instructions that 

they were given. Such execution does not constitute an isolated case and, indeed, many 

forensic linguists have devoted extensive research to assess the comprehensibility of jury 

instructions in distinct communities. For instance, Levi had already described a similar 

situation in the previous decade when she acted as an expert witness in a case in which 

James P. Free, who was sentenced to death for murder, argued that “his constitutional 

rights had been violated by the fundamental inadequacies of the instructions given to the 

jury in the sentencing phase of his trial” (1993, p. 20). 

With the purpose of preventing this type of communicative misunderstandings and 

their inherent consequences, Tiersma, who contributed to redraft the Pattern Jury 

Instructions of California in 2005, suggests four maxims to elaborate these legal 

documents effectively, which could be delineated as follows: 

A) “Identify the parties clearly and consistently” (2009, p. 14). To illustrate the way 

in which the parties of a case tend to be introduced in certain jury instructions, 


46 
 

Tiersma presented a real sample in which the three people involved in a rape were 

described as a person, another person and yet another person, which may be 

confusing for the juries. For that reason, he suggests that the best way to avoid 

ambiguity in the identification of participants is to use consistently their names 

or, at least, a descriptive term like the defendant. 

B) “Use an example or illustration to clarify a difficult point” (2009, p. 15). 

According to Tiersma, this becomes particularly useful when the jury has to face 

abstract concepts. 

C) “Develop a clear ‘template’ for the elements of a crime or cause of action” (2009, 

p. 15). 

D) “Give the jurors clear guidance on how to go about their task” (2009, p. 16). The 

author insists on the idea that these instructions must include clear directions on 

how to reach a verdict and, additionally, how to fill out the verdict form.  

In brief, a poor elaboration of jury instructions may have negative consequences for those 

involved in the course of a trial, and thus the implementation of these measures in the 

process of creating them could become crucial for the correct functioning of the justice 

system. 

Making police cautions more comprehensible constitutes another relevant field of 

action for forensic linguists. Gibbons (2003) points out that every person who is arrested 

or is about to face a police investigative interview must be properly informed about their 

rights, although there might be certain variations in the naming and the content of these 

instructions among countries:  

In the USA these are generally referred to as “warnings” and the most widely used 

warnings are the “Miranda Warnings.” In most other Common Law countries, 

including England, Australia and Malaysia, they are known as “cautions”, and 

derive from an original English source which over time has evolved differently in 

these varied contexts. (2003, pp. 186-187) 

The main problem behind many of these cautions is that, since they tend to be written in 

legal language, some citizens may find it hard to comprehend their actual meaning and 

hence they fail in the achievement of their objective (Rock, 2007). Despite the fact that 

police officers are allowed to change the original words of the cautions if that helps the 

detainee to understand their meaning, many studies prove the ineffectiveness of these 


47 
 

cautions and the need to adopt a series of measures to ensure the fulfilment of their goal 

(Rock, 2007; Perkins & Grant, 2012). Even though there has been little acceptance of the 

work carried out by linguists in the legal community, John Gibbons successfully 

contributed to redraft the New South Wales Police Caution. The stages of this process 

were described by the author as follows (2003, p. 188): 

1) The police sent out the old versions and I and others suggested revisions in 

writing; 

2) The police produced a draft revised Code of Practice which was sent out again 

for comment; 

3) The police made some changes on the basis of the comments; 

4) The revised draft was discussed at a large meeting involving many interested 

parties at Police Headquarters; 

5) The police produced the final version of the Code of Practice without further 

consultation. 

Gibbons explains that the original forty-one cautions included in the Code of Practice 

were reduced to five in the final version that was elaborated after his contribution. With 

the aim of exemplifying the kind of revisions implemented, the transformation of Caution 

1, which was used at the initial stage of police investigative interviews, will be analysed. 

The original version was the following (2003, p. 189): 

I am going to ask you certain questions which will be recorded on a videotape 

recorder. You are not obliged to answer or do anything unless you wish to do so, 

but whatever you say or do will be recorded and may later be used in evidence. 

Do you understand that?  

Gibbons’ analysis of the text concluded that the syntactic subordination and coordination 

of the second sentence could be complicated to process at a cognitive level for some 

audiences because of the presence of two passive constructions without an agent and, 

especially, the fact that this caution forces the reader to deal with more than one concept 

at a time. In other words, the caution asks if the interviewee has understood the right to 

remain silent and the fact that s/he is being recorded at the same time, instead of asking 

about such concepts one by one. For those reasons, the suggested revision presented by 

Gibbons was as follows (2003, p. 191): 


48 
 

I am going to ask you some questions. You do not have to answer if you do not 

want to. Do you understand that? 

We will record what you say. We can use this recording in court/against 

you/against you in court. Do you understand that? 

Gibbons states that the text presented above divides the caution into its two main issues, 

which facilitates its understanding. In addition, expressions such as are obliged and unless 

have been replaced by others which are more comprehensible (have to and if not, 

respectively).  

The author regrets the fact that he was never given the chance to sit with them and 

work collectively on the abovementioned changes. Nevertheless, it seems undeniable that 

this process constitutes a breakthrough for the forensic linguistic community and has set 

a precedent for future cooperation, even though Gibbons himself admits that it was merely 

consultative. 

This section of the thesis has highlighted the linguistic difficulties that many citizens 

find when they are exposed to the written language of the law, with a special focus on the 

consequences derived from the lack of understanding of jury instructions and police 

cautions. In addition, a series of work methods suggested by relevant scholars in the field 

to pave the way for the development of the Plain English Movement have been presented.  

3.3.2. The spoken language of the law 

The forensic linguistic analysis of the spoken language of the law covers “from the 

moment of arrest and the first communication of rights through police interview, 

interrogation and charge, to the announcement of the verdict at the end of the trial” 

(Coulthard et al., 2010, p. 534). As mentioned earlier, the line between the spoken and 

the written language of the law is often blurred, given that there are many written 

documents which are meant to be performed orally, as it is the case of police cautions, 

which have been addressed in the previous section.  

A case that reflects the importance of the oral aspect of police cautions will be 

presented and discussed. Tiersma (1993) explains that in Rhode Island v. Innis (1980), 

the police identified a suspect of killing a taxi driver with a fire weapon. At the moment 

of arrest, he was read the Miranda warning and he asked for a lawyer, which meant that 

the police officers were not allowed to interrogate him until the attorney was there. While 


49 
 

they were taking the suspect to the police station, one of the officers told the other that 

there were many handicapped children in the area and added “God forbid one of them 

might find a weapon with shells and they might hurt themselves” (1993, p. 279). It was 

at that moment when the suspect, who was worried about the kids, told the officers where 

the gun was hidden. 

The debate that took place at the Supreme Court was whether those police officers 

had interrogated the suspect or not, considering that interrogation does not only include 

direct questioning, but also any functional equivalent of it. At the end of the trial, it was 

determined that the police officers did not question the suspect in any form. Nevertheless, 

Tiersma states that the utterance produced by the police officer “conveyed that something 

very bad might happen unless he provided the information” (1993, p. 280), so it was an 

indirect way of interrogating the suspect. In my view, the suspect was interrogated, as 

Tiersma argues. If the utterance is analysed from a pragmatic point of view, even though 

the locutionary force, that is, what was literally said, does not reflect a prototypical 

question, its illocutionary force, that is, the intention behind those words, was to obtain 

that information and, indeed, the perlocutionary effect was that the suspect told the 

officers where the gun was (see Austin, 1962).  

The area of forensic linguistics that is specialized in the spoken language of the law 

is characterized not only for analysing the oral interactions that take place at the moment 

of arrest, but also for having a major focus on those that occur during police investigative 

interviews. Valero-Garcés (2018) points out that these interviews tend to have four main 

objectives. Firstly, to discern if a crime has been committed and, if the answer is 

affirmative, to determine what the crime was; secondly, to discover evidence that leads 

to the identification of the subjects that committed such crime; thirdly, to generate 

evidence that prevents the criminal from mounting an inappropriate defence in court; and 

lastly, to find out whether the witnesses are portraying the facts accurately or if they are 

exaggerating or twisting them. Oxburghm et al. made a review of the research that has 

been conducted over the years on the type of questions which are formulated in police 

interviews and discussed how these could be categorized, as well as what types of 

questions could allow the interviewee to express their ideas in a better way, as it is the 

case of open questions, that can “generate free narratives and longer responses from 

witnesses compared with closed questions” (2010, p. 48).  


50 
 

Regarding the psychological aspect of these interactions, Baldwin stated after the 

analysis of 600 samples of audio and video tapes of police interviews that, even though 

police officers often claim to apply complex psychological principles in their interviews, 

they tend to lack social skills, for instance when they repeatedly interrupt the interviewee 

or when they make the questions “in such quick-fire succession that suspects [are] not 

given the opportunity to put their versions of events coherently” (1993, p. 349). 

It must be borne in mind that the evidence which is presented in court after the 

conduction of these interviews is not a literal transcription of the dialogue that takes place 

between the police officer and the suspect or the witness, but a simplified report written 

by the officer that may contaminate the original narrative (Vázquez Maroño, 2014; 

Haworth, 2018). Haworth argued that this transformation of the interviewee’s speech into 

an official written document tends to have a negative impact on their defence in court: 

[…] the credibility of a witness can be destroyed by counsel highlighting 

differences between what is said in court, and what was (recorded as being) said 

at interview […]. The effects can be devastating, especially for defendants, and 

so the accuracy of interview records must be crucial. (2018, p. 428) 

Coulthard (1996) presents the Bentley case as an example of how police officers have the 

chance to manipulate the suspect’s words and facilitate their imprisonment. In the 1950s, 

Derek Bentley and Chris Craig were arrested by the police while they were trying to steal 

in a warehouse. At the moment of arrest, Derek allegedly told Chris, who was in 

possession of a revolver, let him have it, Chris, and immediately after those words were 

uttered, Chris shot the gun and killed a police officer. The interpretation of the jury was 

that Derek was indirectly asking Chris to shoot, for which he was sentenced to death. The 

most remarkable aspect of Derek’s defence was the fact that he continually emphasized 

that he did not mention such words at all, which was received with great scepticism. 

However, a few decades later, the case of Paul Dandy in 1989 changed the public’s 

opinion about the credibility of police officers, given that it was irrefutably proved by 

Electro-Static Deposition Analysis that the officers who conducted his interview added a 

couple of incriminating sentences in its record some hours after they had drafted the rest 

of the document. In order to avoid manipulations of this kind, many of the interviews 

conducted today in some jurisdictions “are video-recorded and almost all of the rest are 

audio-taped using stereo tapes with a pre-recorded voice announcing the time at ten 

seconds intervals, in order to prevent subsequent editing” (1996, p. 122). 


51 
 

Despite the many improvements in the field over the last decades, Haworth (2018) 

complains about the fact that physical evidence is carefully preserved to avoid the 

slightest contamination before it is presented in court, whereas the treatment given to 

interview data is still far from being equal. For that reason, she suggests a series of 

measures to ensure the preservation of the original evidence extracted from investigative 

interviews and, as a result, to avoid the miscommunication of the interviewee’s words 

and its legal implications: 

A) “All police recording equipment should be switched to digital rather than outdated 

audio cassette tapes, in order to ensure better data quality at source” (2018, p. 

446). 

B) Police officers should embrace the usage of video recordings to complement the 

verbal production of the interviewee. 

C) There should be a common code of practice for the transcription of interviews that 

specifies what to do in cases of pauses, overlaps, etcetera. The author further 

explains that “this would ensure consistency in production and interpretation, 

which would be especially beneficial at the courtroom evidence stage” (2018, p. 

446). 

D) Transcribers should be properly trained. This means that they should learn some 

basic notions about legal language, the main differences between its written and 

spoken format and what principles they should follow during the editing process.  

E) The people in charge of evaluating the interview as part of the evidence should 

not only take into consideration the official transcript, but also the original 

recording. 

F) “The practice of reading aloud the interview transcript in court should be 

abandoned”, given that “it adds a further unnecessary layer of distortion, 

confusion and corruption to the interview data” (2018, p. 447). As pointed out 

earlier, this practice is usually beneficial for the prosecution, since the credibility 

of the witness is put into question if there is any minor difference between what 

they state in court and what they stated during the interview. 

The cooperation between police forces and linguists can provide a route to improving the 

quality of investigative interviews, which would have a positive impact on the treatment 

received by suspects and witnesses and, by extension, on the legal system as a whole.  


52 
 

There is another type of interrogation that stands as a major concern for forensic 

linguists, which is the one that takes place in court between a lawyer and a hostile witness. 

Gibbons (2005) explains that when an interaction of this kind takes place, the lawyers are 

in a significantly better position than the witnesses, given that they have the advantage to 

make the questions and, as a result, they are in control of the direction of the conversation. 

The author further suggests that lawyers tend to pursue four main objectives during the 

trial, which will be presented below together with the persuasive techniques through 

which they can be achieved:  

A) To reinforce their own version of the facts. Gibbons states that this objective tends 

to be achieved in two different ways. Firstly, the attorney may formulate the 

questions and portray the facts in a way that does not allow the witness to provide 

an alternative narrative, for instance when they make a yes/no question. Secondly, 

the lawyer might force the witness to accept their version of the story with certain 

linguistic practices, such as the use of assertions like “He came into the room”, 

instead of asking “Did he come into the room?” (2005, p. 196).4 The author refers 

to these techniques as “controlling the information” and “controlling the person”, 

respectively (2005, p. 194).5  

B) To increase the credibility of the witness who is on their side by strengthening 

their social status during the interview, which is supposed to have a positive 

impact on the reliability of their story. 

C) To challenge the veracity of the testimonies provided by the hostile witness and 

their defence. The most common way to accomplish this is by finding 

contradictions between the witness’ latest statement and the previous one(s), as 

has been previously explained in this section of the thesis. 

D) To cast doubt on the credibility of the hostile witness who has been interrogated. 

In this case, lawyers usually work out the opposite strategy that they used with 

their own witnesses. In other words, they try to create in the judge and the rest of 

the audience the impression that the hostile witness “lacks intelligence, maturity, 

moral ethics, emotional control, the ability to reason and reliability” (2005, p. 

194).6 

 
4 My own translation. 
5 My own translation. 
6 My own translation. 


53 
 

The expertise of the forensic linguist is also required in situations with vulnerable 

witnesses, as it is the case of those who cannot speak the native language of the court. 

Kredens (2016) explains that the role of the public service interpreter (PSI) could be 

categorised into three main domains, the first one being associated with the translation of 

others’ utterances and, by extension, pragmatic and socio-pragmatic equivalence 

problems, the second one with economic and political differences that make the 

interpreter a cultural mediator and, lastly, Kredens highlights the fact that “roles can arise 

spontaneously in any PSI setting; the interpreter can become […] a confidant, an expert 

witness […], an ally […], or even a messenger […], and this list is by no means extensive” 

(2016, p. 66). 

Children are other type of vulnerable witnesses involved in legal disputes who need 

to be protected. Coulthard (2010) explains that there have been some major improvements 

in the way in which information is elicited from them, and a representative example of 

this progress is the fact that some judges are giving children permission to video-record 

their testimony before the trial takes place or allowing them to communicate from 

somewhere outside the courtroom. 

This section has expounded the main scopes of action of forensic linguists in oral 

legal contexts, which are the communication of rights during the moment of arrest and in 

police investigative interviews, as well as the spoken interactions that take place in such 

scenarios and the courtroom. The analysis of the abovementioned interactions, the 

implementation of the measures suggested by linguistic experts and the recognition of 

their work could be crucial to protect certain witnesses and suspects and ensure a better 

functioning of the legal system.  

3.3.3. The linguist as an expert witness 

Forensic linguists can be called upon to testify in court as expert witnesses when language 

is involved in a case. There is a plethora of roles which have been played by linguists in 

this area and, due to space constraints, only a few will be illustrated before authorship 

attribution studies, which constitute the actual focus of the thesis, are presented and 

explained in more depth.  

Linguists can be required in court to analyse legal cases which are considered 

language crimes. Among the many types of language crimes that can be examined by 


54 
 

forensic linguists, vilification and performative crimes will be exemplified and briefly 

commented.  

According to Gibbons, vilification “may target a specific individual or corporation, in 

which case it is handled in common law as ‘defamation’: slander if it is non-permanent 

form or libel if it is in a recorded form” (2011, p. 236). Shuy acted as an expert witness 

in a famous case of libel in which Frank Celebrezzee, a member of the Ohio Supreme 

Court, accused The Cleveland Plain Dealer of making him lose his reelection after they 

published articles suggesting that “Celebrezze had cast his judicial vote in two criminal 

cases in exchange for campaign contributors” (2010, p. 99).  

On the other hand, offering a bribe or threatening someone are examples of 

performative crimes. Regarding the latter, Fraser states that a successful threat expresses 

three concepts, these ones being “the intention to perform an act, the belief that the state 

of the world resulting from that act is unfavourable to the addressee [and] the intention to 

intimidate the addressee” (1998, p. 162). The author further explains that not all threats 

are illegal and provides examples for both situations by stating that, while threatening 

someone to talk about an infidelity to their current partner is legal, threatening about 

exposing an infidelity to someone’s partner or even to the press unless there is an 

economic remuneration is illegal. As can be inferred from the previous example, the line 

between what is legal and what is not is sometimes blurred and the expertise of a linguist 

may be required for a correct application of justice.  

Legal disputes among trademarks constitute another relevant scope of action for 

forensic linguists. Shuy points out that this kind of disputes tend to begin in two different 

ways: “with charges of trademark infringement and with charges of unfair competition” 

(2002, p. 44). He further adds that, when one of these situations takes place, “trademark 

attorneys tend to use accountants and other experts to deal with representations of actual 

damage and linguists to address the issues of linguistic similarities and the ways that 

language use can give clue to intentions” (2002, p. 45). With the purpose of illustrating 

the nature of these cases, the contribution of Shuy in McDonald’s Corporation v. Quality 

Inns, International, Inc. (1988), will be expounded. The author, who was called upon by 

Quality Inns as an expert witness, depicted the origin of the conflict as follows: 

In the fall of 1987, a large hotel chain, Quality Inns International, made public its 

plan to create a new chain of inexpensive hotels to complement its other market 


55 
 

brands. The name of this new hotel was to be McSleep Inns and they planned to 

open some 200 McSleep franchises within three years. Three days after this initial 

announcement, the McDonald’s corporation, the famous fast-food marketer, sent 

a letter to Quality Inns alleging trademark infringement and demanding that 

Quality Inns not use the proposed McSleep name. (2002, p. 95) 

Quality Inns argued that it was unlikely that people could associate McDonald’s with 

McSleep, given that they belong to different types of business. They asked Shuy to 

analyse the case from a linguistic perspective and he compiled a corpus of words 

including the prefix Mc- that were unrelated to McDonald’s and categorized their 

meaning according to the linguistic context in which they were found. Among those 

words there were proper names, acronyms, products fabricated by Macintosh, parodies 

of fast-food products and certain words that intended to mean something “basic, 

convenient, inexpensive and standardized” (2002, p. 99). The latter group was of great 

interest for the case, and two notable examples of this type of terms were McLaw, that 

was used in The California Law Review to depict cheap and accessible legal services; and 

McArt, which could be found in Forbes to describe the massive marketing campaigns that 

characterized certain art stores. In other words, the defence of Quality Inns was built upon 

the idea that the prefix had grown into a generalized lexical item and therefore it could 

have a separate meaning from the one that was associated with McDonald’s in some 

communicative contexts.  

According to Shuy (2002), McDonald’s called on another linguist who stated that, as 

a theoretical linguist, he considered that the only way to determine the meaning of a word 

is by asking people directly what they think of it. In my view, while Shuy’s contribution 

was well built, the statements of the McDonald’s linguist could be seen as biased, since 

they deny the validity of the findings provided by well-established disciplines of applied 

linguistics such as pragmatics. Nevertheless, the judge decided that the prefix Mc- was 

not generic and that it was possible to associate McDonald’s with McSleep, which 

implied a legal defeat for Quality Inns International. 

Up until this point, this chapter has presented a definition of forensic linguistics, a 

summary of its historical development and an explanation of its main applications with 

the aim of offering a general introduction to the discipline, especially for those readers 

who are not familiarized with it. Due to space limitations, some of these applications have 

been briefly discussed and even omitted, for which I apologise. Even though authorship 


56 
 

attribution studies constitute another scope of action of the linguist as an expert witness 

and hence they could have been included in this section, a separate one will be devoted 

to discussing their theoretical foundations and methods in more depth, given that they 

constitute the actual focus of this investigation. 

3.4. Authorship attribution studies 

This section intends to provide a description of the main principles of authorship 

attribution studies, as well as a critical review of relevant investigations in the field. 

According to Bozkurt et al., authorship attribution studies could be defined as follows: 

Authorship attribution (AA) is the process of attempting to identify the likely 

authorship of a given document, given a collection of documents whose 

authorship is known. Applications of authorship attribution include plagiarism 

detection (e.g. college essays), deducing the writer of inappropriate 

communications that were sent anonymously or under a pseudonym (e.g. 

threatening or harassing letters), as well as resolving historical questions of 

unclear or disputed authorship. (2007, p. 1) 

In other words, the goal of forensic authorship attribution is the identification of the 

author(s) of anonymous or disputed documents through the analysis of their linguistic 

features under the assumption that each speaker has an individual variety of their native 

language, which is known as their idiolect, and that “this idiolect will manifest itself 

through distinctive and idiosyncratic choices in texts” (Coulthard, 2004, p. 431). As 

pointed out by Turell, “[t]he linguistic production of individual speakers and writers can 

sometimes reveal information about an individual’s age, gender, occupation, education, 

religion […], political background […], geographical origin [or] ethnicity (2010, p. 212).  

Even though the term idiolect will be consistently used in the thesis, it is worth 

mentioning that there is great controversy around this concept. Turell highlighted that its 

generalized usage may derive from an idealised vision of language, since “it could be 

argued that it is impossible to determine whether a given feature observed in a recording 

or a written text is idiolectal, dialectal, sociolectal, genderlectal, constrained by age 

factors, etc.” (2010, p. 216). She therefore added that “idiolects can only be determined 

with countless amounts of data from each individual, something which never happens 

when dealing with real forensic linguistic data” (2010, p. 217), and suggested the use of 

the notion idiolectal style in forensic authorship contexts. This could be defined as “the 


57 
 

set of options that writers take from the linguistic repertoire available to them as users of 

a specific language” (2010, p. 217), that is, the distinctive way in which an individual 

applies a linguistic system shared by many. 

Studies in forensic authorship contexts may concern several types of data, and hence 

Coulthard et al. (2010) make a distinction between single text and comparative authorship 

problems, or in other words, cases involving an open and a close set of suspects, 

respectively.  

The authors explain that “a single text problem occurs where comparison texts are 

unavailable or where an investigation is not yet narrowly focused on a small pool of 

suspects” (2010, p. 536). They further suggest that these cases usually involve a set of 

texts that can be unified and analysed as a single document and that the forensic linguist 

is then expected to provide information about the author of such text by classifying its 

idiolectal features, as well as to clarify the possible meaning of ambiguous utterances.  

In contrast, studies in this area may involve a disputed text whose linguistic features 

have to be compared with the idiolect of a suspect or a set of suspects in order to determine 

its likeliest authorship. Coulthard et al. (2010) explain that a considerable proportion of 

the cases in which the expertise of a linguist is required are of this nature. Indeed, the 

research conducted in the present thesis could be classified into this category, given that 

its main goal is to define Shakespeare’s and Marlowe’s idiolect through a linguistic 

analysis of their undisputed works to discern the likeliest authorship of a disputed play. 

Queralt (2014) points out that before any type of investigation takes place, the forensic 

linguist must decide if the analysis of a certain text will constitute a proper linguistic case 

or not depending on its length and its quality. She explains that even though the academic 

community has not reached an agreement on which is the minimum length that is 

necessary to carry out a reliable linguistic analysis, qualitative studies can be conducted 

with shorter texts of around 150 words, whereas quantitative studies tend to require larger 

samples, since they generally imply the usage of computer programs to find linguistic 

patterns. On the other hand, the quality of a text is related to whether the sample contains 

enough linguistic features to reflect the idiolect of its author. Regarding this issue, 

Kredens (personal communication, February 17, 2019) defends the idea that the genre of 

a text has a major impact on the amount of idiolectal features that it includes. This means 


58 
 

that, for instance, it would be almost impossible to determine the authorship of a shopping 

list for obvious reasons. 

Once an explanation of the most basic concepts on which this discipline is built has 

been provided, the main applications in which authorship attribution studies can be 

divided will be addressed, with a special focus on historical questions of disputed 

authorship in general, and that of Arden of Faversham in particular.  

3.4.1. Attribution of authorship in cases of plagiarism 

The notion of plagiarism will be expounded together with the role of the forensic linguist 

in this area and the tools that have been created to facilitate their work. Olsson states that 

plagiarists act in three distinct ways (2008, p. 108): 

A) Archaeological plagiarists —the most common type— take an artefact and try 

to disguise its surface by substituting some of its parts and by re-arranging 

others. 

B) Diachronic plagiarists take an artefact from an earlier period and try to 

disguise its chronicity by translating it into an artefact of their own time. 

C) Cultural plagiarists transpose elements of their own culture onto a cultural 

artefact of another culture or, alternatively, try to take cultural artefacts from 

elsewhere and convert them into own culture substitutes. 

Barrón-Cedeño et al. (2014) suggest that a text can be plagiarised in four different ways, 

according to the categories delineated by Martin to describe the nature of this crime 

(2004). Firstly, someone can plagiarise other individual’s ideas or theories without giving 

them due recognition. Secondly, a section of a text or even a whole text can be copied 

word by word or with slight modifications. Thirdly, the sources of a text can be 

plagiarised when an author mentions those presented by another author without clarifying 

that these sources were extracted from his/her work. Finally, they mention authorship 

attribution issues, that is, when someone claims to have written a text that was indeed 

produced by someone else.  

In addition to these modalities of plagiarism, Sousa-Silva uses the term translingual 

plagiarism to refer to those cases in which “plagiarists lift the text from one language, 

have it translated to another language, and subsequently use it as their own” (2014, p. 72). 

As the author further explains, the identification and demonstration of translingual 


59 
 

plagiarism is particularly complicated, since the resulting texts have no apparent 

similarities with the originals. 

The approaches adopted for the detection of plagiarism could be divided into external 

and intrinsic (Potthast et al., 2009, as cited in Sousa-Silva, 2013). Whereas the first one 

is oriented to the comparison of the suspicious text with a corpus of original manuscripts 

to find linguistic similarities, the latter “exclusively analyses the input document, i.e., 

does not perform comparisons to documents in a reference collection” (Foltýnek et al., 

2019, p. 10). The purpose of the intrinsic approach is to find stylistic inconsistencies that 

might reflect an attempt at plagiarising an external source, which would require a further 

external analysis. 

A popular case of plagiarism in the Spanish academic community took place when 

the Rector of the Universidad Rey Juan Carlos, Fernando Suárez, was found guilty of 

making a literal transcription of a text produced by the former Dean of the Faculty of Law 

at the Universitat de Barcelona, Miguel Ángel Aparicio. When the news broke, professors 

and researchers from different universities signed a petition asking for Fernando Suárez’s 

resignation, who moved forward the elections to appoint a new Rector.7 This illustrates 

the frequency with which plagiarism occurs in academic settings and the need of linguistic 

expertise to protect the intellectual property rights of other scholars. For this end, an 

increasing amount of computer programs have been developed to prevent plagiarism, as 

it is the case of Turnitin, Unicheck and Urkund. Nevertheless, these programs are only 

meant to assist linguists by giving them sufficient proof to discern whether plagiarism has 

been committed or not. The final decision, as well as the legal measures that should be 

adopted, must be determined by the forensic expert (Barrón-Cedeño et al., 2014). 

3.4.2. Attribution of authorship of criminal texts with an open set of suspects 

Forensic linguists can be called upon to study the authorship of anonymous texts that 

cannot be associated with any possible suspect. In such cases, the forensic expert is 

expected to draw up a profile of the author based on their idiolectal features, which can 

provide crucial information for the development of the investigation (Coulthard et al., 

2010). One of the most well-known cases in which the analysis of the idiolectal features 

of an anonymous text was required to condemn a terrorist is that of the Unabomber. 

According to Coulthard and Johnson (2007), between the years 1978 and 1995, an 

 
7 https://www.elmundo.es/madrid/2017/02/03/5894c721e2704e80678b4615.html 

https://www.elmundo.es/madrid/2017/02/03/5894c721e2704e80678b4615.html


60 
 

American citizen who was later known as the Unabomber sent bombs to people working 

at universities and airlines through the post. In 1995, he sent a manuscript of 35,000 words 

called Industrial Society and its Future to six national journals and offered to stop sending 

bombs if his manuscript was released to the public. The Washington Post agreed to 

publish the document and, a few months later, a man contacted the FBI claiming that the 

text contained a series of expressions that were commonly used by his brother, who had 

not been in touch with him for more than ten years. He put a special emphasis on the fact 

that his brother used to repeat the expression cool-headed logician, which appeared in the 

manuscript and is a distinctive idiolectal feature that had a major impact on the posterior 

analysis. When the FBI finally discovered where his brother was and arrested him, they 

found a 300-word document that he had written more than a decade ago, and its linguistic 

analysis revealed that it presented a high degree of resemblance with the 35,000-word 

manifesto, which was the ultimate proof of his guilt. 

The anthrax case also stands as one of the most popular investigations that can be 

associated with the forensic linguistic analysis of criminal documents with an open set of 

suspects. Olsson (2004, 2008) reports that, after the attack on the Twin Towers on 

September 11, 2001, certain public figures received envelopes which were, allegedly, 

letters written by schoolchildren. Nevertheless, these documents contained anthrax, a 

lethal poison which provoked the death of five people and sickened another seventeen, 

according to the FBI’s official website.8 Olsson (2004, 2008) explains that the American 

authorities linked this terrorist attack to Al-Qaida and tried to discern if the messages 

contained in the abovementioned envelopes had been written by an English or an Arabic 

native speaker. The author offers a transcription of the message contained in the envelope 

that was sent to Senator Daschle: “09-11-01. You can not stop us. We have this anthrax. 

You die now. Are you afraid? Death to America. Death to Israel. Allah is great” (2004, 

p. 104). He states that the style is considerably similar to that of the letter that was sent to 

Tom Brokaw: “09-11-01. This is next. Take penacilin now. Death to America. Death to 

Israel. Allah is great” (2004, p. 104). Olsson inferred the following conclusions after the 

study of the samples: 

Note the terseness of the style. It is far from easy for a learner of English to use 

the language in this concise, precise way. Moreover, it is probably indicative of 

 
8 https://www.fbi.gov/history/famous-cases/amerithrax-or-anthrax-investigation 

https://www.fbi.gov/history/famous-cases/amerithrax-or-anthrax-investigation


61 
 

someone with a good education and —paradoxically— someone who is used to 

doing a lot of writing. The misspelling ‘penaciling’ and the pseudo-pidgin style 

‘You die now’ are probably just red herrings and should be ignored. (2004, pp. 

104-105) 

The FBI states in their website that an exhaustive revision of the case led to the 

incrimination of Dr. Bruce Irvins, a worker at the United States Army Medical Research 

Institute of Infectious Diseases (USAMRIID), who killed himself before charges could 

be presented. 

As has been reflected in the two cases presented above, the role of the forensic linguist 

in drawing a profile of the possible author of an anonymous criminal text may be crucial 

to narrow down the scope of a police investigation, since there are certain idiolectal 

features that can provide information about the individual’s gender, age, native language 

or educational background, among other details. The following section will address the 

role of the linguist in those cases in which the criminal text can be associated with a series 

of possible authors. 

3.4.3. Attribution of authorship of criminal texts with a close set of suspects 

In many of the cases in which forensic experts are required to analyse the authorship of a 

criminal text, there is already a list of possible authors and, as a result, the linguist is 

expected to determine with which of the idiolects of the candidates the disputed text has 

a higher degree of resemblance (Coulthard et al., 2010). The protocols followed in these 

cases involving a close set of suspects can be illustrated by describing that of Dulceliz 

Díaz, who had allegedly killed her 5-year-old daughter and committed suicide in 2007.  

James R. Fitzgerald, a former FBI agent who had a crucial participation in the case, 

explains that the forensic linguistic analysis of suicide notes often intends to discern if the 

letter was indeed written by the victim or if it was elaborated by someone else in an 

attempt at covering their murder (2014). Therefore, he states that suicide notes should 

always be compared with undisputed texts of the victim and, at the same time, with texts 

produced by relatives or acquaintances that may be seen as potential suspects.  

According to the author, an alleged suicide letter was sent to three members of 

Dulceliz’s family through an email account that she shared with her former boyfriend, 

Alberto Pérez, who was also the father of her daughter. The email that was sent from 


62 
 

Díaz’s account (see Appendix 1) was considered the disputed document that needed to be 

compared with undisputed texts produced by Díaz herself and the main suspect, Alberto 

Pérez. These undisputed texts were other emails, blog entries and forum posts.  

The linguistic analysis of the disputed email showed that it was highly probable that 

Pérez wrote it and, at the same time, that it was unlikely that Díaz elaborated it. The email 

contains the abbreviation gonna, which was used multiple times by Pérez in his 

undisputed texts, whereas there was a preference for the form gunna in Diaz’s samples. 

The expression peace out of the email was also found in many of the suspect’s online 

posts and in none of Díaz’s. Similarly, there is an ellipsis in the email, which is something 

Díaz only wrote once in her 438 undisputed texts, while it appeared 119 times in Pérez’s 

393 reference samples. Agent Fitzgerald was called upon as an expert witness during the 

trial and states that this forensic linguistic analysis had a major impact on the final verdict, 

in which Alberto Pérez was sentenced to death for a double homicide. 

In sum, the attribution of authorship of a criminal text with a close set of suspects is 

based on the classification of the idiolectal features of every suspect for further 

comparison with those of the disputed document.  

3.4.4. Attribution of authorship of historical texts 

Forensic experts may be asked to examine the authorship of literary and other types of 

ancient texts by tracing the idiolectal features of the possible authors through a linguistic 

analysis of their undisputed works for further comparison with the features of the disputed 

sample. 

Although McMenamin listed the most salient authorship tests of his time almost three 

decades ago (1993), there is a bewildering variety of methods that have been used to study 

the authorship of historical texts over the last years due to the irruption of new 

technologies and the possibilities that they offer, and thus this section will only discuss 

those procedures that will be taken into consideration for the present research or have 

been used by other scholars to analyse Arden of Faversham.  

Despite this, I would like to mention Canter’s and Chester’s investigation proving the 

lack of reliability of the Cusum technique (1997), which was proposed by Morton and 

Michaelson (1990) to discriminate between texts written by one author and collaborative 

texts; Larner’s research on the usage of formulaic expressions as authorship 


63 
 

discriminators, even though he did not reach conclusive results (2014); and, more 

importantly, Grant’s study to identify effective authorship markers combining 

discriminant function analysis with Bayesian likelihood measures, where he argued for 

the importance of designing a method that leads to no cases of misattribution, even if its 

success rate is not as high as those of others who do have the potential to misattribute 

samples (2007). 

The first attempt at making a statistical description of an author’s literary style to 

prove that idiolectal features could be quantified for further authorship attribution studies 

was that of Mendelhall (1887), which was based on the calculation of the average number 

of letters per word (see Section 3.2). This work set a precedent and there have been many 

subsequent studies that have analysed the length and/or the frequency of words and other 

linguistic items for the same purpose. This approach can be observed in the research 

conducted by Moerk (1973), who focused on the samples provided by thirty American 

college students who were asked to write freely a short story that began with this sentence: 

“He (She — according to the sex of the subject) stood at the window, clasped his (her) 

hands behind his (her) back and stared out into the night” (1973, p. 51). According to the 

author, “this one sentence induces nearly all writers to adhere to an area of content 

concerning personal problems, social interactions, feelings and memories, so that content 

per se should produce no or minimal differences in style” (1973, p. 51). 

With the purpose of documenting a statistical description of the style of these literary 

texts, Moerk quantified the frequency of certain types of words according to their length, 

grammatical category or syntactic function, as well as the average number of words per 

sentence of the texts and other variables. Among these, the average number of words per 

sentence of a text, which can be calculated dividing its total number of words by the 

number of sentences that it has, seems like a potentially distinctive idiolectal feature, 

given that it can reflect a preference for certain type of syntactic structures. In other words, 

those authors that present a low average number of words per sentence may have a 

tendency towards the usage of simple sentences, whereas those whose average number of 

words per sentence is higher may prefer more complex syntactic constructions. For that 

reason, the calculation of this parameter will be considered for the present study and 

programmed as one of the tasks that can be carried out by the software ALTXA (see 

Section 4.5.2). 


64 
 

Another simple but effective procedure in studies of this nature consists in the 

calculation of the relative frequency of a list of chosen keywords within a disputed sample 

for further comparison with the relative frequency that they have in the reference texts of 

the possible authors. The percentage of the relative frequency of a word in a text can be 

obtained by dividing the number of times that the word appears in such text by its total 

number of words and multiplying that result by a hundred. Thomas Merriam (1996) 

compared a Shakespearean corpus formed by the 36 plays that appear in his First Folio 

with a corpus that included 7 plays of Christopher Marlowe in terms of the relative 

frequency of a series of words that he had delineated as idiolectal markers of the latter 

due to their prominent presence in his play Tamburlaine the Great. When these two 

reference corpora were compared, the keywords that he had selected presented a 

considerably higher relative frequency in the corpus of Marlowe, which proved the 

reliability of the method.  

Afterwards, the author calculated the frequency of those Marlowian keywords in each 

of the plays that formed the two reference corpora individually and noticed that their 

frequency in Henry VI, Part I, which had been allegedly written by Shakespeare only, 

was similar to the one that they had in the plays written by Marlowe, whereas their values 

in the rest of the Shakespearean texts were much lower. This stands as a remarkable result, 

given that this play was attributed to both authors as a collaborative text years later (see 

Section 2.2), which explains the frequency of Marlowian keywords in the text. For this, 

the quantification of the relative frequency of a set of keywords selected by the researcher 

will be considered for the conduction of this thesis and introduced as one of the 

functionalities of the software ALTXA (see Section 4.5.1). 

A more complex methodology for the attribution of authorship of historical 

documents is based on the quantification of their percentage of lexical richness, which 

can be obtained by dividing the number of distinct words that a text contains, also known 

as its types, by its total number of words, also known as its tokens, and multiplying the 

resulting number by a hundred. Baker (1988) conducted a study where he compared the 

lexical richness9 of the plays and poems written by the two playwrights that constitute the 

focus of this doctoral thesis, William Shakespeare and Christopher Marlowe, to discern 

if the results derived from the calculation of this parameter presented enough intra-author 

 
9 Baker refered to this as vocabulary richness. 


65 
 

consistency and inter-author variation, that is, if the Shakespearean values were similar 

among themselves and sufficiently different from the Marlowian ones, which were also 

expected to be consistent (see Turell, 2010 for a more detailed insight into the notions of 

intra-author consistency and inter-author variation in idiolectal studies). His analysis 

revealed that the lexical richness of the Shakespearean plays remained relatively stable, 

whereas that of the works of Marlowe presented more fluctuations, and Baker therefore 

suggested that Christopher Marlowe was able to adopt more registers than the Bard. In 

addition, the results showed that, despite the fluctuations, Marlowe could provide his texts 

with more lexical richness than Shakespeare.  

It is obvious that short texts are more likely to present a high lexical richness, given 

that the chances of repeating words are lower. Despite the fact that Baker suggests that 

this parameter is not dependent on the length of the texts unless there are overwhelming 

differences among them, I would say that it is not rigorous to compare works of distinct 

lengths in terms of it. Nevertheless, the fact that Baker’s study associated many of the 

works of Shakespeare and Marlowe among themselves using this discriminator seems 

like a solid reason to consider its usage in the present thesis, although it needs to be 

applied differently to avoid the inconsistent results that may derive from the comparison 

of samples of dissimilar lengths, as will be developed in the following chapter. Therefore, 

the quantification of this parameter will be programmed as one of the tasks that can be 

carried out by ALTXA (see Section 4.5.3). 

The main problem behind the procedures that have been described so far is their 

inaccuracy when they are used to analyse the authorship of short samples, which stands 

as one of the most complicated tasks within the disciplinary framework of forensic 

linguistics, since it is more complicated to identify quantifiable idiolectal features in them 

(Queralt, 2014).  

Nevertheless, n-gram tracing, which constitutes a more modern method than those 

that have been previously described in this section, is known for its effectiveness in the 

attribution of authorship of short texts (Grieve et al., 2018). What n-grams are and how 

this method can be used for forensic linguistic purposes could be delineated as follows: 

 [A]n n-gram is defined as a sequence of one or more linguistic forms (e.g. 1-

grams or 2-grams) at any level of linguistic analysis (e.g. words or characters) 

[…]. The basic idea behind n-gram tracing is to calculate the percentage of n-


66 
 

grams that occur in a questioned document that also occur at least once in a 

possible author writing sample. This process is repeated for each possible author 

and the text is then attributed to the possible author whose writing sample contains 

the highest percentage of the n-grams from the questioned document. (Grieve et 

al., 2018, p. 6) 

This relates to the conventional depiction of n-grams as combinations of consecutive 

characters or words that take place within the same sentence (see also Cheng et al., 2006; 

Ishihara, 2014). For instance, if the sample He looked at her. She seemed concerned. is 

analysed from this perspective, the word 2-grams of this short text would be He looked, 

looked at, at her, She seemed and seemed concerned, whereas her She would not 

constitute a word 2-gram, given that these words belong to distinct sentences. 

Grieve et al. (2018) used n-gram tracing to carry out an authorship analysis of the 

Bixby Letter, which is known to be a short message of 139 words in which Abraham 

Lincoln gave his condolences to a widow called Lydia Bixby after the loss of her five 

sons in the American Civil War. According to the authors, this piece of correspondence 

allegedly written by Abraham Lincoln has raised substantial debate among linguists, 

given that some historians claim that it was written by John Hay, who was his personal 

secretary. For that reason, they compiled a series of Hay’s undisputed written documents 

for his reference corpus and, to compile the reference corpus of Lincoln, they selected a 

group of texts that he wrote before he hired John Hay, in case Hay himself may have 

written other samples that have been traditionally attributed to Lincoln. Once these 

samples were gathered, they analysed the character and the word n-grams that the letter 

shared with the undisputed corpora of the two candidates and determined that John Hay 

was its likeliest author. It is worth mentioning that the authors had previously conducted 

a pre-study in which they analysed undisputed texts of Lincoln and Hay as if they were 

disputed to assess the reliability of n-gram tracing. The conduction of a case study with 

methods that have been already tested in a pre-study reflects an approach that has been 

adopted for this doctoral thesis (see Chapter 4).  

Other investigations that have proved the effectiveness of n-gram tracing in the 

attribution of authorship of small samples are those of Wright (2017) and Cicres and 

Queralt (2019). Wright worked with a set of emails extracted from the Enron Email 

Corpus and correctly attributed the authorship of most of the samples among the 176 

possible authors. The success rate of the studies was especially high when these traced 


67 
 

shared word n-grams of between two and six words. On the other hand, Cicres and Queralt 

analysed texts produced by a group of schoolchildren in Catalan and concluded that word 

3-grams and 2-grams could effectively classify the samples in terms of the age of the 

authors, that ranged from 6 to 11 years old.  

Given that the scenes of the play Arden of Faversham will be analysed as independent 

texts for the present investigation, the selection of a method that seems to be effective for 

the attribution of authorship of short samples is crucial, for which n-gram tracing will be 

considered for its conduction and programmed as one of the functionalities of ALTXA 

(see Section 4.5.4). 

As a matter of fact, n-gram tracing has already been used to analyse the play Arden 

of Faversham, since Taylor (2019) studied the authorship of the first 274 words of the 

tenth scene of the play, that is, Scene IV.i, using this method. The author states that he 

selected this excerpt because “(1) it owes nothing to the narrative sources of the play, and 

(2) it begins a long stretch of text that recent investigators […] agree was not written by 

Shakespeare” (2019, p. 859). It seems that the selection of such a specific portion of the 

text could be perceived as arbitrary, since analysing at least a complete scene of the play 

appears to be more rigorous than making an artificial cut of the manuscript. 

Taylor decided to consider 15 possible candidates for the attribution of authorship of 

the text. These were Munday, Greene, Nashe, Lodge, Shakespeare, Marlowe, Peele, Lyly, 

Kyd, Drayton, Wilson, Achelley, Chettle, Hathway and Thomas Watson, who was 

determined as the likeliest author of the text at the end of the study. To compile the 

reference corpora of the abovementioned authors, he decided to include dramatic and 

non-dramatic texts that were written between 1585 and 1594, although he further stated 

that “for some candidates, however, it has been necessary to extend the date range” (2019, 

p. 857), given the lack of undisputed works of authors like Achelley and Chettle from that 

period. Therefore, the approach adopted by Taylor was based on the hypothesis that the 

reference corpora of the candidates should be compiled with texts that belong to distinct 

literary genres, which is opposite to the idea on which the present thesis is based. 

The author explains that he identified the n-grams of two or more consecutive words 

that the disputed sample shared with the reference corpora of the possible candidates, 

which stands as a traditional approach towards n-gram tracing, but that also “searches 

were made for every collocation of two or more semantic words […] ten words before or 


68 
 

after each other” (2019, p. 859), excluding the function words among them, which is a 

less conventional way of applying this method.  

Taylor concluded that, even though the disputed sample presented more unique 

matches with the Shakespearean reference corpus, the ratio of unique matches per word 

was superior with the corpus of Thomas Watson, for which he attributed its authorship to 

him. There were 14 unique n-grams in common between the 274-word sample and the 

30,397 words that formed Watson’s corpus.  

The main differences between Taylor’s approach and that of the present investigation 

lie in the fact that this doctoral thesis intends to analyse all the scenes of Arden of 

Faversham, which could be considered as natural divisions of the play, whereas Taylor 

studied the authorship of a fragment that was artificially selected, as well as in criteria for 

the compilation of the reference corpora of the possible candidates. This research is based 

on the hypothesis that an author’s idiolect is dynamic and hence the inclusion of plays 

from dissimilar periods and genres in a reference corpus will diminish the effectiveness 

of the study, which is an issue that will be addressed in detail in Chapter 4, where the 

methodological foundations of the thesis will be expounded. 

The most advanced procedure that has been selected for this doctoral thesis is the 

variant of the Zeta test suggested by Craig and Kinney (2009) to analyse the authorship 

of Arden of Faversham and other Elizabethan plays, which could be explained as follows. 

The first step consists in the compilation of a reference corpus for each of the two 

candidates (or groups of candidates) of the study. Once these corpora have been compiled, 

the texts that they contain must be divided in fragments of 2,000 words and the residual 

words at the end of each text must be combined with its last fragment. The disputed 

sample should be also divided following the same criteria. 

The second step is to obtain a list of 500 words that are characteristic of each candidate 

(or group of candidates) not only by their prominence in that corpus, but also by their low 

frequency or lack of appearance in the corpus of the other candidate(s). The formula to 

obtain each of these 500 markers for both corpora is the following. The researcher is 

expected to identify how many fragments of 2,000 words (or more, in the case of those 

that include the last words of a text) of the corpus of the first candidate(s) contain a given 

word and how many fragments of the corpus of the second candidate(s) do not contain 

that word, regardless of how many times it appears in each fragment. If the proportion of 


69 
 

fragments of the first candidate(s) that contain that specific word is transformed into a 

number from 0 to 1 and added to another number from 0 to 1 that stands as the percentage 

of fragments of the second candidate(s) that do not contain the word and the result is 

higher than 1, this word can become a marker of the first candidate(s). The 500 words 

with the highest scores that are superior to 1 following this procedure will be considered 

the markers of the first candidate(s). Afterwards, a list of 500 markers for the second 

candidate(s) must be obtained with the opposite procedure. With the purpose of filling 

these lists with distinctive lexical items, most of the function words and certain lexical 

words that are so related to the context of the play where they appear that they do not 

reflect an authorial pattern are not considered for their elaboration, that is, they are 

ignored during the mathematical process described in this paragraph.  

The final step is to place on an axis of coordinates the fragments in which the reference 

corpora of the two candidates (or groups of candidates) of the study have been divided, 

as well as the fragments in which the sample whose authorship wants to be tested has 

been divided. The value of the horizontal axis for each fragment stands as the number of 

markers of the first candidate(s) that it contains divided by its number of distinct words. 

Such division is made to compensate the superior length of those fragments that include 

residual words for being at the end of a text. Similarly, the value of the vertical axis for 

each fragment is the division of the number of markers of the second candidate(s) that it 

has by its number of distinct words.  

If the style of the two candidates (or groups of candidates) of the study is distinct 

enough, the fragments of the reference corpus of the first candidate(s) will occupy a 

specific area on the coordinate axis forming a cluster that is in a different position from 

the area occupied by the cluster created by the fragments of the reference corpus of the 

second candidate(s). Therefore, the proximity of the fragments of the disputed text to one 

cluster or the other will determine its likeliest authorship. 

Kinney (2009) analysed the authorship of the play Arden of Faversham with the Zeta 

test comparing Shakespeare with a group of more than 15 Elizabethan playwrights like 

Marlowe, Kyd, Heywood and Chettle, whose plays were combined in one corpus. The 

Shakespearean corpus was formed by 27 undisputed plays, whereas the non-

Shakespearean corpus included 109. A relevant factor about these samples is that they 

belong to distinct subgenres and were elaborated between 1580 and 1619. 


70 
 

Kinney delineated a list of 500 Shakespearean markers where the word gentle 

occupied the first position, given that it was present in 69% of the Shakespearean 

fragments and it did not appear in 55% of the non-Shakespearean fragments, which is a 

total score of 1.24 if these percentages are transformed into numbers from 0 to 1 and 

added. On the other hand, the word that appeared at the top of the list that included the 

500 non-Shakespearean markers was yes, whose score was of 1.27. Even though yes is a 

function word, it seems that the author decided to keep it as a potential marker because 

its usage reflects a choice made by the author, who also has the opportunity to write yea 

or ay. Nevertheless, a review of the literature on this issue shows that the selection of 

these linguistic forms in the Elizabethan period was more related to the dialect of the 

speakers and the linguistic context of the interaction than to their idiolect (see Culpeper, 

2018). Neither the list of ignored words to obtain the 500 markers of each reference 

corpus, that is, the stop list, nor the complete lists of 500 markers themselves were 

revealed in the study, which would have been of use for other researchers. 

The author then placed on a coordinate axis the Shakespearean fragments, the non-

Shakespearean fragments, and those of the scenes of Arden of Faversham, which were 

analysed as independent texts (see Appendix 2). The value of every fragment on the 

horizontal axis stands as the number of Shakespearean markers that it contains divided 

by its number of distinct words, whereas the value of the vertical axis reflects the number 

of non-Shakespearean markers that it includes divided by its number of distinct words.  

As can be observed in the graphical representation of the results presented in 

Appendix 2, Kinney attributed to Shakespeare the authorship of six scenes of the play, 

whereas the fragments of the rest of the scenes occupied the area of the non-

Shakespearean cluster. The samples whose authorship was attributed to Shakespeare in 

the study were Scenes III.i, III.ii, III.iii, III.iv, III.vi and V.iii.  

I would like to comment on a few aspects about Kinney’s investigation on the 

authorship of Arden of Faversham. Firstly, it does not seem sensible to me to compile the 

corpora of the candidates with plays that were written between 1580 and 1619 and without 

making a distinction among subgenres. If the idiolect is defined as a dynamic 

phenomenon, the inclusion of plays that were written in distant periods and have different 

tones in the reference corpora reflects the opposite, that is, that the idiolect of an author 

stays fossilized throughout their entire career. One could hypothesize that the author 

behind Arden of Faversham adopted certain idiolectal features during the period in which 


71 
 

s/he elaborated the play, as well as when s/he was writing plays with a tragic tone that 

differs from that of comedies. Therefore, if the plays where these idiolectal features can 

be found are mixed with others that are so dissimilar, the effectiveness of the study might 

diminish. This stance reflects one of the main hypotheses on which this thesis is built (see 

Section 1.2) and will be addressed in more depth in the following chapter, where its 

methodological approach will be discussed. 

Secondly, I would suggest that authors should be compared individually with this 

method. If, for instance, Marlowe had a tendency to write a highly distinctive word that 

could be of use to distinguish between his texts and those of Shakespeare, but his works 

are mixed with many of other playwrights who did not use it, the average values of the 

group would cause this solid marker to disappear from the study. This has also been 

suggested as a hypothesis at the beginning of the thesis (see Section 1.2) and will be 

developed in more detail in Section 4.5.5. 

Lastly, I would like to highlight that, since the reference corpora are divided in 

fragments of 2,000 words or more, it does not seem statistically rigurous to compare them 

with most of the scenes of Arden of Faversham in terms of the number of markers that 

they contain from the two lists of 500, even if these are then divided by the number of 

distinct words of the fragment, given that most of the scenes of Arden of Faversham do 

not even have 500 words. Maybe this method should be only applied with disputed 

fragments that have a comparable length to that of the fragments in which the reference 

corpora are divided (see Section 4.5.5). As a matter of fact, Kinney states that “some of 

the scenes are very short, and their placement [on the coordinate axis] cannot be regarded 

as reliable” (2009, p. 94). The solution that he adopted was to divide Arden of Faversham 

in four large segments that contained consecutive scenes of the play and conduct the Zeta 

test again, but these results do not seem to be reliable either, since, as he himself admits, 

this approach “carries a greater risk of combining more than one author’s work in a single 

segment” (2009, p. 94). Following this approach, he did compare Shakespeare with 

Marlowe individually and, even though the graphical representation of the results derived 

from this study is not shown, the author states that the four segments were attributed to 

Shakespeare.  

Elliott and Greatley-Hirsch (2017) also analysed the authorship of Arden of 

Faversham using the variant of the Zeta test adopted by Kinney, among other tests. The 

most notable difference between their study and that of Kinney is in the criteria for the 


72 
 

compilation of the reference corpora of the candidates. These authors used plays that were 

elaborated between the years 1580 and 1594, while Kinney compiled his reference 

corpora with plays that were written between 1580 and 1619. None of the two studies 

considered the subgenre of the plays, which is something that this doctoral thesis intends 

to do. It is also worth mentioning that neither the stop list with all the ignored words for 

the calculation of the 500 markers of each candidate nor such lists of markers are shown 

in the study of Elliott and Greatley-Hirsch, as happens in that of Kinney. 

The authors divided Arden of Faversham “into overlapping blocks of 2,000 words 

advancing in 500-word increments, so that the first segment holds words 1-2,000, the 

second segment holds words 501-2,500, the third 1,001-3,000, and so on” (2017, p. 151). 

This contrasts with one of the aims of this thesis, which is to divide the play in its original 

scenes and, if some of these are too short to be analysed with this method, to study their 

authorship with alternative procedures.  

Elliott and Greatley-Hirsch compared in every case an author versus a group of 

authors. The candidates for the study were Greene, Kyd, Lodge, Lyly, Marlowe, Nashe, 

Peele, Shakespeare and Wilson. Therefore, they compared the plays of Shakespeare with 

the plays written by the other candidates of the study as a group and then they carried out 

the same procedure with Marlowe, Kyd and the others, which is an approach that 

contradicts one of the hypotheses suggested in this doctoral thesis, as has been previously 

explained. 

Their study concluded that “Shakespeare is the only authorial candidate to which it 

[i.e., the Zeta test] attributes any Arden of Faversham segments, and just six of them” 

(2017, p. 164). These fragments corresponded to the first part of Scene I.i and the totality 

of Scenes III.vi and IV.i.  

Finally, this section will briefly discuss other studies on the authorship of Arden of 

Faversham whose methods have not been adopted for the conduction of this doctoral 

thesis but have had a relevant impact on the academic community.  

Craig and Kinney (2009) described another method for the analysis of Elizabethan 

plays in which the frequency of function words is used to discriminate between two 

authors (or groups of authors). This procedure, called Principal Component Analysis, 

does not take into consideration if function words appear or not in certain segments as the 

Zeta test does with the words on which it focuses, given that the likeliest possibility is 


73 
 

that almost every function word will be present in all the fragments. In contrast, this 

method “works in frequencies and combines them so as to bring out more subtle patterns 

of use” (2009, p. 28), and thus if the frequency of a function word is considered as a 

variable, this test will give “each word-frequency variable a weighting so as to highlight 

cumulative similarities and dissimilarities” (2009, p. 31). Kinney conducted a Principal 

Component Analysis to study the authorship of Arden of Faversham and the results 

coincided to some extent with those of his Zeta test, for which he suggested that 

Shakespeare participated in the elaboration of “the middle section of the play” (2009, p. 

99), and that there was at least another author involved in the process. 

Macdonald P. Jackson has also devoted considerable research to demonstrate the 

participation of Shakespeare in the creation of Arden of Faversham, especially in the 

Quarrel Scene, that is, Scene III.v. According to the author, only Shakespeare could have 

written a scene with such poetic value and emotional intensity between two characters.10 

To prove this, he used a database where “words and phrases can be found, and so can 

instances of the proximity of one word or phrase to another” (2014, p. 17) called 

Literature Online (LION) to compare the play with the works of Elizabethan writers such 

as Shakespeare himself, Marlowe and Kyd. He searched for “[p]arallels in imagery and 

ideas […] only if passages had at least one prominent word in common”, as well as 

“[p]hrases and collocations rare enough to occur five or fewer times” (2014, p. 19). The 

results showed that the play with which Arden of Faversham shared more of these 

parallels was Henry VI, Part III, followed by The Two Gentlemen of Verona and Henry 

VI, Part II, which allowed Jackson to state that this is solid proof to attribute the 

authorship of the scene to Shakespeare. Nevertheless, according to recent research, the 

likeliest possibility is that the three parts of Henry VI were not only written by 

Shakespeare, but that he collaborated with Marlowe in their creation (see Section 2.2), 

and thus these results might also reflect the participation of the latter in the elaboration of 

the scene. Jackson reinforced the results of the abovementioned quantitative analysis by 

tracing images in Arden of Faversham that can be associated with others found in 

Shakespearean plays. He later used the LION database to see the frequency of these 

images in the works of other playwrights with the objective of assessing how rare they 

were.  

 
10 Scene III.v from Arden of Faversham portrays a heated argument between the characters of Mosby and 

Alice. 


74 
 

In contrast, Vickers has strongly argued that Arden of Faversham was written entirely 

by Thomas Kyd. He used the software Pl@giarism, which was originally designed to 

detect cases of plagiarism among students, to find a series of word sequences in common 

between the play and the undisputed works of Kyd (2008). In addition, Vickers himself 

encountered parallel passages between Arden of Faversham and the works of this 

playwright and reinforced his argument in favour of Kyd by stating that the play “is far 

ahead of Shakespeare’s abilities at the beginning of his career” (2015, p. 11), even though 

he has been the preferred candidate by many scholars, as reflected in this section of the 

thesis. Nevertheless, these findings and the methods with which they were obtained have 

been heavily criticized by Jackson (2015, 2017) and Taylor to the point that the latter 

stated that “[…] it is surprising, and unfortunate, that the Times Literary Supplement 

[where Vickers published the two articles that have been referenced earlier] continues to 

give Vickers a platform” (2015, p. 6).  

Similarly, I would like to express my disagreement with the approach adopted by 

Vickers, who has directly attributed the authorship of Arden of Faversham to a single 

author. According to historical and literary sources, the likeliest possibility is that the play 

was written in collaboration (see Chapter 2), for which it seems more reasonable to divide 

it in smaller portions to study their authorship independently and only attribute it to a 

single author if the results derived from such analyses coincide. Furthermore, the 

technique of finding parallel passages between a disputed text and the reference corpus 

of a possible author, which is an approach followed by Jackson and Vickers, does not 

seem to be conclusive enough in studies involving Elizabethan playwrights, given that 

their styles tend to present a high degree of resemblance and thus it is possible to find 

similarities between any play and the reference corpus of any candidate, in my view. This 

is one of the reasons why this doctoral thesis will only rely on statistical criteria. 

In sum, Section 3.4 has offered an overview of the fundamentals of authorship 

attribution studies and the distinct types of texts that can be analysed within this 

disciplinary framework, with a special emphasis on the study of the authorship of 

historical texts in general, and that of Arden of Faversham in particular, given the focus 

of the thesis. This review of previous studies allows for the establishment of a connection 

between this chapter and the following, where the approach and the methods adopted for 

the present investigation will be developed. 

 
75 
 

3.5. Summary 

This chapter has provided the reader with a holistic perspective of forensic linguistics by 

commenting on its definition and historical development, as well as on its three main 

areas of study, known as the written language of the law, the spoken language of the law 

and the linguist as an expert witness. Firstly, the written language of the law has been 

presented as the branch of forensic linguistics that intends to make the laws more 

comprehensible through the Plain English Movement, which has been illustrated by 

classifying the prototypical features of legal documents and by analysing how to improve 

police cautions and jury instructions. Secondly, the spoken language of the law has been 

introduced as the area that examines the oral interactions that take place from the moment 

of arrest until a trial takes place, for which the manner in which oral evidence can be 

contaminated and the difficulties experienced by vulnerable participants involved in legal 

processes have been expounded. Lastly, the cases in which a forensic linguist is required 

to provide evidence for a case in which the use of language is involved or testify in court 

have been described with a special emphasis on authorship attribution studies, which have 

been addressed in a separate section.  

The main goal underlying the selection of this structure for the chapter has been to 

narrow down progressively its scope until previous studies on the authorship of Arden of 

Faversham have been depicted, together with the authorship tests considered for the 

conduction of this research. The following chapter of the thesis will provide an in-depth 

account of the steps that will be taken to trace the idiolectal features of William 

Shakespeare’s and Christopher Marlowe’s undisputed plays for further comparison with 

those of Arden of Faversham. 

  
76 
 

CHAPTER 4 | METHODOLOGY 

This chapter aims to provide the reader with a chronological explanation of the processes 

that have been followed for the conduction of the investigation. The criteria for the 

selection of William Shakespeare and Christopher Marlowe as the possible candidates for 

the attribution of authorship of Arden of Faversham will be addressed first. Afterwards, 

the manner in which the undisputed samples of each candidate have been compiled and 

adapted will be expounded, as well as the modifications that have been introduced in the 

disputed text, that is, Arden of Faversham, so that it can be compared with these reference 

corpora. This chapter will then present the tests that have been taken into consideration 

for the analysis and the need to evaluate their effectiveness in a series of pre-studies 

focused on the attribution of authorship of undisputed scenes of Shakespeare and 

Marlowe. Such pre-studies will be carried out with the aim of applying in the attribution 

of authorship of each scene of Arden of Faversham only those tests that have been proved 

to be reliable in a similar linguistic context. The lack of accessibility of the tools that can 

be used to conduct an analysis that includes some of the selected linguistic procedures 

generated the need to develop the software ALTXA. Its functionalities will be addressed 

in depth throughout this chapter, given that one of the main goals of the thesis is to 

facilitate the implementation of forensic linguistics in educational settings by offering this 

computational tool with an accessible interface as a free software to the academic 

community. Lastly, an explanation of the way in which the distinct functionalities of the 

software will be applied in the study of each type of scene will be provided, together with 

some guidelines on how to interpret every kind of outcome either in the pre-studies or in 

the final case study. 

4.1. Delimitation of the scope of the investigation 

The analysis of the authorship of Arden of Faversham has been selected as the focus of 

the thesis for two main reasons. Firstly, it represents a continuity of the research that I 

developed in my previous work (see Section 1.1). Secondly, there is a scarcity of studies 

approaching this topic from a forensic linguistic perspective and those that have already 

been conducted could be considered inconclusive, since there is still much disagreement 

among scholars on which author(s) could have been involved in the elaboration of the 

text (see Section 3.4.4).  


77 
 

Given the considerable length of the analysis, which will be expounded further on, as 

well as the many possible candidates that have been suggested for the authorship of each 

scene, the scope of this thesis needed to be narrowed down to the selection of two 

candidates. This decision has been made because of the way in which certain procedures 

like the Zeta test will be applied and, most importantly, to put the focus on establishing a 

solid methodological and computational basis for further studies involving the rest of the 

candidates.  

Even though there have been over fifteen playwrights considered as potential authors 

of the play in previous studies (see Section 3.4.4), scholars tend to agree on the fact that 

the three main candidates for the authorship of Arden of Faversham are William 

Shakespeare, Christopher Marlowe and Thomas Kyd (see Section 2.3). Many researchers 

who have studied the play from a linguistic point of view have suggested Shakespeare as 

its partial author (see Section 3.4.4), for which it seemed reasonable to select him for the 

study in the first place.  

The reasons why Marlowe has been selected over Kyd for the conduction of this 

research are that he is known to have collaborated with the Bard in the elaboration of 

Henry VI and that, if his life events are taken into consideration, one could ponder that he 

is prone to be associated with an anonymous play of this kind (see Section 2.2). As a 

matter of fact, his play Tamburlaine was published anonymously (Boas, 1940, as cited in 

Kinney, 2009). On the other hand, the studies that have supported Kyd’s authorship are 

far from being widely accepted in the academic community, as explained in Section 3.4.4, 

and there are only a few single-authored texts attributed to him, which would hinder a 

subsequent analysis. In any case, the fact that Shakespeare and Marlowe have been 

selected as the candidates for the attribution of authorship of Arden of Faversham in this 

study does not mean that the rest of the possible authors will not be taken into 

consideration in future lines of research, Thomas Kyd being the first one on the list (see 

Section 8.2).  

In sum, the objective of the thesis has not been delineated as determining the 

authorship of Arden of Faversham conclusively, since there are other candidates that have 

not been included in the analysis, but to discern if the likeliest author of each scene is 

Shakespeare or Marlowe in case that it was indeed written by one of them. Therefore, this 

thesis should be seen as the first step of a long-term academic project whose priorities 

are, for the moment, the determination of Shakespeare or Marlowe as the likeliest author 


78 
 

of each scene of the play for future comparisons with Thomas Kyd and the rest of the 

Elizabethan playwrights, the creation of a solid methodology that can pave the way for 

those future studies and the development of an accessible computer program with a wide 

range of functionalities for forensic authorship attribution (see Section 1.2). 

4.2. Data collection 

This section seeks to present the criteria for the compilation of the reference corpora, that 

is, the plays that will be used to delineate Shakespeare’s and Marlowe’s idiolect for 

further comparison with Arden of Faversham.  

On the one hand, Taylor’s study represents an approach for the compilation of the 

reference corpora that is opposite to that of the present thesis, given that it was based on 

the notion that “attribution problems in that period can be better understood if plays are 

tested against authorial canons that include non-dramatic as well as dramatic works” 

(2019, p. 1). 

On the other hand, there has been a tendency in studies of this kind to limit the 

selection of the undisputed works of the candidates to those that belong to the same genre 

and were written during a similar period to the one in which the disputed text was created, 

given that an author’s idiolect is not fossilized and thus a play that someone wrote in 1590 

may greatly differ from another text that the same person wrote in 1610, for instance. This 

approach can be observed if the study of Kinney (2009), which included plays that date 

from 1580 to 1619, is compared to that carried out by Elliott and Greatley-Hirsch (2017), 

where the selection of the plays for the analysis was restricted to those that were 

elaborated between 1580 and 1594.  

In my view, the style of Shakespeare and Marlowe evolved constantly and hence there 

might be significant stylistic inconsistencies among plays that were elaborated with a 

difference of more than five years. For that reason, the first criterion for the selection of 

the Shakespearean and Marlowian texts for the present research is that they need to have 

been written approximately between the years 1590 and 1595, given that Arden of 

Faversham was published in 1592 and probably elaborated during that year or the year 

before (see Section 2.3). This stands as a continuation of the approach suggested by Elliott 

and Greatley-Hirsch in their study (2017). 


79 
 

In addition, I have noticed that the forensic linguistic studies of Elizabethan plays tend 

not to take into consideration something that seems to be crucial when these texts are 

analysed in the field of literature, which is their subgenre. The tone of a comedy seems 

considerably distinct from that of tragedies and history plays and, as a result, one could 

hypothesize that there might be notable idiolectal differences among subgenres that need 

to be taken into consideration in the compilation of the corpora. Therefore, the exclusion 

of comedies such as Shakespeare’s The Two Gentlemen of Verona will be considered as 

another criterion for the elaboration of the reference corpus of each author, given that 

Arden of Faversham is a domestic tragedy, despite the presence of a few comic scenes in 

the play. 

Narrowing down the selection of the reference plays of each candidate to those that 

were approximately written between 1590 and 1595 and are not comedies is an innovative 

way of carrying out the attribution of authorship of Elizabethan plays in general and that 

of Arden of Faversham in particular. This approach, which differs from those adopted by 

Kinney (2009), Elliott and Greatley-Hirsch (2017) and Taylor (2019), is built upon the 

hypothesis that the idiolect is such a dynamic phenomenon that significant changes may 

arise over short periods of time and among different subgenres and thus this question 

must be given maximum priority (see Section 1.2). Consequently, it is preferable to have 

shorter samples than other studies on this subject but to compile reference corpora that 

present almost identical idiolectal features to those that characterize the disputed text. It 

may seem that this approach has two main drawbacks, which are the inherent difficulties 

in finding sole-authored and undisputed plays that meet the abovementioned requirements 

and, most importantly, that the resulting corpora may not be considered representative 

enough to define Shakespeare’s and Marlowe’s idiolect due to an insufficient number of 

words, which is an issue that will be addressed further on. 

The plays selected for the compilation of the Marlowian corpus are The Jew of Malta 

(1589) and Edward II (1592),11 since Arden of Faversham and these two works were 

written no more than three years apart and they both are plays with a tragic tone attributed 

to Marlowe without major doubts. Examples of Marlowe’s plays that have been discarded 

from the analysis are Dr. Faustus, given that it contains scenes which he probably wrote 

 
11 These dates have been taken from the official webpage of The Marlowe Society: http://www.marlowe-

society.org/christopher-marlowe/works/ 


80 
 

in collaboration with other playwrights (Elliott & Greatley-Hirsch, 2017), and The 

Massacre at Paris, for being extremely short in comparison to other eligible texts. 

The two plays selected for the compilation of the Shakespearean corpus are Richard 

III (1592-1594) and Richard II (1595-1596).12 These plays are not comedies; they were 

elaborated during the period that was established as acceptable for the conduction of the 

study and there seems to be consensus within the linguistic and the literary community 

about the fact that they were written only by Shakespeare. Examples of other plays that 

have been considered for the compilation of the corpus because they were elaborated 

between 1590 and 1595 but do not meet the rest of the established criteria for such 

selection are Henry VI, Part I, due to the presence of studies that suggest that he wrote 

this text in collaboration with Christopher Marlowe (see Section 2.2), and The Comedy 

of Errors, for being a comedy. 

In sum, due to the period in which they were written, their subgenre and the fact that 

they are sole-authored and well-attributed plays, the selected texts for the compilation of 

the Marlowian reference corpus are The Jew of Malta and Edward II, while Richard III 

and Richard II have been chosen to delineate Shakespeare’s idiolect for further 

comparison with Arden of Faversham. Nevertheless, these undisputed works and Arden 

of Faversham itself needed to be carefully selected among many digital editions and later 

edited to make the posterior analysis as precise as possible, which is an issue that will be 

addressed in the following section. 

4.3. Extraction and adaptation of the samples 

This section will discuss the criteria followed to extract and adapt the five plays that 

constitute the focus of the analysis. During the process of selecting the most suitable 

digital edition of the texts, it has been of paramount importance to avoid major spelling 

inconsistencies by taking all of them from a single source with unified criteria. For this 

reason, the samples have been extracted from the archives of Project Gutenberg,13 since 

they have published the five plays and claim to have prioritized the preservation of the 

original words used in the first quarto of Arden of Faversham from 1592, a quarto from 

1598 of Marlowe’s Edward II, the first edition of The Jew of Malta, which is a quarto of 

 
12 These dates have been taken from the official webpage of The Royal Shakespeare Company: 

https://www.rsc.org.uk/shakespeares-plays/timeline 
13 https://www.gutenberg.org/ 


81 
 

1633, and Shakespeare’s Richard III and Richard II from the 1623 edition of the First 

Folio (see Primary Sources for the links to access each edition).  

Even though no public access to a scanned version of the previously referenced 

Marlowian texts and the 1592 quarto of Arden of Faversham has been found, there is a 

scanned version of Shakespeare’s 1623 First Folio that can be accessed online,14 for 

which I compared the editions of Richard III and Richard II published by Project 

Gutenberg with those of the Folio. The main goal behind this procedure was to ensure 

that the selection of words of the plays published by Project Gutenberg is faithful to that 

of the First Folio, even though Project Gutenberg has adapted their spelling slightly to be 

better understood by modern audiences.  

Afterwards, the five plays extracted from Project Gutenberg were compared among 

themselves to see if the spelling criteria are unified and no major differences could be 

found. This spelling adaptation and homogenization does not constitute a problem for the 

development of the research, given that its focus is on the selection of words of the plays 

rather than their spelling. This is because spelling inconsistencies are frequent in the 

previously referenced original editions, since these were usually transcribed by more than 

one person (see Ryskina et al., 2017 for an introduction to compositor attribution studies 

with Elizabethan texts), which is why the authorship analysis of published works stands 

as such a complex task. 

In case that there are inconsistencies that have not been noted and edited as well as 

minor editorial modifications in the samples from Project Gutenberg, it seems that these 

are not significant enough to constitute a threat for the preciseness of the large-scale 

statistical analysis on which this investigation is built and can be seen as the low but 

inevitable error rate that characterizes studies of this kind. 

Once an explanation of the reasons underlying the selection of Project Gutenberg as 

the source for the compilation of the texts that constitute the focus of the research has 

been provided, the modifications that have been introduced in these plays to improve the 

quality of the subsequent analysis will be listed and exemplified below. 

A) In order to put the focus on the dialogues of the characters exclusively, all kinds 

of external indications have been erased, as can be observed in this example taken 

 
14 A scanned version of Shakespeare’s First Folio can be found in the following webpage: 

https://internetshakespeare.uvic.ca/Library/facsimile/overview/book/F1.html 


82 
 

from the beginning of Scene I.i from Richard III, where the original text was the 

following. 

ACT I. SCENE I. 

[Enter RICHARD, DUKE OF GLOUCESTER, solus.] 

GLOUCESTER. Now is the winter of our discontent 

Made glorious summer by this sun of York; 

And all the clouds that lour'd upon our house 

In the deep bosom of the ocean buried. 

The piece of text presented above has been modified according to the criterion 

described earlier and this is the resulting sample.  

Now is the winter of our discontent 

Made glorious summer by this sun of York; 

And all the clouds that lour'd upon our house 

In the deep bosom of the ocean buried. 

Those stage directions embedded in the dialogues of the characters, such as 

[aside], have also been erased. This decision has been made to avoid their 

contamination under the belief that idiolectal features are less likely to appear in 

indications of this kind. In other words, stage directions can be seen as a different 

subgenre within the play where playwrights are only expected to provide basic 

instructions that do not reflect significant linguistic choices in the same way that 

a dialogue does. 

B) Given that one of the authorship tests selected for the conduction of the analysis 

is based on the average number of words per sentence of the texts (see Section 

4.5.2), a decision on how to proceed in those cases in which a character interrupts 

another has been made, as can be seen in the extract from Scene IV.iv from 

Richard III provided below. 

KING RICHARD. Then, by my self- 

QUEEN ELIZABETH. Thy self is self-misus'd. 


83 
 

In this excerpt, the character of Queen Elizabeth interrupts that of King Richard 

and, if the names of the characters are erased, it could seem like it is a single 

sentence, when it is in fact a sentence that interrupts another. For this reason, 

interruptions of this kind have been divided with a period, so that the software 

ALTXA can count them as two separate sentences. Therefore, the resulting text 

is as follows. 

Then, by my self. 

Thy self is self-misus'd. 

C) The non-linguistic elements embedded in the texts, such as the footnotes’ 

numbers, have been removed. 

D) Since the software ALTXA cannot recognize certain characters, these have been 

modified, and for instance Æ has been turned into ae and ë into e. 

E) The samples included a few typos that have been corrected. These could be found 

in sentences containing an opening bracket that was not followed by a closing 

bracket, in excerpts where the name of a character was mentioned within a 

dialogue but the whole name had been written in capital letters as if it was an 

indication of who was speaking and, more predominantly, in fragments where two 

hyphens appeared together instead of a dash. 

F) The prologues and the epilogue written by Thomas Heywood in the 1633 quarto 

of The Jew of Malta (Elliott & Greatley-Hirsch, 2017) have been erased from the 

corpus. 

All the changes described above have been introduced manually and the resulting texts 

have been revised twice. This adaptation of the plays allows for the conduction of the 

analysis, whose structure will be expounded in the following section.  

4.4. Structure of the analysis 

Given the possibility that Arden of Faversham was written in collaboration, its scenes 

will be analysed as independent texts. In cases of possible cooperation between two or 

more playwrights in an Elizabethan play, forensic linguists can adopt two approaches, the 

first one consisting in the division of the play in even fragments for further analysis, and 


84 
 

the second one involving the analysis of the original scenes of the text.15 The latter has 

been selected for the conduction of the study, given that it seems more sensible that if 

Shakespeare and Marlowe had elaborated the play together, they would have probably 

assigned certain scenes to one or the other depending on its thematic content. For instance, 

one of them may have been in charge of the scenes with the characters of Black Will and 

Shakebag, while the other could have written the romantic scenes between Mosby and 

Alice. 

This approach has a major drawback, which is the disparity in the length of the scenes 

of the play. This can be seen in Table 1, which details the number of words of the scenes 

of Arden of Faversham, once it has been edited under the principles described in the 

previous section of the chapter. 

Table 1 | Length of the scenes of Arden of Faversham 

Scene Length 

Scene I.i 5,133 words 

Scene II.i 916 words 

Scene II.ii 1,694 words 

Scene III.i 822 words 

Scene III.ii 516 words 

Scene III.iii 357 words 

Scene III.iv 240 words 

Scene III.v 1,293 words 

Scene III.vi 1,265 words 

Scene IV.i 838 words 

Scene IV.ii 263 words 

Scene IV.iii 593 words 

Scene IV.iv 1,250 words 

Scene V.i 3,477 words 

Scene V.ii 106 words 

Scene V.iii 179 words 

Scene V.iv 117 words 

Scene V.v 321 words 

Scene V.vi (Epilogue) 148 words 

 
15 It is worth mentioning that “[t]he 1592 Quarto, the only substantive text, is not divided into acts or scenes. 

Modern editions […] divide the play into eighteen scenes and an epilogue. Each of these scenes ends with 

[…] ‘Exeunt’, and so there are clear-cut ‘natural’ divisions” (Kinney, 2009, p. 91). 


85 
 

Despite the plethora of studies that have analysed the effectiveness of distinct methods in 

authorship attribution studies, I believe that the effectiveness of every procedure depends 

on the linguistic context where it is applied. In other words, the validity of any given 

method depends on the type of text where it is applied, its length and the idiolectal features 

of its potential authors, and thus an authorship test that has been proved to be effective to 

distinguish between Shakespeare’s and Marlowe’s fragments of 2,000 words may not be 

useful to determine the authorship of their shortest scenes, and a test that works well for 

these two authors may not find significant idiolectal differences if, for instance, 

Marlowe’s texts are replaced by others written by Thomas Kyd. 

The scenes of Arden of Faversham have been divided into four groups according to 

their size and a series of pre-studies will be conducted to analyse the authorship of 

undisputed scenes from the Shakespearian and the Marlowian reference corpora that have 

a similar length to those included in each of these four groups as if they were anonymous. 

Such analyses will be carried out to discern which are the most reliable authorship tests 

for each type of scene of Arden of Faversham. In other words, the pre-studies will be 

carried out to only apply in the attribution of authorship of the scenes of Arden of 

Faversham those procedures that have been proved to be highly effective in the analysis 

of undisputed scenes of the two candidates of the study. In addition, the conduction of 

pre-studies of this kind allows the researcher to have a reference of what kind of outcome 

can be considered valid in the subsequent case study. 

The first group in which the scenes of Arden of Faversham have been divided, which 

is the largest, is formed by scenes that contain between 100 and 450 words, and the study 

of their authorship constitutes the most challenging task of the research. The second group 

includes scenes whose number of words ranges from 500 to 950. These samples are more 

representative than those of the first group, but they are still considerably short. The 

following group includes three scenes whose length ranges from 1,100 to 1,700 words. 

Finally, the fourth group is constituted by the two largest scenes of the play, which have 

more than 2,000 words and seem to be the ones whose authorship can be attributed more 

easily, since idiolectal features are more likely to arise as the number of words of a sample 

increase. 

In sum, this investigation will be divided into a series of pre-studies and a case study. 

The objective of the pre-studies is to analyse a representative number of scenes from the 

undisputed plays of Shakespeare and Marlowe as if they were disputed texts to only apply 


86 
 

in the case study, that is, the attribution of authorship of the scenes of Arden of 

Faversham, those methods that have been proved to be reliable with samples written by 

these candidates and have a comparable size.  

There is a bewildering number of variables that can alter the results of an idiolectal 

study, which creates the need to control as many of them as possible (Kredens, personal 

communication, February 17, 2019). This justifies the conduction of the pre-studies, that 

the selection of the undisputed plays of Shakespeare and Marlowe has been narrowed 

down to such a specific period of time and why comedies have been excluded from these 

reference corpora.  

In response to the question that raised a few pages ago about whether these 

Shakespearean and Marlowian reference corpora may not appear to be representative 

enough to analyse the authorship of Arden of Faversham due to an insufficient number 

of words, I would suggest the following answer, which derives from one of the main 

hypotheses on which the investigation is built (see Section 1.2). The representativeness 

of a reference corpus is not only determined by its length, but also by the extent to which 

its texts are able to reflect the conditions in which the disputed sample was elaborated. 

The Shakespearean reference corpus contains 50,057 words, whereas that of Marlowe is 

constituted by a total of 38,434 words. Given that each scene of Arden of Faversham will 

be analysed as an independent text and most of them will not exceed 1,000 words, the 

length of these reference corpora seems sufficient to establish reliable comparisons. 

Furthermore, they have been compiled with undisputed texts that are similar to the scenes 

of Arden of Faversham in terms of their subgenre and the period in which they were 

elaborated, for which they are truly representative of the idiolectal features of this 

disputed text. The following section of the chapter will present the tests that have been 

taken into consideration for this study and the software that has been specifically 

developed for its conduction. 

4.5. Selection of the authorship tests for the analysis and the role of ALTXA 

This section will examine the authorship attribution methods selected for the analysis and 

how they can be accessed in the software ALTXA. These are based on the quantification 

of the relative frequency of a series of keywords in the plays, the calculation of their 

average number of words per sentence and their lexical richness, tracing common n-grams 

and the conduction of the Zeta test. A computational tool is required to carry out the 


87 
 

abovementioned procedures, for which the suitability of already available programs was 

assessed at the initial stage of the investigation.  

There are some with an intuitive interface, such as WordSmith Tools, Voyant Tools 

and AntConc, whose usage is accessible for linguists and were programmed to conduct 

simple tasks like the calculation of the relative frequency of a keyword in a text. However, 

they cannot carry out some of the tests mentioned earlier and, for instance, WordSmith 

Tools and Voyant Tools do not include n-gram tracing among their functionalities and 

the conduction of a Zeta test is not available in any of these programs.  

WordSmith Tools is a computer program16 with three main functionalities. These are 

to identify the concordances of a word selected by the user in a corpus, to generate a list 

of all its words according to their frequency and to calculate the number of appearances 

and relative frequency of a set of keywords, which is a functionality that it shares with 

ALTXA (see Smith, 2021 for a review of the latest version of WordSmith Tools).  

Voyant Tools is an online platform17 with a simplified interface where the user can 

upload a corpus of one or more texts, press the button Reveal and have instant access to 

all the parameters that it measures. The functionalities that it shares with ALTXA are the 

calculation of the relative frequency of a set of chosen keywords in the corpus, its average 

number of words per sentence and its lexical richness (see Alhudithi, 2021 for a detailed 

list of all the functionalities of Voyant Tools). 

AntConc is a computer program18 that shares with ALTXA the ability to calculate the 

relative frequency of keywords selected by the user in a corpus (although it can also 

generate its own list of keywords using the log-likelihood or the chi squared method, 

which are procedures that ALTXA cannot carry out), as well as its lexical richness. It also 

allows the user to conduct a customized search for n-grams in a corpus, whereas ALTXA 

has been programmed to detect the n-grams that two corpora share (see Smith, 2021 for 

a review of one of the latest versions of AntConc). 

On the other hand, more powerful tools like the software Sketch Engine and the 

programming language R can be found on the Internet. Their strength relies on the wide 

range of functionalities that they offer, but their usage might be complicated for those 

 
16 Available at https://www.lexically.net/wordsmith/ 
17 https://voyant-tools.org/  
18 Available at https://www.laurenceanthony.net/software/antconc/ 

https://www.lexically.net/wordsmith/
https://voyant-tools.org/
https://www.laurenceanthony.net/software/antconc/


88 
 

who do not have a solid IT background, which is not only my case, but the case of many 

linguists.  

Sketch Engine is an online platform19 that offers many possibilities for the 

compilation and treatment of a corpus. Its main functionalities are presented on the 

interface as Word Sketch, Word Sketch Difference, Thesaurus, Concordance, Parallel 

Concordance, Wordlist, N-grams, Keywords, Trends and One-Click Dictionary, but these 

include a wide range of advanced settings (see Arias Rodríguez & Fernández-Pampillón 

Cesteros, 2020 for a thorough explanation of each of them). Its N-grams tool is more 

similar to the one that is present in AntConc than to that of ALTXA, given that it is mainly 

focused on providing the user with a customized search for n-grams within a corpus, 

rather than comparing those that two corpora share. Despite the many functionalities that 

Sketch Engine includes, the Zeta test is not programmed as one of them, as happens with 

the simpler tools presented earlier.  

R is a programming language that was mainly created for statistical computing. As 

also happens with Python and other programming languages, its possibilities are almost 

endless, if the user knows how to divide complex authorship attribution methods into 

simple tasks and program them. Despite the attempts at making its usage accessible for 

linguists through specialized courses (see Análisis de textos y estilometría usando R, 

organized by the Universidad Nacional de Educación a Distancia20), this requires 

considerable time and effort for those who do not have experience in programming. 

Given the limitations of these tools, I decided to develop a software that included a 

representative catalogue of authorship tests within the disciplinary field of forensic 

linguistics among its functionalities and presented a simplified interface so that it could 

be accessible to all linguists. With that purpose in mind, I contacted computer 

programmer Carlos Antón and we invested a couple of years in the creation of a Java 

program named ALTXA, which is compatible with all operative systems and admits texts 

in Spanish and English. The implementation of ALTXA in professional and educational 

settings to contribute to the development of this relatively modern discipline has been 

delineated as one of the main goals of this doctoral thesis (see Section 1.2). The following 

 
19 https://www.sketchengine.eu/  
20 Information available at https://formacionpermanente.uned.es/tp_actividad/idactividad/10010  

https://www.sketchengine.eu/
https://formacionpermanente.uned.es/tp_actividad/idactividad/10010


89 
 

subsections will address the functionalities of ALTXA and the manner in which they will 

be applied in the pre-studies and, if these are successful, in the final case study. 

4.5.1. Quantification of the relative frequency of keywords 

The first method that was selected for the study is based on the calculation of the relative 

frequency of a series of keywords chosen by the researcher in the disputed texts, that is, 

the scenes of Arden of Faversham, and the reference corpora (see Section 3.4.4 for an 

account of previous research involving this procedure). This selection of keywords was 

based on my personal judgement after having read most of the plays written by 

Shakespeare and Marlowe and consisted in a list of words from Arden of Faversham that 

I thought to be more characteristic of one author than the other.  

This was the first function programmed in ALTXA and can be accessed in the 

following way. The interface of the program includes tabs that allow for the conduction 

of distinct types of tests, which will be listed and explained throughout this chapter. When 

the user clicks the Text Analysis tab, they will find a file chooser called Text file where 

they are expected to upload a document in .txt format that contains the text where the 

analysis will be conducted. There will also be another file chooser called Keywords file 

where the user has the option to upload a .txt document with a list of keywords. Such 

keywords must be written separated by single spaces and the program will not make a 

distinction between capital and lowercase letters to avoid counting as two distinct lexical 

items a same word which has been written with and without a capital letter, as in You and 

you. When the button Execute is clicked on the interface of ALTXA, the program will 

count the number of times that each keyword appears in the sample, divide the result by 

the total number of words, that is, the tokens, and multiply it by a hundred to establish a 

percentage that stands as the relative frequency of the keyword. The software then will 

detail on the blank space of its interface the relative frequency of all the keywords selected 

by the researcher, as well as the average number of words per sentence of the text, its 

lexical richness and other parameters that have not been considered for this study, such 

as the average number of letters per word (see Figure 1). 

 
90 
 

Figure 1 | Interface of ALTXA for text analysis 

 
A few months after the inclusion of this functionality in the software, I came up with a 

study on the authorship of the Bixby Letter conducted by Grieve et al. (see Section 3.4.4) 

where they stated the following regarding the use of authorship methods that are based 

on the quantification of a set of features selected by the researcher: 

 In forensic linguistics, short texts are often attributed by manually selecting 

 linguistic features from the questioned document that appear to be relatively 

 distinctive or rare and then by searching for these forms in the writing samples of 

 each possible author. Although this method is logical and regularly applied 

 in casework, there are […] potential issues with its application. First, it is 

 unclear how to select an exhaustive or at least an unbiased feature set […]. It is 

 unclear how to judge whether differences in the use of forms in the 

 possible author writing samples are sufficient in the aggregate to attribute the 

 questioned document: because this approach relies on the judgement of the 

 analyst and therefore  cannot be consistently or mechanically applied, it is 

 difficult to systematically evaluate the reliability of such methods. (2018, pp. 5-

 6) 

The excerpt presented above made me question the nature of this test and eventually 

discard it from the analysis. I realized that most of the selected keywords were mainly 

influenced by the works of Shakespeare for the simple reason that he was the author that 


91 
 

I had read the most. In other words, my selection of keywords would have provided 

Shakespeare with more chances to be selected as the likeliest author of the scenes of 

Arden of Faversham, for which I decided that all the tests involved in the analysis should 

not rely in any way on my judgement. Nevertheless, this function has been kept in 

ALTXA for other linguists who may decide to carry out an analysis of this kind, since 

there might be linguistic contexts in which the calculation of this parameter could be 

useful. 

In sum, the quantification of the relative frequency of keywords, which was initially 

selected as one of the tests for the conduction of the study, has been discarded because of 

its reliance on subjective criteria. The following subsections will show the fundamentals 

of those that will play a role in the pre-studies. 

4.5.2. Quantification of the average number of words per sentence 

The second test selected for the conduction of the study is based on the quantification of 

the average number of words per sentence of the samples. As pointed out in Section 3.4.4, 

it seems that this parameter could be effective to discern which author tends to write more 

complex syntactic constructions. To measure this parameter, ALTXA has been 

programmed to count the total number of words of a given sample, that is, its tokens, and 

divide the result by its number of sentences by considering a period, an exclamation mark, 

an interrogation mark and a colon as the end of a sentence.  

This function can be accessed if the Text analysis tab is clicked on the interface of 

ALTXA, which is the same where the quantification of the relative frequency of 

keywords in a sample and the calculation of its lexical richness can be conducted (see 

Figure 1). As underlined in the previous section, the user will find a file chooser called 

Text file, where a document in .txt format containing the text that they wish to put into 

analysis can be uploaded. Once the user clicks the button Execute, ALTXA will detail on 

the blank space of its interface the average number of words per sentence of the text, its 

lexical richness and, if they previously uploaded a .txt document with keywords to the 

file chooser Keywords file, it will also indicate the relative frequency of such keywords. 

This parameter will be included in a pre-study that will analyse undisputed scenes of 

Shakespeare and Marlowe to assess its effectiveness to distinguish between samples 

written by both authors. Four analyses, one for each of the four types of scenes that have 


92 
 

been delineated earlier according to their length, will be conducted to determine if this 

authorship test can be used to analyse the authorship of Arden of Faversham.  

Firstly, five random scenes of the Shakespearean corpus whose length is between 100 

and 450 words will be extracted and their average number of words per sentence will be 

calculated. Afterwards, the same procedure will be conducted with five random scenes 

from the Marlowian corpus whose length also ranges from 100 and 450 words under the 

assumption that, if this test is effective in this linguistic context, the Shakespearean scenes 

will present similar values among themselves and, at the same time, that these results will 

be different enough from those derived from the analysis of the Marlowian fragments, 

whose values should be also similar among themselves. If that is the case, a posterior 

calculation of the average number of words per sentence of the scenes of this length from 

Arden of Faversham could allow for their association with the values of one of the 

candidates.  

The same procedure will then be repeated with undisputed scenes of both playwrights 

from the other three groups, whose number of words is between 500 and 950, between 

1,100 and 1,700, and similar or superior to 2,000, respectively. If the results derived from 

the analysis of the scenes of any group show enough intra-author consistency and inter-

author variation, this method will be used to analyse the authorship of the scenes of Arden 

of Faversham that belong to the same group. Five scenes of each author from the second 

and the third group will be analysed, whereas, due to the scarcity of undisputed scenes of 

almost 2,000 words or more in the Marlowian reference corpus, the fourth stage of the 

pre-study will include five Shakespearean scenes and four of Marlowe. 

It is reasonable to believe that, as the size of the samples increases, so will do the 

effectiveness of this method, given that a higher number of sentences will facilitate the 

stabilization of this value, which will allow for the existence of more intra-author 

consistency. Nevertheless, it is hard to predict if the average number of words per 

sentence of Shakespeare and Marlowe will overlap, which would automatically exclude 

this test from the final case study, since it would be impossible to associate a scene from 

Arden of Faversham with one of the authors in terms of this parameter. 

In sum, four analyses will be conducted to determine if the average number of words 

per sentence of the scenes of Shakespeare and Marlowe presents sufficient intra-author 


93 
 

consistency and inter-author variation to be later used to determine the likeliest authorship 

of the scenes of Arden of Faversham (see Section 5.1). 

4.5.3. Quantification of the lexical richness 

The belief that the quantification of the lexical richness of the samples can be of use to 

discern which of the two candidates of the study handled a wider range of vocabulary has 

determined its inclusion in the research. To calculate this parameter, which can be also 

accessed in the Text analysis tab, ALTXA has been programmed to divide the number of 

distinct words of a sample, or types, by its total number of words, or tokens, and multiply 

the result by a hundred. As happens during the calculation of the relative frequency of 

keywords, the software will not make a distinction between capital and lowercase letters 

while measuring the lexical richness of a sample to avoid counting as two different types 

a word that has been written with and without a capital letter. The results derived from its 

calculation will appear on the blank space of the interface together with the average 

number of words per sentence of the sample and the relative frequency of the selected 

keywords, among other parameters (see Figure 1). 

With the purpose of assessing the reliability of this parameter to analyse the 

authorship of Arden of Faversham, a pre-study divided into four stages, one for each of 

the four types of scenes in terms of their length, will be carried out with undisputed scenes 

taken from the Shakespearean and the Marlowian corpora. The objective of these analyses 

is the same as in those on the average number of words per sentence, that is, to discern if 

there is enough intra-author consistency and inter-author variation to later associate with 

clarity the lexical richness of the scenes of Arden of Faversham with one of the two 

candidates. Five Shakespearean and five Marlowian scenes will be included in each stage 

of the pre-study, with the exception of the fourth, which will analyse five Shakespearean 

scenes and the only four scenes of almost 2,000 words or more that the Marlowian corpus 

contains, as in the previous pre-study. 

The scenes of each group will not be randomly selected as in the pre-study on the 

average number of words per sentence, given that, as the size of a sample becomes larger, 

the chances of repeating words are higher, and hence slight increases in the number of 

words of a sample may greatly lower its percentage of lexical richness. For that reason, 

the samples whose number of words is more similar will be selected for each stage of the 

pre-study, creating subgroups of scenes of almost identical length within each group to 


94 
 

optimize even more the results. This contrasts with the criterion behind the selection of 

the scenes for the pre-study on the average number of words per sentence, given that this 

parameter is not so heavily affected by the size of the samples and thus it is enough to 

compare random scenes that belong to the same group. 

It is worth mentioning that, since the scenes of almost 2,000 words or more of the two 

reference corpora have disparate lengths and hence they cannot be divided into subgroups 

as those of the other three groups, the decision of establishing after the calculation of their 

lexical richness a projection of what these values would be if their size was more balanced 

has been made to evaluate more efficiently the results of the fourth stage of the pre-study 

(see Section 5.2.4). 

It could be hypothesized that this discriminator needs larger samples than those 

involved in the four stages of the pre-study to present intra-author consistency, but this 

needs to be proved with concrete studies, which will be carried out in Section 5.2. 

In brief, the pre-study on the calculation of the lexical richness of undisputed scenes 

of Shakespeare and Marlowe aims to investigate whether there is enough intra-author 

consistency and inter-author variation in any of the four types of scenes to later apply this 

test in the attribution of authorship of the scenes of Arden of Faversham. 

4.5.4. N-gram tracing 

A study of the common n-grams between the disputed text and the reference corpora 

stands as the next authorship test selected for the analysis. Taylor (2019) conducted a 

study of this kind on the authorship of a small fragment of Arden of Faversham where he 

compiled a reference corpus for each of his candidates that included texts written in 

relatively different periods and that belong to distinct literary genres (see Section 3.4.4). 

The approach of the present study differs from Taylor’s in what has been considered a 

representative idiolectal sample, since the Shakespearean and the Marlowian corpora of 

this research have been compiled with plays that were written in a period which was close 

to the year 1592, that is, when Arden of Faversham was first published, and are not 

comedies, which may have a notable impact on the results, as has been argued throughout 

this chapter. Furthermore, this research follows the traditional vision of n-grams as 

combinations of linguistic forms that appear consecutively within a same sentence, which 

constitutes another difference with the analysis conducted by Taylor, who also traced 

certain types of non-consecutive combinations of linguistic forms. 


95 
 

There are character n-grams and word n-grams, as pointed out during the explanation 

of previous studies of this kind (see Section 3.4.4). Even though both types of n-grams 

have been proved to be useful in certain linguistic contexts, I have decided to focus on 

word n-grams under the hypothesis that these reflect more distinctive linguistic 

constructions (see Section 1.2). In addition, this research will only study word n-grams 

of at least two words, which is a similarity with the research conducted by Taylor. The 

main reasons underlying the selection of this approach is that, firstly, a combination of 

two or more words in common tends to be more distinctive than a combination of letters 

or a single word in common. In addition, the Zeta test, which will be expounded later, 

identifies distinctive single words in common between the disputed text and the reference 

corpora, for which a study of n-grams that also focuses on that would be redundant. As a 

matter of fact, most of the common words that n-gram tracing reveals are function words, 

which are not as distinctive as the lexical words on which the Zeta test mainly focuses, 

and thus the latter will present more significant results when studying single words in 

common.  

In any case, the identification of all types of word n-grams has been programmed in 

ALTXA, where the test can be accessed as follows. When the user clicks the tab called 

N-gram analysis (see Figure 2), they will find two file choosers called Text A file and Text 

B file, which only admit documents in .txt format. The first file chooser is expected to 

store the shortest sample, since, when the button Execute is clicked, the program will 

make a list of all the word n-grams of the document stored as Text A and then look for 

coincidences with Text B ignoring commas and other punctuation marks within the 

sentence. There is no problem in uploading the largest sample to the first file chooser, but 

the process will take longer for ALTXA.  

The software will then generate on the blank space of its interface a list of all the word 

n-grams shared between both samples. This list will not only indicate the number of n-

grams of each type in common, but it will also offer a detailed list of which are those n-

grams and how many times they appear in each of the two samples. The order in which 

the n-grams are listed will be determined by its length, and for instance 4-grams will 

appear before 3-grams, and so on. 

 
96 
 

Figure 2 | Interface of ALTXA for n-gram tracing 

 
The main problem behind the comparison between a text and the reference corpora of 

Shakespeare and Marlowe was that their length was dissimilar. While the corpus of 

Christopher Marlowe had 38,434 words after the editing process that has been described 

earlier in this chapter, Shakespeare’s corpus contained 50,057 words. The Shakespearean 

corpus would have always had more chances to present more n-grams in common with a 

disputed sample than the Marlowian corpus, for which a solution needed to be adopted. 

This has been to remove a similar number of words from Richard III and Richard II, that 

is, the two samples included in the Shakespearean corpus, to ensure that both candidates 

are in equal conditions to become the likeliest author of every scene whose authorship is 

tested.  

Removing words from the beginning or the end of the plays may seem biased, for 

which its point of departure has been determined by a randomly generated number that 

indicated a word number of the play. To avoid leaving unfinished sentences in the corpus, 

the removal began in the sentence that followed the randomly generated word number. 

Following this procedure, 5,808 words have been removed from Richard III starting from 

the sentence that followed the excerpt “I lay it naked to the deadly stroke, and humbly 

beg the death upon my knee” in Scene I.ii, while 5,819 words have been erased from 

Richard II starting from the sentence that followed the excerpt “Then I must not say no” 

in Scene III.iv. The resulting reference corpus of the Bard is formed by 38,430 words, 


97 
 

which is a similar number of words to those contained in the Marlowian corpus. This 

adaptation will allow for the conduction of unbiased studies in which the corpora of both 

candidates are in equal conditions to be compared with a smaller sample to determine its 

likeliest authorship. 

Once the size of both reference corpora is balanced, a pre-study to evaluate the 

effectiveness of n-gram tracing in the attribution of authorship of scenes written by 

Shakespeare and Marlowe will be carried out. This pre-study will be based on the 

extraction of five scenes of each author from every group of scenes to quantify the n-

grams that they share with the reference corpus from which they have been removed and 

with that of the other candidate to discern if the method can associate each sample with 

the corpus of the author from which it has been taken. 

Firstly, five Shakespearean and five Marlowian scenes whose length ranges from 100 

to 450 words will be extracted from their corpora and analysed as disputed texts to 

estimate if n-gram tracing is reliable enough to investigate the authorship of the scenes 

from Arden of Faversham that have a similar length. Each of these ten scenes will be 

extracted and compared with the two reference corpora independently, so this stage of the 

pre-study will be formed by ten different analyses. The purpose of these studies is to 

discern how many of the ten undisputed scenes present more n-grams in common with 

the corpus from which they have been taken than with that of the other candidate. The 

same procedure will then be carried out with scenes taken from the three other groups, 

which contain between 500 and 950 words, between 1,100 and 1,700 words and almost 

2,000 words or more, respectively. The fourth stage will include four Marlowian scenes 

instead of five, as in the two pre-studies described earlier.  

Two criteria will be followed during the conduction of this pre-study. The first one 

derives from the fact that the undisputed scenes that will be analysed as disputed texts 

will present n-grams in common with the corpus from which they have been taken that 

include proper names which are exclusive of the play where they belong. For instance, if 

Scene II.iii from Edward II is extracted from the Marlowian corpus and analysed with 

ALTXA, it will present a series of 2-grams in common with such corpus that include the 

names of characters and locations that only appear in that play, for instance Gaveston is 

and in Tynmouth. These circumstantial n-grams would help to attribute each undisputed 

scene to its author, but that will not occur if a scene taken from Arden of Faversham is 


98 
 

analysed by this method, since it does not share specific characters and locations with any 

of the two reference corpora.  

For that reason, an exhaustive review of all the common n-grams of the pre-study will 

be made to eliminate the circumstantial ones manually from the lists provided by ALTXA 

and only take into consideration the type of n-grams that would be involved in the final 

case study. In contrast, those that include the names of characters and locations that are 

present in both reference corpora will be kept, for instance King Henry and England, 

which are mentioned multiple times by Shakespeare and Marlowe. This criterion ensures 

that the pre-study will offer a realistic outcome that enables the assessment of the validity 

of the method. 

Even though n-gram tracing is often seen as a quantitative method, its results can be 

“noisy” on certain occasions, which require a subsequent qualitative analysis (Kredens, 

personal communication, February 17, 2019). In other words, the fact that two samples 

coincide in the use of an n-gram that contains many words may not necessarily be 

significant if it is a highly common construction. Therefore, the second criterion that will 

be adopted for the conduction of the pre-study is that if a disputed scene shares at least 

ten n-grams of a certain number of words with one of the reference corpora, these will be 

analysed quantitatively and the results will be presented in tables,21 whereas the others 

will be analysed from a qualitative perspective. This qualitative analysis of the larger but 

less frequent n-grams is only meant to complement the quantitative analysis, which will 

be given more importance in the attribution process. It makes sense to establish a 

statistical comparison of the results derived from the study of a type of n-grams that are 

shared a certain number of times by the texts, but it would be illogical to conduct a 

quantitative analysis of, for instance, 5-grams that are only shared once or twice by the 

samples and consider them as significant as the number of common 2-grams, which are 

much more frequent.  

This means that a 5-gram like and here he comes the will not be given much 

importance if it is the only common 5-gram between a scene and a reference corpus, even 

if there are no 5-grams in common between that scene and the other reference corpus, 

since it is not a distinctive combination of words. In contrast, a 5-gram like this hell of 

 
21 For the sake of clarity, the results derived from the quantitative analyses will be expressed in absolute 

figures, instead of using the overlap coefficient or the Jaccard index (see Grieve et al., 2018). 


99 
 

grief is will be considered a solid idiolectal marker that can complement the results of the 

quantitative study, since it is an unusual combination of five words that holds a 

metaphorical meaning. This criterion will be kept for the final case study on the 

authorship of the scenes of Arden of Faversham if this method proves to be effective in 

any of the stages of the pre-study. 

In short, if a text whose authorship is being analysed shares at least ten n-grams of a 

certain length with one of the reference corpora, these will be analysed from a quantitative 

perspective and the results will be presented in tables. When these results are discussed, 

a qualitative analysis of the larger but less frequent n-grams in common will be provided.  

In the elaboration of the final verdict on the authorship of a scene, three expressions 

will be used, the first one being that it seems highly probable that it was written by 

Shakespeare/Marlowe. This will be used on those occasions in which every type of n-

grams clearly links the scene to a specific author or in those cases in which, even if the 

results derived from the analysis of one type of n-grams are inconclusive, the others 

associate it with one of the candidates with great certainty. The second expression is that 

it seems slightly probable that the scene was written by Shakespeare/Marlowe, which will 

be used when the results provided by the analysis link the authorship of the sample to a 

specific author by a narrow margin. Finally, when the results lack clarity, the expression 

it seems uncertain if the scene was written by Shakespeare or Marlowe will be employed. 

In brief, the reliability of n-gram tracing will be assessed by extracting scenes from 

the two reference corpora, whose number of words has been balanced, and analysing if 

they share more n-grams with that from which they have been taken than with the other 

after the manual exclusion of those n-grams that include the names of characters and 

locations that are exclusive of the play where they belong. If a scene that is being analysed 

shares at least ten n-grams of a certain type with one of the reference corpora, these will 

be analysed quantitatively, whereas the others will be examined from a qualitative 

perspective. If the success rate of n-gram tracing is sufficiently solid in any of the four 

stages of the pre-study, this method will be used to determine the likeliest authorship of 

the scenes of Arden of Faversham of such group following the same criteria that have 

been delineated for the conduction of the pre-study. 

 
100 
 

4.5.5. The Zeta test 

The last method selected for the conduction of the pre-studies is the Zeta test (see Section 

3.4.4 for a detailed explanation of the fundamentals of this procedure as well as of 

previous research involving its usage). This test can be accessed if the user clicks the 

ZTest tab on the interface of ALTXA, where they will find four file choosers called Text 

A file, Text B file, Text C file and Ignored words file (see Figure 3). 

The reference corpora of both candidates must be uploaded in txt. format to the first 

two file choosers with the combination of symbols @#@ written at the end of every text 

within a corpus. This will allow the software to divide the corpora properly, that is, in 

fragments of 2,000 words but adding the residual ones at the end of each play to its last 

fragment. The disputed text is expected to be uploaded with the same format to the third 

file chooser. 

To elaborate the lists of the 500 markers of each of the two reference corpora, ALTXA 

has been programmed to only take into consideration words that do not appear in a stop 

list that has to be uploaded in .txt format to the file chooser Ignored words list. This stop 

list needs to include the most common function words of the language in which the 

researcher is conducting their study,22 which can be easily found on the Internet, as well 

as proper names and other lexical items that they wish to ignore on the ground that they 

are “more closely related to local, play-specific contexts rather than indicative of any 

consistent authorial pattern” (Elliott & Greatley-Hirsch, 2017, p. 151). All these words 

must be introduced without capital letters and separated by single spaces in the .txt 

document. The idea of creating an editable stop list to adapt the conduction of each Zeta 

test to the specific needs of the researcher is one of the innovations introduced by the 

software ALTXA that differentiates it from other computational tools. Appendix 3 

contains the stop list with all the words that have been ignored as potential markers for 

the conduction of the Zeta tests of this thesis. 

 
22 Following the criterion of Kinney (2009), some function words with a similar meaning, like yes and yea, 

whose usage may be seen as a choice of the author, have not been ignored for the conduction of this test, 

despite the fact that the abovementioned forms are mainly dialectal or context-dependent (see Section 

3.4.4). The combinations of two function words in a contracted form, such as I’ll, have not been ignored 

either, since they stand as idiolectal choices, in my view. 


101 
 

Figure 3 | Interface of ALTXA for the Zeta test 

 
When the button Execute Zeta test analysis is clicked, ALTXA will quantify the 

proportion of fragments from the first reference corpus in which every word that is not 

present in the document Ignored words list appears and the proportion of fragments from 

the second reference corpus where they do not appear. If the percentages of appearance 

and not appearance of each of these words are transformed into numbers from 0 to 1 and 

these are added, a distinctive one must produce a result that is higher than 1. The 500 

words of the first author with the highest results above 1 will be listed as their markers, 

and the opposite procedure will be simultaneously applied by the software to elaborate 

the list of 500 markers of the second candidate. In other words, the 500 markers of an 

author are not only chosen by their frequency in his/her corpus, but also by their low 

frequency or lack of appearance in the corpus of the other candidate. The lists will be 

generated by the software in an Excel document, together with a png. image file with the 

graphical representation of each fragment of 2,000 words or more on a coordinate axis 

and another Excel document that details these coordinates. 

There is a problem with the conduction of the Zeta test with Elizabethan playwrights, 

which is that their samples include archaic forms like thine that tend not to be present on 

the lists of function words available on the Internet, and thus the software will identify 

them as potential markers if it is not ordered to ignore them. For that reason, the process 

by which the 500 markers of each author are obtained for the conduction of these Zeta 


102 
 

tests needs to be repeated many times. Every time it is carried out, it is necessary to 

include all the function words and proper names that appear on these two lists in the stop 

list and execute the Zeta test repeatedly until there are only distinctive lexical items in 

both lists of markers, which is a thorough process (see Appendix 3 for the stop list with 

all the words ignored as potential markers during the conduction of the Zeta tests of this 

thesis). For obvious reasons, this would not be such a complicated task if the texts were 

modern. 

As pointed out earlier, ALTXA will generate an Excel document with the lists of 500 

markers of each author that need to be revised until they only include the kind of words 

that the researcher wishes to consider for their study, a png. image with the graphical 

representation of the results and an Excel document that includes the exact coordinates of 

every fragment on the coordinate axis. The latter document is generated in case the 

researcher wants to elaborate another graphical representation of the results on the Excel 

sheet itself or to export the coordinates to a distinct database easily.  

In the png. file generated by ALTXA, the fragments of the first reference corpus will 

be represented by blue dots, those of the second reference corpus by red squares, and the 

fragments of the disputed text by black triangles (see Sections 5.4.1, 6.1, 6.3 and 6.14 for 

examples of these representations). Their position on the coordinate axis will be 

determined as follows. The value of the horizontal axis stands as the division of the 

number of markers of the first reference corpus that a fragment includes by its number of 

distinct words, whereas its position on the vertical axis will be determined by the division 

of the markers of the second reference corpus that it contains by its number of distinct 

words. As explained in Section 3.4.4, the Zeta test does not take into account the number 

of times that a marker appears in a fragment, but whether it appears or not, and the fact 

that the number of markers that a fragment contains is divided by its number of different 

words or types is to compensate the dissimilar size that some of them have as a result of 

including a residual number of words for being at the end of a text. 

The likeliest authorship of the fragments of the disputed text will therefore be 

determined by their proximity to the centroid of each of the two clusters formed by the 

fragments in which the two reference corpora have been divided. In case that it is not 

discernible at plain sight which of the two clusters is closer to the fragments of the 

disputed text, the coordinates of the centroid of a cluster can be determined by calculating 

the average value of all its X and Y coordinates. Afterwards, the distance between the 


103 
 

coordinates of the centroid of a cluster and those of a fragment of the disputed text can 

be calculated, according to Professor Elisa Isabel Lozano (personal communication, 

January 10, 2020), with the formula |𝐴𝐵⃗⃗⃗⃗  ⃗| = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2. 

 The main differences between the Zeta tests conducted by Kinney (2009) and Elliott 

and Greatley-Hirsch (2017) to analyse the authorship of Arden of Faversham and those 

that will be applied in the present thesis lie in the fact that these ones will compare 

Shakespeare with Marlowe individually, instead of comparing one candidate with a group 

of many, and what has been considered a representative reference corpus for each 

candidate. In addition, the type of scenes to which this method will be applied will also 

differ from those of the abovementioned studies, which is an issue that will be addressed 

further on. 

Firstly, I would like to suggest that using the Zeta test to compare a writer with a 

group of writers is not as efficient as comparing them individually, which could be 

exemplified in the following way. If the word gentle appears in many Shakespearean 

fragments and does not occur so often in the Marlowian fragments, it will be classified as 

a discriminator of the Bard if these candidates are compared individually. Nevertheless, 

if Shakespeare is compared with a group of writers that includes Marlowe, but whose 

majority uses the word gentle frequently, this word will not be selected as a marker 

because of the average values of the group and thus a reliable discriminator between these 

two authors will be lost. The idea of obtaining a set of markers from a group of writers in 

a Zeta test does not seem sensible to me, given that their corpus will not constitute a 

proper reflection of the idiolect of any of them, but a mixture of many idiolects that cannot 

fully represent any of its parts. For that reason, one of the main hypotheses suggested in 

this doctoral thesis is that the Zeta test should only compare authors individually (see 

Section 1.2), given that it is the only way of obtaining realistic discriminators among 

them. If a researcher wants to distinguish between Shakespeare and the rest of the 

Elizabethan authors, such comparisons should be made one by one. This is one of the 

reasons why the catalogue of candidates for the conduction of this thesis has been 

narrowed down to Shakespeare and Marlowe and why future studies will compare the 

likeliest author of every scene of Arden of Faversham according to this research with Kyd 

and the rest of possible candidates individually (see Section 8.2). 


104 
 

The second major difference between the Zeta tests of the present thesis and those 

applied by Kinney (2009) and Elliott and Greatley-Hirsch (2017) is the criteria for the 

compilation of the reference corpora of the candidates. The corpora compiled by these 

authors included works of periods and subgenres that are different from those of Arden 

of Faversham, while the ones selected for the conduction of this doctoral thesis were 

written no more than three years apart from the creation of the play and none of them are 

comedies. This issue, which is associated with one of the most relevant hypotheses of the 

investigation (see Section 1.2), has been addressed in depth earlier in this chapter, for 

which no further explanations will be provided. 

The Zeta test places on a coordinate axis 2,000-word fragments according to the 

markers of the two reference corpora that they contain, for which it seems reasonable to 

only analyse scenes whose length is similar or superior to 2,000 words with this method. 

Despite the fact that Kinney (2009) also included the shortest scenes of Arden of 

Faversham in this procedure, I would say that it does not make much sense to compare 

fragments of 2,000 words with others of, for instance, 200, in terms of the number of 

markers that they have from two lists of 500, even if these numbers are then divided by 

the number of distinct words of each fragment. There are other methods, such as n-gram 

tracing, that can effectively analyse the authorship of scenes of this kind without making 

unbalanced comparisons. 

Therefore, only undisputed scenes whose length is similar or superior to 2,000 words 

will be analysed with the Zeta test to evaluate its validity. This pre-study will consist in 

the extraction of five Shakespearean and four Marlowian scenes of such length to be 

analysed independently by this method. If the pre-study shows solid results (see Section 

5.4), the Zeta test will be used to determine the likeliest authorship of the scenes of Arden 

of Faversham of that group. 

In short, this doctoral thesis suggests that candidates should be compared individually 

during the Zeta test. It has also been suggested that the reference corpora of these 

candidates should be formed by texts which have similar characteristics to those of the 

disputed text. Lastly, the fragments in which the disputed text is divided should present a 

length that is at least close to that of the fragments in which the reference corpora are 

divided, for which only scenes from the fourth group will be included in the pre-study to 

assess the validity of the method.  


105 
 

4.6. Summary 

This chapter has offered an exhaustive explanation of the distinct steps that have been 

and will be taken for the conduction of the study, which could be summarized as follows. 

The selection of the authorship of Arden of Faversham as the focus of this research has 

been determined by the topic of my previous work and the inconclusive results of the few 

studies that have been conducted on this subject from a forensic linguistic perspective. 

Given the thoroughness of the analysis and its inherent length, the first methodological 

decision has been to only take into consideration two candidates for the authorship of the 

play and consider this thesis as the first milestone of a long-term project where the rest of 

the Elizabethan playwrights will be involved. The selection of Shakespeare as one of the 

candidates has been due to the influence of the research conducted by other scholars, 

whereas the selection of Marlowe is mainly due to his biographical data, which seem to 

make him a suitable candidate for a play of this nature, and the fact that he is known to 

have collaborated with Shakespeare in the creation of Henry VI.  

The selection of the candidates for the attribution of authorship of Arden of 

Faversham has been followed by the delimitation of a series of criteria to compile their 

reference corpora. These have been to select plays that are not comedies and were written 

no more than three years apart from the creation of Arden of Faversham. Such decisions 

respond to one of the main hypotheses on which the investigation is built, which is that 

authorship problems can be better addressed if the disputed text is compared to reference 

corpora that reflect faithfully the conditions in which it was created and thus are truly 

representative of its idiolectal features. The hypotheses that word n-grams reflect more 

distinctive constructions than character n-grams and that authors should be compared 

individually during a Zeta test have also been discussed in this chapter. 

Richard III and Richard II have been selected for the compilation of the 

Shakespearean corpus, whereas the corpus of Marlowe has been compiled with The Jew 

of Malta and Edward II. These plays and Arden of Faversham itself have been extracted 

from the archives of Project Gutenberg, that has prioritized the preservation of the 

selection of words of the original manuscripts, which is on what the subsequent analysis 

will focus, rather than spelling features. Afterwards, the texts have been adapted to 

optimize the results of such analysis following a series of criteria introduced by the 

researcher, for instance that only the direct interventions of the characters in the dialogues 


106 
 

will be considered in the analysis, since they reflect more idiolectal features than stage 

directions. 

As a result of the belief that the effectiveness of an authorship attribution method is 

determined by the context where it is applied, this investigation will be divided into a 

series of pre-studies and a case study. The pre-studies will analyse the authorship of 

scenes of distinct lengths taken from the Shakespearean and the Marlowian reference 

corpora as if they were disputed texts to only apply in the case study, that is, the attribution 

of authorship of the scenes of Arden of Faversham, those procedures that have been 

proved to be solid enough in an identical linguistic context. In addition, these pre-studies 

will allow the researcher to have a reference of what kind of outcomes can be considered 

conclusive in the case study. 

The functionalities of the software ALTXA and how they can be accessed on its 

interface have been thoroughly addressed in this chapter. These are the quantification of 

the relative frequency of a set of keywords selected by the researcher, which will not be 

included in the study for its reliance on subjective criteria, the calculation of the average 

number of words per sentence of a text and its lexical richness, n-gram tracing and the 

conduction of the Zeta test. This reflects the importance of ALTXA in the thesis, given 

that one of its main objectives is to prove the validity of this tool and establish a solid 

methodological basis for the conduction of future studies involving other possible 

candidates. 

  
107 
 

CHAPTER 5 | PRE-STUDIES  

This chapter seeks to assess the reliability of the authorship tests that have been selected 

in Chapter 4 by applying them in the analysis of undisputed scenes taken from the 

Shakespearean and the Marlowian reference corpora. The only objective of these pre-

studies is to use in the final case study, that is, the attribution of authorship of the scenes 

of Arden of Faversham, those procedures that have proved to be effective in a similar 

linguistic context. The first pre-study (see Section 5.1) will address whether the 

Shakespearean and the Marlowian scenes can be effectively differentiated by calculating 

their average number of words per sentence, whereas the second one (see Section 5.2) 

will assess the reliability of the calculation of their lexical richness for the same purpose. 

The third pre-study (see Section 5.3) will evaluate the effectiveness of n-gram tracing to 

attribute the authorship of samples of both candidates and the fourth one (see Section 5.4) 

will do the same with the Zeta test. These pre-studies will be divided into four distinct 

stages, one for each of the four types of scenes that have been delineated in the previous 

chapter in terms of their length, except for the pre-study about the Zeta test, which will 

be only conducted with samples whose length is similar or superior to 2,000 words (see 

Chapter 4 for a detailed explanation of the reasons underlying such decisions). 

5.1. Pre-study on the calculation of the average number of words per sentence (Pre-

study 1) 

This pre-study, which will analyse the authorship of undisputed scenes of Shakespeare 

and Marlowe in terms of their average number of words per sentence, will be based on 

the following principle. The scenes that were written by the same author should present 

similar values among themselves and, simultaneously, a sufficient degree of 

differentiation from those elaborated by the other author, which should also present intra-

author consistency. Random scenes from the two reference corpora of between 100 and 

450 words will be analysed first, followed by scenes of between 500 and 950, between 

1,100 and 1,700 and, lastly, of almost 2,000 words or more.  

The objective of this pre-study is therefore to assess the extent to which the calculation 

of the average number of words per sentence can be considered a reliable discriminator 

to distinguish between samples written by Shakespeare and Marlowe in any of the four 

abovementioned contexts to later apply it in the analysis of the scenes of Arden of 

Faversham. 


108 
 

5.1.1. Average number of words per sentence of scenes of between 100 and 450 words 

The average number of words per sentence of five Shakespearean and five Marlowian 

random scenes of between 100 and 450 words has been calculated by ALTXA and the 

results can be observed in Table 2. 

Table 2 | Stage 1 of the pre-study on the average number of words per sentence 

Shakespearean 

scenes 

Words per sentence 

(w/s) 

Marlowian scenes Words per sentence 

(w/s) 

Richard III, Scene 

II.iii (398 words) 

12.061 w/s Edward II, Scene 

II.iii (218 words) 

12.824 w/s 

Richard III, Scene 

III.iii (197 words) 

11.588 w/s Edward II, Scene 

IV.iii (426 words) 

10.923 w/s 

Richard III, Scene 

V.ii (188 words) 

17.091 w/s The Jew of Malta, 

Scene III.i (253 

words) 

14.056 w/s 

Richard II, Scene III.i 

(342 words) 

21.375 w/s The Jew of Malta, 

Scene III.ii (288 

words) 

8.727 w/s 

Richard II, Scene 

V.vi (411 words) 

17.125 w/s The Jew of Malta, 

Scene IV.iii (351 

words) 

9.75 w/s 

The table presented above shows that these ten scenes do not have neither intra-author 

consistency nor inter-author variation, as will be developed in the following paragraphs.  

The results of the Shakespearean scenes could be divided into those obtained by 

Scenes II.iii and III.iii from Richard III, which are relatively close (12.061 and 11.588, 

respectively), those of Scene V.ii from Richard III and Scene V.vi from Richard II, which 

are almost identical (17.091 and 17.125, respectively), and that of Scene III.i from 

Richard II, whose average number of words per sentence is of 21.375. Hence, the 

Shakespearean samples present a maximum difference of 9.787 points, which can be 

found if Scene III.iii from Richard III (11.588) is compared to Scene III.i from Richard 

II (21.375). This stands as a reflection of their lack of consistency.  

The Marlowian scenes are even more heterogeneous, since there is a difference of 

more than one point among each of the five samples if their values are ordered from the 


109 
 

lowest to the highest (8.727, 9.75, 10.923, 12.824 and 14.056). If Scene III.ii from The 

Jew of Malta is compared to Scene III.i from the same play, there is a difference of 5.329 

words per sentence between them, which is the highest within the Marlowian samples. 

Even though three of the scenes written by Shakespeare are the only ones that have 

more than 15 words per sentence and two of the Marlowian scenes are the only ones with 

less than 10, there are scenes of the two authors that present overlapping results. This is 

the case of Shakespeare’s Scene II.iii from Richard III, with 12.061 words per sentence, 

and Marlowe’s Scene II.iii from Edward II, which has 12.824, that is, almost the same. 

Similarly, the average number of words per sentence of Shakespeare’s Scene III.iii from 

Richard III (11.588) overlaps with the results derived from the analysis of Scenes II.iii 

and IV.iii from Marlowe’s Edward II (12.824 and 10.923, respectively). This means that 

if the average number of words per sentence of a scene from Arden of Faversham is 

calculated by ALTXA and the result is within those values, it would be impossible to 

associate its authorship with any of the two candidates of the study. 

In conclusion, the quantification of the average number of words per sentence of 

undisputed scenes of between 100 and 450 words written by Shakespeare and Marlowe 

has shown barely any intra-author consistency and that the results of both authors tend to 

overlap, for which this discriminator will not be used to determine the authorship of the 

scenes of Arden of Faversham of the same length. 

5.1.2. Average number of words per sentence of scenes of between 500 and 950 words 

The second stage of this pre-study consists in the quantification of the average number of 

words per sentence of five undisputed scenes of Shakespeare and five undisputed scenes 

of Marlowe whose length ranges from 500 to 950 words to evaluate if there is enough 

intra-author consistency and inter-author variation. The results derived from the analysis 

of these scenes, which have been randomly selected, can be observed in Table 3. 

 
110 
 

Table 3 | Stage 2 of the pre-study on the average number of words per sentence 

Shakespearean 

scenes 

Words per sentence 

(w/s) 

Marlowian scenes Words per sentence 

(w/s) 

Richard III, Scene 

II.iv (591 words) 

 10.368 w/s Edward II, Scene 

III.iii (726 words) 

12.737 w/s 

Richard III, Scene 

III.iv (860 words) 

 13.871 w/s Edward II, Scene 

V.iii (527 words) 

9.246 w/s 

Richard III, Scene 

IV.ii (920 words) 

8.364 w/s The Jew of Malta, 

Scene III.iii (521 

words) 

8.683 w/s 

Richard II, Scene I.ii 

(579 words) 

16.543 w/s The Jew of Malta, 

Scene III.iv (847 

words) 

10.329 w/s 

Richard II, Scene 

III.iv (856 words) 

 16.151 w/s The Jew of Malta, 

Scene IV.v (532 

words) 

10.231 w/s 

Table 3 shows that there is a lack of intra-author consistency in the results derived from 

the analysis of the Shakespearean scenes. There is a dramatic difference of 8.179 points 

if the average number of words per sentence of Scene IV.ii from Richard III (8.364) is 

compared to that of Scene I.ii from Richard II (16.543). The results of Scenes II.iv and 

III.iv from Richard III remain in an intermediate position among the two abovementioned 

scenes, although there is a considerable difference of more than three words per sentence 

between them (10.368 and 13.871, respectively). The only ones that seem to have similar 

values are Scenes I.ii and III.iv from Richard II (16.543 and 16.151, respectively). 

The Marlowian scenes present a maximum difference of 4.054 words per sentence, 

which can be found if Scene III.iii from Edward II is compared to Scene III.iii from The 

Jew of Malta (12.737 and 8.683, respectively). There is great intra-author consistency in 

the average number of words per sentence of Scenes III.iv and IV.v from The Jew of 

Malta (10.329 and 10.231, respectively), and the results of Scene V.iii from Edward II 

and Scene III.iii from The Jew of Malta are relatively close (9.246 and 8.683, 

respectively).  

The average number of words per sentence of the latter Marlowian scene is almost 

identical to that of Scene IV.ii from Shakespeare’s Richard III, which is of 8.364. 


111 
 

Furthermore, the results of Scenes III.iv and IV.v from Marlowe’s The Jew of Malta are 

between 10 and 11 words per sentence (10.329 and 10.231, respectively), as happens with 

Shakespeare’s Scene II.iv from Richard III (10.368). 

This method has been proved to be ineffective to distinguish between scenes of the 

second group written by Shakespeare and Marlowe, given the lack of intra-author 

consistency and, especially, the high frequency with which the average number of words 

per sentence of the two playwrights overlap. 

5.1.3. Average number of words per sentence of scenes of between 1,100 and 1,700 

words 

The third stage of the pre-study focuses on the calculation of the average number of words 

per sentence of five Shakespearean and five Marlowian random scenes whose length 

ranges from 1,100 to 1,700 words. The results provided by the software ALTXA can be 

observed in the following table. 

Table 4 | Stage 3 of the pre-study on the average number of words per sentence 

Shakespearean 

scenes 

Words per sentence 

(w/s) 

Marlowian scenes Words per sentence 

(w/s) 

Richard III, Scene I.i 

(1,243 words) 

 15.538 w/s Edward II, Scene I.i 

(1,588 words) 

 11.94 w/s 

Richard III, Scene 

II.ii (1,214 words) 

 13.64 w/s Edward II, Scene 

III.ii (1,401 words) 

16.679 w/s 

Richard III, Scene 

III.i (1,580 words) 

 11.704 w/s Edward II, Scene V.i 

(1,266 words) 

13.326 w/s 

Richard II, Scene I.i 

(1,605 words) 

 21.4 w/s The Jew of Malta, 

Scene I.i (1,425 

words) 

13.443 w/s 

Richard II, Scene 

II.iii (1,377 words) 

  17.213 w/s The Jew of Malta, 

Scene IV.iv (1,135 

words) 

11.823 w/s 

Table 4 shows that the disparity of the average number of words per sentence of the 

Shakespearean scenes is evident, with a maximum difference of 9.696 points between 

Scene III.i from Richard III (11.704) and Scene I.i from Richard II (21.4), as well as a 

difference of almost two points or more among each of the five scenes if their values are 


112 
 

ordered from the lowest to the highest (11.704, 13.64, 15.538, 17.213 and 21.4). The five 

Shakespearean scenes that contain between 1,100 and 1,700 words have shown no intra-

author consistency. 

The results derived from the study of the Marlowian scenes are slightly more 

consistent, given that there is great similarity between the results of Scene I.i from 

Edward II (11.94) and Scene Iv.iv from The Jew of Malta (11.823), as well as between 

those of Scene V.i from Edward II (13.326) and Scene I.i from The Jew of Malta (13.443). 

Nevertheless, the average number of words per sentence of Scene III.ii from Edward II 

(16.679) is notably distinct from that of the other Marlowian scenes, creating a maximum 

difference of 4.856 points between it and Scene Iv.iv from The Jew of Malta (11.823). 

In addition to the lack of intra-author consistency, which is more evident in the case 

of the Shakespearean scenes, the results of the two candidates overlap. This can be 

observed if the average number of words per sentence of Shakespeare’s Scene III.i from 

Richard III (11.704) is compared to that obtained by Marlowe’s Scene I.i from Edward 

II (11.94) and Scene Iv.iv from The Jew of Malta (11.823), or if the average number of 

words per sentence of Shakespeare’s Scene II.ii from Richard III (13.64) is compared to 

that of Marlowe’s Scene V.i from Edward II (13.326) and Scene I.i from The Jew of 

Malta (13.443). 

In brief, it seems that this discriminator is not consistent enough to be used in the 

authorship analysis of the scenes of Arden of Faversham that have between 1,100 and 

1,700 words. 

5.1.4. Average number of words per sentence of scenes of almost 2,000 words or 

more 

The final stage of the pre-study aims to assess the effectiveness of this test with 

undisputed scenes of Shakespeare and Marlowe whose number of words is similar or 

superior to 2,000. While it is possible to include five Shakespearean scenes of more than 

2,000 words in this study, there are only three Marlowian scenes that contain such number 

of words, for which Scene II.ii from Edward II, that contains 1,995 words, has been 

included in this analysis, whose results can be observed in Table 5. 

 
113 
 

Table 5 | Stage 4 of the pre-study on the average number of words per sentence 

Shakespearean 

scenes 

Words per sentence 

(w/s) 

Marlowian scenes Words per sentence 

(w/s) 

Richard III, Scene 

I.iii (2,845 words) 

 13.678 w/s Edward II, Scene I.iv 

(3,330 words) 

 11.767 w/s 

Richard III, Scene 

IV.iv (4,267 words) 

 13.334 w/s Edward II, Scene II.ii 

(1,995 words) 

9.975 w/s 

Richard III, Scene 

V.iii (2,726 words) 

 10.904 w/s The Jew of Malta, 

Scene I.ii (2,929 

words) 

11.623 w/s 

Richard II, Scene I.iii 

(2,402 words) 

 18.336 w/s 

 
The Jew of Malta, 

Scene II.iii (3,034 

words) 

11.669 w/s 

Richard II, Scene II.i 

(2,372 words) 

  17.701 w/s   

The Shakespearean scenes present heterogeneous results, except for Scenes I.iii and IV.iv 

from Richard III, whose average number of words per sentence is quite similar (13.678 

and 13.334, respectively). Scene V.iii from Richard III presents an average number of 

words per sentence of 10.904, which creates a considerable difference of 7.432 points if 

it is compared with Scene I.iii from Richard II (18.336), and of 6.797 points with Scene 

II.i from Richard II (17.701). 

In contrast, the Marlowian scenes are highly homogeneous, since three of them 

present between 11 and 12 words per sentence, while the remaining one, which is Scene 

II.ii from Edward II, has 9.975. The main problem behind the homogeneity of the 

Marlowian samples, whose values range from 9.9 to 11.8, is that the average number of 

words per sentence of Scene V.iii from Shakespeare’s Richard III (10.904) overlaps with 

them. This means that if the average number of words per sentence of a scene from Arden 

of Faversham is calculated and the result is close to 11, the authorship of the disputed text 

could not be associated with Shakespeare or Marlowe with certainty.  

In sum, this stage of the pre-study has shown highly consistent results in the analysis 

of the Marlowian scenes, but great intra-author variation in those of Shakespeare, which 

invalidates the test automatically. In addition, even though the results of the Marlowian 

scenes are quite similar, they overlap with one of the results of the Shakespearean scenes, 


114 
 

which would not allow for a reliable attribution of authorship of a scene from Arden of 

Faversham that has an average number of words per sentence within those values. 

5.1.5. Conclusions derived from Pre-study 1 

It has been proved that the calculation of the average number of words per sentence cannot 

distinguish with sufficient reliability a Shakespearean scene that belongs to a play that is 

not a comedy and was written between 1590 and 1595 from a Marlowian scene of the 

same characteristics. The pre-study has not achieved satisfactory results in any of the four 

types of scenes that have been put into analysis. Even though the intra-author consistency 

has improved as the size of the samples has increased, especially in the case of the 

Marlowian scenes, the overlapping results of both playwrights in the four categories has 

undermined the reliability of this discriminator. Therefore, the calculation of the average 

number of words per sentence will not be used in the final case study. This does not mean 

that the method is ineffective, but that it is not effective enough in this specific linguistic 

context. In other words, the quantification of this discriminator could prove to be effective 

if the samples of one of the candidates are changed by the works of a different playwright 

or if the samples of Shakespeare and Marlowe are taken from a different period, for 

instance. The following section will present and discuss the results derived from the 

second pre-study. 

5.2. Pre-study on the calculation of the lexical richness (Pre-study 2) 

The second pre-study of the thesis intends to evaluate the reliability of the calculation of 

the lexical richness to distinguish between Shakespearean and Marlowian scenes. For 

such end, four distinct analyses will be conducted, that is, one for each of the four types 

of scenes according to their length, under the principle that if this discriminator is 

effective enough, the values of the scenes written by the same author should present 

certain consistency and, at the same time, that they should be sufficiently different from 

those of the other candidate. Scenes that contain between 100 and 450 words will be 

analysed first. Afterwards, scenes whose length ranges from 500 to 950 and from 1,100 

to 1,700 words will be studied. Finally, the effectiveness of the discriminator will be 

assessed with scenes that contain almost 2,000 words or more.  

The scenes involved in this pre-study will not be randomly selected as in that on the 

average number of words per sentence, given that this parameter is greatly affected by 

small differences in the size of the samples. For that reason, the scenes of a more similar 


115 
 

length in the first three groups will be selected and classified into subgroups, which will 

optimize the results and allow for a realistic evaluation of the effectiveness of this test. 

Since there is a lack of scenes of a similar length in the fourth group, an estimation of 

what their lexical richness would be if their size was balanced will be taken into 

consideration to assess the reliability of the procedure (see Section 4.5.3 for a more 

detailed explanation of the reasons underlying these decisions). The results will be 

presented in tables and later discussed.  

5.2.1. Lexical richness of scenes of between 100 and 450 words 

The lexical richness of five Shakespearean and five Marlowian scenes that have between 

100 and 450 words has been calculated by ALTXA to discern if there is sufficient intra-

author consistency and inter-author variation. The results derived from this study can be 

observed in Table 6. 

Table 6 | Stage 1 of the pre-study on the lexical richness 

Shakespearean 

scenes 

Lexical richness (%) Marlowian scenes Lexical richness (%) 

Richard III, Scene 

III.iii (197 words) 

 64.975% Edward II, Scene III.i 

(151 words) 

 61.589% 

Richard III, Scene 

III.vi (116 words) 

 77.586% Edward II, Scene IV.i 

(123 words) 

77.236% 

Richard III, Scene 

V.iv (110 words) 

 60.0% The Jew of Malta, 

Scene III.i (252 

words) 

64.032% 

Richard III, Scene 

V.v (315 words) 

 62.54% 

 
The Jew of Malta, 

Scene III.ii (288 

words) 

57.986% 

Richard II, Scene III.i 

(342 words) 

  56.725% The Jew of Malta, 

Scene III.v (253 

words) 

63.241% 

As underlined earlier, minor differences among the size of the samples could have a major 

impact on the results of this test, for which the scenes of each author have been carefully 

selected in order to have two subgroups where their length is almost identical. For that 

reason, three of the Shakespearean scenes contain between 110 and 200 words (Scenes 

III.iii, III.vi and V.iv from Richard III), while the other two taken from his corpus present 


116 
 

a length that ranges from 310 to 350 words (Scene V.v from Richard III and Scene III.i 

from Richard II).  

The results of Scenes III.iii, III.vi and V.iv from Richard III differ considerably, given 

that their lexical richness is of 64.975%, 77.586% and 60.0% respectively, which means 

that no consistency can be found in the subgroup of Shakespearean scenes of between 

110 and 200 words. Similarly, if the lexical richness of the two Shakespearean scenes 

whose length is between 310 and 350 words is compared, there is a notable distance of 

more than five points between them, since the result of Scene V.v from Richard III is of 

62.54%, whereas that of Scene III.i from Richard II is of 56.725%. Hence, it has been 

proved that the quantification of the lexical richness of Shakespearean scenes of such a 

short length leads to disparate results, even if their number of words is highly similar. 

In the case of the Marlowian scenes, two of them present a length that ranges from 

120 to 160 words (Scenes III.i and IV.i from Edward II), and three of them contain 

between 250 and 290 words (Scenes III.i, III.ii and III.v from The Jew of Malta). These 

have been selected to observe if the results derived from the analysis of each subgroup 

present intra-author consistency, which would allow for a posterior comparison with the 

scenes of the other candidate. 

Like the Shakespearean scenes of both subgroups, the Marlowian scenes whose 

number of words ranges from 120 and 160 words present disparate results, since the 

lexical richness of Scene III.i from Edward II is of 61.589% and the result achieved by 

Scene IV.i from the same play is of 77.236%. The scenes from the other Marlowian 

subgroup present more consistency among them, since Scenes III.i and III.v from The 

Jew of Malta present a similar lexical richness (64.032% and 63.241%, respectively). 

Nevertheless, the third sample of this second subgroup, that is, Scene III.ii from The Jew 

of Malta, presents a lexical richness of 57.986%, which differs considerably from the 

results achieved by the two others of the same subgroup. 

In sum, the results derived from the analysis of undisputed scenes that contain 

between 100 and 450 words seem to have been chaotic and present such intra-author 

variation that it is not necessary to compare the results of both playwrights to determine 

that this discriminator should not be applied in the analysis of the scenes of Arden of 

Faversham of a similar length. Even though the scenes have been carefully selected to 

create two subgroups per author where they have an almost identical number of words, it 


117 
 

seems evident that the calculation of this parameter can only reach consistent results if 

the size of the texts increases dramatically. 

5.2.2. Lexical richness of scenes of between 500 and 950 words 

The second stage of the pre-study consists in the calculation of the lexical richness of five 

scenes of between 500 and 950 words from each of the two reference corpora. Even 

though the samples of this stage of the pre-study are larger than those of the previous one 

and hence differences in their number of words should not have such a huge impact on 

the results, the decision of selecting scenes that have a similar length to create two 

subgroups of scenes per author has been made again, as can be observed in the following 

table. 

Table 7 | Stage 2 of the pre-study on the lexical richness 

Shakespearean 

scenes 

Lexical richness (%) Marlowian scenes Lexical richness (%) 

Richard III, Scene 

II.iv (591 words) 

 48.9% Edward II, Scene 

II.iv (529 words) 

48.582% 

Richard III, Scene 

III.iv (860 words) 

 44.651% Edward II, Scene II.v 

(849 words) 

38.634% 

Richard II, Scene I.ii 

(579 words) 

 54.059% The Jew of Malta, 

Scene III.iii (521 

words) 

49.712% 

Richard II, Scene 

III.iv (856 words) 

  
47.196% 

 
The Jew of Malta, 

Scene III.iv (847 

words) 

43.92% 

Richard II, Scene V.i 

(836 words) 

 47.608% The Jew of Malta, 

Scene IV.v (532 

words) 

43.609% 

Table 7 shows that the Shakespearean samples could be divided into those that contain 

between 550 and 600 words, that is, Scene II.iv from Richard III and Scene I.ii from 

Richard II, and those whose number of words ranges from 830 to 860 words, which are 

Scene III.iv from Richard III and Scenes III.iv and V.i from Richard II.  

The results of the scenes of the first subgroup lack consistency, given that their lexical 

richness is of 48.9% and 54.059% and thus there are more than five points between them. 


118 
 

The scenes of the second subgroup are more consistent, since Scenes III.iv and V.i from 

Ricahrd II have an almost identical lexical richness (47.196% and 47.608%, 

respectively), although their results differ slightly from that of Scene III.iv from Richard 

III (44.651%). 

The Marlowian scenes can be divided into a first subgroup where their length ranges 

from 500 to 550 words and contains Scene II.iv from Edward II, as well as Scenes III.iii 

and IV.v from The Jew of Malta; and a second subgroup formed by Scene II.v from 

Edward II and Scene III.iv from The Jew of Malta, that have between 800 and 850 words. 

The length of the scenes of these two subgroups is similar to that of the Shakespearean 

subgroups with the purpose of facilitating a posterior comparison between both 

playwrights, if necessary.  

There is a difference of more than six points if the lexical richness of Scene III.iii 

from The Jew of Malta (49.712%) is compared to that of Scene IV.v from the same play 

(43.609%) and hence it seems that the results of the first subgroup lack intra-author 

consistency, despite the fact that the lexical richness of Scene II.iv from Edward II 

(48.582%) is relatively close to that of Scene III.iii from The Jew of Malta. Similarly, 

there is a difference of more than five points if the lexical richness of the two scenes of 

the second subgroup, that is, Scene II.v from Edward II and Scene III.iv from The Jew of 

Malta, is compared (38.634% and 43.92%, respectively). 

This lack of intra-author consistency makes an exhaustive comparison between the 

results obtained by the scenes of both playwrights in each of the two subgroups 

unnecessary. In any case, it is worth mentioning that the lexical richness of the scenes of 

Shakespeare and Marlowe overlaps frequently. For instance, the results of Marlowe’s 

Scene II.iv from Edward II (48.582%) and Scene III.iii from The Jew of Malta (49.712%) 

are highly similar to that of Scene II.iv from Shakespeare’s Richard III (48.9%), whose 

length presents a high degree of resemblance with that of the abovementioned Marlowian 

scenes. 

In sum, the undisputed scenes of between 500 and 950 words that have been analysed 

in this stage of the pre-study present a lack of intra-author consistency and inter-author 

variation that do not allow for the inclusion of this parameter in the analysis of the scenes 

of Arden of Faversham of this length. 

 
119 
 

5.2.3. Lexical richness of scenes of between 1,100 and 1,700 words 

The lexical richness of five Shakespearean and five Marlowian scenes of between 1,100 

and 1,700 words has been calculated by ALTXA. As in the previous stages of the pre-

study, the scenes of both authors have been carefully selected to create two subgroups 

where they have a similar number of words. The results derived from this analysis can be 

observed in Table 8. 

Table 8 | Stage 3 of the pre-study on the lexical richness 

Shakespearean 

scenes 

Lexical richness (%) Marlowian scenes Lexical richness (%) 

Richard III, Scene II.i 

(1,117 words) 

 38.675% Edward II, Scene I.i 

(1,588 words) 

38.854% 

Richard III, Scene 

III.i (1,580 words) 

 33.797% Edward II, Scene 

III.ii (1,401 words) 

38.687% 

Richard II, Scene I.i 

(1,605 words) 

 40.561% Edward II, Scene V.i 

(1,226 words) 

41.028% 

Richard II, Scene II.ii 

(1,150 words) 

 43.304% The Jew of Malta, 

Scene I.i (1,425 

words) 

40.0% 

Richard II, Scene 

V.iii (1,163 words) 

 42.304% The Jew of Malta, 

Scene IV.iv (1,135 

words) 

38.767% 

Table 8 shows that the Shakespearean scenes could be divided into those whose length 

ranges from 1,115 to 1,165 words, that is, Scene II.i from Richard III and Scenes II.ii and 

V.iii from Richard II; and those whose number of words is between 1,580 and 1,605, that 

is, Scene III.i from Richard III and Scene I.i from Richard II.  

The lexical richness of Scene II.i from Richard III, that belongs to the first subgroup, 

is of 38.675%, which differs from the percentages achieved by the two other scenes of 

the subgroup, that is, Scenes II.ii and V.iii from Richard II, that are close to each other 

(43.304% and 42.304%, repectively). The lexical richness of the two Shakespearean 

scenes of the second subgroup seems even more inconsistent, given that Scene III.i from 

Richard III has scored a result of 33.797% and this creates a dramatic difference of almost 

seven points if it is compared to that of Scene I.i from Richard II (40.561%). 


120 
 

The Marlowian scenes could be divided into those whose number of words is between 

1,130 and 1,230, that is, Scene IV.iv from The Jew of Malta and Scene V.i from Edward 

II, and those whose length ranges from 1,400 to 1,590 words, that is, Scenes I.i and III.ii 

from Edward II and Scene I.i from The Jew of Malta. 

As can be observed in Table 8, these scenes present more consistency than those of 

the other candidate and, as a matter of fact, the values of the scenes of both subgroups 

remain quite uniform. The maximum difference among the five scenes can be found if 

the lexical richness of Scene III.ii from Edward II (38.687%) is compared to that of Scene 

V.i from the same play (41.028%), which implies a low distance of less than two and a 

half points. For the first time in this pre-study, there is barely any intra-author variation 

in the lexical richness of the Marlowian scenes of both subgroups. 

The results of the Marlowian scenes overlap with those of the other candidate, and for 

instance the lexical richness of the scenes of the first subgroup of the Bard is of between 

38.675% and 43.304%, and that of the Marlowian scenes of the first subgroup, whose 

length is similar to that of the Shakespearean scenes that have just been mentioned, is of 

between 38.767% and 41.028%. The same problem occurs if the scenes of the second 

subgroup of each candidate are compared, and for instance the lexical richness of 

Shakespeare’s Scene I.i from Richard II (40.561%) is almost identical to that of 

Marlowe’s Scene I.i from The Jew of Malta (40%). This means that if the lexical richness 

of a scene from Arden of Faversham is calculated and the result is among these values, it 

would be impossible to associate its authorship with one of the candidates.  

The Shakespearean scenes of the two subgroups have scored disparate values, 

whereas the Marlowian scenes present a high degree of homogeneity, even if they belong 

to distinct subgroups. In addition to the lack of consistency of the Shakespearean scenes, 

which does not allow for the inclusion of this discriminator in the case study, the results 

of both playwrights overlap, and hence it will not be used to study the authorship of the 

scenes of Arden of Faversham of between 1,100 and 1,700 words. 

5.2.4. Lexical richness of scenes of almost 2,000 words or more 

The last stage of the pre-study consists in the analysis of undisputed scenes whose length 

is of almost 2,000 words or more. Since there are not five scenes in the Marlowian 

reference corpus of such length, only four will be included in the analysis. Due to this 

scarcity of Marlowian scenes of the fourth group, it is not possible to make a careful 


121 
 

selection of them to create two subgroups of a specific range of words, as has been made 

with the Shakespearean samples.  

The main difficulty behind the conduction of this analysis is that, unlike in the three 

previous stages of the pre-study, the length of the Shakespearean scenes is quite disparate 

from that of the Marlowian scenes, for which it is necessary to estimate what the lexical 

richness of these scenes would be if their size was more balanced to compare both 

playwrights. Only such estimations will be considered to assess the reliability of this test 

when samples of dissimilar length are compared, instead of directly analysing the results 

derived from the calculation of their lexical richness. These results, which have been 

provided by ALTXA, can be observed in the table presented below. 

Table 9 | Stage 4 of the pre-study on the lexical richness 

Shakespearean 

scenes 

Lexical richness (%) Marlowian scenes Lexical richness (%) 

Richard III, Scene 

I.iii (2,845 words) 

 33.111% Edward II, Scene I.iv 

(3,330 words) 

 28.919% 

Richard III, Scene 

V.iii (2,726 words) 

 33.786% Edward II, Scene II.ii 

(1,995 words) 

37.644% 

Richard II, Scene I.iii 

(2,402 words) 

 36.178% The Jew of Malta, 

Scene I.ii (2,929 

words) 

30.522% 

Richard II, Scene II.i 

(2,372 words) 

 37.69% 

 
The Jew of Malta, 

Scene II.iii (3,034 

words) 

28.378% 

Richard II, Scene 

IV.i (2,628 words) 

  32.078%   

Table 9 shows that there are great differences between the length of the Shakespearean 

scenes and that of the scenes extracted from the Marlowian corpus, which justifies the 

distinct approach that will be adopted to analyse the results. 

The Shakespearean samples could be divided into a first subgroup formed by scenes 

that contain between 2,370 and 2,405 words, that is, Scenes I.iii and II.i from Richard II, 

and a second subgroup where these have a length of between 2,625 and 2,845 words and 

includes Scenes I.iii and V.iii from Richard III, as well as Scene IV.i from Richard II.  


122 
 

In the case of the first subgroup, Scene I.iii from Richard II presents a lexical richness 

of 36.178%, whereas that of Scene II.i from the same play is of 37.69%. The homogeneity 

of the values of the first subgroup can be also found in the second subgroup, since there 

is a distance of less than two points among the lexical richness of Scene I.iii from Richard 

III (33.111%), Scene V.iii from Richard III (33.786%) and Scene IV.i from Richard II 

(32.078%). Hence, the results derived from the calculation of the lexical richness of 

Shakespearean scenes whose length is superior to 2,000 words could be seen as highly 

consistent in the two subgroups. 

The length of the Marlowian scenes is heterogeneous, since Scene II.ii from Edward 

II contains 1,995 words and therefore is considerably shorter than the other three. The 

number of words of Scene I.ii from The Jew of Malta (2,929) is similar to that of Scene 

II.iii from the same play (3,034), which allows for a direct comparison between them to 

discern if there is sufficient intra-author consistency. Finally, Scene I.iv from Edward II 

contains 3,330 words, which makes it a slightly larger sample than the two previous ones. 

As the size of the samples increases, the chances of repeating words are higher and 

hence their lexical richness tends to be lower. Therefore, it makes sense that the 

Marlowian scene with the highest lexical richness is Scene II.ii from Edward II 

(37.644%). The two scenes with the most similar length, that is, Scenes I.ii and II.iii from 

The Jew of Malta, present a slight difference of more than two points if their lexical 

richness is compared (30.522% and 28.378%, respectively). Lastly, it is surprising that 

the result of Scene I.iv from Edward II, whose length is moderately superior to that of the 

two abovementioned scenes, is of 28.919%, which is in an intermediate position between 

the lexical richness of these two scenes. It could be said that the results of the Marlowian 

scenes are relatively consistent, but not as much as those of the other candidate. 

As pointed out at the beginning of the section, the disparity in the length of the scenes 

of both playwrights makes it necessary to establish an estimation of what the lexical 

richness of their scenes would be if their size was balanced to compare them. It seems 

that the Shakespearean scenes whose number of words ranges from 2,625 to 2,845 and 

have a lexical richness of between 32% and 33.8% would probably present a similar value 

to the Marlowian scenes that have between 2,920 and 3,330 words and a lexical richness 

of between 28.3% and 30.6% if their size increased to the point of being balanced with 

those of Marlowe. In other words, it looks evident that the lexical richness of both 

playwrights would overlap on many occasions if the size of these scenes was similar, for 


123 
 

which this method should not be considered highly effective to distinguish between both 

playwrights in this linguistic context. 

In sum, even though the disparity of the length of the samples does not allow for such 

a precise evaluation of the results as in the three previous stages of the pre-study, it seems 

that the lexical richness of the scenes of both playwrights presents more intra-author 

consistency than in any other stage, which could mean that the stability of this parameter 

increases with the size of the samples. Nevertheless, the probability with which the results 

of the two candidates may overlap can be seen as a threat for the preciseness of the 

analysis of a disputed scene, for which this test will not be applied to determine the 

likeliest authorship of the scenes of Arden of Faversham of this length. 

5.2.5. Conclusions derived from Pre-study 2 

This pre-study has analysed the effectiveness of the calculation of the lexical richness to 

distinguish between scenes taken from plays that are not comedies and were written 

approximately between 1590 and 1595 by Shakespeare and Marlowe. The first stages of 

the pre-study have shown many inconsistencies in the values of the samples written by 

the same author, even though they have been carefully selected to have a similar number 

of words and optimize their results. As the size of the samples has increased, the lexical 

richness of those written by the same author has presented more consistency, but not 

enough inter-author variation. Therefore, this parameter has achieved a higher degree of 

intra-author consistency than the calculation of the average number of words per sentence, 

but the frequency with which the results of both playwrights tend to overlap does not 

guarantee the presence of clear results if disputed scenes are put into analysis.  

In conclusion, the quantification of the lexical richness will not be used for the 

attribution of authorship of Arden of Faversham, given that it has not been proved to be 

effective to distinguish between Shakespeare and Marlowe in any of the four types of 

scenes. Nevertheless, I would like to stress that this does not mean that the quantification 

of the lexical richness is ineffective in authorship attribution studies in general, but that it 

is not sufficently reliable in this specific linguistic context. 

5.3. Pre-study on n-gram tracing (Pre-study 3) 

The objective of the third pre-study is to assess the effectiveness of n-gram tracing to 

distinguish between Shakespearean and Marlowian scenes. For this end, the authorship 


124 
 

of ten undisputed scenes (five from each of the two reference corpora) whose number of 

words is between 100 and 450 will be analysed independently first. Afterwards, the same 

procedure will be followed with ten scenes that contain between 500 and 950 words, 

between 1,100 and 1,700 words and, due to the scarcity of Marlowian scenes of more 

than 2,000 words, the last stage of the pre-study will analyse five Shakespearean and four 

Marlowian scenes of almost 2,000 words or more. 

The reference corpora of both playwrights have been edited to have a similar number 

of words and be in equal conditions to present a higher number of n-grams in common 

with each sample whose authorship is analysed. Those common n-grams between a scene 

and the reference corpus from which it has been extracted that include the names of 

characters and locations that are exclusive of the play where that scene belongs will be 

manually discarded from the list provided by ALTXA to produce an outcome that can 

reflect more faithfully the situation that would be faced if the authorship of a scene from 

Arden of Faversham was studied with this method. If a scene shares at least 10 n-grams 

of a certain type with one of the reference corpora, these will be analysed quantitatively, 

whereas the others will be analysed from a qualitative perspective that can complement 

those results. Depending on the clarity with which a scene can be associated with one of 

the reference corpora, its attribution will be presented as highly probable or slightly 

probable, whereas the expression it seems uncertain if this scene was written by 

Shakespeare/Marlowe will be used if the results are inconclusive (see Section 4.5.4 for a 

thorough explanation of the reasons underlying the abovementioned decisions).  

5.3.1. N-gram tracing with scenes of between 100 and 450 words 

With the purpose of testing the effectiveness of n-gram tracing in determining the likeliest 

authorship of Shakespearean and Marlowian scenes that contain between 100 and 450 

words, the authorship of five random scenes of each author will be analysed 

independently. The ten analyses will be conducted by removing each scene from the 

reference corpus where it belongs, identifying with ALTXA the n-grams that it shares 

with the two reference corpora and applying the methodological principles that have been 

pointed out at the beginning of this section, which were thoroughly explained in Section 

4.5.4.  

 
125 
 

Scene II.iii from Shakespeare’s Richard III (398 words) 

The first scene that has been randomly selected for this stage of the pre-study is Scene 

II.iii from Richard III. It has been removed from the Shakespearean reference corpus and 

ALTXA has identified the n-grams that it has in common with such corpus and with that 

of the other candidate of the study. The results of the quantitative analysis of the 3-grams 

and 2-grams that the scene shares with the Shakespearean and the Marlowian corpora will 

be presented in Table 10 and later commented. Afterwards, the common 4-grams will be 

mentioned and qualitatively analysed. 

Table 10 | N-gram tracing with Scene II.iii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 17 13 

2-grams 148 124 

Table 10 shows the number of 3-grams and 2-grams that Scene II.iii from Richard III 

shares with the Shakespearean corpus from which it has been extracted and with the 

Marlowian corpus. The scene has 17 common 3-grams with the Shakespearean corpus 

after the elimination of those that include the names of characters and locations that are 

exclusive of the play where it belongs from the list provided by ALTXA, whereas it has 

13 3-grams in common, that is, four less, with the Marlowian corpus. The difference of 

the 2-grams in common is more significant than that of the 3-grams, since the scene has 

148 with the Shakespearean corpus and 124, that is, twenty-four less, with the Marlowian 

corpus. 

As a result of the low number of words that it contains, Scene II.iii from Richard III 

shares no 4-grams with the Marlowian corpus, while the Shakespearean corpus only 

presents 2 4-grams in common with it after the removal of the circumstantial ones that 

include the names of characters and locations that are exclusive of the play where the 

scene belongs from the list provided by ALTXA. These 4-grams are I fear I fear and the 

king is dead. The expression the king is dead includes two lexical words that are common 

in texts of this nature, for which it should not be seen as a solid idiolectal marker, whereas 

the 4-gram I fear I fear could be considered more distinctive, given that the repetition of 

the expression I fear seems like a conscious decision of the author who elaborated the 

dialogue. 


126 
 

According to the present study, it seems highly probable that Scene II.iii from Richard 

III was written by Shakespeare due to the clarity of the quantitative analysis of the 

common 3-grams and 2-grams, which has been slightly reinforced by the qualitative 

analysis of the 4-grams in common. In other words, this study has determined the 

authorship of the sample correctly. 

Scene III.iii from Shakespeare’s Richard III (197 words) 

The second scene involved in this stage of the pre-study is Scene III.iii from Richard III, 

which will be analysed under the same principles that have been delineated earlier. The 

results of the quantitative analysis of the 2-grams that it shares with the Shakespearean 

and the Marlowian corpora will be presented in Table 11. Since the scene does not have 

at least 10 3-grams in common with any of the two reference corpora, these will be 

analysed from a qualitative perspective. 

Table 11 | N-gram tracing with Scene III.iii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

2-grams 63 53 

As can be observed in the table presented above, Scene III.iii from Richard III presents 

63 2-grams in common with the Shakespearean corpus and 53, that is, ten less, with the 

Marlowian corpus, which is a considerable difference for such a short text. 

Scene III.iii from Richard III also has 6 3-grams in common with the Shakespearean 

corpus from which it has been extracted. These 3-grams are and for my, from all the, for 

them as, and I for, to death and and let me tell, which do not seem to be particularly 

distinctive. 

On the other hand, the scene has 4 3-grams in common with the Marlowian corpus, 

that is, two less than with the corpus of the Bard. These 3-grams are we meet again, and 

for my, we give to and and her princely. The latter 3-gram seems to be the most distinctive 

of the group due to the presence of the relatively uncommon adverb princely. 

No common n-grams of more than three words can be found between the scene and 

the reference corpora. This is not surprising, given that it only has 197 words and hence 

it has few chances of having larger constructions in common with them. 


127 
 

In brief, the quantitative analysis of the common 2-grams reveals that Shakespeare is 

the likeliest author of Scene III.iii from Richard III by a margin of ten points, which is 

notable for such a short sample. The analysis of the 3-grams in common shows that the 

scene also shares more with the Shakespearean corpus than with that of Marlowe, even 

though one of the 3-grams that it shares with the Marlowian corpus could be seen as the 

most distinctive. In any case, if the results of the study are observed from a holistic 

perspective, it seems highly probable that this scene was written by Shakespeare, for 

which it has achieved its goal successfully. 

Scene V.ii from Shakespeare’s Richard III (189 words) 

The results derived from the quantitative analysis of the 3-grams and 2-grams that Scene 

V.ii from Richard III shares with the reference corpora of Shakespeare and Marlowe can 

be observed in Table 12. After such results are commented, the number of 5-grams and 

4-grams in common will be revealed and these constructions will be analysed from a 

qualitative perspective. 

Table 12 | N-gram tracing with Scene V.ii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 12 5 

2-grams 70 57 

Scene V.ii from Richard III presents 12 common 3-grams with the corpus of the Bard, 

which stands as more than twice of the 3-grams that it shares with the Marlowian corpus 

(5). Furthermore, there are 70 2-grams in common between the Shakespearean corpus 

and the scene, which shares 57 with the Marlowian corpus, that is, thirteen less than with 

the corpus of the Bard. 

There is a 5-gram shared between Scene V.ii from Richard III and the Shakespearean 

corpus, which is to reap the harvest of. This 5-gram, which can be divided into 2 4-grams 

(to reap the harvest and reap the harvest of), seems to be distinctive, given that it contains 

two lexical words (reap and harvest) that are relatively uncommon.  

The Marlowian corpus presents a 4-gram in common with the scene, which is the 

bowels of the. This 4-gram is not as distinctive as the 5-gram that has been previously 

commented, given that it only contains one lexical word. 


128 
 

According to this study, it seems highly probable that Scene V.ii from Richard III was 

written by Shakespeare, given the clarity of the quantitative analysis of the common 3-

grams and 2-grams as well as the presence of a distinctive 5-gram in common between 

the scene and his reference corpus, which is unlikely to find in the analysis of such a short 

sample. In other words, the study has linked the authorship of the text to its author 

successfully. 

Scene II.iv from Shakespeare’s Richard II (192 words) 

The fourth Shakespearean scene involved in this stage of the pre-study belongs to Richard 

II, which is the second play used for the compilation of his reference corpus. The results 

of the quantitative analysis of the 3-grams and 2-grams that Scene II.iv from Richard II 

shares with the two reference corpora will be presented in Table 13 and later discussed. 

This will be followed by the qualitative analysis of the common 4-grams. 

Table 13 | N-gram tracing with Scene II.iv from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 13 6 

2-grams 67 59 

Table 13 shows that Scene II.iv from Richard II has more 3-grams and 2-grams in 

common with the Shakespearean corpus from which it has been extracted than with the 

corpus of Marlowe. While the scene has 13 common 3-grams with the corpus of the Bard, 

it shares 6 with the Marlowian corpus, that is, less than half. In addition, it shares 67 2-

grams with the Shakespearean corpus and eight less (59) with the reference corpus of the 

other candidate. 

Scene II.iv from Richard II has no larger n-grams in common with the Marlowian 

corpus, while it shares 3 4-grams with the reference corpus of Shakespeare. These are 

friends are fled to, on the earth and and the king is dead. The first of these 4-grams seems 

to be an uncommon expression.  

According to this study, it seems highly probable that Scene II.iv from Richard II was 

written by Shakespeare. This is due to the clarity of the results of the quantitative analysis 

of the common 3-grams and 2-grams, which has been complemented by the presence of 


129 
 

3 4-grams in common between the scene and his reference corpus, one of which seems to 

be moderately distinctive. Therefore, the study has achieved its goal successfully. 

Scene III.i from Shakespeare’s Richard II (342 words) 

The last Shakespearean scene of between 100 and 450 words included in this stage of the 

pre-study is Scene III.i from Richard II. The results of the quantitative analysis of the 

common 3-grams and 2-grams between the scene and the two reference corpora will be 

offered first in the form of a table. Afterwards, such results will be discussed and 

complemented by the qualitative analysis of the common 4-grams. 

Table 14 | N-gram tracing with Scene III.i from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 21 19 

2-grams 130 106 

Table 14 shows that there is a small difference between the 3-grams that this scene shares 

with both reference corpora, given that it has 21 in common with the Shakespearean 

corpus and 19 with that of Marlowe. Scene III.i from Richard II also shares 130 2-grams 

with the Shakespearean corpus and 106 with that of the other candidate, which stands as 

a considerable difference of twenty-four points. 

While the scene has no larger n-grams in common with the Marlowian corpus, it 

shares 3 4-grams with the corpus of the Bard, which are and the hand of, I am a gentleman 

and to the king in. None of these three constructions seems to be distinctive of an author’s 

idiolect, for which they should not be seen as solid markers. 

According to the study, it seems highly probable that Scene III.i from Richard II was 

written by Shakespeare, since it shares more 3-grams with his corpus than with the corpus 

of Marlowe by a low margin, as well as more 2-grams by a notable difference. 

Furthermore, while the scene has no common 4-grams with the Marlowian corpus, it 

shares a few with the corpus of the Bard, even though they are not particularly distinctive. 

In sum, the five Shakespearean scenes of between 100 and 450 words that have been 

analysed as disputed texts with n-gram tracing have been successfully attributed to their 

author. The next five scenes will be extracted from the Marlowian corpus. 


130 
 

Scene II.iii from Marlowe’s Edward II (218 words) 

The results of the quantitative analysis of the common 3-grams and 2-grams between 

Scene II.iii from Edward II and the reference corpora of the two candidates of the study 

will be presented in Table 15 and later discussed. Afterwards, the qualitative analysis of 

the common 5-grams and 4-grams will be provided. 

Table 15 | N-gram tracing with Scene II.iii from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 16 19 

2-grams 86 96 

Table 15 shows that the scene presents more 3-grams in common with the Marlowian 

corpus than with that of Shakespeare by a low margin (19 vs. 16). Furthermore, it presents 

96 common 2-grams with the corpus of Marlowe, which are ten more than those that it 

shares with the Shakespearean corpus (86). 

The scene presents a common 5-gram with the Shakespearean corpus, which is hardy 

as to touch the. This construction, which can be divided into the 4-grams hardy as to 

touch and as to touch the, seems to be unusual, for which it could be seen as an idiolectal 

marker.  

On the other hand, the scene shares the 4-gram the earl of Lancaster with the corpus 

of Marlowe, which does not seem to be distinctive, since it also shares the 2-gram of 

Lancaster with the corpus of Shakespeare. Therefore, it could be said that the qualitative 

analysis of the common 5-grams and 4-grams associates the scene with Shakespeare, 

given that it shares a 5-gram with his reference corpus that is more distinctive than the 4-

gram that it has in common with the corpus of Marlowe. This contrasts with the results 

of the quantitative analysis of the common 3-grams and 2-grams. 

According to this study, it seems slightly probable that Scene II.iii from Edward II 

was written by Marlowe. The scene shares more 3-grams with the Marlowian corpus than 

with the corpus of Shakespeare by a low margin and more 2-grams by a difference of ten 

points, which is significant if the length of the scene is taken into consideration. 

Nevertheless, the qualitative analysis of the larger n-grams, which should complement 


131 
 

these results, links the scene to Shakespeare. In any case, this study has successfully 

associated the authorship of the scene with its author. 

Scene III.i from Marlowe’s Edward II (151 words) 

The results derived from the quantitative analysis of the 3-grams and 2-grams that Scene 

III.i from Edward II shares with the reference corpora of the two playwrights that 

constitute the focus of the study will be presented in Table 16. This will be followed by a 

discussion of such results and the qualitative analysis of the common 6-grams, 5-grams 

and 4-grams. 

Table 16 | N-gram tracing with Scene III.i from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 11 20 

2-grams 61 74 

Table 16 shows that the number of 3-grams in common between the scene and the 

Marlowian corpus is almost twice as the number of 3-grams that it shares with the corpus 

of the Bard (20 vs. 11). There is also a difference of more than ten points if the number 

of 2-grams that Scene III.i from Edward II shares with the reference corpus of Marlowe 

(74) is compared to those that it has in common with the Shakespearean corpus (61). 

These differences seem especially significant if the fact that the scene only contains 151 

words is taken into consideration. 

In addition, the scene has a common 6-gram with the Marlowian corpus, which is 

shall I not see the king. This 6-gram, which can be divided into 2 5-grams and 3 4-grams, 

seems to be distinctive, given that it is the largest construction in common found in this 

stage of the pre-study. The text also presents 2 more 4-grams in common with the 

Marlowian corpus, apart from those derived from the division of the 6-gram mentioned 

above. These are tell him that I and of all my bliss, that do not seem to be solid idiolectal 

markers.  

There is as well a common 4-gram between the scene and the corpus of the Bard, 

which is the king of heaven. This expression appears to be relatively distinctive, given its 

metaphorical meaning. 


132 
 

The results derived from the quantitative analysis reveal that it is highly probable that 

Scene III.i from Edward II was written by Marlowe, given that it shares more 3-grams 

and 2-grams with his reference corpus than with that of the Bard by a considerable 

margin. This has been reinforced by the presence of a common 6-gram between the scene 

and the corpus of Marlowe that stands as a robust idiolectal marker, as well as a few 5-

grams and 4-grams, and thus this study has linked the authorship of the scene to its author 

effectively. 

Scene IV.i from Marlowe’s Edward II (123) 

The attribution of authorship of this scene appears to be of great difficulty, given that it 

is the one with the lowest number of words and, consequently, there are less chances of 

finding common constructions between the text and the two reference corpora. The results 

of the quantitative analysis of the common 2-grams will be presented in Table 17 and 

later discussed. Afterwards, the qualitative analysis of the 4-grams and 3-grams in 

common will be provided. 

Table 17 | N-gram tracing with Scene IV.i from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

2-grams 33 35 

Table 17 shows that the quantification of the common 2-grams offers a very similar result 

for both playwrights. While Scene IV.i from Edward II has 33 2-grams in common with 

the Shakespearean corpus, it shares 35, that is, only two more, with the corpus of 

Marlowe. This low difference should not be shocking if the length of the scene is taken 

into consideration. 

The software ALTXA has found no common 4-grams between Scene IV.i from 

Edward II and the Shakespearean corpus and only one between it and the corpus of 

Marlowe, which is but hath your grace. This 4-gram does not seem to be a distinctive 

construction for a play of this kind. 

The scene has 2 3-grams in common with the Shakespearean corpus and 7 with the 

corpus of Marlowe. The 3-grams that Scene IV.i from Edward II shares with the 

Shakespearean corpus are you my lord and hath my lord, which seem to be frequent 

constructions. 


133 
 

The 7 3-grams that the scene shares with the Marlowian corpus are you my lord, which 

can be also found in the Shakespearean corpus, hath your grace, but hath your, to pass 

in, me leave to, my country’s cause and for England’s good. Among these constructions, 

my country’s cause and for England’s good appear to be the most distinctive of the group 

due to the use of the genitive. This stands as an idiolectal choice, given that the author 

could have written the cause of my country and for the good of England. 

According to the present study, it seems slightly probable that Scene IV.i from 

Edward II was written by Marlowe, given that it shares more 2-grams with his corpus 

than with that of Shakespeare by a considerably low margin and the qualitative analysis 

of the common 4-grams and 3-grams links its authorship to him with certain clarity. 

Although the results are not as clear as on other occasions, the study has been successful.  

Scene IV.iv from Marlowe’s Edward II (228 words) 

Table 18 shows the common 3-grams and 2-grams between Scene IV.iv from Edward II 

and the reference corpora of the two candidates of the study. After these results are 

commented, the qualitative analysis of the only common 4-gram that has been found by 

ALTXA will be presented. 

Table 18 | N-gram tracing with Scene IV.iv from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 11 17 

2-grams 84 94 

As can be observed in the table presented above, the scene has six more 3-grams in 

common with the Marlowian corpus (17) than with the corpus of the Bard (11). In 

addition, there is a difference of ten points if the number of 2-grams that Scene IV.iv from 

Edward II shares with the corpus of Marlowe (94) is compared to those that it has in 

common with the corpus of the other candidate (84). 

The scene does not share any larger n-grams with the Shakespearean corpus, whereas 

it has a common 4-gram with the corpus of Marlowe, which is England’s wealth and 

treasury. This can be seen as a distinctive construction, given that it includes three lexical 

words and the use of the genitive, which stands as an idiolectal choice, as pointed out in 

the analysis of the previous scene. 


134 
 

In sum, it seems highly probable that this scene was written by Marlowe if the clarity 

of the results of the quantitative analysis of the common 3-grams and 2-grams is taken 

into account. Furthermore, these results have been complemented by the presence of a 

distinctive 4-gram in common between the scene and the reference corpus of Marlowe, 

for which the study has achieved its goal successfully. 

Scene III.i from Marlowe’s The Jew of Malta (253 words) 

The last scene included in this stage of the pre-study is Scene III.i from The Jew of Malta. 

The results of the quantitative analysis of the 3-grams and 2-grams that the scene shares 

with the two reference corpora will be presented in Table 19. This will be discussed and 

complemented by the qualitative analysis of the common 5-grams and 4-grams. 

Table 19 | N-gram tracing with Scene III.i from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 23 31 

2-grams 96 104 

Scene III.i from The Jew of Malta presents eight more 3-grams in common with the 

Marlowian corpus than with the corpus of Shakespeare (31 vs. 23). Similarly, there is a 

difference of eight points between the 2-grams that the scene shares with the corpus of 

Marlowe (104) and those that it has in common with the corpus of the other candidate 

(96).  

There is a 5-gram in common between the scene and the Marlowian corpus. This is 

or it shall go hard, which does not seem to be especially distinctive. On the other hand, 

the scene does not share any 5-grams with the corpus of Shakespeare. 

The analysis of the common 4-grams conducted by ALTXA reveals that Scene III.i 

from The Jew of Malta shares 2 with the corpus of Shakespeare and 8 with that of 

Marlowe. The 2 4-grams that it shares with the Shakespearean corpus are I know she is 

and and here he comes, which seem to be common combinations of words.  

The 8 4-grams that Scene III.i from The Jew of Malta has in common with the 

Marlowian corpus are, apart from those derived from the division of the 5-gram or it shall 

go hard, she is a courtesan, in such sort as, that ever I beheld, and in the night, and here 

he comes and and yet I know. Among these, the 4-gram that ever I beheld appears to be 


135 
 

the most distinctive construction of the group because of the inversion of the words I and 

ever, which takes place in an affirmative sentence both in the extracted scene and the 

remaining corpus. 

In the light of the findings provided by the study, it seems highly probable that Scene 

III.i from The Jew of Malta was written by Marlowe. The clarity of the results of the 

quantitative analysis of the common 3-grams and 2-grams has been reinforced by the 

presence of a 5-gram and 8 4-grams in common between the scene and the Marlowian 

corpus, one of which seems to be distinctive. 

Conclusions derived from the first stage of the pre-study 

The authorship of five Shakespearean and five Marlowian scenes that contain between 

100 and 450 words has been analysed as if they were disputed texts with the purpose of 

testing the validity of n-gram tracing to determine the authorship of the scenes of Arden 

of Faversham of the same length. The authorship of the ten scenes has been correctly 

attributed, with a high degree of certainty on eight of the occasions. Since the success rate 

of this method has been of 100%, it will be used in the analysis of the scenes of Arden of 

Faversham that have between 100 and 450 words. The next stage of the pre-study will 

study scenes whose length ranges from 500 to 950 words. 

5.3.2. N-gram tracing with scenes of between 500 and 950 words 

The second stage of this pre-study will analyse the authorship of five Shakespearean and 

five Marlowian scenes whose length ranges from 500 to 950 words using n-gram tracing. 

The scenes will be randomly selected and the analyses will be built upon the same 

methodological principles that have been followed in the previous stage of the pre-study. 

Scene II.iv from Shakespeare’s Richard III (591 words) 

The first scene that has been randomly selected for this second stage of the pre-study is 

Scene II.iv from Richard III. The results of the quantitative analysis of the 3-grams and 

2-grams that it shares with the reference corpora of the two candidates that constitute the 

focus of the study can be observed in Table 20. Afterwards, the discussion of such results 

and the qualitative analysis of the common 5-grams and 4-grams will be provided.  

 
136 
 

Table 20 | N-gram tracing with Scene II.iv from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 43 29 

2-grams 208 208 

Table 20 shows that Scene II.iv from Richard III has more than ten 3-grams in common 

with the Shakespearean corpus than with the Marlowian corpus (43 vs. 29), although the 

number of 2-grams shared between the scene and the two reference corpora is identical 

(208). 

There is a common 5-gram between the text and the Shakespearean corpus, although 

this 5-gram is I will go with you, which is a common expression that should not be seen 

as a solid idiolectal marker. Apart from the 2 4-grams in which the previous 5-gram can 

be divided, the scene also shares another 5 4-grams with the Shakespearean corpus, which 

are with all my heart, you have no cause, me my gracious lord, how doth the prince and 

to-morrow or next day. None of these constructions seems to be distinctive of an author’s 

idiolect. 

The scene shares 2 4-grams with the Marlowian corpus, which are with all my heart, 

which can be also found on the list of common 4-grams between the scene and the 

Shakespearean corpus, and the ruin of my, which is formed by three function words and 

hence it does not seem to be a solid marker either. 

In brief, the quantitative analysis of the common 3-grams and 2-grams associates the 

authorship of the scene with Shakespeare, although it shares the same number of 2-grams 

with the two reference corpora. The scene also shares more 5-grams and 4-grams with 

the corpus of the Bard, but these are not distinctive. According to this study, it seems 

slightly probable that Scene II.iv from Richard III was written by Shakespeare. In other 

words, the study has successfully achieved its goal, although not with a high degree of 

certainty. 

Scene III.iv from Shakespeare’s Richard III (860 words) 

The results of the quantitative analysis of the 4-grams, 3-grams and 2-grams that Scene 

III.iv from Richard III shares with the two reference corpora will be presented in Table 


137 
 

21. After the results provided in such table are commented, a qualitative analysis of the 

common 5-grams will be offered. 

Table 21 | N-gram tracing with Scene III.iv from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 11 3 

3-grams 57 45 

2-grams 325 294 

Table 21 shows that the scene has more 4-grams, 3-grams and 2-grams in common with 

the Shakespearean corpus than with that of Marlowe. Scene III.iv from Richard III shares 

11 4-grams with the corpus of the Bard, which are many for a text of this length, and only 

3 4-grams with the Marlowian corpus. It also has 57 3-grams in common with the 

Shakespearean corpus, which are twelve more than those that it shares with the corpus of 

Marlowe (45). Furthermore, there is a difference of more than thirty 2-grams between 

those that the scene shares with the corpus of the Bard (325) and those that it has in 

common with the Marlowian corpus (294). 

In addition, the scene shares 2 5-grams with the Shakespearean corpus, which are will 

my lord with all and time here comes the duke. The latter 5-gram could be seen as an 

idiolectal marker, given that the combination of words time here does not appear in the 

corpus of the other candidate. 

According to the study, it seems highly probable that Scene III.iv from Richard III 

was written by Shakespeare, given that the quantitative analysis of the common 4-grams, 

3-grams and 2-grams associates the authorship of the scene with his reference corpus with 

great clarity. Moreover, ALTXA has identified 2 5-grams in common between the scene 

and the corpus of the Bard, one of which appears to be relatively distinctive. In sum, the 

study has effectively accomplished its goal. 

Scene IV.ii from Shakespeare’s Richard III (920 words) 

The results of the quantitative analysis of the 3-grams and 2-grams that this scene shares 

with the reference corpora of the two candidates will be presented in Table 22 and later 

commented. Afterwards, the qualitative analysis of the common 5-grams and 4-grams 

will be provided. 


138 
 

Table 22 | N-gram tracing with Scene IV.ii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 66 50 

2-grams 319 292 

Table 22 shows that Scene IV.ii from Richard III shares 66 3-grams with the 

Shakespearean corpus, that is, sixteen more than with the Marlowian corpus, with which 

it shares 50. Furthermore, there is a difference of more than twenty-five points if the 

number of 2-grams shared between the scene and the Shakespearean corpus (319) is 

compared to those that it has in common with the reference corpus of Marlowe (292). 

There is a 5-gram in common between Scene IV.ii from Richard III and the 

Shakespearean corpus, which is may it please to you. This 5-gram, which can be divided 

into 2 4-grams, does not seem to stand as a particularly distinctive construction and, as a 

matter of fact, the 3-gram may it please can be also found in the corpus of Marlowe. Apart 

from the 2 4-grams derived from the division of the abovementioned 5-gram, the scene 

also shares with the corpus of Shakespeare upon the stroke of, my lord I have, but I had 

rather, me my gracious lord and and will no doubt. Among the 7 4-grams that it has in 

common with the Shakespearean corpus, upon the stroke of appears to be the most 

unusual construction.  

Scene IV.ii from Richard III shares 3 4-grams with the Marlowian corpus, which are 

no more but so, a friend of mine and what say’st thou now. None of these constructions 

seems to be a reliable authorship marker. 

The clarity of the quantitative analysis of the common 3-grams and 2-grams combined 

with that of the qualitative analysis of the 5-grams and 4-grams in common suggests that 

it is highly probable that this scene was written by Shakespeare, for which the study has 

attributed the scene to its author correctly. 

Scene III.iv from Shakespeare’s Richard II (856 words) 

Table 23 presents the 3-grams and 2-grams that Scene III.iv from Richard II shares with 

the corpora of Shakespeare and Marlowe. Such results will be discussed and 

complemented by the qualitative analysis of the common 5-grams and 4-grams. 

 
139 
 

Table 23 | N-gram tracing with Scene III.iv from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 40 45 

2-grams 259 237 

As can be observed in the table presented above, while the scene has more 3-grams in 

common with the corpus of Marlowe by a low margin (45 vs. 40), it presents more 

common 2-grams with the Shakespearean corpus by a considerable difference of twenty-

two points (259 vs. 237).  

According to the lists provided by ALTXA, the text presents 3 4-grams in common 

with the Shakespearean corpus, which are had he done so, in the remembrance of and the 

king shall be. None of them seems to be distinctive. 

The scene also presents 2 5-grams in common with the Marlowian corpus, which are 

what was I born to and how can’st thou by this. The 5-gram what was I born to appears 

to be distinctive, given that it is part of a rhetorical question. These 2 5-grams can be 

divided into 4 4-grams, which are the only ones that the scene shares with the reference 

corpus of Marlowe. 

According to the present study, it seems uncertain if Scene III.iv from Richard II was 

written by Shakespeare or Marlowe. While the scene presents more 2-grams in common 

with the Shakespearean corpus by a notable difference of more than twenty points, it 

shares more 3-grams with the Marlowian corpus by a narrow margin. It also shares 2 5-

grams with the corpus of Marlowe, one of which is distinctive. This means that, for the 

first time during the conduction of the pre-study, the authorship of a scene has not been 

correctly associated with its author, although it has not been misattributed either. 

Scene V.i from Shakespeare’s Richard II (836 words) 

The last Shakespearean scene of between 500 and 950 words that has been randomly 

selected for this stage of the pre-study is Scene V.i from Richard II. Table 24 shows the 

3-grams and 2-grams that it shares with the two reference corpora. After these results are 

discussed, the common 5-grams and 4-grams will be revealed and analysed from a 

qualitative perspective. 

 
140 
 

Table 24 | N-gram tracing with Scene V.i from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 33 33 

2-grams 259 241 

Although Scene V.i from Richard III shares the same number of 3-grams (33) with the 

two reference corpora, it has 259 2-grams in common with the Shakespearean corpus, 

which are almost twenty more than those that it shares with the Marlowian corpus (241). 

The scene shares the 5-gram and yet not so for with the Shakespearean corpus, which 

does not seem to be especially distinctive, given that it does not include any uncommon 

lexical words and the 3-gram yet not so can be found in the corpus of Marlowe. In 

addition, the scene shares 4 4-grams with the Shakespearean corpus, which are, apart 

from the 2 4-grams that derive from the division of the abovementioned 5-gram, on my 

head and and with a heavy heart. The latter 4-gram holds a metaphorical meaning, for 

which it stands as a robust idiolectal marker.  

The scene also has the 4-gram I am dead and in common with the Marlowian corpus, 

which is relatively distinctive, given that it seems unusual to claim such thing in the first 

person. 

If these analyses are observed from a holistic perspective, it seems highly probable 

that Scene V.i from Richard III was written by Shakespeare. Even though the scene shares 

the same number of 3-grams with the two reference corpora, it shares more 2-grams with 

the Shakespearean corpus by a notable margin and the qualitative analysis of the common 

5-grams and 4-grams also links the scene to him with clarity, for which this study has 

accomplished its objective effectively. 

In sum, the authorship of four of the five Shakespearean scenes that have been 

included in this stage of the pre-study has been correctly attributed, whereas the remaining 

one could not be clearly associated with any of the two candidates. The following scenes 

will be extracted from the corpus of Marlowe. 

Scene II.i from Marlowe’s Edward II (649 words) 

The first Marlowian scene of between 500 and 950 words that has been randomly selected 

for this stage of the pre-study is Scene II.i from Edward II. The number of 3-grams and 


141 
 

2-grams that this scene shares with the Marlowian corpus from which it has been 

extracted and with that of Shakespeare will be presented in Table 25. Afterwards, a 

qualitative analysis of the common 5-grams and 4-grams between the scene and these 

reference corpora will be conducted. 

Table 25 | N-gram tracing with Scene II.i from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 38 49 

2-grams 239 238 

Scene II.i from Edward II shares eleven more 3-grams with the corpus of Marlowe (49) 

than with that of the Bard (38). Nevertheless, the scene presents one more 2-gram in 

common with the Shakespearean corpus (239 vs. 238), which contrasts with the previous 

comparison. 

The scene has 7 4-grams in common with the Marlowian corpus, which are my lord 

when I, my lord the king, the king and he, I humbly thank your, it shall be done, and now 

and then and not so much as. Among these 4-grams, the only one that appears to be 

distinctive is I humbly thank your because of the way in which the verb thank is modified 

by the adverb humbly, which stands as an idiolectal choice that cannot be found in the 

corpus of the other candidate as a 2-gram.  

On the other hand, the text shares 3 4-grams with the Shakespearean corpus, which 

are my lord the king, a friend of mine and he loves me well. None of these 4-grams seems 

to be a solid authorship marker. 

Taking into consideration that Scene II.i from Edward II only shares one more 2-gram 

with the Shakespearean corpus than with that of Marlowe, but it has eleven more 3-grams 

in common with the corpus of the latter and the qualitative analysis of the common 4-

grams also associates its authorship with him, it seems slightly probable that it was written 

by Marlowe. In other words, the study has been successful, although not with the same 

degree of certainty as on other occasions. 

Scene III.iii from Marlowe’s Edward II (726 words) 

The results of the quantitative analysis of the 3-grams and 2-grams in common between 

Scene III.iii from Edward II and the two reference corpora will be presented in Table 26 


142 
 

and later commented. Afterwards, the qualitative analysis of the common 5-grams and 4-

grams will be provided. 

Table 26 | N-gram tracing with Scene III.iii from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 36 28 

2-grams 211 236 

Table 26 shows that the scene has eight more 3-grams in common with the Shakespearean 

corpus than with the corpus of Marlowe (36 vs. 28) and twenty-five more 2-grams in 

common with the Marlowian corpus than with the corpus of the Bard (236 vs. 211). In 

other words, the study of the common 3-grams associates the scene with the corpus of 

Shakespeare with moderate certainty and the study of the 2-grams in common does the 

same with the Marlowian corpus, for which it could be said that the results of this 

quantitative analysis are inconclusive. 

The scene shares a 5-gram with both reference corpora. The 5-gram that it shares with 

the corpus of the Bard is the worst is death and, while the 5-gram that it has in common 

with the Marlowian corpus is have you no doubt my. While the first 5-gram seems 

distinctive because it conveys a negative view towards death and thus it stands as a 

specific vision of the world, the second one includes the combination of words have you 

in an imperative sentence, which is something that cannot be found in the corpus of the 

other candidate.  

The 4-grams that the scene shares with the Shakespearean corpus and are not derived 

from the division of the 5-gram that has been previously commented are sound drums 

and trumpets, which includes three lexical words and appears to be distinctive, it may not 

be, which seems to be a common construction in plays of this kind, and are up in arms, 

which can be also found in the Marlowian corpus and thus it should not be considered a 

solid authorship marker.  

The 4-grams that Scene III.iii from Edward II has in common with the Marlowian 

corpus and do not come from the division of the 5-gram that has been already commented 

are ‘gainst law of arms, that appears to be a relatively unusual combination of words, I 


143 
 

doubt it not, which seems to be a frequent construction in texts of this nature, and are up 

in arms, which can be also found in corpus of the other candidate. 

According to this study, it seems uncertain if Scene III.iii from Edward II was written 

by Shakespeare or Marlowe, given that neither the quantitative analysis of the common 

3-grams and 2-grams nor the qualitative analysis of the 5-grams and 4-grams in common 

links with certain clarity the scene to one of the reference corpora. Therefore, while the 

study has not attributed the sample to the wrong author, it has not been able to provide 

substantial evidence to attribute its authorship to Marlowe. 

Scene III.iii from Marlowe’s The Jew of Malta (521 words) 

The number of 3-grams and 2-grams that Scene III.iii from The Jew of Malta shares with 

the two reference corpora will be presented in Table 27, together with a brief discussion 

of such results and the qualitative analysis of the common 5-grams and 4-grams. 

Table 27 | N-gram tracing with Scene III.iii from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 17 23 

2-grams 138 169 

Scene III.iii from The Jew of Malta presents six more 3-grams in common with the 

Marlowian corpus (23) than with that of the other candidate (17). In addition, there is a 

dramatic difference of more than thirty points between the number of 2-grams that the 

scene shares with the Marlowian reference corpus (169) and those that it shares with the 

corpus of the Bard (138).  

The scene also shares the 5-gram nay you shall pardon me with the Marlowian corpus, 

which can be divided into 2 4-grams. The use of nay instead of other linguistic items 

could be seen as an idiolectal choice of the author or as a dialectal or context-dependent 

form, as happens with ay and yea (see Section 3.4.4). On the other hand, the scene does 

not share any 5-grams with the corpus of Shakespeare. 

Since Scene III.iii from The Jew of Malta shares more 3-grams and 2-grams with the 

Marlowian corpus and there is also a relatively distinctive 5-gram in common between 

the scene and his corpus, it seems highly probable that it was written by him, for which 

the study has been effective. 


144 
 

Scene IV.v from Marlowe’s The Jew of Malta (532) 

Table 28 shows the 3-grams and 2-grams that Scene IV.v from The Jew of Malta shares 

with the Marlowian reference corpus from which it has been extracted and with that of 

Shakespeare. After such results are discussed, the qualitative analysis of the common 4-

grams will be presented. 

Table 28 | N-gram tracing with Scene IV.v from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 25 42 

2-grams 175 216 

Scene IV.v from The Jew of Malta has seventeen more 3-grams in common with the 

Marlowian corpus (42) than with the corpus of the Bard (25). Furthermore, the 

quantification of the common 2-grams reveals that the scene shares 216 with the reference 

corpus of Marlowe and 175, that is, thirty-one less, with the corpus of Shakespeare, which 

stands as a notable difference. 

The scene also has 4 4-grams in common with the corpus of Marlowe. These are and 

I shall die, in my power to, and when he comes and send me three hundred, which do not 

seem to be especially distinctive.  

On the other hand, it shares the 4-gram I cannot do it with the Shakespearean corpus, 

which is a highly frequent expression and, as a matter of fact, the 2-gram I cannot can be 

found multiple times in both reference corpora. 

According to the study, it seems highly probable that Scene IV.v from The Jew of 

Malta was written by Marlowe, given the clarity of the quantitative analysis of the 

common 3-grams and 2-grams and the fact that it also shares more 4-grams with the 

corpus of Marlowe, even though they are not distinctive. In other words, this study has 

successfully attributed the text to its author. 

Scene V.i from Marlowe’s The Jew of Malta (762 words) 

The study of Scene V.i from The Jew of Malta constitutes the last one of this stage of the 

pre-study. The results of the quantitative analysis of the common 3-grams and 2-grams 

between the scene and the two reference corpora will be presented in Table 29, which 


145 
 

will be later commented. This will be complemented by the qualitative analysis of the 

common 5-grams and 4-grams. 

Table 29 | N-gram tracing with Scene V.i from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 3 10 

3-grams 35 65 

2-grams 248 289 

As can be observed in Table 29, while the scene shares 3 4-grams with the corpus of 

Shakespeare, it has seven more in common, that is, 10, with the Marlowian corpus, which 

are many for a text of this length. The scene also has 35 3-grams in common with the 

Shakespearean corpus, whereas it presents 65, that is, thirty more, with the reference 

corpus of Marlowe. Lastly, there is a remarkable difference of more than forty points if 

the number of 2-grams that the scene shares with the corpus of Shakespeare (248) is 

compared to those that it has in common with the Marlowian corpus (289). 

Moreover, while the scene does not have any 5-grams in common with the corpus of 

Shakespeare, it shares 3 with the Marlowian corpus. These are my lord and here they, 

once more away with him and and I know not that, which do not seem to be distinctive 

constructions.  

In the light of the findings provided by the study, it seems highly probable that Scene 

V.i from The Jew of Malta was written by Marlowe, given the clarity of the results of the 

quantitative analysis of the 4-grams, 3-grams and 2-grams in common and the fact that 

the scene also shares 3 5-grams with his reference corpus, even though they are not 

particularly distinctive. Therefore, this study has achieved its goal efficiently. 

Conclusions derived from the second stage of the pre-study 

Five Shakespearean and five Marlowian scenes of between 500 and 950 words have been 

analysed as if they were disputed texts to evaluate if n-gram tracing can determine their 

authorship correctly. Eight of these scenes have been successfully attributed to their 

author, whereas the results derived from the study of the two remaining scenes have been 

inconclusive.  


146 
 

This method has been proved to have an effectiveness of 80% in this linguistic context 

and, in those cases in which the scenes have not been associated with their author, they 

have not been misattributed either, which is of vital importance. For these two reasons, 

n-gram tracing will be used to analyse the scenes of Arden of Faversham that contain 

between 500 and 950 words.  

5.3.3. N-gram tracing with scenes of between 1,100 and 1,700 words 

This stage of the pre-study, which willl be built upon the same methodological 

foundations than the previous ones, will analyse the authorship of five Shakespearean and 

five Marlowian random scenes whose length ranges from 1,100 to 1,700 words. 

Scene I.i from Shakespeare’s Richard III (1,243) 

The 4-grams, 3-grams and 2-grams that Scene I.i from Richard III shares with the two 

reference corpora can be observed in Table 30. After such results are commented, the 

larger n-grams in common will be revealed and analysed from a qualitative perspective. 

Table 30 | N-gram tracing with Scene I.i from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 10 3 

3-grams 82 66 

2-grams 412 397 

Table 30 shows that Scene I.i from Richard III shares more 4-grams, 3-grams and 2-

grams with the Shakespearean reference corpus from which it has been extracted than 

with the corpus of Marlowe. The scene has 10 4-grams in common with the 

Shakespearean corpus and only 3, that is, seven less, with the Marlowian corpus. 

Furthermore, while it has 82 3-grams in common with the corpus of the Bard, it presents 

sixteen less in common (66) with that of the other candidate. Similarly, there is a 

difference of fifteen points between the 2-grams that the scene shares with the 

Shakespearean corpus (412) and with the corpus of Marlowe (397). 

There is also an 8-gram in common between the scene and the Shakespearean corpus, 

which is I do beseech your grace to pardon me. This is the largest n-gram in common 

that has been found in the pre-study so far and it seems to be a robust authorship marker 

not only because of its length, but also for the inclusion of the auxiliar do between the 


147 
 

subject and the verb to emphasize the construction, which reflects a conscious linguistic 

choice of the author. The division of this 8-gram generates 2 7-grams, 3 6-grams and 4 5-

grams in common between the scene and the corpus of the Bard.  

The clarity of the quantitative analysis of the common 4-grams, 3-grams and 2-grams 

combined with the presence of a highly distinctive 8-gram shared between the scene and 

the Shakespearean corpus stands as solid proof to suggest that it is highly probable that 

Scene I.i from Richard III was written by him, for which the study has achieved its goal 

effectively. 

Scene II.ii from Shakespeare’s Richard III (1,214) 

Table 31 includes the number of 3-grams and 2-grams that Scene II.ii from Richard III 

shares with the Shakespearean corpus from which it has been extracted and with the 

corpus of Marlowe. Afterwards, the 5-grams and 4-grams that the scene has in common 

with them will be qualitatively analysed. 

Table 31 | N-gram tracing with Scene II.ii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 62 48 

2-grams 379 324 

As can be observed in the table presented above, while the scene has 62 3-grams in 

common with the Shakespearean corpus, it shares fourteen less (48) with the corpus of 

Marlowe. Furthermore, there is a dramatic difference of more than fifty points between 

the number of 2-grams that the scene shares with the corpus of the Bard (379) and with 

the Marlowian corpus (324). 

Scene II.ii from Richard III also has the 5-gram to reap the harvest of in common 

with the Shakespearean corpus, which includes the uncommon lexical words reap and 

harvest and therefore appears to be a distinctive construction. 

In addition, it has 7 4-grams in common with the corpus of Shakespeare. These are, 

apart from the 2 4-grams that derive from the division of the abovementioned 5-gram, 

will you go to, I hope the king, and make me die, God will revenge it and who shall hinder 

me. Among these 4-grams, it is worth mentioning that make me die appears as an 

imperative construction both in the scene and the reference corpus, and God will revenge 


148 
 

it seems to be highly distinctive for the combination of the words God and revenge, which 

reflects a specific perception that the author has about God. 

The scene presents 3 common 4-grams with the Marlowian corpus, which are and so 

do I, and so will I and what noise is this. None of these constructions seems to be a solid 

authorship marker. 

Given the clarity of the results of the quantitative analysis of the common 3-grams 

and 2-grams, as well as the number of 5-grams and 4-grams that the scene shares with the 

Shakespearean corpus and how distinctive they are, it seems highly probable it was 

written by him, and therefore this study has successfully achieved its objective. 

Scene I.i from Shakespeare’s Richard II (1,605 words) 

Scene I.i from Richard II has been removed from the reference corpus where it belongs 

and the n-grams that it shares with such corpus and with that of Marlowe have been 

identified by ALTXA. The number of 3-grams and 2-grams that the scene shares with the 

two reference corpora can be observed in Table 32, which will be later commented. 

Afterwards, the 4-grams in common will be qualitatively analysed. 

Table 32 | N-gram tracing with Scene I.i from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 70 58 

2-grams 461 415 

Scene I.i from Richard II shares more 3-grams with the Shakespearean corpus than with 

that of the other candidate by a margin of twelve points (70 vs. 58). Furthermore, there is 

a dramatic difference of forty-six points if the number of 2-grams that the scene has in 

common with the corpus of the Bard (461) is compared to those that it shares with the 

Marlowian corpus (415). 

The scene also has one more 4-gram in common with the corpus of Shakespeare than 

with that of the other candidate. The 3 4-grams that Scene I.i from Richard II shares with 

the Shakespearean corpus are against the duke of and thou art a traitor, which do not 

seem to be particularly distinctive, and the kindred of the, that contains the word kindred, 

which is not present in the corpus of the other candidate. 


149 
 

The 2 4-grams that the scene shares with the corpus of Marlowe are of the king and, 

which seems to be a highly frequent construction in texts of this nature, and be rul’d by 

me, which should not be seen as a solid marker either, since the 2-gram rul’d by can be 

found multiple times in the corpora of the two candidates. 

Taking into consideration the clarity of the quantitative analysis of the common 3-

grams and 2-grams, which has been slightly reinforced by the qualitative analysis of the 

shared 4-grams, it seems highly probable that Scene I.i from Richard II was written by 

Shakespeare, and thus this study has effectively attributed the text to its author.  

Scene II.ii from Shakespeare’s Richard II (1,150 words) 

The results of the quantitative analysis of the 4-grams, 3-grams and 2-grams that Scene 

II.ii from Richard II shares with the two reference corpora will be presented in Table 33, 

which will be later commented and complemented by the qualitative analysis of the 

common 6-grams and 5-grams. 

Table 33 | N-gram tracing with Scene II.ii from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 15 7 

3-grams 82 67 

2-grams 372 332 

The table shows that Scene II.ii from Richard II shares eight more 4-grams with the 

Shakespearean corpus from which it has been extracted than with the corpus of the other 

candidate (15 vs. 7). It also has fifteen more 3-grams in common with the Shakespearean 

corpus (82) than with the corpus of Marlowe (67), and there is dramatic difference of 

forty points if the number of 2-grams that the scene shares with the corpus of the Bard 

(372) is compared to those that it has in common with the Marlowian corpus (332). 

Scene II.ii from Richard II has the 6-gram here comes the duke of York in common 

with the Shakespearean corpus, which includes the frequent construction here comes the 

duke and the 2-gram of York, which can be also found in the corpus of Marlowe. 

Apart from the 2 5-grams that derive from the division of the abovementioned 6-gram, 

the scene also shares the 5-gram myself I cannot do it with the Shakespearean corpus, 

which does not seem to be a distinctive combination of words. 


150 
 

On the other hand, the scene shares the 5-gram it may be so but with the Marlowian 

corpus, which seems like an ordinary construction and therefore it should not be seen as 

a solid authorship marker either. 

According to this study, it seems highly probable that Scene II.ii from Richard II was 

written by Shakespeare, given the clarity of the results of the quantitative analysis of the 

common 4-grams, 3-grams and 2-grams, which has been reinforced by the presence of a 

6-gram and a few 5-grams in common between the scene and his reference corpus, even 

though they are not particularly distinctive. Therefore, the study has effectively associated 

the scene with its author. 

Scene II.iii from Shakespeare’s Richard II (1,377 words) 

The last Shakespearean scene selected for this stage of the pre-study is Scene II.iii from 

Richard II. The number of 4-grams, 3-grams and 2-grams that it has in common with the 

reference corpora of the two candidates of the study can be observed in Table 34, which 

will be commented. Afterwards, the common 5-grams will be qualitatively analysed. 

Table 34 | N-gram tracing with Scene II.iii from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 14 10 

3-grams 97 84 

2-grams 494 440 

The scene has more 4-grams in common with the Shakespearean corpus than with that of 

Marlowe by a narrow margin of four points (14 vs. 10). It also has 97 3-grams in common 

with the corpus of the Bard and thirteen less (84) with the Marlowian corpus. Lastly, there 

is a remarkable difference of more than fifty points if the number of 2-grams that the 

scene shares with the Shakespearean corpus (494) is compared to those that it has in 

common with the corpus of Marlowe (440). 

According to the analysis conducted by ALTXA, Scene II.iii from Richard II shares 

a 5-gram with each of the two reference corpora. The one that it shares with the 

Shakespearean corpus is I will go with you, whereas what would you have me is the one 

that it has in common with the corpus of Marlowe. None of these constructions seems to 

be distinctive. 


151 
 

Even though the qualitative analysis of the shared 5-grams does not associate the 

authorship of Scene II.iii from Richard II with any of the two candidates of the study, the 

clarity of the quantitative analysis of the 4-grams, 3-grams and 2-grams in common 

suggests that it is highly probable that it was written by Shakespeare.  

In conclusion, the five Shakespearean scenes of between 1,100 and 1,700 words that 

have been analysed using n-gram tracing have been correctly attributed to their author 

with a high degree of certainty. The next five scenes will be extracted from the corpus of 

the other candidate. 

Scene I.i from Marlowe’s Edward II (1,588 words) 

The number of 4-grams, 3-grams and 2-grams that Scene I.i from Edward II shares with 

the Marlowian corpus and with that of Shakespeare can be observed in Table 35, which 

will be discussed and complemented by the qualitative analysis of the 6-grams and 5-

grams in common. 

Table 35 | N-gram tracing with Scene I.i from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 9 14 

3-grams 78 117 

2-grams 489 551 

There is a relatively narrow margin of five points between the 4-grams that the scene 

shares with the Marlowian corpus (14) and with the corpus of the Bard (9). The scene 

also has 117 3-grams in common with the corpus of Marlowe and 78 with the 

Shakespearean corpus, which is a notable difference of almost forty points. Lastly, while 

the scene shares 551 2-grams with the Marlowian corpus, it presents sixty-two less in 

common with the corpus of Shakespeare (489), which reflects the clarity with which this 

analysis associates the scene with its author.  

Scene I.i from Edward II has more 6-grams and 5-grams in common with the 

Shakespearean corpus than with the corpus of Marlowe, which contrasts with the results 

of the quantitative analysis presented earlier. While the scene has no 6-grams in common 

with the Marlowian corpus, it shares the 6-gram I cannot nor I will not with the corpus of 

Shakespeare, which does not seem to be particularly distinctive. 


152 
 

Apart from the 2 5-grams that derive from the division of the abovementioned 6-gram, 

the scene shares another 5-gram with the Shakespearean corpus, which is to be reveng’d 

on thee. This construction is almost identical to the 4-gram to be reveng’d on that the 

scene shares with the Marlowian corpus, for which it should not be seen as a robust 

authorship marker.  

On the other hand, the scene presents the 5-gram the favourite of a king in common 

with the Marlowian corpus, which does not appear to be a distinctive combination of 

words either.  

Scene I.i from Edward II shares a few more 6-grams and 5-grams with the corpus of 

Shakespeare than with that of Marlowe, but none of them seems to be a robust marker if 

they are analysed from a qualitative perspective. Nevertheless, the quantitative analysis 

of the shared 4-grams, 3-grams and 2-grams associates the authorship of the scene with 

Marlowe with such clarity that it seems highly probable that it was written by him, and 

hence this study has been successful. 

Scene III.ii from Marlowe’s Edward II (1,401 words) 

The number of 3-grams and 2-grams in common between Scene III.ii from Edward II and 

the two reference corpora can be observed in Table 36. After the results presented in the 

table are discussed, the common 5-grams and 4-grams will be revealed and qualitatively 

analysed. 

Table 36 | N-gram tracing with Scene III.ii from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 73 74 

2-grams 421 447 

Table 36 shows that the number of 3-grams that Scene III.ii from Edward II shares with 

the two reference corpora is almost identical, although it has one more in common with 

the Marlowian corpus than with that of the other candidate (74 vs. 73). The scene also 

has more 2-grams in common with the corpus of Marlowe, but the difference is notable 

in this case, given that it shares 447 with his corpus and 421, that is, twenty-six less, with 

the corpus of the Bard. 


153 
 

ALTXA has identified a 5-gram in common between the scene and each of the two 

reference corpora. The 5-gram that it shares with the Marlowian corpus is with the king 

of France, which does not seem to be distinctive, since France is mentioned multiple 

times in the reference corpora of the two playwrights and, as a matter of fact, the 4-gram 

the king of France can be also found in the Shakespearean corpus.  

The 5-gram that the scene has in common with the corpus of Shakespeare is lord I 

take my leave, which seems like a conventional way of bidding farewell, for which it does 

not appear to be a distinctive marker either. 

Scene III.ii from Edward II also presents 9 4-grams in common with the Marlowian 

corpus and 7 with the corpus of the Bard. The 4-grams that it shares with the corpus of 

Marlowe and are not derived from the division of the 5-gram that has been previously 

commented are I here create thee, long live king Edward, drink your fill and, and by my 

father’s, ‘gainst law of arms, undertake to carry him and the king and do. Among these 

4-grams, I here create thee stands out as a remarkable linguistic choice, given that it holds 

a certain metaphorical meaning, and ‘gainst law of arms seems to be an unusual 

combination of words. 

The 7 4-grams that the scene shares with the Shakespearean corpus are, apart from 

those derived from the division of the 5-gram that has been mentioned earlier, the king of 

France, in the field and, the earl of Pembroke, my good lord for and yea my good lord. 

The 4-gram that seems to be the most distinctive of the group is yea my good lord, since 

the use of the word yea instead of yes can be seen as a linguistic choice of the author, 

although this linguistic form is more dialectal or context-dependent than idiolectal (see 

Section 3.4.4). 

The quantitative analysis of the common 3-grams and 2-grams associates Scene III.ii 

from Edward II with the Marlowian corpus. In addition, the scene also shares more 4-

grams with the corpus of Marlowe than with that of the other candidate, and some of these 

are relatively distinctive, for which it seems highly probable that it was written by him. 

Hence, this study has been able to attribute the authorship of the scene to its author 

effectively. 

 
154 
 

Scene V.i from Marlowe’s Edward II (1,226 words) 

Table 37 shows the number of 3-grams and 2-grams that Scene V.i from Edward II shares 

with the reference corpora of the two candidates of the study. The discussion of these 

results will be followed by the qualitative analysis of the common 4-grams. 

Table 37 | N-gram tracing with Scene V.i from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 68 80 

2-grams 389 397 

The scene shares 80 3-grams with the corpus of Marlowe, which are twelve more than 

those that it has in common with the Shakespearean corpus (68). It also has eight more 2-

grams in common with the Marlowian corpus (397) than with the corpus of Shakespeare 

(389), which stands as a narrow margin for a study of this kind. 

Scene V.i from Edward II has one more 4-gram in common with the Shakespearean 

corpus, with which it has 9, than with the Marlowian corpus from which it has been 

extracted. The 8 4-grams that the scene shares with the Marlowian corpus are I am a king, 

I know not but, my lord the king, my most gracious lord, man of noble birth, out of my 

sight, stay for rather than and what are you mov’d, which do not seem to stand as 

distinctive linguistic choices.  

The 4-grams that it shares with the corpus of the Bard are of my wrongs that, the name 

of king, upon my head and, we take our leave, me to my son, out of my sight, be guilty of 

so, my lord the king and I am a king. None of these constructions seems to be distinctive 

either and, as a matter of fact, some of them are similar or identical to those that the scene 

shares with the Marlowian corpus, for which they should not be seen as robust markers.  

Even though Scene V.i from Edward II shares one more 4-gram with the corpus of 

Shakespeare than with that of Marlowe, the qualitative analysis of these constructions 

does not clearly link the scene to any of the two playwrights. The quantitative analysis of 

the common 3-grams and 2-grams associates the authorship of the scene with the corpus 

of Marlowe, but not with the same degree of certainty as on other occasions. Taking all 

this into account, it seems slightly probable that the scene was written by Marlowe, for 

which the study has achieved its goal. 


155 
 

Scene I.i from Marlowe’s The Jew of Malta (1,424) 

The number of 3-grams and 2-grams that Scene I.i from The Jew of Malta shares with the 

corpus from which it has been extracted and with that of the other candidate can be 

observed in Table 38. After such results are discussed, the 5-grams and 4-grams in 

common will be revealed and qualitatively analysed. 

Table 38 | N-gram tracing with Scene I.i from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 42 54 

2-grams 348 382 

As can be seen in Table 38, while Scene I.i from The Jew of Malta presents 54 3-grams 

in common with the corpus of Marlowe, it shares twelve less (42) with that of 

Shakespeare. Furthermore, there is a notable difference of thirty-four points between the 

2-grams that the scene has in common with the corpus of Marlowe (382) and those that it 

shares with the Shakespearean corpus (348). 

Scene I.i from The Jew of Malta also has 4 4-grams in common with the Marlowian 

corpus, which are all the wealth in, but who comes here, it may be so and and here he 

comes. None of these 4-grams seems to contain a characteristic linguistic choice. 

The scene shares the 5-gram serve as well as I, which seems to be relatively unusual, 

with the Shakespearean corpus. It also shares 8 4-grams with his corpus, which are, apart 

from those that derive from the division of the abovementioned 5-gram, but who comes 

here, it may be so, in the council-house to, and here he comes, oft have I heard and the 

bowels of the. Among such 4-grams, in the council-house to seems to be relatively rare 

due to the presence of the compound word council-house. Moreover, oft have I heard 

appears to be a solid marker if the use of the word oft is taken into account, as well as the 

inversion of the words have and I, given that this construction appears in an affirmative 

sentence both in the extracted scene and the corpus of Shakespeare. The fact that the scene 

shares more 5-grams and 4-grams with the corpus of the Bard and that some these are 

distinctive contrasts with the results of the quantitative analysis presented earlier. 

In sum, the quantitative analysis of the common 3-grams and 2-grams attributes the 

authorship of Scene I.i from The Jew of Malta to Marlowe with great clarity, but the 


156 
 

qualitative analysis of the larger n-grams in common, which is meant to complement these 

results, associates the scene with the Shakespearean corpus. Even though this quantitative 

analysis has accomplished its objective effectively, the qualitative analysis of the 

common 5-grams and 4-grams undermines the certainty of the final verdict. 

Consequently, according to this study, it seems slightly probable that the scene was 

written by Marlowe.  

Scene IV.iv from Marlowe’s The Jew of Malta (1,135 words) 

The last sample included in this stage of the pre-study is Scene IV.iv from The Jew of 

Malta. Table 39 shows the 4-grams, 3-grams and 2-grams that this text shares with the 

reference corpora of Shakespeare and Marlowe.  

Table 39 | N-gram tracing with Scene IV.iv from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 4 12 

3-grams 52 90 

2-grams 345 416 

While Scene IV.iv from The Jew of Malta has 12 4-grams in common with the corpus of 

Marlowe, which stands as a large number for a study of this kind, it shares only 4 with 

the corpus of the other candidate. It also presents 90 3-grams in common with the 

Marlowian corpus and 52 with the corpus of the Bard, which creates a remarkable 

difference of thirty-eight points. The difference of the shared 2-grams is even more 

significant, given that the scene has seventy-one more in common with the corpus of 

Marlowe (416) than with the Shakespearean corpus (345). 

There are no larger n-grams in common between the scene and any of the two 

reference corpora, for which there will not be a qualitative analysis in this study. 

Given the clarity with which the quantitative analysis of the common 4-grams, 3-

grams and 2-grams associates Scene IV.iv from The Jew of Malta with the corpus of 

Marlowe, it seems highly probable that it was written by him, and thus this study has 

successfully achieved its goal. 

 
157 
 

Conclusions derived from the third stage of the pre-study 

Five Shakespearean and five Marlowian scenes of between 1,100 and 1,700 words have 

been analysed as if their authorship was disputed to assess the effectiveness of n-gram 

tracing in this linguistic context. The ten scenes have been correctly attributed to their 

author, eight of which with a high degree of certainty, and thus this method will be used 

to analyse the authorship of the scenes of Arden of Faversham of the same length. 

5.3.4. N-gram tracing with scenes of almost 2,000 words or more 

This last stage of the pre-study will analyse the authorship of scenes of almost 2,000 

words or more applying the same methodological criteria that have been followed in the 

previous ones. Since there are only three Marlowian scenes of more than 2,000 words, 

Scene II.ii from Edward II, that has 1,995 words, will be also included. Therefore, five 

randomly selected Shakespearean scenes of more than 2,000 words and the only four 

Marlowian scenes of almost 2,000 words or more will be analysed independently to assess 

the validity of n-gram tracing in this linguistic context. 

Scene I.iii from Shakespeare’s Richard III (2,845 words) 

The first Shakespearean scene that has been randomly selected for this stage of the pre-

study is Scene I.iii from Richard III. The number of 4-grams, 3-grams and 2-grams that 

it shares with the two reference corpora will be presented in Table 40, which will be 

discussed and complemented by the qualitative analysis of the common 5-grams. 

Table 40 | N-gram tracing with Scene I.iii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 21 20 

3-grams 173 147 

2-grams 844 744 

As can be observed in Table 40, Scene I.iii from Richard III presents only one more 4-

gram in common with the corpus of the Bard (21) than with the Marlowian corpus (20). 

Nevertheless, if the number of 3-grams that it shares with the Shakespearean corpus is 

compared to those that it has in common with the corpus of Marlowe, a notable difference 

of twenty-six points can be found (173 vs. 147). There is also a dramatic difference of 

one hundred 2-grams between those that it shares with the Shakespearean corpus (844) 


158 
 

and those that it has in common with the Marlowian corpus (744), which reflects the 

clarity with which this quantitative analysis associates the scene with the Bard. 

Scene I.iii from Richard III has one more 5-gram in common with the Shakespearean 

corpus, with which it shares 5, than with the Marlowian corpus. The 5 5-grams that the 

scene shares with the corpus of Shakespeare are we wait upon your grace, will you go 

with me, good time of day unto, here come the lords of and vain flourish of my fortune. 

Among these 5-grams, vain flourish of my fortune appears to be a robust authorship 

marker, given that the description of a flourishing fortune holds a metaphorical meaning 

and therefore stands as a distinctive linguistic choice of the author. 

On the other hand, the 4 5-grams that the scene has in common with the Marlowian 

corpus are my lord we will not, I can no longer hold, in presence of the king and my lord 

as much as, which do not seem to be particularly distinctive. 

Given the clarity of the quantitative analysis of the common 4-grams, 3-grams and 2-

grams and the manner in which this has been reinforced by the qualitative analysis of the 

shared 5-grams, it seems highly probable that Scene I.iii from Richard III was written by 

Shakespeare, and thus this study has accomplished its objective successfully. 

Scene IV.iv from Shakespeare’s Richard III (4,268 words) 

The number of 4-grams, 3-grams and 2-grams that this scene shares with the reference 

corpora of the two candidates of the study can be observed in Table 41, which will be 

commented. Afterwards, the common 5-grams between the scene and the two reference 

corpora will be listed and qualitatively analysed. 

Table 41 | N-gram tracing with Scene IV.iv from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 16 9 

3-grams 173 150 

2-grams 1,053 963 

While the scene has 16 4-grams in common with the corpus of the Bard, it shares 9 with 

the Marlowian corpus, which stands as an acceptable margin of seven points for a study 

of this kind. It also shares 173 3-grams with the Shakespearean corpus and twenty-three 

less (150) with the corpus of the other candidate. The highest difference can be found if 


159 
 

the number of 2-grams that the scene has in common with the two corpora is observed, 

since there is a gap of ninety points between those that it shares with the corpus of 

Shakespeare (1053) and those that it has in common with the Marlowian corpus (963). 

Scene IV.iv from Richard III also shares 2 5-grams with the corpus of the Bard and 1 

with that of Marlowe. The 2 5-grams that it has in common with the Shakespearean corpus 

are no my good lord therefore and will my lord with all, whereas the one that it shares 

with the corpus of the other candidate is I pray that I may. None of these 5-grams seems 

to stand as a distinctive linguistic choice. 

Taking into consideration the results of the quantitative analysis of the common 4-

grams, 3-grams and 2-grams, it seems highly probable that the scene was written by 

Shakespeare, even though the qualitative analysis of the common 5-grams has not 

reinforced such results. Therefore, this study has attributed the scene to its author 

correctly. 

Scene V.iii from Shakespeare’s Richard III (2,726 words) 

Table 42 shows the number of 4-grams, 3-grams and 2-grams that Scene V.iii from 

Richard III shares with the Shakespearean corpus from which it has been extracted and 

with that of the other candidate. After these results are discussed, the common 5-grams 

will be revealed and qualitatively analysed. 

Table 42 | N-gram tracing with Scene V.iii from Richard III 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 13 10 

3-grams 123 114 

2-grams 701 631 

There is a narrow difference of three points between the 4-grams that the scene shares 

with the corpus of Shakespeare (13) and with that of Marlowe (10). Furthermore, the 

scene only shares nine more 3-grams with the Shakespearean corpus (123) than with the 

corpus of Marlowe (114), which also stands as a low margin for a study of this kind. In 

contrast, there is a remarkable difference of seventy points between the 2-grams that it 

shares with the corpus from which it has been extracted and with that of the other 

candidate (701 vs. 631). 


160 
 

Scene V.iii from Richard III shares the 2 5-grams of the house of Lancaster and upon 

the stroke of four with the Shakespearean corpus. The first of these 5-grams includes the 

2-gram of Lancaster, which also appears multiple times in the corpus of Marlowe, for 

which it should not be seen as a solid marker. Nevertheless, upon the stroke of four seems 

to be a more unusual construction.  

The scene has a 5-gram in common with the Marlowian corpus, which is I warrant 

you my lord. This expression does not appear to be distinctive and, as a matter of fact, the 

3-gram you my lord can be found many times in the reference corpora of both candidates. 

It seems highly probable that the scene was written by Shakespeare, given that the 

quantitative analysis of the common 4-grams, 3-grams and 2-grams associates its 

authorship with him, and this has been slightly reinforced by the qualitative analysis of 

the shared 5-grams. Hence, the study has effectively linked the scene to the corpus from 

which it has been extracted. 

Scene I.iii from Shakespeare’s Richard II (2,402 words) 

The number of 4-grams, 3-grams and 2-grams that Scene I.iii from Richard II shares with 

the two reference corpora will be presented in Table 43 and later commented. This will 

be complemented by the qualitative analysis of the 5-grams in common. 

Table 43 | N-gram tracing with Scene I.iii from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 11 11 

3-grams 93 104 

2-grams 661 618 

Scene I.iii from Richard II presents 11 4-grams in common with the two reference corpora 

and it shares eleven more 3-grams with the corpus of Marlowe (104) than with the 

Shakespearean corpus from which it has been extracted (93), which is surprising. In 

contrast, it has more 2-grams in common with the corpus of the Bard than with the corpus 

of Marlowe by a considerable margin of forty-three points (661 vs. 618). As a result of 

this difference in the number of 2-grams in common, it could be said that the quantitative 

analysis suggests that Shakespeare is slightly more likely to have written the scene, 

although the results are far from being as clear as on other occasions. 


161 
 

According to the analysis conducted by ALTXA, Scene I.iii from Richard II shares 

the 5-gram of you my noble cousin with the corpus of Shakespeare, which does not appear 

to be distinctive. 

The scene also has 4 5-grams in common with the Marlowian corpus, that is, three 

more than with the corpus of Shakespeare. These are the duty that you owe, lord I take 

my leave, by the grace of God and hardy as to touch the. Among these, hardy as to touch 

the seems to be the most unusual linguistic choice of the group. Nevertheless, the 4-gram 

hardy as to touch can be found as well in the Shakespearean corpus, for which this 5-

gram should not be seen as a solid authorship marker.  

In sum, the quantitative analysis shows that Scene I.iii from Richard II shares the 

same number of 4-grams with the two reference corpora, more 3-grams with the corpus 

of Marlowe by a relatively narrow margin of eleven points and more 2-grams with the 

Shakespearean corpus by a notable difference of forty-three points. The qualitative 

analysis of the common 5-grams reveals that, even though the scene shares a few more 

with the corpus of Marlowe, none of them is distinctive. It could be said that it is uncertain 

if the scene was written by Shakespeare or Marlowe according to the study. This means 

that, despite the fact that it has not attributed the sample to the wrong candidate, it has not 

been able to clearly associate it with its author either. 

Scene II.i from Shakespeare’s Richard II (2,372 words) 

The last Shakespearean scene that has been randomly selected for this stage of the pre-

study is Scene II.i from Richard II. The number of 3-grams and 2-grams that it shares 

with the two reference corpora can be observed in Table 44. After such results are 

discussed, the qualitative analysis of the common 4-grams will be provided. 

Table 44 | N-gram tracing with Scene II.i from Richard II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 81 75 

2-grams 634 576 

There is a difference of seven points if the 3-grams that the scene shares with the corpus 

of Shakespeare (82) are compared to those that it has in common with the Marlowian 

corpus (75). In addition, it has 636 2-grams in common with the Shakespearean corpus 


162 
 

and 576 with that of Marlowe, which creates a dramatic difference of sixty points and 

reflects the effectiveness with which this quantitative analysis associates the scene with 

the corpus from which it has been extracted. 

Since Scene II.i from Richard II does not share 10 4-grams with any of the two 

reference corpora, these will be analysed from a qualitative perspective. The scene has 8 

4-grams in common with the Shakespearean corpus, which are the earl of Wiltshire, I do 

beseech your, to the duke of, so much for that, commends him to your, on my life and, and 

the hand of and the king is not. Among these 4-grams, I do beseech your seems to be the 

most distinctive of the group because of the presence of the auxiliar do between the 

subject and the verb to emphasize the construction. The other 4-grams include 

combinations of words that are frequent in plays of this kind or similar to others that 

appear in the corpus of the other candidate. For instance, the 4-gram the earl of Wiltshire 

is almost identical to the 3-gram earl of Wiltshire that can be found in the corpus of 

Marlowe. 

The scene only shares 2 4-grams with the Marlowian corpus. These are of the king 

for, which is mainly formed by function words, and if it be so, that includes the 2-gram it 

be, which also appears multiple times in the corpus of Shakespeare. 

It seems highly probable that Scene II.i from Richard II was written by Shakespeare, 

given the clarity of the results of the quantitative analysis of the common 3-grams and 2-

grams, which has been reinforced by the qualitative analysis of the shared 4-grams. 

In brief, four of the five Shakespearean scenes that have been studied in this stage of 

the pre-study have been correctly attributed to their author, whereas the analysis of the 

remaining scene has led to inconclusive results. The next four scenes will be extracted 

from the corpus of Marlowe. 

Scene I.iv from Marlowe’s Edward II (3,329 words) 

The number of 4-grams, 3-grams and 2-grams that Scene I.iv from Edward II has in 

common with the Marlowian corpus and with that of Shakespeare will be presented in 

Table 45 and later commented. Afterwards, the larger n-grams that the scene shares with 

the two reference corpora will be revealed and qualitatively analysed. 

 
163 
 

Table 45 | N-gram tracing with Scene I.iv from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 22 39 

3-grams 181 267 

2-grams 881 986 

The scene has seventeen more 4-grams in common with the Marlowian corpus (39) than 

with the corpus of the other candidate (22), which is a remarkable difference for a study 

of this kind. In addition, there is a gap of eighty-six points if the number of 3-grams that 

the scene shares with the corpus of Marlowe (267) is compared to those that it has in 

common with the Shakespearean corpus (181). There is also a dramatic difference of one 

hundred and five 2-grams if those that it shares with the Marlowian corpus (986) are 

compared to the ones that it has in common with the corpus of the Bard (881). The results 

of this quantitative analysis associate the scene with the corpus from which it has been 

extracted with a high degree of certainty. 

There is a 7-gram in common between the scene and the Marlowian corpus. This 7-

gram, which is lord I come to bring you news, could be seen as a solid marker because of 

its length and the fact that it contains four lexical words. In addition to the 2 6-grams and 

3 5-grams that derive from the division of this 7-gram, the scene also shares with the 

Marlowian corpus the 5-grams it shall be done my, whether I will or no and with the earl 

of Kent, which do not seem to be particularly distinctive. 

Scene I.iv from Edward II has 4 5-grams in common with the Shakespearean corpus. 

These are I will not yield to, my gracious lord I come, what would you have me and the 

duty that you owe, which seem to be frequent expressions in texts of this nature. 

Given the clarity of the quantitative analysis of the shared 4-grams, 3-grams and 2-

grams, which has been reinforced by the qualitative analysis of the larger n-grams in 

common, it seems highly probable that Scene I.iv from Edward II was written by 

Marlowe, and hence this study has correctly attributed the sample to its author. 

Scene II.ii from Marlowe’s Edward II (1,995 words) 

Table 46 shows the number of 4-grams, 3-grams and 2-grams that Scene II.ii from 

Edward II shares with the corpus from which it has been extracted and with that of the 


164 
 

other candidate. After such results are discussed, the larger n-grams in common will be 

listed and analysed from a qualitative perspective. 

Table 46 | N-gram tracing with Scene II.ii from Edward II 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 10 31 

3-grams 119 164 

2-grams 569 649 

There is a difference of twenty-one 4-grams between those that the scene shares with the 

corpus of Marlowe (31) and with the Shakespearean corpus (10). It is surprising that a 

text of 1,995 words has 31 4-grams in common with one of the corpora, which reflects 

the clarity with which the analysis links the scene to its author. There is also a gap of 

forty-five points if the 3-grams that the scene shares with the corpus of Marlowe (164) 

are compared to those that it has in common with the corpus of Shakespeare (119). Lastly, 

the scene presents eighty more 2-grams in common with the Marlowian corpus (649) than 

with the corpus of the Bard (569). 

While Scene II.ii from Edward II has no larger n-grams in common with the corpus 

of Shakespeare, it shares the 8-gram will be the ruin of the realm and with the corpus of 

Marlowe. This stands as the largest construction in common that has been found in the 

four stages of the pre-study.  

It also has 3 7-grams in common with the corpus of Marlowe, which are the 2 7-grams 

derived from the division of the abovementioned 8-gram and lord I come to bring you 

news. This construction could be seen as distinctive because of the number of words that 

it has and the fact that it is mainly formed by lexical words, as pointed out in the study of 

the previous scene, where this 7-gram was present as well. 

Apart from the 6-grams and 5-grams that derive from the division of the 8-gram and 

the 7-gram that have just been commented, the scene shares with the Marlowian corpus 

the 5-grams that I love thee well, which seems to be a frequent construction in texts of 

this nature, and I fear me he is, which appears to be relatively unusual. In total, Scene II.ii 

from Edward II shares 1 8-gram, 3 7-grams, 5 6-grams and 9 5-grams with the Marlowian 

corpus from which it has been extracted. 


165 
 

The clarity with which the quantitative analysis of the shared 4-grams, 3-grams and 

2-grams associates Scene II.ii from Edward II with the corpus of Marlowe and the manner 

in which these results have been reinforced by the qualitative analysis of the larger n-

grams in common suggest that it is highly probable that it was written by him, for which 

this study has successfully accomplished its objective. 

Scene I.ii from Marlowe’s The Jew of Malta (2,929 words) 

The results of the quantitative analysis of the 4-grams, 3-grams and 2-grams that Scene 

I.ii from The Jew of Malta has in common with the two reference corpora will be 

presented in Table 47. Once the results of the table are commented, the 6-grams and 5-

grams in common will be revealed and analysed from a qualitative perspective. 

Table 47 | N-gram tracing with Scene I.ii from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 9 23 

3-grams 120 163 

2-grams 726 817 

There is a notable variation of fourteen points between the 4-grams that the scene shares 

with the Marlowian corpus (23) and with that of the other candidate (9). Furthermore, 

while the scene has 163 3-grams in common with the corpus of Marlowe, it shares 120, 

that is, forty-three less, with the Shakespearean corpus. There is also a dramatic difference 

of ninety-one points if the number of 2-grams that the scene shares with the corpus of 

Marlowe (817) is compared to those that it has in common with the corpus the Bard (726), 

which stands as solid proof to suggest that this quantitative analysis has been successful. 

The scene also has a 6-gram in common with the Marlowian corpus, which is too or 

it shall go hard. This 6-gram, which can be divided into 2 5-grams, appears to be 

relatively distinctive because of the combination of words go hard, which cannot be found 

in the Shakespearean corpus.  

The 5 5-grams that the scene has in common with the Marlowian corpus are, apart 

from those that derive from the division of the aforementioned 6-gram, I must be forced 

to, o my lord we will and my lord and here they. Among these 5-grams, I must be forced 


166 
 

to seems to stand out as the most distinctive of the group, whereas the others include the 

2-gram my lord, which is highly frequent in texts of this nature. 

On the other hand, the only 5-gram in common between Scene I.ii from The Jew of 

Malta and the Shakespearean corpus is it may be so but, which seems to be a common 

combination of words. 

Given the clarity of the results of the quantitative analysis of the common 4-grams, 3-

grams and 2-grams, which has been reinforced by the qualitative analysis of the shared 

6-grams and 5-grams, it seems highly probable that Scene I.ii from The Jew of Malta was 

written by Marlowe. Therefore, this study has attributed the scene to its author effectively. 

Scene II.iii from Marlowe’s The Jew of Malta (3,034 words) 

The last sample included in this stage of the pre-study is Scene II.iii from The Jew of 

Malta. The number of 4-grams, 3-grams and 2-grams that it shares with the two reference 

corpora can be observed in Table 48, which will be later discussed. This will be 

complemented by the qualitative analysis of the shared 6-grams and 5-grams. 

Table 48 | N-gram tracing with Scene II.iii from The Jew of Malta 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 6 31 

3-grams 144 212 

2-grams 785 898 

Table 48 shows that there is a notable difference of twenty-five points between the 4-

grams that Scene II.iii from The Jew of Malta shares with the corpus from which it has 

been extracted (31) and with that of the other candidate (6). The scene also shares 212 3-

grams with the corpus of Marlowe and 144, that is, sixty-eight less, with the corpus of 

Shakespeare. The highest difference can be found if the number of 2-grams that it has in 

common with the two reference corpora is compared, since it shares 898 with the 

Marlowian corpus and 785, that is, one hundred and thirteen less, with the corpus of the 

Bard. 

While Scene II.iii from The Jew of Malta does not share any larger n-grams with the 

corpus of Shakespeare, it presents the 6-gram too or it shall go hard in common with the 

Marlowian corpus. This 6-gram, which can be also found in the scene that has been 


167 
 

previously analysed, appears to be relatively distinctive because of the 2-gram go hard 

that it includes, which is not present in the Shakespearean corpus.  

The scene also shares 4 5-grams with the corpus of Marlowe. These are, apart from 

those that derive from the division of the abovementioned 6-gram, whether I will or no 

and o my lord we will, which do not seem to stand as unusual linguistic choices. 

According to the study, it seems highly probable that Scene II.iii from The Jew of 

Malta was written by Marlowe, since the quantitative analysis of the common 4-grams, 

3-grams and 2-grams and the qualitative analysis of the shared 6-grams and 5-grams 

clearly associate the scene with his reference corpus. 

Conclusions derived from the fourth stage of the pre-study 

This stage of the pre-study has analysed five Shakespearean and four Marlowian scenes 

of almost 2,000 words or more as if their authorship was disputed with n-gram tracing. 

Eight of these scenes have been correctly attributed to their author with a high degree of 

certainty, whereas the results derived from the analysis of the remaining scene could not 

associate its authorship with any of the two candidates. 

This method has had an effectiveness of 88.8% with the scenes of this group. In 

addition, the only time in which it has not linked a scene to its author, the results have 

been inconclusive, which means that no cases of misattribution have been found. For 

these two reasons, n-gram tracing will be used to analyse the authorship of the scenes of 

Arden of Faversham of almost 2,000 words or more. 

5.3.5. Conclusions derived from Pre-study 3 

This pre-study has assessed the effectiveness of n-gram tracing to distinguish between 

Shakespearean and Marlowian scenes from plays that are not comedies and were written 

between 1590 and 1595, approximately. The first stage of the pre-study has analysed the 

authorship of ten scenes of between 100 and 450 words independently and all of them 

have been correctly attributed to their author. The second stage has studied the authorship 

of ten scenes whose length ranges from 500 to 950 words and eight of them have been 

successfully associated with their author, whereas the results derived from the analysis of 

the other two scenes have been inconclusive. The third stage has analysed ten scenes of 

between 1,100 and 1,700 words and the method has had an effectiveness of 100%, as 

happened in the first stage. The last stage of the pre-study has analysed the authorship of 


168 
 

nine scenes of almost 2,000 words or more, eight of which have been effectively linked 

to their author, whereas the remaining scene could not be attributed to any of the two 

candidates. 

If these results are evaluated from a holistic point of view, the present pre-study has 

analysed the authorship of thirty-nine undisputed scenes with n-gram tracing and thirty-

six of them have been successfully attributed to their author, which implies an 

effectiveness of 92.3%. In addition, there has not been any cases of misattribution, which 

also reflects the reliability of the method, according to the line of thought suggested by 

Grant in 2007 (see Section 3.4.4). 

In sum, n-gram tracing has been proved to be highly effective in the four stages of the 

pre-study, for which all the scenes of Arden of Faversham will be analysed with this 

method. The following pre-study will test the reliability of the Zeta test. 

5.4. Pre-study on the conduction of the Zeta test (Pre-study 4) 

This last pre-study will evaluate the effectiveness of the Zeta test to distinguish between 

Shakespearean and Marlowian scenes of almost 2,000 words or more (see Section 4.5.5 

for an account of the reasons why shorter scenes will not be included in the pre-study). 

Five random scenes from the Shakespearean corpus and the only four scenes of the 

Marlowian corpus of such length will be extracted and analysed independently. If this 

method can correctly associate these scenes with the corpus from which they have been 

removed, it will prove its effectiveness to be later used in the analysis of the scenes of 

Arden of Faversham. The stop list with all the words ignored as potential markers for the 

conduction of all the Zeta tests of this thesis has been included in Appendix 3. This list 

of ignored words has been gradually completed after repeating the process of obtaining 

the lists of 500 markers of the two authors in every Zeta test until all these lists of markers 

have only been formed by distinctive words. This means that common function words 

whose usage does not reflect an idiolectal choice, proper names and lexical words that 

are heavily dependent on play-specific contexts and therefore do not reflect an authorial 

pattern have been discarded for the conduction of this method (see Section 4.5.5). 

If it is not discernible at plain sight which centroid of the two clusters formed by the 

fragments of the reference corpora is closer to the fragments of the disputed text when 

these are placed on a coordinate axis, the formula |𝐴𝐵⃗⃗⃗⃗  ⃗| = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2 will 

be used to measure the exact distances. 


169 
 

Once the scenes are analysed independently, the results derived from the pre-study 

will be interpreted from a holistic perspective and its main findings will be summarized. 

5.4.1. Zeta test with scenes of almost 2,000 words or more 

Scene I.ii from Shakespeare’s Richard III (2,062 words) 

The first Shakespearean scene that has been randomly selected to test the effectiveness of 

the Zeta test is Scene I.ii from Richard III, which has been removed from the reference 

corpus where it belongs and introduced in ALTXA as a disputed text. The software has 

calculated the 500 markers of each of the two reference corpora after the removal of the 

scene and placed every fragment in which these and the scene itself have been divided on 

a coordinate axis, as can be observed in Figure 4 (see Section 4.5.5 for a detailed 

explanation of the procedures underlying the calculation of such markers and the 

conduction of this test in general). 

Figure 4 | Zeta test with Scene I.ii from Richard III 

 
170 
 

The red squares in Figure 4 represent the fragments in which the remaining 

Shakespearean corpus has been divided, the blue circles stand as the fragments derived 

from the division of the corpus of Marlowe and Scene I.ii from Richard III is represented 

by the black triangle. All these fragments contain 2,000 words or 2,000 plus the residual 

words at the end of a text. The value on the horizontal axis of each fragment is derived 

from the division of the number of Shakespearean markers that it contains by its number 

of distinct words, whereas its value on the vertical axis stands as the division of the 

Marlowian markers that it has by its number of distinct words. The division of the number 

of markers that the fragments have by their number of distinct words is made to 

compensate the dissimilar size that some of them present (see Section 4.5.5). 

The black triangle that represents Scene I.ii from Richard III is clearly closer to the 

Shakespearean cluster, and thus this Zeta test has correctly linked the scene to the corpus 

from which it has been extracted. 

Scene I.iii from Shakespeare’s Richard III (2,845 words) 

Scene I.iii from Richard III has been introduced in ALTXA as a disputed text and, after 

calculating the 500 markers of the remaining Shakespearean corpus and those of the 

Marlowian corpus, the software has placed on a coordinate axis the fragments of 2,000 

words or more in which the three samples have been divided. 

As can be observed in Figure 5, the black triangle that represents Scene I.iii from 

Richard III is considerably closer to the cluster of red squares that stand as the fragments 

in which the Shakespearean corpus has been divided than to the blue circles that represent 

the fragments in which the corpus of Marlowe has been divided. Therefore, the Zeta test 

has successfully associated the authorship of the scene with the corpus from which it has 

been removed. 

 
171 
 

Figure 5 | Zeta test with Scene I.iii from Richard III 

 
Scene V.iii from Shakespeare’s Richard III (2,726 words) 

Scene V.iii from Richard III has been extracted from the Shakespearean corpus and 

analysed as a disputed text by ALTXA using the Zeta test. The graphical representation 

of the results derived from this study can be observed in Figure 6. 

Figure 6 shows that Scene V.iii from Richard III, which is represented by the black 

triangle, is notably close to the cluster formed by the red squares that stand as the 

fragments in which the Shakespearean corpus has been divided, whereas the Marlowian 

cluster can be found in a distant position from them. Therefore, this Zeta test has correctly 

attributed the authorship of Scene V.iii from Richard III to Shakespeare. 

 
172 
 

Figure 6 | Zeta test with Scene V.iii from Richard III 

 
Scene I.iii from Shakespeare’s Richard II (2,402 words) 

The authorship of Scene I.iii from Richard II, which has been extracted from the corpus 

where it belongs, has been analysed with ALTXA using the Zeta test. Once the 500 

markers of the remaining Shakespearean corpus and those of the corpus of Marlowe have 

been obtained, the fragments in which these corpora have been divided and the only 

fragment that stands as the disputed scene have been placed on a coordinate axis that can 

be observed in Figure 7. 

 
173 
 

Figure 7 | Zeta test with Scene I.iii from Richard II 

 
Figure 7 shows that there is great proximity between the black triangle that stands as 

Scene I.iii from Richard II and the cluster of red squares formed by the fragments in 

which the Shakespearean corpus has been divided. In contrast, the blue circles that 

represent the fragments in which the corpus of Marlowe has been divided are considerably 

far from the black triangle. According to this Zeta test, the scene was written by 

Shakespeare, for which it has been successful. 

Scene IV.i from Shakespeare’s Richard II (2,628 words) 

Scene IV.i from Richard II is the last Shakespearean scene included in this pre-study. 

After the calculation of the 500 markers of the Shakespearean corpus from which it has 

been extracted and those of the corpus of Marlowe, the samples have been divided in 

fragments of 2,000 words or more and placed on a coordinate axis in terms of the markers 

from the two lists that they contain, as can be observed in Figure 8. 


174 
 

Figure 8 | Zeta test with Scene IV.i from Richard II 

 
The blue circles of Figure 8 that represent the fragments in which the Marlowian corpus 

has been divided are distant from the black triangle and the red squares, which are 

relatively close and represent Scene IV.i from Richard II and the fragments in which the 

corpus of Shakespeare has been divided, respectively. It seems evident that this Zeta test 

associates the authorship of the scene with the Shakespearean corpus. 

In sum, the Zeta test has effectively associated the five Shakespearean scenes that 

have been analysed as disputed texts with the corpus from which they have been 

extracted. The next four scenes of the pre-study will be taken from the corpus of Marlowe. 

 
175 
 

Scene I.iv from Marlowe’s Edward II (3,329 words) 

The first Marlowian sample included in this pre-study is Scene I.iv from Edward II, which 

has been extracted from the reference corpus where it belongs and analsysed as a disputed 

text using the Zeta test. The software ALTXA has quantified the 500 markers of the 

remaining Marlowian corpus, as well as those of the corpus of the other candidate, and it 

has placed the fragments in which these corpora and the scene have been divided on a 

coordinate axis according to the markers of both lists that they contain. 

Figure 9 | Zeta test with Scene I.iv from Edward II 

 
Figure 9 shows that the black triangle that represents Scene I.iv from Edward II is notably 

closer to the cluster of blue circles formed by the fragments in which the Marlowian 

corpus has been divided than to the Shakespearean cluster. Therefore, this Zeta test has 

correctly linked the scene to the corpus of Marlowe. 

 
176 
 

Scene II.ii from Marlowe’s Edward II (1,995 words) 

Scene II.ii from Edward II has been removed from the corpus of Marlowe and ALTXA 

has analysed its authorship using the Zeta test. The software has elaborated a list of the 

500 markers of the remaining Marlowian corpus and a list of the 500 markers of the 

corpus of Shakespeare. Afterwards, it has placed on a coordinate axis the fragments of 

2,000 words or more in which these corpora and the scene have been divided in terms of 

the number of markers of each type that they present. 

Figure 10 | Zeta test with Scene II.ii from Edward II 

 
As can be seen in Figure 10, the results derived from this Zeta test associate the scene 

with the corpus from which it has been extracted with more clarity than on any previous 

occasion. The black triangle that represents Scene II.ii from Edward II is almost within 

the same area occupied by the Marlowian cluster, which reflects the effectiveness of the 

method. 


177 
 

Scene I.ii from Marlowe’s The Jew of Malta (2,929 words) 

Scene I.ii from The Jew of Malta has been extracted from the Marlowian corpus and 

analysed following the same methodological principles described in the study of previous 

scenes. The graphical representation of the results provided by ALTXA can be observed 

in Figure 11. 

Figure 11 | Zeta test with Scene I.ii from The Jew of Malta 

 
It is discernible at plain sight that the black triangle that represents Scene I.ii from The 

Jew of Malta is notably close to the Marlowian cluster and that these are distant from the 

area occupied by the red squares that represent the Shakespearean fragments. The Zeta 

test conducted by ALTXA has determined the authorship of the scene correctly. 

Scene II.iii from The Jew of Malta (3,034 words) 

The last sample selected to assess the effectiveness of the Zeta test is Scene II.iii from 

The Jew of Malta, which has been analysed by ALTXA following the same process 


178 
 

described during the study of previous scenes. The position on the coordinate axis of the 

fragment that represents this scene and that of the fragments in which the two reference 

corpora have been divided can be observed in Figure 12. 

Figure 12 | Zeta test with Scene II.iii from The Jew of Malta 

 
Figure 12 shows that the Zeta test has effectively accomplished its objective, since the 

black triangle that stands as Scene II.iii from The Jew of Malta is considerably closer to 

the centroid of the Marlowian cluster than to that of the Shakespearean cluster. As in 

previous studies, it is not necessary to calculate the exact distances, given that the results 

are discernible at plain sight. 

5.4.2. Interpretation of the results 

The reader may be surprised by the fact that the fragments of the scenes that have been 

analysed as disputed texts are not within the same area occupied by the cluster of the 

reference corpus from which they have been extracted, but in a near position, which 


179 
 

contrasts with studies such as Kinney’s (see Appendix 2). I would like to provide the 

reason why they have occupied such positions on the coordinate axis. 

The reference corpora used by Kinney (2009) and Elliott and Greatley-Hirsch (2017) 

to analyse the authorship of Arden of Faversham were compiled with many plays from 

distinct periods, some of which are comedies, and this contrasts with one of the main 

hypotheses suggested in this thesis, which is related to the criteria for the compilation of 

the reference corpora in studies of this kind (see Chapter 4). This investigation has 

compiled the reference corpora of the two candidates with plays that have a similar tone 

to that of Arden of Faversham and were written in a similar period, and thus they include 

less texts than the corpora of the studies that have been previously referenced. This means 

that the area of the clusters formed by their fragments will be smaller and, as a result, it 

is normal that the fragments of the disputed texts do not fall exactly within the same space. 

The graphical representation of the results of the Zeta tests conducted by the authors 

mentioned above are more visually pleasing, given that almost the whole coordinate axis 

is filled with fragments. Nevertheless, most of these fragments belong to plays whose 

features are not closely related to those of the disputed text that they put into analysis, as 

has been argued in Chapter 4, which may even lead to deceiving conclusions about its 

authorship.  

The guarantee that the approach adopted for the conduction of the Zeta tests of this 

doctoral thesis is reliable is that the nine scenes from the reference corpora that have been 

analysed as disputed texts have been correctly attributed to their author, which means that 

the method has had a success rate of 100%. In addition, these studies allow the researcher 

and the reader to have a reference of the type of outcome that can be considered acceptable 

for the attribution of authorship of the scenes of almost 2,000 words or more from Arden 

of Faversham that will be analysed with this method in Chapter 6. 

5.4.3. Conclusions derived from Pre-study 4 

This last pre-study has analysed the authorship of five Shakespearean and four Marlowian 

scenes of almost 2,000 words or more as if they were disputed texts with the Zeta test. 

The nine samples have been successfully associated with the reference corpus from which 

they have been extracted, and thus this method will be used to determine the likeliest 

authorship of the scenes of Arden of Faversham of such length. 


180 
 

This pre-study has also been useful to observe the kind of outcome that can be 

expected in the analysis of the scenes of Arden of Faversham. Since the reference corpora 

of the candidates have been compiled considering certain variables that have been 

overlooked in other studies, there are differences in the graphical representation of the 

results that these Zeta tests provide in comparison with others. Nevertheless, those 

conducted following this approach have had an effectiveness of 100%, which stands as a 

reflection of its reliability. 

5.5. Summary 

This chapter has presented a series of pre-studies about the effectiveness of four 

authorship attribution methods to distinguish between Shakespearean and Marlowian 

scenes from plays that were approximately written between 1590 and 1595 and are not 

comedies. These have been based on the calculation of the average number of words per 

sentence of the scenes (Pre-study 1), the calculation of their lexical richness (Pre-study 

2), n-gram tracing (Pre-study 3) and the conduction of the Zeta test (Pre-study 4). The 

first three pre-studies have included samples whose length ranges from 100 to 450 words, 

from 500 to 950, from 1,100 to 1,700 and samples whose length is similar or superior to 

2,000 words, and thus they have been divided into four stages. In contrast, the pre-study 

on the reliability of the Zeta test has only focused on scenes of almost 2,000 words or 

more (see Section 4.5.5 for a justification of such methodological decision). 

The pre-studies on the quantification of the average number of words per sentence 

and the lexical richness of the scenes have assessed whether there is enough intra-author 

consistency and inter-author variation within the undisputed scenes, whereas the pre-

studies on n-gram tracing and the Zeta test have extracted scenes from the reference 

corpora to later discern if these methods could associate them with the corpus from which 

they have been removed. 

Firstly, the quantification of the average number of words per sentence has shown 

great intra-author variation when it has been applied with the first groups of scenes. Even 

though this intra-author consistency has increased slightly with the third and the fourth 

group of scenes, these have not presented sufficient inter-author variation, for which this 

method has been discarded from the final case study. 

Secondly, the results derived from the calculation of the lexical richness of the scenes 

have presented a similar tendency to that of the previous method. While the first groups 


181 
 

of scenes have shown no intra-author consistency, this has increased with the size of the 

samples. In fact, the intra-author consistency of the third and the fourth group has been 

superior to that obtained by the previous method. Nevertheless, the overlapping results of 

the scenes of both playwrights has not allowed for the inclusion of this test in the final 

case study. 

Thirdly, n-gram tracing has shown a high degree of effectiveness in the study of the 

four types of scenes. Furthermore, in those few cases in which this method could not 

associate the authorship of a scene with its author, it has not misattributed any of them. 

Consequently, n-gram tracing has been selected to carry out the authorship analysis of all 

the scenes of Arden of Faversham. 

Finally, the Zeta test has been proved to be effective to analyse the authorship of 

Shakespearean and Marlowian samples that have almost 2,000 words or more, for which 

this method has been selected to study the scenes of Arden of Faversham of this length. 

The following chapter will attribute to each scene of Arden of Faversham its likeliest 

authorship using n-gram tracing, which will be complemented by the Zeta test in those 

cases in which the disputed sample contains almost 2,000 words or more. 

  
182 
 

CHAPTER 6 | CASE STUDY: ATTRIBUTION OF AUTHORSHIP OF THE 

SCENES OF ARDEN OF FAVERSHAM 

This chapter will present the results derived from the study of the authorship of the scenes 

of Arden of Faversham. The methods that will be applied in each scene are those that 

have been proved to be reliable in the pre-studies that have analysed undisputed scenes 

of distinct lengths written by Shakespeare and Marlowe (see Chapter 5). The scenes of 

Arden of Faversham whose length is similar or superior to 2,000 words will be analysed 

with n-gram tracing and the Zeta test, whereas only n-gram tracing will be applied with 

the shorter ones. The results of these studies will be provided and discussed in the same 

order that the scenes are present in the play. 

6.1. Scene I.i (5,135 words) 

The first scene of Arden of Faversham is the longest of the play, with 5,135 words. 

Consequently, its authorship has been analysed with n-gram tracing, whose results will 

be provided first, and a Zeta test, which will be presented afterwards. 

Attribution of authorship of Scene I.i from Arden of Faversham with n-gram tracing 

Scene I.i from Arden of Faversham has been extracted from the play and ALTXA has 

identified the n-grams that it shares with the Shakespearean and the Marlowian corpora. 

The size of these reference corpora has been balanced so that both candidates are in equal 

conditions to become the likeliest author of the scene (see Section 4.5.4). The results 

derived from the quantitative analysis of the 4-grams, 3-grams and 2-grams that it has in 

common with the reference corpora will be provided in Table 49, which will be later 

commented. This will be complemented by the qualitative analysis of the larger n-grams 

in common (see Section 4.5.4 for an account of such methodological decision).  

Table 49 | N-gram tracing with Scene I.i from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

4-grams 24 43 

3-grams 259 322 

2-grams 1,279 1,394 

Table 49 shows that Scene I.i from Arden of Faversham has more 4-grams, 3-grams and 

2-grams in common with the Marlowian corpus than with that of Shakespeare. There is a 


183 
 

significant difference of almost twenty 4-grams if those that the scene shares with the 

corpus of Marlowe (43) are compared with those that it has in common with the 

Shakespearean corpus (24). The scene also has sixty-three more 3-grams in common with 

the Marlowian corpus (322) than with the corpus of the Bard (259), which stands as a 

remarkable distance for a study of this kind. Furthermore, it shares 1,394 2-grams with 

the Marlowian corpus, which creates a dramatic difference of one hundred and fifteen 

points if these are compared with the ones that it has in common with the corpus of 

Shakespeare (1,279).  

Scene I.i from Arden of Faversham shares the 7-gram I know he loves me well but 

with the Shakespearean corpus. Even though this construction can be seen as relatively 

distinctive due to its length, it contains the expression he loves me well, which can be also 

found in the Marlowian corpus as a 4-gram and stands as a common combination of words 

in texts of this nature. The scene also shares with the corpus of the Bard the 2 6-grams 

and 3 5-grams that derive from the division of the previously referenced 7-gram, as well 

as the 5-gram tell him what you say. This 5-gram appears to be a common expression, for 

which it should not be seen as a reliable idiolectal marker either. 

On the other hand, the scene shares the 6-gram for I had rather die than with the 

reference corpus of Marlowe. The expression I had rather die, by which a character 

expresses the will to sacrifice their life, cannot be found in the Shakespearean corpus and 

hence it could be seen as distinctive. In other words, this 6-gram seems to contain a more 

unusual combination of words than the 7-gram that the scene has in common with the 

Shakespearean corpus. Scene I.i from Arden of Faversham also shares 3 5-grams with the 

Marlowian corpus, which are the 2 5-grams that derive from the division of the 6-gram 

that has been previously commented and I have it for you, that does not seem to include 

a particular selection of words. 

If this scene was written by Shakespeare or Marlowe, the present study suggests that 

it is highly probable that Marlowe is its author, given that the quantitative analysis of the 

common 4-grams, 3-grams and 2-grams clearly links the authorship of the scene to him, 

and these results have been reinforced by the qualitative analysis of the larger n-grams in 

common.  

 
184 
 

Attribution of authorship of Scene I.i from Arden of Faversham with the Zeta test 

Given its length, the authorship of the scene has also been analysed with a Zeta test, whose 

graphical representation can be observed in Figure 13.  

The plays of the reference corpora of Shakespeare and Marlowe have been divided 

by ALTXA in fragments of 2,000 words and the residual ones at the end of each play 

have been added to its last fragment. Afterwards, ALTXA has elaborated a list of 500 

markers that appear in a considerable proportion of the Shakespearean fragments and are 

not present in many of the fragments in which the Marlowian corpus has been divided, as 

well as another list of 500 Marlowian markers (see Section 4.5.5 for a detailed explanation 

of the formulas underlying the calculation of the 500 markers for each author). The stop 

list with all the words that have been ignored as potential markers can be observed in 

Appendix 3, and Appendix 4 includes the lists of 500 markers for the conduction of the 

Zeta tests of this chapter. 

The blue circles that appear on the upper left area of the coordinate axis forming a 

cluster represent the fragments in which the plays from the Marlowian reference corpus 

have been divided and their coordinates could be explained as follows. The value of the 

vertical axis stands as the number of Marlowian markers that a fragment contains divided 

by its number of distinct words, whereas the value of the horizontal axis is determined by 

the division of the number of Shakespearean markers that it has by its number of distinct 

words. The Shakespearean fragments have been placed under the same principles that 

have just been described and are represented by the red squares that can be found on the 

lower right area forming another cluster.  

Lastly, the black triangles represent the two fragments in which Scene I.i from Arden 

of Faversham has been divided. Since the scene has 5,135 words, one of the fragments 

contains 2,000 words, whereas the other one has 2,000 plus the residual 1,135 which are 

at the end of the scene. Their coordinates have also been determined by the criteria 

described in the previous paragraph. 

As can be observed in Figure 13, it is discernible at plain sight that the fragments of 

Scene I.i from Arden of Faversham are considerably closer to the Marlowian cluster than 

to the area occupied by that of Shakespeare, which means that the scene contains more 

markers from the Marlowian list than from that of the other candidate.  


185 
 

Nevertheless, for the sake of clarity, the centroid of each cluster has been calculated 

by establishing the average values of the X and Y coordinates of all its fragments and the 

formula |𝐴𝐵⃗⃗⃗⃗  ⃗| = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2 has been applied to measure the distances 

between the two fragments in which Scene I.i from Arden of Faversham has been divided 

and these centroids. The distances between the two disputed fragments and the centroid 

of the Marlowian cluster on the coordinate axis are of 0.07085 and 0.0757 points, whereas 

their distances from the centroid of the Shakespearean cluster are of 0.1293 and 0.12066 

points. Therefore, this Zeta test reveals that Marlowe is the likeliest author of the scene. 

Figure 13 | Zeta test with Scene I.i from Arden of Faversham 

 
In brief, the authorship of Scene I.i from Arden of Faversham has been analysed with n-

gram tracing and a Zeta test and both methods have concluded with great certainty that 

Marlowe is more likely to have written it than Shakespeare, if it was indeed written by 

one of them.  


186 
 

6.2. Scene II.i (916 words) 

Scene II.i from Arden of Faversham contains 916 words, for which its authorship has 

only been analysed with n-gram tracing. The number of 3-grams and 2-grams that this 

scene shares with the Shakespearean and the Marlowian corpora will be provided in Table 

50. After such results are commented, the 4-grams that it shares with them will be 

revealed and qualitatively analysed. 

Table 50 | N-gram tracing with Scene II.i from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with and 

the Marlowian corpus 

3-grams 47 62 

2-grams 280 294 

Table 50 shows that Scene II.i from Arden of Faversham has more 3-grams and 2-grams 

in common with the Marlowian corpus than with the corpus of the Bard. While the scene 

shares 62 3-grams with the corpus of Marlowe, it has fifteen less in common, that is, 47, 

with the Shakespearean corpus, which stands as a significant distance if the fact that the 

text contains 916 words is taken into consideration. Furthermore, there is a difference of 

fourteen points between the 2-grams that it shares with the Marlowian corpus (294) and 

those that it has in common with the corpus of Shakespeare (280). 

While the scene has no larger n-grams in common with the Shakespearean corpus, it 

shares 6 4-grams with the corpus of Marlowe, which are as if he had, I know not but, I 

must to the, what wilt thou give, and I am bound and me and I am. These 4-grams are 

mainly formed by function words and the few lexical words that they include, such as 

know or give, appear to be quite frequent, for which it seems that none of these 

combinations of words is distinctive.  

If Scene II.i from Arden of Faversham was written by Shakespeare or Marlowe, it 

seems highly probable that the latter is its author. The quantitative analysis of the common 

3-grams and 2-grams clearly associates the authorship of the scene with the Marlowian 

corpus, with which it also shares 6 4-grams, even though they are not particularly 

distinctive. 

 
187 
 

6.3. Scene II.ii (1,694 words) 

The authorship of Scene II.ii from Arden of Faversham has been analysed with n-gram 

tracing and, even though it does not have 2,000 words, with the Zeta test. The main reason 

underlying this decision is that its length is close to that of the fragments in which the 

undisputed plays of Shakespeare and Marlowe are divided during the conduction of the 

Zeta test, for which it seems sensible to include the scene in the procedure.  

Attribution of authorship of Scene II.ii from Arden of Faversham with n-gram tracing 

The software ALTXA has identified the n-grams that Scene II.ii from Arden of 

Faversham has in common with the reference corpora of the two candidates of the study. 

The number of shared 3-grams and 2-grams can be observed in Table 51, which will be 

later commented. This will be complemented by the qualitative analysis of the 4-grams 

in common. 

Table 51 | N-gram tracing with Scene II.ii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 62 85 

2-grams 455 501 

The quantitative analysis of the common 3-grams and 2-grams clearly links the authorship 

of Scene II.ii from Arden of Faversham to Marlowe. The scene has 85 3-grams in 

common with the Marlowian corpus, whereas it shares twenty-three less, that is, 62, with 

the corpus of the Bard. There is also a dramatic difference of forty-six points if the 2-

grams that the scene shares with the corpus of Marlowe (501) are compared to those that 

it has in common with the Shakespearean corpus (455). 

Scene II.ii from Arden of Faversham shares 5 4-grams with the Shakespearean corpus, 

which are no more but this, that you have ta’en, and for her sake, as well as I and that 

thou hast done. These 4-grams seem to be common combinations of words that should 

not be seen as solid authorship markers. 

The scene shares 6 4-grams with the Marlowian corpus, that is, one more than with 

the corpus of the Bard. These are not so much as, I must have more, what’s that to thee, 

as well as I, you will let my and if he be not. Among these constructions, if he be not 


188 
 

appears to be an uncommon combination of words that includes the infinitive form of the 

verb to be instead of is, which could be seen as an idiolectal choice. 

According to this study, if Scene II.ii from Arden of Faversham was written by 

Shakespeare or Marlowe, it seems highly probable that the latter elaborated it, given that 

the clarity of the results of the quantitative analysis of the common 3-grams and 2-grams 

has been slightly reinforced by the qualitative analysis of the 4-grams in common. 

Attribution of authorship of Scene II.ii from Arden of Faversham with the Zeta test 

As pointed out earlier, the fact that the scene contains almost 2,000 words makes it 

suitable to carry out the Zeta test, since it has a comparable size to that of the fragments 

in which the reference corpora of Shakespeare and Marlowe are divided for its 

conduction.  

Figure 14 reflects the position on the coordinate axis of the fragments of the two 

reference corpora as well as that of the fragment that represents Scene II.ii from Arden of 

Faversham. The criteria for the division of these fragments and the determination of their 

coordinates are the same as in the study of Scene I.i (see Appendix 3 for the stop list with 

all the ignored words as potential markers and Appendix 4 for the lists of 500 markers of 

the two candidates). 

The Marlowian fragments, which are represented by blue circles, form a cluster on 

the upper left area, whereas the Shakespearean fragments, which are represented by red 

squares, create a cluster on the lower right area. It is evident that the black triangle that 

stands as the fragment of 1,694 words of Scene II.ii from Arden of Faversham is 

considerably closer to the Marlowian cluster. The distance between this fragment and the 

centroid of the Marlowian cluster is of 0.06293 points, whereas its distance from the 

centroid of the Shakespearean cluster is of 0.13372 points. According to this Zeta test, 

Christopher Marlowe is the likeliest author of the text. 

The two methods employed to analyse the authorship of the text have presented 

consistent results and thus it could be argued that, if Scene II.ii from Arden of Faversham 

was written by Shakespeare or Marlowe, the latter is more likely to have written it. 

 
189 
 

Figure 14 | Zeta test with Scene II.ii from Arden of Faversham 

 
6.4. Scene III.i (822 words) 

The first scene of the third act of Arden of Faversham contains 822 words, for which its 

authorship has only been studied with n-gram tracing. The results derived from the 

quantitative analysis of the common 3-grams and 2-grams between the scene and the two 

reference corpora can be observed in Table 52. After such results are commented, the 4-

grams that the scene shares with them will be qualitatively analysed. 

Table 52 | N-gram tracing with Scene III.i from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 36 30 

2-grams 232 222 

 
190 
 

Table 52 shows that Scene III.i from Arden of Faversham has a few more 3-grams and 2-

grams in common with the Shakespearean corpus than with that of Marlowe, which 

breaks with the trend of previous studies. There is a narrow distance of six points if the 

3-grams that the scene shares with the Shakespearean corpus (36) are compared to those 

that it has in common with the Marlowian corpus (30). Furthermore, while it has 232 2-

grams in common with the corpus of the Bard, it shares ten less, that is, 222, with the 

corpus of Marlowe, which stands again as a relatively low difference if the length of the 

scene is taken into consideration. 

The analysis of the larger n-grams in common conducted by ALTXA reveals that 

Scene III.i from Arden of Faversham shares the 4-gram the hour of death with the 

Shakespearean corpus, which holds a metaphorical meaning and therefore can be seen as 

distinctive. 

The scene also has 2 4-grams in common with the Marlowian corpus, which are let 

us go to, which seems to be a frequent expression, and I like not this. The negative 

construction I like not appears to be a conscious choice of the author, given that he could 

have written I do not like, and this cannot be found in the corpus of Shakespeare. This 

means that the scene has a distinctive construction of four words in common with the two 

candidates of the study. 

The study suggests that if Scene III.i from Arden of Faversham was written by 

Shakespeare or Marlowe, the Bard is slightly more likely to be its author, given that the 

quantitative analysis of the shared 3-grams and 2-grams associates the scene with him by 

a narrow margin, while the qualitative analysis of the larger n-grams in common seems 

to be inconclusive. 

6.5. Scene III.ii (516 words) 

Scene III.ii from Arden of Faversham contains 516 words, for which its authorship has 

only been studied with n-gram tracing. The results derived from the quantitative analysis 

of the 3-grams and 2-grams that the scene has in common with the two reference corpora 

will be provided in Table 53, which will be later discussed. This will be complemented 

by the qualitative analysis of the shared 4-grams. 

 
191 
 

Table 53 | N-gram tracing with Scene III.ii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 25 37 

2-grams 171 175 

As can be observed in Table 53, there is a difference of twelve points between the 3-

grams that the scene shares with the Marlowian corpus (37) and those that it has in 

common with the corpus of the Bard (25), which can be seen as relatively significant, 

given that the length of the scene is of 516 words. The number of 2-grams that it shares 

with the two reference corpora is more balanced, since it has only four more in common 

with the corpus of Marlowe (175) than with that of Shakespeare (171), which stands as a 

narrow margin. 

The analysis conducted by ALTXA reveals that Scene III.ii from Arden of Faversham 

only shares the 4-gram it will not be with the corpus of Shakespeare, which seems to be 

a frequent construction and, as a matter of fact, it can be also found in the Marlowian 

corpus, as will be mentioned in the next paragraph. 

The scene has the 8-gram and then let me alone to handle him in common with the 

Marlowian corpus, which seems to be highly unique not only because of its length, but 

also because of the combination of words let me alone to handle, which cannot be found 

in the Shakespearean corpus. This 8-gram can be divided into 2 7-grams, 3 6-grams, 4-5 

grams and 5 4-grams. In addition to these 5 4-grams, the scene also shares with the 

Marlowian corpus the 4-grams it will not be, the pleasures of the and of the day and, 

which do not seem to be distinctive. 

According to this study, if Scene III.ii from Arden of Faversham was written by 

Shakespeare or Marlowe, it seems highly probable that the latter is its author. The results 

of the quantitative analysis of the shared 3-grams and 2-grams, which associate the 

authorship of the scene with Marlowe, have been reinforced by the presence of a highly 

distinctive 8-gram in common. 

6.6. Scene III.iii (357 words) 

Table 54 shows the number of 3-grams and 2-grams that Scene III.iii from Arden of 

Faversham shares with the reference corpora of the two candidates of the study. After 


192 
 

these results are discussed, the qualitative analysis of the shared 4-grams will be 

conducted. 

Table 54 | N-gram tracing with Scene III.iii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 10 19 

2-grams 90 99 

The scene has 19 3-grams in common with the Marlowian corpus, while it shares 10, that 

is, nine less, with the corpus of the Bard. It also shares nine more 2-grams with the corpus 

of Marlowe (99) than with the Shakespearean corpus (90). These differences appear to be 

acceptable if the fact that the scene only contains 357 words is taken into consideration. 

According to the analysis conducted by ALTXA, Scene III.iii from Arden of 

Faversham has 2 4-grams in common with the Shakespearean corpus, which are I’ll bear 

you company and it may be so. The first one should not be considered distinctive because 

the 3-gram bear you company can be also found in the Marlowian corpus. Similarly, the 

4-gram it may be so is also present in the Marlowian corpus, as will be revealed in the 

following paragraph.  

The scene shares the 5-gram you shall go with me with the corpus of Marlowe, which 

does not seem to be a particular selection of words. It also has 3 4-grams in common with 

his corpus, which are those that derive from the division of the aforementioned 5-gram 

and it may be so, which can be also found in the Shakespearean corpus, as pointed out 

earlier. 

If Scene III.iii from Arden of Faversham was written by Shakespeare or Marlowe, it 

seems highly probable that the latter is its author, given that the quantitative analysis of 

the common 3-grams and 2-grams links the scene to him with clarity for such a short 

sample. In addition, the scene also shares more 5-grams and 4-grams with the Marlowian 

corpus than with that of Shakespeare, despite the fact that none of them seems to be 

distinctive. 

6.7. Scene III.iv (240 words) 

ALTXA has identified the n-grams that Scene III.iv from Arden of Faversham shares 

with the Shakespearean and the Marlowian reference corpora. The number of common 3-


193 
 

grams and 2-grams can be observed in Table 55, which will be later discussed. The 

qualitative analysis of the shared 4-grams will be provided afterwards. 

Table 55 | N-gram tracing with Scene III.iv from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with and 

the Marlowian corpus 

3-grams 6 14 

2-grams 73 91 

Scene III.iv from Arden of Faversham has more 3-grams and 2-grams in common with 

the Marlowian corpus than with that of the Bard. While the scene has 14 3-grams in 

common with the corpus of Marlowe, it shares eight less, that is, 6, with the 

Shakespearean corpus, which is an acceptable distance if the length of the scene is taken 

into consideration. In addition, there is a notable difference of eighteen points between 

the 2-grams that the scene shares with the Marlowian corpus (91) and those that it has in 

common with the Shakespearean corpus (73). 

Scene III.iv from Arden of Faversham has no larger n-grams in common with the 

Shakespearean corpus, but it shares 2 4-grams with the corpus of Marlowe. These are this 

shall be your, which seems to be a common construction, and hear what he can, which is 

part of an unusual request both in the Marlowian corpus and the disputed scene and 

therefore can be seen as distinctive.23 

According to this study, it seems highly probable that Marlowe is the author of Scene 

III.iv from Arden of Faversham if it was indeed written by one of the two candidates. 

This verdict is derived from the clarity of the quantitative analysis of the common 3-

grams and 2-grams, which has been slightly reinforced by the qualitative analysis of the 

shared 4-grams. 

6.8. Scene III.v (1,293 words) 

Scene III.v from Arden of Faversham contains 1,293 words and its authorship has only 

been studied with n-gram tracing, given that its length is far from the 2,000 words in 

which the undisputed works of Shakespeare and Marlowe are divided during the 

conduction of a Zeta test. The number of 3-grams and 2-grams that the scene has in 

 
23 The linguistic contexts where this construction appears are my lord, hear what he can allege, in the 

case of the corpus of Marlowe, and let’s hear what he can say, in the case of the disputed scene. 


194 
 

common with the Shakespearean and the Marlowian reference corpora can be observed 

in Table 56. After such results are commented, a qualitative analysis of the shared 4-

grams will be provided. 

Table 56 | N-gram tracing with Scene III.v from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 47 63 

2-grams 351 370 

Table 56 shows that the Marlowian corpus shares more 3-grams and 2-grams with Scene 

III.v from Arden of Faversham than the Shakespearean corpus. There is a difference of 

sixteen points if the 3-grams that the scene has in common with the corpus of Marlowe 

(63) are compared to those that it shares with the corpus of the other candidate (47). In 

addition, while the scene has 370 2-grams in common with the corpus of Marlowe, it 

shares 351, that is, nineteen less, with the corpus of the Bard, which stands as a 

considerable distance. 

The analysis conducted by ALTXA reveals that Scene III.v from Arden of Faversham 

presents 2 4-grams in common with the Shakespearean corpus. These are too good to be, 

which seems to be a common combination of words, and thou know’st it well. Since the 

2-gram thou know’st can be also found in the corpus of Marlowe, the latter 4-gram should 

not be seen as a solid marker either. 

The scene also shares 4 4-grams with the Marlowian corpus, that is, two more than 

with the corpus of the Bard. These are to the gates of, come let us in, here she comes and 

and I’ll none of that. Among these constructions, I’ll none of that stands out as a solid 

marker due to the use of the word none immediately after I’ll, which seems to be an 

unusual combination. 

In sum, the clarity of the results of the quantitative analysis of the common 3-grams 

and 2-grams, which has been reinforced by the qualitative analysis of the shared 4-grams, 

suggests that it is highly probable that Marlowe is the author of Scene III.v from Arden 

of Faversham if it was written by one of the two candidates that constitute the focus of 

the study. 

 
195 
 

6.9. Scene III.vi (1,265 words) 

The last scene of the third act of Arden of Faversham contains 1,265 words and its 

authorship has only been analysed with n-gram tracing for the same reason provided in 

the study of the previous scene. The number of 3-grams and 2-grams that Scene III.vi 

from Arden of Faversham shares with the Shakespearean and the Marlowian reference 

corpora can be observed in Table 57, which will be later discussed. This will be 

complemented by the qualitative analysis of the larger n-grams in common. 

Table 57 | N-gram tracing with Scene III.vi from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 55 72 

2-grams 362 379 

Table 57 shows that Scene III.vi from Arden of Faversham shares 72 3-grams with the 

Marlowian corpus, whereas it has seventeen less in common, that is, 55, with the corpus 

of the Bard, which stands as a notable difference. The scene also has seventeen more 2-

grams in common with the corpus of Marlowe (379) than with the Shakespearean corpus 

(362). 

The study conducted by ALTXA reveals that the scene has the 5-gram ay my good 

lord and in common with the Shakespearean corpus. The selection of the word ay instead 

of yes could be seen as a conscious choice of the author, although, as underlined in Section 

3.4.4, this linguistic form is more dialectal or context-dependent than idiolectal. In any 

case, the scene also shares the 4-gram ay my good lord with the Marlowian corpus, as 

will be revealed further on, for which the abovementioned 5-gram should not be seen as 

a solid marker for this study. 

Apart from the 2 4-grams that derive from the division of the previously referenced 

5-gram, the scene presents another 2 4-grams in common with the Shakespearean corpus. 

These are to speak with you and that thou hast done, which are frequent constructions in 

texts of this kind. 

On the other hand, Scene III.vi from Arden of Faversham has 6 4-grams in common 

with the Marlowian corpus, that is, two more than with the corpus of Shakespeare, even 

though they do not share any 5-grams. These 4-grams are as thou hast done, give him a 


196 
 

crown, I have made a, I would you were, ay my good lord and on the sudden is. The 4-

gram I would you were stands out as one of the most unique constructions that has been 

found so far in this research, since the word would is used as a synonym of want or wish,24 

which is a characteristic linguistic choice that cannot be found in the Shakespearean 

corpus. 

Therefore, if Scene III.vi from Arden of Faversham was written by one of the two 

playwrights that constitute the focus of the study, it seems highly probable that Marlowe 

is its author. The clarity of the results of the quantitative analysis of the shared 3-grams 

and 2-grams has been greatly reinforced by the presence of a unique 4-gram in common 

between the scene and his reference corpus. 

6.10. Scene IV.i (838 words) 

Scene IV.i from Arden of Faversham contains 838 words, for which its authorship has 

only been studied with n-gram tracing. Table 58 shows the number of 3-grams and 2-

grams that it has in common with the two reference corpora. The discussion of these 

results will be complemented by the qualitative analysis of the larger n-grams in common. 

Table 58 | N-gram tracing with Scene IV.i from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 41 40 

2-grams 249 262 

The number of 3-grams that Scene IV.i from Arden of Faversham shares with the 

reference corpora of the two candidates of the study is almost identical, although it has 

one more in common with the Shakespearean corpus (41) than with that of Marlowe (40). 

In contrast, the scene shares thirteen more 2-grams with the Marlowian corpus (262) than 

with the corpus of the Bard (249), which is an acceptable difference.  

According to the analysis conducted by ALTXA, the scene shares the 5-gram the time 

hath been would with the Shakespearean corpus. This 5-gram appears to be distinctive, 

since the combination of words time hath cannot be found in the Marlowian corpus and 

this includes the archaic form of the verb has, which stands as a linguistic choice. Apart 

 
24 This construction appears in I would you were his father too, in the case of the Marlowian corpus, and 

in I would you were in state to tell it out, in the case of Scene III.vi from Arden of Faversham. 


197 
 

from the 2 4-grams that derive from the division of this 5-gram, the scene also shares with 

the corpus of Shakespeare and all the rest and go along with us, which seem to be frequent 

constructions. 

The scene shares 2 4-grams with the Marlowian corpus, that is, two less than with the 

corpus of Shakespeare. These 4-grams are I have lost my, which does not seem to be 

distinctive, and these arms of mine. The latter reflects a linguistic choice of the author, 

given that the same idea could have been expressed with the construction my arms. It is 

worth mentioning that, while these arms of mine can be found in the scene and the 

Marlowian corpus, the corpus of Shakespeare only includes the expression my arms, and 

thus this 4-gram could be seen as a robust idiolectal marker. 

Scene IV.i from Arden of Faversham shares almost the same number of 3-grams with 

the corpora of the two candidates of the study, although it has more 2-grams in common 

with the Marlowian corpus by an acceptable margin. The qualitative analysis of the larger 

n-grams shows that, even though the scene shares a relatively distinctive 5-gram and a 

few 4-grams with the corpus of Shakespeare, it has a highly characteristic 4-gram in 

common with the Marlowian corpus. Therefore, it seems that Marlowe is slightly more 

likely to have written the scene than Shakespeare if it was indeed written by one of them.  

6.11. Scene IV.ii (263 words) 

Table 59 shows the number of 3-grams and 2-grams that Scene IV.ii from Arden of 

Faversham, which only contains 263 words, shares with the two reference corpora. Once 

Table 59 is commented, the qualitative analysis of the shared 5-grams and 4-grams will 

be presented. 

Table 59 | N-gram tracing with Scene IV.ii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 14 12 

2-grams 90 93 

The quantitative analysis of the common 3-grams and 2-grams shows inconclusive 

results, since the scene shares two more 3-grams with the Shakespearean corpus than with 

the corpus of Marlowe (14 vs. 12), but it presents three more 2-grams in common with 

the Marlowian corpus than with that of the Bard (93 vs. 90). 


198 
 

While the scene has no larger n-grams in common with the Marlowian corpus, it 

shares the 5-gram and I will follow you with the corpus of the Bard, which can be divided 

into 2 4-grams and does not seem to include a particular combination of words. 

According to this study, it seems uncertain if Scene IV.ii from Arden of Faversham 

was written by Shakespeare or Marlowe, given that the number of common 3-grams and 

2-grams does not clearly associate the scene with any of the two candidates and, even 

though there is a 5-gram in common between the scene and the Shakespearean corpus, it 

does not seem to be significant enough to have an impact on the final verdict on the 

authorship of the scene after the inconclusive results of the quantitative analysis. 

6.12. Scene IV.iii (593 words) 

The number of 3-grams and 2-grams that Scene IV.iii from Arden of Faversham has in 

common with the two reference corpora will be presented in Table 60 and later discussed. 

This will be complemented by the qualitative analysis of the shared 5-grams and 4-grams. 

Table 60 | N-gram tracing with Scene IV.iii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 25 35 

2-grams 187 204 

Table 60 shows that there is a difference of ten 3-grams if those that the scene shares with 

the Marlowian corpus (35) are compared to those that it has in common with the corpus 

of the Bard (25). Furthermore, there is a difference of seventeen points between the 2-

grams that the scene shares with the corpus of Marlowe (204) and those that it has in 

common with the Shakespearean corpus (187), which is significant if the fact that this 

scene only has 593 words is taken into consideration. 

The analysis conducted by ALTXA reveals that, while Scene IV.iii from Arden of 

Faversham does not share any larger n-grams with the Shakespearean corpus, it has a 5-

gram and 5 4-grams in common with the corpus of Marlowe. The 5-gram that they share 

is ay for a while but, which presents the collocation for a while, that cannot be found in 

the Shakespearean corpus.  

The 5 4-grams that the scene has in common with the Marlowian corpus are, apart 

from those that derive from the division of the abovementioned 5-gram, I hope to see, as 


199 
 

we have done and my life for thine. The latter 4-gram appears to be the most unusual of 

the group. 

The quantitative analysis of the common 3-grams and 2-grams and the qualitative 

analysis of the larger n-grams in common show that, if Scene IV.iii from Arden of 

Faversham was written by Shakespeare or Marlowe, it seems highly probable that the 

latter elaborated it. 

6.13. Scene IV.iv (1,251 words) 

The last scene of the fourth act of Arden of Faversham contains 1,251 words, for which 

its authorship has only been studied with n-gram tracing. Table 61 shows the number of 

3-grams and 2-grams that it shares with the two reference corpora, which will be 

discussed and complemented by the qualitative analysis of the common 4-grams. 

Table 61 | N-gram tracing with Scene IV.iv from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 48 66 

2-grams 370 376 

There is a notable difference of eighteen points if the 3-grams that Scene IV.iv from Arden 

of Faversham has in common with the corpus of Marlowe (66) are compared to those that 

it shares with the Shakespearean corpus (48). Even though the scene also presents more 

2-grams in common with the Marlowian corpus, there is a narrow difference of six points 

between those that it shares with his corpus (376) and with that of the Bard (370), which 

appears to be surprising if the fact that the scene contains 1,251 words is taken into 

consideration. 

Scene IV.iv from Arden of Faversham shares 5 4-grams with the Shakespearean 

corpus, which are I will perform it, thee on thy way, to show the world, as I have heard 

and may do thee good. Among these, to show the world could be seen as an authorship 

marker, since it might be considered a metonymy. 

The scene also shares 8 4-grams with the Marlowian corpus, and these are what hast 

thou done, what wilt thou do, see where he comes, know you what you, with me and be, 

thee on thy way, such prayers as these and let them have it. Among these 4-grams, such 

prayers as these stands out as the most distinctive of the group, given that it seems to be 


200 
 

an unusual combination of words and the 2-gram such prayers cannot be found in the 

Shakespearean corpus.  

If Scene IV.iv from Arden of Faversham was written by Shakespeare or Marlowe, it 

seems slightly probable that the latter is its author, according to the study. The reason 

why Marlowe has not been suggested as the likeliest author with high degree of certainty 

is that, even though the quantitative analysis of the common 3-grams and 2-grams links 

the authorship of the text to him, the difference in the number of common 2-grams is quite 

narrow for such a large sample. Furthermore, although the scene has more 4-grams in 

common with his corpus and one of them appears to be distinctive, it also shares a 4-gram 

with the Shakespearean corpus that could be seen as a solid idiolectal marker. 

6.14. Scene V.i (3,477 words) 

The study of Scene V.i from Arden of Faversham could be seen as the most important of 

the thesis, given that it links the authorship of the scene to Marlowe with a degree of 

certainty that has no precedent neither in the analysis of the previous scenes of the play 

nor in the pre-studies.  

Since the scene contains 3,477 words, its authorship has been tested with n-gram 

tracing, whose results will be provided first, and a Zeta test, which will be presented 

afterwards. 

Attribution of authorship of Scene V.i from Arden of Faversham with n-gram tracing 

The number of 5-grams, 4-grams, 3-grams and 2-grams that Scene V.i from Arden of 

Faversham shares with the two reference corpora can be observed in Table 62. After these 

results are discussed, a qualitative analysis of the larger n-grams in common will be 

provided. 

Table 62 | N-gram tracing with Scene V.i from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

5-grams 2 11 

4-grams 11 35 

3-grams 146 236 

2-grams 870 973 


201 
 

As explained in Section 4.5.4, the criterion for the inclusion of a type of n-grams in the 

quantitative analysis is that the disputed text and one of the reference corpora must share 

at least ten constructions of that kind. The fact that the common 5-grams have been 

included in this quantitative analysis reflects the high degree of resemblance that Scene 

V.i from Arden of Faversham has with the corpus of Marlowe, given that it is the only 

occasion on which this has occurred. The scene shares 11 5-grams with the Marlowian 

corpus and only 2 with that of Shakespeare. There is also a dramatic difference of twenty-

four points if the number of 4-grams that the scene has in common with the corpus of 

Marlowe (35) is compared to those that it shares with the Shakespearean corpus (11), 

which stands as the largest found so far. The difference in the shared 3-grams is also the 

largest found in the thesis, since the scene presents 236 in common with the corpus of 

Marlowe and 146, that is, ninety less, with the Shakespearean corpus. Lastly, while Scene 

V.i from Arden of Faversham presents 973 2-grams in common with the corpus of 

Marlowe, it shares one hundred and three less (870) with that of the other candidate. 

The analysis conducted by ALTXA reveals that Scene V.i from Arden of Faversham 

does not have any larger n-grams in common with the corpus of Shakespeare. In contrast, 

it shares the 10-gram I have my wish in that I joy thy sight with the corpus of Marlowe. 

This stands as the largest construction in common found in the case study and the pre-

study, which reflects how unlikely it is that two texts share a combination of words of this 

kind. In addition, it contains such a particular selection of words that it seems hard to 

believe that two different authors may have chosen it. This is one the main findings of the 

thesis and provides solid evidence for the participation of Marlowe in the elaboration of 

the play, as will be discussed in depth in Chapter 7. 

One could ponder that not only is Marlowe more likely than Shakespeare to have 

written Scene V.i from Arden of Faversham, but that it seems complicated to suggest that 

the scene could have been written by a distinct author if the number of 5-grams, 4-grams, 

3-grams and 2-grams that it shares with his reference corpus are taken into consideration, 

as well as the presence of such a unique 10-gram in common. 

Attribution of authorship of Scene V.i from Arden of Faversham with the Zeta test 

Figure 15 shows the position on the coordinate axis of the fragments in which the two 

reference corpora and Scene V.i from Arden of Faversham have been divided. The 

division of these three samples, the calculation of the 500 markers of each reference 


202 
 

corpus and the determination of the coordinates of the fragments have followed the same 

criteria applied in previous studies. The ignored words for the elaboration of the two lists 

of 500 markers have been included in Appendix 3, while Appendix 4 contains the lists of 

markers. 

Figure 15 | Zeta test with Scene V.i from Arden of Faversham 

 
The blue circles that create a cluster on the upper left area represent the fragments in 

which the undisputed plays of Marlowe have been divided, whereas the red squares that 

occupy the lower right area forming another cluster stand as the fragments in which the 

undisputed plays of Shakespeare have been divided. The black triangle represents Scene 

V.i from Arden of Faversham, which is considerably closer to the centroid of the 

Marlowian cluster than to that of the Shakespearean cluster. The exact distance between 

the centroid of the Marlowian cluster and the position of the fragment that represents the 

scene is of 0.0749 points, whereas its distance from the centroid of the Shakespearean 

cluster is of 0.12418 points. Therefore, this Zeta test suggests that Marlowe is the likeliest 


203 
 

author of the scene, which coincides with the results of the study conducted with n-gram 

tracing. 

The clarity with which the Zeta test and, especially, n-gram tracing have associated 

the authorship of Scene V.i from Arden of Faversham with Marlowe makes it complicated 

to suggest that the scene could have been written by a different playwright. This 

constitutes a crucial breakthrough in the investigation and will be thoroughly addressed 

in Chapter 7. 

6.15. Scene V.ii (106 words) 

The attribution of authorship of Scene V.ii from Arden of Faversham appears to be of 

great difficulty, since it only contains 106 words and hence the chances of finding 

common n-grams with the two reference corpora are lower than in the analysis of the 

previous scenes of the play. The number of 3-grams and 2-grams that the scene shares 

with the Shakespearean and the Marlowian reference corpora can observed in Table 63, 

which will be later commented. This will be complemented by the qualitative analysis of 

the only shared 4-gram. 

Table 63 | N-gram tracing with Scene V.ii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 5 14 

2-grams 38 52 

Table 63 shows that there is a difference of nine points if the 3-grams that the scene shares 

with the Marlowian corpus (14) are compared to those that it has in common with the 

corpus of the Bard (5). It also shares fourteen more 2-grams with the corpus of Marlowe 

(52) than with the Shakespearean corpus (38). These differences are remarkable if the fact 

that the scene only contains 106 words is taken into consideration. 

The software ALTXA has identified a common 4-gram between Scene V.ii from 

Arden of Faversham and the corpus of Marlowe, which is what care I though. This 4-

gram appears to be distinctive, given that the omission of the auxiliar do stands as an 

idiolectal choice of the author.  

If Scene V.ii from Arden of Faversham was elaborated by Shakespeare or Marlowe, 

it seems highly probable that the latter is its author. The clarity of the quantitative analysis 


204 
 

of the common 3-grams and 2-grams has been slightly reinforced by the presence of a 

distinctive 4-gram in common, which is surprising if the small size of the scene is 

considered. 

6.16. Scene V.iii (179 words) 

Scene V.iii from Arden of Faversham does not share 10 3-grams with any of the two 

reference corpora, for which only the common 2-grams will be quantitatively analysed. 

This will be complemented by the qualitative analysis of the common 4-grams and 3-

grams. 

Table 64 | N-gram tracing with Scene V.iii from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

2-grams 60 62 

Table 64 shows that the scene presents almost the same number of 2-grams in common 

with the two reference corpora. While it shares 62 with the corpus of Marlowe, it has two 

less in common, that is, 60, with that of Shakespeare, which stands as an insufficient 

distance to consider the results conclusive. 

The analysis conducted by ALTXA reveals that Scene V.iii from Arden of Faversham 

shares the 4-gram what shall I say with the corpus of Shakespeare, which does not seem 

to be a distinctive construction. As a matter of fact, the 3-gram what shall I can be found 

in the Marlowian corpus, as will be revealed further on. 

It also shares 9 3-grams with the Shakespearean corpus, which are those derived from 

the division of the abovementioned 4-gram, as well as I have done, I did not, have done 

this, and I have, me when we, on me when and me and in. None of these 3-grams seems 

to be distinctive.  

On the other hand, the scene has 8 3-grams in common with the corpus of Marlowe, 

that is, one less than with the corpus of Shakespeare. These are wherefore stay we, what 

shall I, and bear me, I have done, and I have, not on me, I did it and me and in. None of 

them appears to be a solid marker either. 

It seems uncertain if Scene V.iii from Arden of Faversham was written by 

Shakespeare or Marlowe, according to the study. The scene shares only two more 2-grams 


205 
 

with the corpus of Marlowe and, even though none of the larger n-grams that have been 

qualitatively analysed seems to be distinctive, it has one more 4-gram and one more 3-

gram in common with the corpus of Shakespeare. In other words, these results are not 

conclusive enough to link the authorship of the disputed text to any of the two candidates 

of the study. 

6.17. Scene V.iv (117 words) 

As happened in the study of the previous scene, only the common 2-grams can be 

included in the quantitative analysis of Scene V.iv from Arden of Faversham, which will 

be presented in Table 65. This will be followed by the qualitative analysis of the common 

3-grams. 

Table 65 | N-gram tracing with Scene V.iv from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

2-grams 37 43 

As can be observed in Table 65, Scene V.iv from Arden of Faversham shares six more 2-

grams with the Marlowian corpus (43) than with the corpus of the Bard (37), which can 

be seen as an acceptable distance if the fact that the scene only has 117 words is taken 

into consideration.  

The software ALTXA has identified 6 3-grams in common between the scene and the 

Shakespearean corpus, which are there is no, my head and, I have done, that I have, that 

I can and but I am. None of them seems to be a solid authorship marker. 

The scene also shares 8 3-grams with the Marlowian corpus, that is, two more than 

with the corpus of Shakespeare. These are that I have, there is no, him and his, I am sure, 

my head and, I have done, and cries for and but I am, which do not seem to be distinctive 

either. 

If Scene V.iv from Arden of Faversham was written by Shakespeare or Marlowe, it 

seems slightly probable that the latter elaborated it. Even though the qualitative analysis 

of the common 3-grams seems to be inconclusive, the quantitative analysis of the shared 

2-grams associates the authorship of the scene with Marlowe with an acceptable degree 

of certainty for such a short sample. 


206 
 

6.18. Scene V.v (321 words) 

Table 66 shows the number of 3-grams and 2-grams that Scene V.v from Arden of 

Faversham shares with the two reference corpora, which will be later discussed and 

complemented by the qualitative analysis of the 4-grams in common. 

Table 66 | N-gram tracing with Scene V.v from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

3-grams 13 18 

2-grams 126 138 

While the scene shares 18 3-grams with the Marlowian corpus, it has five less in common, 

that is, 13, with the corpus of Shakespeare. In addition, there is a difference of twelve 

points if the 2-grams that the scene has in common with the corpus of Marlowe (138) are 

compared to those that it shares with the Shakespearean corpus (126).  

Scene V.v from Arden of Faversham shares 2 4-grams with each of the two reference 

corpora. The 2 4-grams that it has in common with the Shakespearean corpus are how 

long shall I, which seems to be a common combination of words, and and bring away 

the, which could be seen as relatively distinctive, since the combination of words bring 

away cannot be found in the corpus of Marlowe. 

The 2 4-grams that the scene has in common with the Marlowian corpus are what 

should I say, which does not seem to be distinctive, and this hell of grief, which is a 

metaphor that reflects the emotional pain of a character and therefore stands as a solid 

authorship marker. 

It seems highly probable that, if Scene V.v from Arden of Faversham was written by 

Shakespeare or Marlowe, the latter is its author. The quantitative analysis of the shared 

3-grams and 2-grams associates the authorship of the scene with him, and this has been 

reinforced by the presence of a highly distinctive 4-gram in common. 

6.19. Epilogue or Scene V.vi (148 words) 

The number of 2-grams that Scene V.vi from Arden of Faversham has in common with 

the two reference corpora will be presented in Table 67. This will be discussed and 

complemented by the qualitative analysis of the shared 4-grams and 3-grams. 


207 
 

Table 67 | N-gram tracing with Scene V.vi from Arden of Faversham 

Type of n-grams Common n-grams with the 

Shakespearean corpus 

Common n-grams with the 

Marlowian corpus 

2-grams 40 42 

Table 67 shows that Scene V.vi from Arden of Faversham shares only two more 2-grams 

with the corpus of Marlowe (42) than with that of Shakespeare (40), for which this 

quantitative analysis does not clearly associate its authorship with any of the two 

candidates. 

The analysis conducted by ALTXA reveals that the scene shares the 4-gram this above 

the rest with the Marlowian corpus, which does not seem to be an unusual combination 

of words. It also has 6 3-grams in common with his corpus, which are, apart from those 

that derive from the division of this 4-gram, and in the, as for the, by force and and is to 

be. None of them seems to be distinctive.  

On the other hand, the scene has 4 3-grams in common with the corpus of 

Shakespeare, which are and in the, deed was done, the Lord Protector and as for the. 

Among these, the only one that seems to be relatively uncommon is the Lord Protector, 

but the 2-gram Lord Protector is also present in the corpus of Marlowe, for which it 

cannot be seen as a solid marker for the study.  

It seems uncertain if Scene V.vi from Arden of Faversham was written by 

Shakespeare or Marlowe, according to the study. It only shares two more 2-grams with 

the corpus of Marlowe than with that of Shakespeare and, even though it also has one 

more 4-gram and two more 3-grams in common with his corpus, none of them seems to 

be distinctive. All these differences are so narrow that it does not seem fair to attribute 

the authorship of the text to him. 

6.20. Summary 

This chapter has analysed the authorship of the nineteen scenes of Arden of Faversham 

independently, considering the hypothesis that the play may have been written in 

collaboration and that William Shakespeare and Christopher Marlowe may have been 

involved in such process. Depending on the length of the scenes, these have been analysed 

with n-gram tracing only or complementing n-gram tracing with the Zeta test, since these 

are the methods that have been proved to be successful to distinguish between undisputed 


208 
 

scenes of the two playwrights in Chapter 5. The attribution of authorship of the scenes 

has been presented as highly probable, slightly probable or uncertain depending on the 

degree of certainty with which n-gram tracing and, if applied, the Zeta test have associated 

them with one of the candidates. Nevertheless, these studies need to be addressed from a 

holistic perspective with the purpose of drawing general conclusions about the authorship 

of the play and whether the objectives of the research have been met or not, which 

constitutes the focus of the next chapter. 

  
209 
 

CHAPTER 7 | DISCUSSION OF THE RESULTS  

The scenes of Arden of Faversham have been analysed as independent texts in Chapter 6 

to discern their likeliest authorship considering William Shakespeare and Christopher 

Marlowe as the candidates for such attribution. These studies will be discussed from a 

holistic perspective in this chapter to extract a series of conclusions about the authorship 

of the play and the objectives and hypotheses delineated at the beginning of the thesis 

(see Section 1.2). 

The results of the nineteen studies conducted in Chapter 6 will be presented in the 

form of a table to assess which groups of scenes have been more easily attributed and 

which ones could be seen as problematic. This table will include the title of the scene, its 

length, the method or methods with which its authorship has been analysed, to which 

candidate it has been attributed and, if it has indeed been attributed to one of them, the 

degree of certainty with which the attribution has taken place. 

Table 68 | Summary of the results derived from the case study 

Title of the 

scene 

Length of the 

scene 

Methods 

involved in the 

attribution 

Likeliest 

author 

Certainty of the 

attribution 

I.i 5,135 words N-gram tracing 

and Zeta test 

Marlowe Highly probable 

II.i 916 words N-gram tracing Marlowe Highly probable 

II.ii 1,694 words N-gram tracing 

and Zeta test 

Marlowe Highly probable 

III.i 822 words N-gram tracing Shakespeare Slightly 

probable 

III.ii 516 words N-gram tracing Marlowe Highly probable 

III.iii 357 words N-gram tracing Marlowe Highly probable 

III.iv 240 words N-gram tracing Marlowe Highly probable 

III.v 1,293 words N-gram tracing Marlowe Highly probable 

III.vi 1,265 words N-gram tracing Marlowe Highly probable 

IV.i 838 words N-gram tracing Marlowe Slightly 

probable 

IV.ii 263 words N-gram tracing Uncertain - 

IV.iii 593 words N-gram tracing Marlowe Highly probable 


210 
 

IV.iv 1,251 words N-gram tracing Marlowe Slightly 

probable 

V.i 3,477 words N-gram tracing 

and Zeta test 

Marlowe Highly probable 

V.ii 106 words N-gram tracing Marlowe Highly probable 

V.iii 179 words N-gram tracing Uncertain - 

V.iv 117 words N-gram tracing Marlowe Slightly 

probable 

V.v 321 words N-gram tracing Marlowe Highly probable 

V.vi 148 words N-gram tracing Uncertain - 

Table 68 shows that the only scene of Act I and the two scenes of Act II from Arden of 

Faversham have been attributed to Marlowe with a high degree of certainty. Two of these 

three scenes (I.i and II.ii) are long enough to be analysed with n-gram tracing and the Zeta 

test and both methods have clearly associated their authorship with him. The remaining 

scene (II.i) has only been analysed with n-gram tracing, which has also linked its 

authorship to Marlowe with clarity. Therefore, the present study can conclude that, if the 

first two acts of Arden of Faversham were written by Shakespeare or Marlowe, the latter 

can be considered their author without major doubts. 

In contrast, n-gram tracing has associated Scene III.i from Arden of Faversham with 

Shakespeare. The scene has been attributed to him with a low degree of certainty, given 

that the difference in the number of n-grams that it shares with the two reference corpora 

is narrow.  

The rest of the scenes of the third act of Arden Faversham, that is, III.ii, III.iii, III.iv, 

III.v and III.vi, have followed the trend that can be found in the scenes of the first two 

acts of the play and have been attributed to Marlowe with a high degree of certainty using 

n-gram tracing. In other words, the scenes of Act III have been attributed to Marlowe 

without major doubts except for the first one, which has been associated with Shakespeare 

by a small margin. It is worth mentioning that this is the only scene of the play whose 

authorship has been attributed to him. 

The fourth act could be seen as the most problematic of the study, given that it 

contains four scenes whose authorship has been analysed with n-gram tracing and only 

one has been attributed with a high degree of certainty. Scene IV.i has been linked to 

Marlowe by a narrow margin, whereas the study of the second scene has shown 


211 
 

inconclusive results. Scene IV.iii is the one that has been attributed to Marlowe with a 

high degree of certainty and Scene IV.iv has also been attributed to him, but without great 

clarity. Even though the authorship of three of the four scenes has been linked to Marlowe, 

it has occurred without great certainty on two occasions, for which one could ponder that 

there might be a different author involved in the elaboration of the fourth act of Arden of 

Faversham, as will be developed further on. 

The authorship of Scene V.i, which has been analysed with n-gram tracing and the 

Zeta test, has been associated with Marlowe with a degree of certainty that has no 

precedent in the thesis. This gives rise to the idea that it is extremely unlikely that the 

scene could have been written by a different playwright, which constitutes a significant 

breakthrough in the investigation that will be expounded further on.  

The five remaining scenes of Act V have only been analysed with n-gram tracing. 

While Scene V.ii has been linked to Marlowe with clarity, the results derived from the 

analysis of Scene V.iii have been inconclusive. The study of Scene V.iv suggests that 

Marlowe is slightly more likely than Shakespeare to have written it, and Scene V.v has 

also been attributed to Marlowe, but with a high degree of certainty. Lastly, the authorship 

of Scene V.vi has remained uncertain. In sum, if the first two scenes of Act V from Arden 

of Faversham were written by Shakespeare or Marlowe, it seems almost certain that the 

latter elaborated them. Nevertheless, the four remaining scenes are more problematic and, 

while two of them have been linked to Marlowe, one of which by a narrow margin, the 

other two have not been associated with any of the two candidates of the study.  

In total, Arden of Faversham contains 19 scenes and, when their authorship has been 

analysed independently considering William Shakespeare and Christopher Marlowe as 

the possible candidates, the latter has been suggested as the likeliest author on 15 

occasions. On 3 of the 4 occasions in which Marlowe has not been selected as the likeliest 

author of a scene, the results have been inconclusive, whereas the remaining scene (III.i) 

has been attributed to Shakespeare by a narrow margin. 

The first conclusion that can be inferred from the present research is that, if Arden of 

Faversham was written by Shakespeare and/or Marlowe, the latter elaborated most of the 

scenes of the play. Nevertheless, there have been many other playwrights considered as 

possible authors of the text (see Section 3.4.4), for which this research should be 

perceived as the first milestone of a long-term project in which the candidate that has 


212 
 

been designated as the likeliest author of every scene should be compared in future studies 

with the other candidates. These comparisons should start with Thomas Kyd, who has 

been suggested by scholars as the most solid alternative for Shakespeare and Marlowe 

(see Section 2.3). The inconclusive results derived from the analysis of certain scenes 

from the fourth and the fifth act may be interpreted as a consequence of their small size, 

which hinders the attribution of their authorship, or a reflection of the participation of a 

distinct playwright from the two candidates of the study in the creation of the disputed 

text. 

Even though one of the main objectives of the thesis is only to determine if 

Shakespeare is more likely than Marlowe to have written each scene of Arden of 

Faversham or vice versa, a conclusion that can be inferred from Chapter 6 is that the 

participation of Marlowe in the elaboration of the play seems quite probable for two main 

reasons. Firstly, it is possible that, if Marlowe had not been involved in the creation of 

the play, these authorship studies would have produced more balanced results between 

the two candidates of the study, instead of attributing 15 of the 19 scenes of the play to 

him with a high degree of certainty in almost every case. Secondly, the manner in which 

the Zeta test and, especially, n-gram tracing have associated the authorship of Scene V.i 

from Arden of Faversham with Marlowe is so overwhelming that it seems complicated 

to suggest that a different playwright may have been involved in the elaboration of the 

scene. As pointed out during the conduction of such analysis (see Section 6.14), the 

number of 5-grams, 4-grams, 3-grams and 2-grams in common between the scene and the 

Marlowian corpus cannot be found in any of the other 57 studies that have been conducted 

using n-gram tracing in the thesis. Furthermore, the qualitative analysis of the larger n-

grams in common has revealed that the scene and the Marlowian corpus share the 10-

gram I have my wish in that I joy thy sight. This common 10-gram, which stands as the 

largest found in the investigation, represents such a distinctive combination of words that 

it seems impossible that two different playwrights might have selected it, unless one of 

them was committing plagiarism.  

Therefore, this investigation has provided substantial evidence to suggest that, even 

though there is still the need to include more candidates in future studies about the 

authorship of Arden of Faversham, Christopher Marlowe was clearly involved in its 

creation, at least in that of Scene V.i, which constitutes a major breakthrough. 


213 
 

The results provided in Chapter 6 also facilitate the extraction of a series of 

conclusions about the participation of William Shakespeare in the elaboration of Arden 

of Faversham. It seems that, if he participated in it, he had a minor contribution, given 

that only Scene III.i has been attributed to him and such attribution has been defined as 

slightly probable. This means that, if the authorship of Scene III.i is analysed by 

comparing the Shakespearean reference corpus with the corpora of other playwrights who 

are different from Marlowe, the chances of finding a candidate whose idiolect presents a 

higher degree of resemblance with that displayed in the scene are solid. This finding can 

be also seen as significant, given that it differs from the results obtained by Kinney (2009), 

whose Zeta test reached the conclusion that Shakespeare was not only involved in the 

elaboration of Scene III.i, as suggested in this study, but also in that of Scenes III.ii, III.iii, 

III.iv, III.vi and V.iii, which have been attributed to Marlowe in this study with a high 

degree of certainty. It also differs from the results presented by Elliott and Greatley-

Hirsch (2017), who associated with Shakespeare the first part of Scene I.i and the totality 

of Scenes III.vi and IV.i after the conduction of the Zeta test, whereas these scenes have 

been attributed to Marlowe in this thesis (see Section 3.4.4 for an account of these 

studies).  

On the other hand, this thesis does not contradict the findings derived from Taylor’s 

investigation (2019), where he attributed the authorship of the first words of Scene IV.i 

to Thomas Watson (see Section 3.4.4), given that he has not been considered as a possible 

candidate here. As a matter of fact, Scene IV.i is one of the few that has been attributed 

to Marlowe as slightly probable, and thus this could reflect the participation of a distinct 

playwright in its elaboration. It is also worth mentioning that this thesis has suggested the 

hypothesis that word n-grams of at least two words can be more effective than character 

n-grams and word 1-grams to deal with an authorship problem of this kind, which 

coincides with the approach followed by Taylor in his study. 

The two hypotheses that have led to such distinct results in comparison with the 

previously referenced studies of Kinney (2009) and Elliott and Greatley-Hirsch (2017) 

are that authors should be compared individually during the Zeta test and that the 

reference corpora of the candidates should be compiled with undisputed works that have 

similar characteristics to those of the disputed text.  

The comparison of a single author with a group of authors during the conduction of a 

Zeta test does not seem statistically sensible, as explained in Section 4.5.5. If, for instance, 


214 
 

Shakespeare was compared with a group of ten playwrights, many discriminators that 

would be useful to distinguish between Shakespeare and one of those authors would be 

probably lost due to the average values of the other candidates of the group, which could 

be illustrated as follows. If the word beseech was highly frequent in the Shakespearean 

corpus and it was barely present in the Marlowian corpus, it would become one of the 

500 Shakespearean markers for the conduction of the test if this only compared these two 

authors. Nevertheless, if Marlowe was included in a group with nine other playwrights 

and they all used the word beseech frequently, this word would not be included as a 

discriminator in the study and thus a great opportunity to compare the two playwrights 

would be missed. It seems that the combination of many idiolects in a same corpus might 

not represent any of its parts properly and that they could mean nothing as a group, 

statistically speaking. For that reason, studies of this kind should compare authors 

individually, as has been consistently suggested throughout the thesis. 

As mentioned earlier, the most important hypothesis of the thesis is that, since the 

idiolect is a dynamic phenomenon, the style of an author may vary greatly depending on 

the period and the type of text that they write, and thus the reference corpora of the 

candidates should be narrowed down and only include plays that belong to a similar 

period to that in which the disputed text was created, with which they should also share a 

similar tone. Scholars state that there are certain idiolectal choices that remain uniform in 

all the creations of an author, which is something I agree with, but it must be borne in 

mind that playwrights were constantly imitating each other during the Elizabethan period 

and there are few stylistic differences among them, for which those idiolectal choices that 

remain uniform in their entire work are probably shared by many. 

For that reason, it has been suggested that there might be a series of idiolectal features 

that can be only found in the plays that Shakespeare and Marlowe wrote during a certain 

period of time and had a tragic tone, and thus their identification and quantification can 

be the key to discern the likeliest authorship of a disputed text of similar characteristics. 

Following that premise, their reference corpora have been compiled with plays that were 

elaborated between 1590 and 1595, since Arden of Faversham was approximately written 

in 1592, and are not comedies, since this play is a tragedy. This approach differs from 

that adopted by Taylor (2019), who suggested that it is more effective to compile the 

reference corpora with texts that belong to dissimilar periods and even distinct genres, 

such as poetry. It also differs from those adopted by Kinney (2009) and Elliott and 


215 
 

Greatley-Hirsch, who compiled their reference corpora with plays written from 1580 to 

1619 and from 1580 to 1594, respectively, including comedies in both cases.  

It is impossible to validate or refute the hypotheses suggested for the conduction of 

this thesis until these are consistently tested in future research involving distinct types of 

texts, since there will never be studies able to attribute the authorship of Arden of 

Faversham beyond any reasonable doubt. Nevertheless, the fact that the adoption of these 

methodological principles has generated such a distinct outcome from that of previous 

studies could raise a debate on the reliability of each approach. I would suggest that the 

compilation of reference corpora that ignores variables such as the genre of the texts and 

the period in which they were written can be successful when two authors that have 

notably different idiolectal features are compared. In contrast, when facing the authorship 

of an Elizabethan text or any kind of sample whose potential authors present such similar 

styles, the most representative corpora are not the larger ones, but those that are able to 

reproduce more faithfully the conditions in which the disputed text was elaborated.  

The fact that forensic linguistics is a relatively new discipline and that many of the 

computational tools used for the conduction of these studies have only been available for 

the last few years explains the lack of consensus on the methodological approach that 

should be adopted. Therefore, this research could be seen as another contribution to the 

development of a discipline that has been constantly evolving over the last decades. 

One of the two main objectives of the thesis, which is to analyse the authorship of the 

nineteen scenes of Arden of Faversham independently, has been accomplished with the 

assistance of ALTXA, whose development as a free software stands as its other main 

objective (see Section 1.2). This computational tool can quantify the relative frequency 

of a keyword in a text, as well as its average number of words per sentence and lexical 

richness. It can also identify the word n-grams that two samples share and conduct a Zeta 

test, and all these functionalities can be accessed in an intuitive interface that seeks to 

facilitate the work of other forensic linguists and the spread of authorship attribution 

studies in educational contexts, where there is usually a lack of experts in the field. The 

two main objectives of the thesis have contributed reciprocally to their fulfillment, since 

the study of the play Arden of Faversham required the use of ALTXA, and ALTXA has 

proved its validity by being applied in the analysis of Arden of Faversham. 


216 
 

Chapter 8 will summarize the steps taken for the conduction of this thesis, its main 

findings, how these can relate to the initial hypotheses and the manner in which its two 

main objectives have been accomplished.  

  
217 
 

CHAPTER 8 | CONCLUSION AND FUTURE LINES OF RESEARCH 

This final chapter consists of two sections. The first one will summarize the objectives, 

hypotheses, theoretical foundations, methodology, results and main findings of the thesis, 

whereas the second one will highlight its limitations and the path that could be adopted 

in future research. 

8.1. Summary and implications of the findings 

This doctoral thesis had two main objectives. The first one was to study from a forensic 

linguistic perspective the authorship of the nineteen scenes of the Elizabethan play Arden 

of Faversham considering Shakespeare and Marlowe as the possible candidates. The 

other one was to develop a computer program that could allow for the conduction of such 

analyses and pave the way for the spread of the discipline in professional and academic 

contexts. 

A brief historical and literary introduction about the playwrights and the text that 

constitute the focus of the study was provided in Chapter 2. This presented a general 

overview about William Shakespeare’s life events until the last decade of the sixteenth 

century, that is, when Arden of Faversham was created, as well as a complete biography 

of Christopher Marlowe, since he died during that period. Shakespeare only received 

basic education and seems to have forged his way as a playwright by working as a 

schoolmaster first and joining the Queen’s Men as an actor afterwards, which made him 

look like an intruder to Robert Greene and others who followed a more traditional path, 

as it is the case of Marlowe. The latter received an exquisite education in Cambridge, 

where he had the opportunity to learn from already established playwrights like Greene 

himself, and seems to have combined his literary career with working as a spy, which 

might justify his mysterious death and the many subsequent speculations. These 

biographical notes also established a connection between the two playwrights, who have 

been credited as the co-authors of the three parts of Henry VI. This, combined with the 

testimony of literary experts suggesting that the cooperation between two or more 

playwrights in the elaboration of plays like Arden of Faversham was a common practice 

at the time, justifies the approach followed for the analysis. The play itself was presented 

as a literary text built upon a historical event, since the assassination on which it focuses 

was documented by Holinshed in the Chronicles of England, Scotland and Ireland in 

1577. A summary of its plot and main literary features was presented under the belief that 


218 
 

this would be of use to have a deeper understanding of the subsequent linguistic analysis, 

since both disciplines could be seen as complementary in a study of this kind. Lastly, a 

description of the approaches adopted to deal with the authorship of this anonymous play 

over the years was provided to connect the contents of this chapter with those of the 

following, which stands as the linguistic background of the thesis. 

Chapter 3 provided the reader with a definition of forensic linguistics and an account 

of its historical development and main applications, with a special emphasis on authorship 

attribution studies. Even though forensic linguistics is a discipline that has not been 

acknowledged as such until recently, since the term was first used in 1968 by Professor 

Jan Svartvik, there have been many legal cases throughout history where the use of 

language has played a crucial role, some of which were commented in this chapter to 

illustrate the inherent relationship between the law and the language. The three main 

fields of study in which forensic linguistics is currently divided were presented and 

discussed. These are known as the written language of the law, which is related to the 

need to make legal documents accessible to the average citizen, the spoken language of 

the law, which analyses the oral interactions that take place in legal contexts such as 

police investigative interviews, and the linguist as an expert witness, which refers to those 

cases in which the linguist gives advice and provides evidence in legal processes. Each 

of these fields of study was carefully explained with practical cases under the belief that 

many linguists are not familiar with forensic linguistics yet and hence this thesis could be 

their first contact with the discipline. The chapter moved from the general to the specific, 

since authorship attribution studies, which correspond to one of the roles that the linguist 

may adopt when acting as an expert witness, were thoroughly explained in its last section. 

This discussed the fundamentals of plagiarism detection, the analysis of criminal texts 

with an open and a close set of suspects and, lastly, the study of historical and literary 

texts, which was addressed in depth by pointing out its methodological foundations and 

reviewing previous research on the authorship of Arden of Faversham. Chapter 3 was not 

merely descriptive in any of its sections, given that it provided theoretical contributions 

during the discussion of certain concepts and practical cases to anticipate the approach 

adopted for the conduction of the research.  

The methodological aspects of the thesis were developed in Chapter 4. Under the 

belief that the idiolect of an author is such a dynamic phenomenon that it may vary 

depending on the period in which they write and the type of text that they produce, it was 


219 
 

suggested that the reference corpora of Shakespeare and Marlowe should be compiled 

with plays that were written in a similar period to that of Arden of Faversham and that 

they should also present a similar tone. This, which stands as the main hypothesis of the 

thesis, led to the selection of Richard III and Richard II to compile the Shakespearean 

corpus and Edward II and The Jew of Malta to compile that of Marlowe, given that they 

were written no more than three years apart from Arden of Faversham and are plays with 

a tragic tone. These four plays and Arden of Faversham were extracted from the archives 

of Project Gutenberg, that preserved the selection of words of the first published 

manuscripts, which is on what the posterior analysis focused, rather than their spelling, 

since this is more likely to have been altered even by those who transcribed the 

abovementioned manuscripts. After the extraction of the texts, these were cleaned to 

optimize the posterior analysis by deleting every stage direction or linguistic element that 

was not part of a dialogue under the assumption that these are less likely to reflect 

idiolectal features. 

The decision of structuring the analysis into a series of pre-studies and a case study 

was of paramount importance to ensure that the methods used for the analysis of Arden 

of Faversham are effective in a similar linguistic context. Initially, five tests were selected 

for the conduction of the pre-studies, but the one based on the quantification of the relative 

frequency of a list of keywords in the samples was eventually discarded because of its 

reliance on subjective criteria. The four remaining methods, which consist in the 

quantification of the average number of words per sentence and the lexical richness of the 

texts, the identification of common n-grams and the conduction of the Zeta test, were 

included in the pre-studies. These analysed samples taken from the Shakespearean and 

the Marlowian reference corpora as if they were disputed in order to assess the reliability 

of each method with each type of scenes depending on their length (from 100 to 450 

words, from 500 to 950, from 1,200 to 1,700 and almost 2,000 or more). Chapter 4 also 

showed the way in which the five linguistic procedures that have been referenced in this 

paragraph can be carried out with ALTXA and how the necessity to design it arose as a 

result of the lack of functionalities of some of the already existing computational tools 

and the lack of accessibility of others, which were mentioned to give the reader an idea 

of the niche that this software intends to occupy. 

The results derived from the pre-studies were presented in Chapter 5. The first one 

evaluated the effectiveness of the calculation of the average number of words per sentence 


220 
 

as an authorship discriminator between Shakespeare and Marlowe. It was divided into 

four stages, that is, one for each of the four types of scenes according to their length. The 

purpose of this pre-study was to discern if there is enough intra-author consistency and 

inter-author variation in the scenes of the two playwrights. The average number of words 

per sentence of Shakespearean and Marlowian scenes from the four groups was calculated 

and the results showed that, even though the intra-author consistency increases with the 

size of the samples, especially that of Marlowe, the results of the scenes of both 

playwrights overlap frequently, for which this parameter was not included in the case 

study. 

The second pre-study, which was also structured into four stages, focused on the 

calculation of the lexical richness of scenes from both reference corpora with the same 

purpose of discerning if there is enough intra-author consistency and inter-author 

variation. Even though the results of the larger scenes presented more intra-author 

consistency than those derived from the calculation of the average number of words per 

sentence, they did not show enough inter-author variation to include the calculation of 

this parameter in the case study either. 

The third pre-study evaluated the effectiveness of n-gram tracing by extracting scenes 

from the reference corpora of Shakespeare and Marlowe to discern if this method could 

associate them with the corpus from which they had been removed. As explained in 

Chapter 4, two methodological decisions were made and a hypothesis was formulated to 

optimize the reliability of these studies. The first decision was to balance the size of the 

two reference corpora by removing a similar number of words from the two plays that 

constitute that of Shakespeare, which was larger. The second one was to analyse from a 

quantitative perspective the type of n-grams that a disputed text shares at least ten times 

with one of the reference corpora, whereas the others could be qualitatively analysed and 

seen as a complement of the quantitative study. Lastly, the hypothesis that word n-grams 

of at least two words are more distinctive than character n-grams and word 1-grams was 

suggested for the conduction of the studies. The results presented in Chapter 5 proved the 

high degree of effectiveness of n-gram tracing to determine the likeliest authorship of 

Shakespearean and Marlowian scenes from the four groups, which allowed for its 

inclusion in the case study. 

The last pre-study evaluated the reliability of the Zeta test to analyse the authorship 

of scenes from the fourth group, that is, of almost 2,000 words or more, since the plays 


221 
 

of the reference corpora are divided in segments of such length during the procedure and 

thus it seems sensible to only compare them with others that are similar. A controversial 

hypothesis was suggested in Chapter 4 for the conduction of the Zeta test, which is that 

authors should be compared individually with this method, instead of comparing an 

author with a group of many, which does not seem to be rigorous from a statistical point 

of view, as illustrated in Section 4.5.5. All the Shakespearean and Marlowian scenes 

extracted from their corpus and analysed as disputed texts were correctly attributed to 

their author, for which the Zeta test was used in the case study to analyse the scenes of 

almost 2,000 words or more. 

Chapter 6 analysed the authorship of the nineteen scenes of Arden of Faversham 

independently and Chapter 7 discussed these studies from a holistic perspective, which 

allowed for the extraction of conclusions about the authorship of the play, the validity of 

the hypotheses and the approach adopted for the conduction of the thesis and whether its 

main objectives have been accomplished or not.  

The results of the case study showed that Marlowe is more likely than Shakespeare to 

have written 15 of the 19 scenes of the play, which were attributed to him with a high 

degree of certainty on most of the occasions. Only Scene III.i was linked to Shakespeare 

and such attribution took place with a low degree of certainty, whereas the results derived 

from the analysis of the three remaining scenes were inconclusive. 

As discussed in Chapter 7, even though one of the main objectives of the investigation 

was only to discern if Shakespeare is more likely than Marlowe to have written each scene 

of the play or vice versa, the participation of Marlowe in its elaboration appears to be 

almost certain for the proportion of scenes that were attributed to him and the results 

derived from the study of Scene V.i. It is possible that, if none of the two playwrights had 

been involved in the creation of the play, the number of scenes attributed to each in 

Chapter 6 would have been more balanced. In addition, n-gram tracing associated the 

authorship of Scene V.i with Marlowe with a degree of certainty that cannot be found in 

the rest of the thesis. The clarity of this quantitative analysis was superior to that of the 

other 57 conducted with this method, including those with undisputed scenes, and this 

was reinforced by the qualitative analysis of the larger n-grams in common, which 

revealed that Arden of Faversham and the reference corpus of Marlowe share the 10-gram 

I have my wish in that I joy thy sight. This, which is the largest construction in common 

found in the thesis, stands as such a unique linguistic choice that it seems impossible that 


222 
 

two different playwrights may have chosen it. The study of Scene V.i using n-gram 

tracing was complemented by the conduction of a Zeta test that also attributed its 

authorship to Marlowe.  

In sum, these studies seem to have offered substantive evidence to confirm the 

presence of Marlowe in the elaboration of Arden of Faversham, regardless of the 

comparisons that need to be made with other possible candidates in future research, which 

stands as a major breakthrough. They also suggest that the contribution of Shakespeare is 

minor or non-existent, given that only one of the scenes was attributed to him and this 

occurred with a low degree of certainty, while there are still other playwrights that need 

to be considered to examine the authorship of the play. 

This thesis has reached conclusions that differ from those presented in the studies of 

Kinney (2009) and Elliott and Greatley-Hirsch (2017) as a result of the hypotheses 

formulated about the compilation of the reference corpora and the conduction of the Zeta 

test. While there has been a tendency in studies of this kind to compile the reference 

corpora of the candidates with plays that were written in distant periods and belong to 

different genres, this thesis has strongly advocated for the necessity to take into account 

as many linguistic variables as possible when comparing authors with similar styles, as it 

is the case of Elizabethan playwrights. In addition, it has presented several arguments 

against the comparison of groups of authors during the conduction of a Zeta test. Even 

though these hypotheses cannot be validated or refuted until they are tested in other 

contexts, the fact that they have led to such distinct results from those of other studies on 

the authorship of Arden of Faversham could raise a debate about which approach is more 

reliable. As mentioned in Chapter 7, this could be seen as another contribution to the 

development of a budding discipline that has been constantly evolving over the last 

decades due to the irruption of new technologies. 

The two main objectives delineated at the beginning of the thesis have been 

accomplished. The authorship of the nineteen scenes of Arden of Faversham has been 

attributed considering Shakespeare and Marlowe as the candidates, and these analyses 

have been carried out with the newly designed software ALTXA. This computational tool 

stands as the pillar of a future project that seeks to assist fellow researchers and facilitate 

the implementation of authorship attribution studies in educational contexts, as will be 

explained in the following section. 


223 
 

8.2. Limitations and future lines of research 

This section will present the limitations of the thesis and suggest possible lines of future 

research. As pointed out in Chapter 4, the delimitation of the scope of the investigation 

to the sole consideration of Shakespeare and Marlowe as the possible candidates for the 

attribution of authorship of Arden of Faversham was a necessity, given its methodological 

approach. 

Every authorship method selected to carry out the research was tested in a pre-study 

divided into four stages depending on the length of the scenes, with the exception of the 

pre-study about the Zeta test, which was only applied with scenes from the fourth group. 

In addition, the hypothesis that authors should be compared individually during the 

conduction of the Zeta test was formulated. This means that the inclusion of more 

candidates to carry out the pre-studies and the case study would have produced an 

unbearable amount of work for the time I had been given or an excessively long thesis.  

Therefore, the main limitation of this work is that only Shakespeare and Marlowe 

have been considered as the possible candidates for the attribution of authorship of the 

disputed text and, for that reason, this should be seen as the first step of a long-term project 

in which other candidates need to be involved (see Section 4.1). The playwright 

designated as the likeliest author of every scene of Arden of Faversham in this thesis 

needs to be compared with others in future studies, where the authorship of each of these 

nineteen scenes must be analysed independently. Thomas Kyd is the first one with whom 

these comparisons should be made, since he has been presented as the most solid 

alternative for Shakespeare and Marlowe in previous research (see Sections 2.3 and 

3.4.4). Thomas Watson, who was suggested by Taylor (2019) as a contributor to the 

elaboration of the play, and other playwrights of the time who have been considered by 

scholars as potential candidates (see Section 3.4.4) should be also included in future 

studies to reach conclusive results about the authorship of the play. 

In sum, the rigour of the approach followed for the conduction of the thesis makes the 

attribution of authorship of Arden of Faversham an arduous task. Every time two 

playwrights are compared, the validity of each method to distinguish between their 

undisputed scenes of distinct lengths needs to be tested, given that what has been proved 

to be effective with Shakespeare and Marlowe may be useless if it is used to compare 

between Marlowe and Kyd, for instance (see Chapter 4). If authors are compared 


224 
 

individually, instead of comparing a single author with a group of many, as other scholars 

did during the conduction of procedures like the Zeta test, there is a wide range of possible 

combinations. In other words, this approach is not compatible with immediate results. 

This thesis is the first milestone of a project where an approach that allows for a 

reliable comparison between two candidates to analyse the authorship of the scenes of 

Arden of Faversham has been designed. In addition, a computational tool on which future 

investigations and an educational project will be built has been developed. Even though 

the creation of the software ALTXA has consumed much time, this will be the key 

instrument to carry out the following studies quickly and effectively. The program will 

be used to start an initiative in 2022 to make forensic linguistics in general, and authorship 

attribution studies in particular, accessible to all types of audiences, which could facilitate 

the spread of the discipline. 

If the user clicks the Help button of ALTXA, a drop-down menu with the section 

About us will spread on its interface. There, the user will have access to the official email 

account of the software to send doubts and queries, its Twitter account to have access to 

the latest updates and, most importantly, the YouTube channel Project ALTXA, where 

videotutorials in Spanish and English on the functionalities of ALTXA will be uploaded, 

as well as enjoyable talks about forensic linguistics and its main areas of study (see 

Appendix 5).  

The objective of this future project is to create an accessible learning environment 

where guest speakers and myself will present brief videos addressing distinct topics 

related to the discipline that can be easily followed by students. These videos may discuss 

theoretical aspects, such as the Plain English Movement or the methodological 

foundations for plagiarism detection, or practical cases that can be solved either with or 

without the assistance of ALTXA, which can be of use for students starting their own 

investigations. 

The software and this educational project will be promoted in academic journals, 

conferences and social media. Its goal is to facilitate the establishment of forensic 

linguistics in academic contexts, where there is still a scarcity of experts in the field and 

teaching tools, which is a niche that ALTXA intends to occupy. This doctoral thesis has 

been motivated by the desire of democratizing knowledge and the aspiration of solving 


225 
 

an authorship problem that has been present for centuries and has its focus on some of the 

most gifted authors in the history of literature. 

  
226 
 

PRIMARY SOURCES 

Anonymous (1592). Arden of Faversham [eBook edition]. Project Gutenberg. Retrieved 

on December 9, 2021, from https://www.gutenberg.org/files/43440/43440-

h/43440-h.htm 

Brandeis University (2018, December 31). Fascimile Viewer: First Folio (1623). Internet 

Shakespeare Editions. Retrieved on December 9, 2021, from 

https://internetshakespeare.uvic.ca/Library/facsimile/overview/book/F1.html 

Marlowe, C. (1598). Edward II [eBook edition]. Project Gutenberg. Retrieved on 

December 9, 2021, from 

https://www.gutenberg.org/cache/epub/20288/pg20288.html 

Marlowe, C. (1633). The Jew of Malta [eBook edition]. Project Gutenberg. Retrieved on 

December 9, 2021, from https://www.gutenberg.org/files/901/901-h/901-h.htm 

Shakespeare, W. (1623). Richard III [eBook edition]. Project Gutenberg. Retrieved on 

December 9, 2021, from https://www.gutenberg.org/cache/epub/1103/pg1103-

images.html 

Shakespeare, W. (1623). Richard II [eBook edition]. Project Gutenberg. Retrieved on 

December 9, 2021, from https://www.gutenberg.org/files/1111/1111.txt 

  
https://www.gutenberg.org/files/43440/43440-h/43440-h.htm
https://www.gutenberg.org/files/43440/43440-h/43440-h.htm
https://internetshakespeare.uvic.ca/Library/facsimile/overview/book/F1.html
https://www.gutenberg.org/cache/epub/20288/pg20288.html
https://www.gutenberg.org/files/901/901-h/901-h.htm
https://www.gutenberg.org/cache/epub/1103/pg1103-images.html
https://www.gutenberg.org/cache/epub/1103/pg1103-images.html
https://www.gutenberg.org/files/1111/1111.txt


227 
 

BIBLIOGRAPHY AND REFERENCES  

Alcaraz, E. (2005). La lingüística legal: el uso, el abuso y la manipulación del lenguaje 

jurídico. In Turell, M. T. (Ed.), Lingüística forense, lengua y derecho: Conceptos, 

métodos y aplicaciones (pp. 49-63). Barcelona: Documenta Universitaria.  

Alhudithi, E. (2021). Review of Voyant Tools: See through your Text. Language 

Learning & Technology, 25(3), pp. 43-50. 

Anthony, L. (2022). AntConc (Version 4.0.3) [Computer software]. Retrieved on January 

25, 2022, from https://www.laurenceanthony.net/software/antconc/  

Arias Rodríguez, I., & Fernández-Pampillón Cesteros, A. M. [Área de Lingüística 

General UCM]. (2020, May 27). Taller de Sketch Engine [Video]. YouTube. 

Retrieved on January 26, 2022, from 

https://www.youtube.com/watch?v=rLNs2UUVHB8  

Astrana, L. (1964). Vida inmortal de William Shakespeare. Barcelona: Editorial 

Atlántico. 

Austin, J. L. (1962). How to Do Things with Words. Oxford: Oxford University Press. 

Baker, J. C. (1988). Pace: A Test of Authorship Based on the Rate at which New Words 

Enter an Author’s Text. Literary and Linguistic Computing, 3(1), pp. 36-39.  

Baldwin, J. (1993). Police Interview Techniques: Establishing Truth or Proof? British 

Journal of Criminology, 33(3), pp. 325-352. 

Barker, S., & Hinds, H. (2003). The Routledge Anthology of Renaissance Drama. New 

York: Routledge. 

Barrón-Cedeño, A., Vila, M., & Rosso, P. (2014). Detección automática de plagio: De la 

copia exacta a la paráfrasis. In Garayzábal, E., Jiménez, M., & Reigosa, M. (Eds.), 

Lingüística forense: La lingüística en el ámbito legal y policial (pp. 123-152). 

Madrid: Euphonia Ediciones. 

Boas, F. S. (1940). Christopher Marlowe: A Biographical and Critical Study. Oxford: 

Oxford University Press. 

Bozkurt, I. N., Baghoglu, O., & Uyar, E. (2007). Authorship Attribution. Performance of 

Various Features and Classification Methods. 22nd International Symposium on 

Computer and Information Sciences, pp. 1-5. Retrieved on January 17, 2019, from 

https://ieeexplore.ieee.org/abstract/document/4456854/citations#citations 

Bryson, B. (2009). Shakespeare: El mundo como escenario. Barcelona: Editorial RBA. 

Canter, D., & Chester, J. (1997). Investigation into the Claim of Weighted Cusum in 

Authorship Attribution Studies. Forensic Linguistics, 4(2), pp. 252-261. 

Cheng, W., Greaves, C., & Warren, M. (2006). From N-gram to Skipgram to Concgram. 

International Journal of Corpus Linguistics, 11(4), pp. 411-433. 

Christensen, A. (2017). Separation Scenes: Domestic Drama in Early Modern England. 

Nebraska: University of Nebraska Press. 

https://www.laurenceanthony.net/software/antconc/
https://www.youtube.com/watch?v=rLNs2UUVHB8
https://ieeexplore.ieee.org/abstract/document/4456854/citations#citations


228 
 

Cicres, J., & Queralt, S. (2019). An N-gram Based Approach to the Automatic 

Classification of Schoolchildren’s Writing. Vigo International Journal of Applied 

Linguistics, 16, pp. 53-80. 

Clarke, I., & Kredens, K. (2018). “I consider myself to be a service provider”: Discursive 

Identity Construction of the Forensic Linguist Expert. The International Journal 

of Speech, Language and the Law, 25(1), pp. 79-107. 

Correa, M. (2013). Forensic Linguistics. An Overview of the Intersection and Interaction 

of Language and Law. Studies About Languages, 23, pp. 5-13. 

Coulthard, M. (1996). The Official Version. Audience Manipulation in Police Records of 

Interviews with Suspects. In Cmejrková, S., Hoffmannová, J., Müllerová, O., & 

Svetlá, J. (Eds.), Dialoganalyse VI: Proceedings of the VI Conference (pp. 121-

132). Prague: Max Niemeyer Verlag.  

Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied 

Linguistics, 25(4), pp. 431-447. 

Coulthard, M. (2010). Forensic Linguistics: The Application of Language Description in 

Legal Contexts. Langage et société, 132(2), pp. 15-33.  

Coulthard, M., Grant, T., & Kredens, K. (2010). Forensic Linguistics. In Wodak, R., 

Johnstone, B., & Kerswill, P. (Eds.), Handbook of Sociolinguistics (pp. 529-544). 

London: SAGE Publications. 

Coulthard, M., & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language 

in Evidence. New York: Routledge. 

Coulthard, M., & Johnson, A. (2010). The Routledge Handbook of Forensic Linguistics. 

New York: Routledge. 

Craig, H., & Kinney, A. F. (2009). Methods. In Craig, H., & Kinney, A. F. (Eds.), 

Shakespeare, Computers and the Mystery of Authorship (pp. 15-39). Cambridge: 

Cambridge University Press. 

Culpeper, J. (2018). Affirmatives in Early Modern English: Yes, yea and ay. Journal of 

Historical Pragmatics, 19(2), pp. 243-264. 

Daubert v. Merrell Dow Pharmaceuticals Inc., Volume 509 U.S. Page 579 (1993). 

Retrieved on May 31, 2019 from https://caselaw.findlaw.com/us-9th-

circuit/1430422.html 

Dudgeon, C. (2009). Forensic Performances: Evidentiary Narrative in Arden of 

Faversham. In Majeske, A., & Detmer-Goebel, E. (Eds.), Justice, Power and 

Women in English Renaissance Drama (pp. 98-117). New Jersey: Fairleigh 

Dickinson University Press. 

Dumas, B. K. (2002). Reasonable Doubt about Reasonable Doubt: Assessing Jury 

Instruction Adequacy in a Capital Case. In Cotteril, J. (Ed.), Language in the 

Legal Process (pp. 245-259). Hampshire: Palgrave Macmillan. 

Durant, A. (2010). Meaning in the Media: Discourse, Controversy and Debate. 

Cambridge: Cambridge University Press. 

https://caselaw.findlaw.com/us-9th-circuit/1430422.html
https://caselaw.findlaw.com/us-9th-circuit/1430422.html


229 
 

El Mundo (2017, February). El rector de la URJC “plagió literalmente” una obra de un 

ex decano de la UB. Retrieved on February 21, 2020, from 

https://www.elmundo.es/madrid/2017/02/03/5894c721e2704e80678b4615.html 

Elliott, J., & Greatley-Hirsch, B. (2017). Arden of Faversham, Shakespearean 

Authorship, and “The Print of Many”. In Taylor, G., & Egan, G. (Eds.), The New 

Oxford Shakespeare: Authorship Companion (pp. 139-181). Oxford: Oxford 

University Press. 

Fallow, D. (2016). Su padre, John Shakespeare. In Edmonson, P., & Wells, S. (Eds.), El 

círculo de Shakespeare (pp. 47-67). Barcelona: Stella Maris. 

Federal Bureau of Investigation (n.d.). Amerithrax or Anthrax Investigation. Retrieved 

on February 15, 2020, from https://www.fbi.gov/history/famous-

cases/amerithrax-or-anthrax-investigation  

Felsenfeld, C. (1981). The Plain English Movement in the United States. FLASH: The 

Fordham Law Archive of Scholarship and History, 6, pp. 408-421. 

Fitzgerald, J. R. (2014). Atribución de autoría y supuestas notas de suicidio: Análisis 

lingüístico forense y su papel en los tribunales penales estadounidenses en dos 

crímenes violentos ocurridos en 2007. In Garayzábal, E., Jiménez, M., & Reigosa, 

M. (Eds.), Lingüística forense: La lingüística en el ámbito legal y policial (pp. 49-

77). Madrid: Euphonia Ediciones. 

Foltýnek, T., Meuschke, N., & Gipp, B. (2019). Academic Plagiarism Detection: A 

Systematic Literature Review. ACM Computing Surveys, 52(6), pp. 112:1-112:42. 

Fraser, B. (1998). Threatening Revisited. Forensic Linguistics, 5(2), pp. 159-173. 

French, P. (1994). An Overview of Forensic Phonetics with Particular Reference to 

Speaker Identification. International Journal of Speech, Language and the Law, 

1(2), pp. 169-181. 

Gibbons, J. (2003). Forensic Linguistics: An Introduction to Language in the Justice 

System. Oxford: Blackwell Publishing. 

Gibbons, J. (2005). El entramado lingüístico de los interrogatorios. In Turell, M. T. (Ed.), 

Lingüística forense, lengua y derecho: Conceptos, métodos y aplicaciones (pp. 

193-219). Barcelona: Documenta Universitaria. 

Gibbons, J. (2011). Towards a Framework for Communication Evidence. The 

International Journal of Speech, Language and the Law, 18(2), pp. 233-260. 

Gil, J., & San Segundo, E. (2014). La cualidad de voz en fonética judicial. In Garayzábal, 

E., Jiménez, M., & Reigosa, M. (Eds.), Lingüística forense: La lingüística en el 

ámbito legal y policial (pp. 153-197). Madrid: Euphonia Ediciones. 

Goustos, D. (1995). Review of Forensic Stylistics, by G. McMenamin. Forensic 

Linguistics, 2(1), pp. 99-113. 

Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The 

International Journal of Speech, Language and the Law, 14(1), pp. 1-25. 

Greenblatt, S., & Logan, G. (2012). The Sixteenth Century. In Greenblatt, S. (Ed.), The 

Norton Anthology of English Literature (pp. 531-1339). New York: Norton. 

https://www.elmundo.es/madrid/2017/02/03/5894c721e2704e80678b4615.html
https://www.fbi.gov/history/famous-cases/amerithrax-or-anthrax-investigation
https://www.fbi.gov/history/famous-cases/amerithrax-or-anthrax-investigation


230 
 

Grieve, J., Clarke, I., Chiang, E., Gideon, H., Heini, A., Nini, A., & Waibel, E. (2018). 

Attributing the Bixby Letter Using N-gram Tracing. Digital Scholarship in the 

Humanities, 34(3), pp. 493-512. 

Halliday, F. E. (1964). Shakespeare: Biografía ilustrada. Barcelona: Ediciones Destino. 

Haworth, K. (2018). Tapes, Transcripts and Trials: The Routine Contamination of Police 

Interview Evidence. The International Journal of Evidence and Proof, 22(4), pp. 

428-450. 

Holinshed, R. (1587). Chronicles of England, Scotland and Ireland. London: The British 

Library. Retrieved on January 19, 2020, from 

http://english.nsms.ox.ac.uk/holinshed/texts.php?text1=1587_8324#p14902 

Holland, P. (2007). William Shakespeare. Oxford: Oxford University Press. 

Honan, P. (2006). Christopher Marlowe: Poet and Spy. Oxford: Oxford University Press. 

Honigman, E. (2001). Shakespeare’s Life. In De Grazia, M., & Wells, S. (Eds.), The 

Cambridge Companion to Shakespeare (pp. 1-12). Cambridge: Cambridge 

University Press. 

Hopkins, L. (2008). Christopher Marlowe, Renaissance Dramatist. Edinburgh: 

Edinburgh University Press. 

Howald, B. S. (2008). Authorship Attribution under the Rules of Evidence: Empirical 

Approaches —a Layperson’s Legal System. The International Journal of Speech, 

Language and the Law, 15(2), pp. 219-247. 

International Association for Forensic and Legal Linguistics (2020). Forensic Linguistics. 

IAFLL. Retrieved on January 19, 2020, from https://www.iafl.org/forensic-

linguistics/  

Ishihara, S. (2014). A Likelihood Ratio Based Evaluation of Strength of Authorship 

Attribution Evidence in SMS Messages Using N-grams. The International 

Journal of Speech, Language and the Law, 21(1), pp. 23-49. 

Jackson, M. P. (2014). Determining the Shakespeare Canon: Arden of Faversham and A 

Lover’s Complaint. Oxford: Oxford University Press. 

Jackson, M. P. (2017). Shakespeare, Arden of Faversham, and A Lover’s Complaint: A 

Review of Reviews. In Taylor, G., & Egan, G. (Eds.), The New Oxford 

Shakespeare: Authorship Companion (pp. 123-135). Oxford: Oxford University 

Press. 

Jackson, M. P., & Taylor, G. (2015). Shakespearean Authorship. The Times Literary 

Supplement, 5849, p. 6.  

Jessen, M. (2008). Forensic Phonetics. Language and Linguistics Compass, 2(4), pp. 671-

711.  

Jonson, B. (1623). To the Memory of my Beloved the Author, Mr. William Shakespeare. 

Poetry Foundation. Retrieved on May 6, 2021, from 

https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-

the-author-mr-william-shakespeare  

Kermode, F. (2005). El tiempo de Shakespeare. Madrid: Debate. 

https://www.iafl.org/forensic-linguistics/
https://www.iafl.org/forensic-linguistics/
https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-the-author-mr-william-shakespeare
https://www.poetryfoundation.org/poems/44466/to-the-memory-of-my-beloved-the-author-mr-william-shakespeare


231 
 

Kilgarriff, A., & Rychlý, P. (2003). Sketch Engine [Online tool]. Retrieved on January 

26, 2022, from https://www.sketchengine.eu/  

Kinney, A. F. (2009). Authoring Arden of Faversham. In Craig, H., & Kinney, A. F. 

(Eds.), Shakespeare, Computers and the Mystery of Authorship (pp. 78-99). 

Cambridge: Cambridge University Press. 

Kocher, P. L. (1948). Christopher Marlowe, Individualist. University of Toronto 

Quarterly, 17(2), pp. 111-120. 

Kredens, K. (2016). Conflict or Convergence? Interpreters’ and Police Officers’ 

Perceptions of the Role of the Public Service Interpreter. Language and Law, 3(2), 

pp. 65-77. 

Larner, S. (2014). A Preliminary Investigation into the Use of Formulaic Sequences as a 

Marker of Authorship. The International Journal of Speech, Language and the 

Law, 21(1), pp. 1-22. 

Latorre, J. A. (2017). Attribution of Authorship of The Merchant of Venice and Henry VI 

through Linguistic Parameters: A Contrastive Study between William 

Shakespeare and Christopher Marlowe [Master’s dissertation, Universidad 

Complutense de Madrid]. Retrieved on February 11, 2020, from 

https://eprints.ucm.es/47400/ 

Levi, J. N. (1993). Evaluating Jury Comprehension of Illinois Capital-Sentencing 

Instructions. American Speech, 68(1), pp. 20-49.  

Ley Orgánica 10/1995, de 23 de noviembre, del Código Penal (2015). Boletín Oficial del 

Estado, 281, sec. I, de 24 de noviembre de 1995, 33987 a 34058. Retrieved on 

May 4, 2020, from https://www.boe.es/buscar/doc.php?id=BOE-A-1995-25444 

Losey, F. D. (1927). The Kingsway Shakespeare. London: George G. Harrap & Co. 

Martin, B. (2004). Plagiarism: Policy Against Cheating or Policy Against Learning? 

University of Wollongong. Retrieved on January 4, 2020, from 

http://www.uow.edu.au/arts/sts/bmartin/ 

McDougall, K., & Duckworth, M. (2018). Individual Patterns of Disfluency Across 

Speaking Styles: A Forensic Phonetic Investigation of Standard Southern British 

English. International Journal of Speech, Language and the Law, 25(2), pp. 205-

230. 

McMenamin, G. (1993). Forensic Stylistics. Amsterdam: Elsevier. 

McMenamin, G. (2002). Advances in Forensic Stylistics. Florida: CRC Press. 

Mellinkoff, D. (1963). The Language of the Law. Oregon: Resource Publications. 

Mendelhall, T. C. (1887). The Characteristic Curves of Composition. Science, ns-

9(214S), pp. 237-246. Retrieved on May 12, 2019, from 

https://science.sciencemag.org/content/ns-9/214S/237/tab-pdf 

Merriam, T. (1996). Tamburlaine Stalks in Henry VI. Computers and the Humanities, 

30(3), pp. 267-280.  

Moerk, E. L. (1973). An Objective, Statistical Description of Style. Linguistics: An 

Interdisciplinary Journal of the Language Sciences, 11(108), pp. 50-58. 

https://www.sketchengine.eu/
https://eprints.ucm.es/47400/
https://www.boe.es/buscar/doc.php?id=BOE-A-1995-25444
https://science.sciencemag.org/content/ns-9/214S/237/tab-pdf


232 
 

Momeni, N. (2011). Forensic Linguistics: A Conceptual Frame of Bribery with Linguistic 

and Legal Features (a Case Study in Iran). International Journal of Criminology 

and Social Theory, 4(2), pp.733-744.  

Morton, A. Q., & Michaelson, S. (1990). The Q-sum Plot. Edinburgh: Department of 

Computer Science, University of Edinburgh. 

Nicholl, C. (2016). El caso de Marlowe. In Edmonson, P., & Wells, S. (Eds.), La verdad 

sobre Shakespeare (pp. 59-73). Barcelona: Stella Maris. 

Nini, A., & Grant, T. (2013). Bridging the Gap between Stylistic and Cognitive 

Approaches to Authorship Analysis Using Systemic Functional Linguistics and 

Multidimensional Analysis. The International Journal of Speech, Language and 

the Law, 20(2), pp. 173-202. 

Olsson, J. (2004). Forensic Linguistics: An Introduction to Language, Crime and the 

Law. London: Continuum International Publishing Group. 

Olsson, J. (2008). Forensic Linguistics. New York: Continuum International Publishing 

Group. 

Oxburgh, G. E., Myklebust, T., & Grant, T. (2010). The Question of Question Types in 

Police Interviews: A Review of Literature from a Psychological and Linguistic 

Perspective. International Journal of Speech, Language and the Law, 17(1), pp. 

45-66. 

Perkins, R., & Grant, T. (2012). Forensic Linguistics. In Siegel, J. A., & Saukko, P. J. 

(Eds.), Encyclopedia of Forensic Sciences, Second Edition (pp. 174-177). 

Amsterdam: Elsevier. 

Perraudin, F. (2016, October 23). Christopher Marlowe Credited as one of Shakespeare’s 

Co-writers. The Guardian. Retrieved on January 21, 2020, from 

https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-

credited-as-one-of-shakespeares-co-writers 

Philbrick, F. A. (1949). Language and the Law. The Semantics of Forensic English. New 

York: The Macmillan Company.  

Potter, L. (2012). The Life of William Shakespeare: A Critical Biography. Oxford: Wiley-

Blackwell. 

Potthast, M., Stein, B., Eiselt, A., Barrón-Cerdeño, A., & Rosso, P. (2009). Overview of 

the 1st International Competition on Plagiarism Detection. In Stein, B., Rosso, P., 

Stamatos, E., Koppel, M., & Agirre, E. (Eds.), Workshop on Uncovering 

Plagiarism, Authorship and Social Software Misuse (PAN 09) (pp. 1-9). CEUR-

WS.org. 

Quality Inns International, Inc. v. McDonald’s Corporation, Volume 695 F. Supp. 198 

(1988). Retrieved on April 10, 2019, from 

https://law.justia.com/cases/federal/district-courts/FSupp/695/198/2346281/ 

Queralt, S. (2014). Acerca de la prueba lingüística en atribución de autoría hoy. Revista 

de Llengua i Dret, 62, pp. 35-48. 

Rhode Island v. Innis, Volume 446 U.S. Page 291 (1980). Retrieved on February 2, 2019, 

from https://supreme.justia.com/cases/federal/us/446/291/ 

https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-credited-as-one-of-shakespeares-co-writers
https://www.theguardian.com/culture/2016/oct/23/christopher-marlowe-credited-as-one-of-shakespeares-co-writers
https://law.justia.com/cases/federal/district-courts/FSupp/695/198/2346281/
https://supreme.justia.com/cases/federal/us/446/291/


233 
 

Richardson, C. (2006). Domestic Life and Domestic Tragedy in Early Modern England: 

The Material Life of the Household. Manchester: Manchester University Press. 

Riggs, D. (2004). The World of Christopher Marlowe. London: Faber and Faber. 

Rock, F. (2007). Communicating Rights: The Language of Arrest and Detention. 

Hampshire: Palgrave Macmillan. 

Royal Shakespeare Company (2021). Timeline of Shakespeare’s Plays. Retrieved on June 

7, 2021, from https://www.rsc.org.uk/shakespeares-plays/timeline  

Ryskina, M., Alpert-Abrams, H., Garrette, D., & Berg-Kirkpatrick, T. (2017). Automatic 

Compositor Attribution in the First Folio of Shakespeare. Proceedings of the 55th 

Annual Meeting of the Association for Computational Linguistics (Short Papers), 

pp. 411-416. 

Schoenbaum, S. (1985). William Shakespeare: Una biografía documentada. Barcelona: 

Editorial Argos Vergara. 

Scott, M. (2021). WordSmith Tools (Version 8) [Computer software]. Retrieved on 

January 25, 2022, from https://www.lexically.net/wordsmith/  

Shuy, R. (2002). Linguistic Battles in Trademark Disputes. Hampshire: Palgrave 

Macmillan. 

Shuy, R. (2010). The Language of Defamation Cases. New York: Oxford University 

Press. 

Sinclair, S., & Rockwell, G. (2022). Voyant Tools (Version 2.5.3) [Online tool]. Retrieved 

on January 25, 2022, from https://voyant-tools.org/  

Smith, E. L. (2021). A Review of the Computational Linguistics Tools WordSmith Tools 

(Version 8) and AntConc (Version 3.5.8). Renaissance and Reformation, 44(1), 

pp. 200-214.  

Solan, L. M. (1993). The Language of Judges. Chicago: University of Chicago Press. 

Sousa-Silva, R. (2013). Detecting Plagiarism in the Forensic Linguistics Turn [Doctoral 

dissertation, Aston University]. Retrieved on January 25, 2022, from 

https://research.aston.ac.uk/en/studentTheses/detecting-plagiarism-in-the-

forensic-linguistics-turn  

Sousa-Silva, R. (2014). Detecting Translingual Plagiarism and the Backlash against 

Translation Plagiarists. Language and Law, 1(1), pp. 70-94. 

Svartvik, J. (1968). The Evans Statements: A Case for Forensic Linguistics. Goteborg: 

Elanders boktryckeri aktiebolag. Retrieved on May 7, 2019, from 

https://www.thetext.co.uk/Evans%20Statements%20Part%201.pdf 

Tallent, L. (2007). Looking for Marlowe. College Literature, 34(1), pp. 213-222. 

Taylor, G. (2019). Finding “Anonymous” in the Digital Archives: The Problem of Arden 

of Faversham. Digital Scholarship in the Humanities, 34(4), pp. 855-873.  

The Marlowe Society (2021). Published Works. Retrieved on June 7, 2021, from 

http://www.marlowe-society.org/christopher-marlowe/works/  

https://www.rsc.org.uk/shakespeares-plays/timeline
https://www.lexically.net/wordsmith/
https://voyant-tools.org/
https://research.aston.ac.uk/en/studentTheses/detecting-plagiarism-in-the-forensic-linguistics-turn
https://research.aston.ac.uk/en/studentTheses/detecting-plagiarism-in-the-forensic-linguistics-turn
https://www.thetext.co.uk/Evans%20Statements%20Part%201.pdf
http://www.marlowe-society.org/christopher-marlowe/works/


234 
 

Tiersma, P. (1993). The Judge as Linguist. Loyola of Los Angeles Law Review, 27(1), pp. 

269-283.  

Tiersma, P. (2009). Communicating with Juries: How to Draft more Understandable Jury 

Instructions. Loyola-LA Legal Studies Paper, 2009-44. Retrieved on November 

11, 2019, from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1507298 

Turell, M. T. (2005). El plagio en la traducción literaria. In Turell, M. T. (Ed.), Lingüística 

forense, lengua y derecho: Conceptos, métodos y aplicaciones (pp. 275-298). 

Barcelona: Documenta Universitaria. 

Turell, M. T. (2010). The Use of Textual, Grammatical and Sociolinguistic Evidence in 

Forensic Text Comparison. The International Journal of Speech, Language and 

the Law, 17(2), pp. 211-250. 

Udina, N. (2017). Forensic Linguistics Implications for Legal Education: Creating the e-

textbook on Language and Law. Procedia: Social and Behavioral Sciences, 237, 

pp. 1337-1340.  

Universidad Nacional de Educación a Distancia (2017, September). Análisis de textos y 

estilometría usando R. Formación permanente UNED. Retrieved on January 15, 

2022, from https://formacionpermanente.uned.es/tp_actividad/idactividad/10010  

Valero-Garcés, C. (2018). Lingüística forense. Contextos, teoría y práctica. Madrid: 

Edisofer S.L. 

Vázquez Maroño, M. L. (2014). La entrevista policial, un diálogo transformado en 

monólogo. In Garayzábal, E., Jiménez, M., & Reigosa, M. (Eds.), Lingüística 

forense: La lingüística en el ámbito legal y policial (pp. 341-356). Madrid: 

Euphonia Ediciones. 

Vickers, B. (2008). Thomas Kyd, Secret Sharer. The Times Literary Supplement, 5481, 

pp. 13-15. 

Vickers, B. (2015). No Shakespeare to Be Found. The Times Literary Supplement, 5487, 

pp. 9-11. 

Wood, M. (2016). Su madre, Mary Shakespeare. In Edmonson, P., & Wells, S. (Eds.), El 

círculo de Shakespeare (pp. 27-45). Barcelona: Stella Maris. 

Woolls, D., & Coulthard, M. (1998). Tools for the Trade. Forensic Linguistics, 5(1), pp. 

33-57. 

Wright, D. (2017). Using Word N-grams to Identify Authors and Idiolects. International 

Journal of Corpus Linguistics, 22(2), pp. 212-241. 

  
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1507298
https://formacionpermanente.uned.es/tp_actividad/idactividad/10010


235 
 

APPENDICES 

 
236 
 

APPENDIX 1 

Transcript of the email that was sent from Dulceliz Díaz’s email account to three 

members of her family on January 15, 2007 (Fitzgerald, 2014, p. 53) 

Subject: do me a favor. 

I’m doing something today thast will affe4ct us all, I weant uou to do me a favor, get 

jajaira and eddie and all 4 of their kids, he raped me when i went to their hous e and she 

watched, so I want you to kill thenm, ill be watchin to make sure you do this, leave albert 

alone though just ‘tell albert I love him and this ist his fault, and its not the familys faut 

either, I just deont weant to live anymore, 

mommy and poppi i liove you, mio I love you carlos I love you and Brenda I love you, 

please tell albert that I will always love him……. i sorry that it has to be this way 

everyone, but this is what iv wantred to do for a very long time, peace3 out and I’ll be 

keeping an eye on all of you, and even though we argued and fight over stupid things, 

you guys are always gonna bwe in my heart,  

 
237 
 

APPENDIX 2 

Graphical representation of the results derived from the Zeta test conducted by 

Kinney to analyse the authorship of the scenes of Arden of Faversham and a table 

with their coordinates to clarify which ones have been attributed to Shakespeare 

(2009, pp. 92-93) 

 
238 
 

239 
 

APPENDIX 3 

Stop list with all the words ignored as potential markers during the conduction 

of the Zeta tests of the thesis in alphabetical order

a  

abigail  

about  

above  

across  

adieu  

after  

afterwards  

again  

against  

alexandria  

all  

almost   

along  

already  

also  

although  

always  

am  

among  

amongst  

amoungst  

an  

and  

anne 

another  

any  

anyhow  

anyone  

anything  

anyway  

anywhere  

are  

around  

art 

arundel 

as  

at  

aumerle 

back 

bagot 

baldock 

barabas 

barnardine 

be   

because  

been  

before  

beforehand  

behind  

being  

below  

beside  

besides  

between 

beyond  

bolingbroke 

bolingbroke's 

borsa 

both  

brandon 

britaine 

buckingham 

bushy 

but  

by  

calymath 

can  

cannot  

catesby 

chamberlain 

clarence 

clarence' 

cornwall 

could  

could'st 

crosby 

danae 

dare  

de 

despite  

did  

did'st 

didst 

do  

does  

don 

done 

dorset  

dost 

doth 

down  

during  

durst 

each  

edmund 

edmund's 

edward's 

edwardum 

eg  

egypt 

eight 

eighth 

either  

eleven 

elizabeth 

else  

elsewhere  

ely 

'em 

england 


240 
 

england's 

english 

enough  

est 

etc  

even  

ever  

every  

everyone  

everything  

everywhere  

except  

exeter 

ferneze 

ferneze's 

few   

fifth 

first 

five 

flint 

flint's 

florence 

for   

four 

fourth 

france 

frenchman 

from  

gaunt 

gaveston 

george 

george's 

glocester 

gloucester 

gloucester's 

gurney 

had  

hadst 

harry 

has  

hast 

hastings 

hath 

have  

he  

hebrew 

hebrews 

hence  

henry 

henry's 

her  

hercules 

here  

hereabouts  

hereafter  

hereby  

hereford 

hereford's 

herein  

hereinafter  

heretofore  

hereunder  

hereupon  

herewith  

hers  

herself  

him  

himself  

his  

how  

however  

hum 

hylas 

i  

if  

in  

indeed  

instead  

into  

ireland 

is  

isabel 

isabella 

it  

italian 

ithamore 

its  

itself  

jacomo 

jerusalem 

jesu 

'jesu 

jew 

jews 

jew's 

john 

jove 

jove's 

kent 

killingworth 

lancaster 

last  

latter  

latterly  

least  

leicester 

less 

levune 

lightborn 

like  

lodowick 

london 

longshanks' 

lot  

lots  

malta 

malta's 

many  

margaret 

margaret's 


241 
 

marquis 

mathias 

matrevis 

may  

mayst 

me  

might  

mine  

more  

moreover  

mortimer 

mortimers 

mortimer's 

morton 

most  

mostly  

much  

must  

my  

myself  

namely  

near  

need  

ne'er 

neither  

never  

nevertheless  

next 

nine 

nineth 

no  

nobody  

nolite 

none  

noone  

nor  

norfolk 

normandy 

northumberland 

not  

nothing  

now  

nowhere  

o' 

occidere 

of  

off  

often  

oftentimes  

on  

once  

one  

o'er 

only  

onto  

or  

other  

others  

otherwise  

ought  

our  

ours  

ourselves  

out  

outside  

over  

paul 

paul's 

pembroke 

pembroke's 

per  

perhaps  

pilia 

plantagenet 

plantagenets 

pomfret 

quam 

ratcliff 

rather  

re  

rhodes 

richard 

richard's 

richmond 

richmond's 

rome 

rutland 

salisburg 

salisbury 

same  

second  

selim 

seven 

seventh 

several  

shall  

shalt 

she  

should  

sicily 

since  

six 

sixth 

so  

some  

somehow  

someone  

something  

sometime  

sometimes  

somewhat  

somewhere  

spain 

spenser 

spensers 

spenser's 

stanley 

still  

such  

't 


242 
 

ten 

tenth 

tewksbury 

th' 

than  

that  

the  

thee 

their  

theirs  

them  

themselves  

then  

thence 

there  

thereabouts  

thereafter  

thereby  

therefore  

therein  

thereof  

thereon  

thereupon  

these  

they  

thine 

third  

this  

thither 

thomas 

those  

thou 

though  

three 

through  

throughout  

thru  

thus  

thy 

timere 

tis 

to  

together  

too  

top  

toward  

towards  

turk 

turkey 

turks 

twas 

twelve 

twenty 

two 

tynmouth 

tyrrel 

under  

until  

up  

upon  

us  

used  

valois 

vaughan 

venice 

very  

via  

walter 

warwick 

warwickshire 

was  

wast 

we  

well  

welshmen 

were  

wert 

westminster 

what  

whatever  

when  

whenas 

whence  

whenever  

where 

whereafter  

whereas  

whereby  

wherein  

whereupon  

wherever  

whether  

which  

while  

whither 

who  

whoever  

whole  

whom  

whose  

why  

whyever  

will  

william 

wilt 

wiltshire 

winchester 

with  

within  

without  

would  

wouldst 

ye 

yet  

york 

york's 

you  

your  

yours  

yourself  


243 
 

yourselves  

 
244 
 

APPENDIX 4  

Lists of the 500 Shakespearean and Marlowian markers for the attribution of 

authorship of the scenes of Arden of Faversham with the Zeta test. The position 

of these markers on the lists is determined by their score according to the 

formula provided in Section 4.5.5 

 
Shakespearean markers 

1. duke 

2. god's 

3. god 

4. tongue 

5. hours 

6. wife 

7. royal 

8. deep 

9. children 

10. foul 

11. dangerous 

12. bloody 

13. sons 

14. eye 

15. arm 

16. holy 

17. noble 

18. days 

19. ill 

20. princes 

21. fearful 

22. seat 

23. weeping 

24. living 

25. hour 

26. just 

27. heavy 

28. cousin 

29. happy 

30. to-day 

31. sorrow 

32. bosom 

33. king 

34. right 

35. high 

36. rivers 

37. beseech 

38. grievous 

39. woe 

40. weary 

41. kindred 

42. bad 

43. peace 

44. thoughts 

45. brother's 

46. eyes 

47. set 

48. look'd 

49. souls 

50. truth 

51. duty 

52. guilty 

53. princely 

54. virtuous 

55. beat 

56. wail 

57. tender 

58. pluck 

59. forth 

60. amen 

61. anointed 

62. gracious 

63. black 

64. conscience 

65. won 

66. mortal 

67. prove 

68. thing 

69. bid 

70. widow 

71. deadly 

72. hearts 

73. tedious 

74. sour 

75. dull 

76. thrive 

77. breath 

78. womb 

79. shame 

80. slander 

81. degree 

82. coward 

83. ancient 

84. leisure 

85. mother 

86. grey 

87. brief 

88. deny 

89. ear 

90. counsel 

91. joy 

92. subjects 

93. cry 

94. knee 

95. self 

96. hand 


245 
 

97. patience 

98. mighty 

99. issue 

100. bids 

101. age 

102. own 

103. spent 

104. pale 

105. cold 

106. title 

107. liege 

108. brothers 

109. stabb'd 

110. drown 

111. destruction 

112. joys 

113. guess 

114. husband 

115. shortly 

116. says 

117. humble 

118. sleeping 

119. party 

120. current 

121. morrow 

122. pluck'd 

123. world's 

124. sentence 

125. doom 

126. alack 

127. loyal 

128. bold 

129. mother's 

130. gain 

131. glory 

132. tidings 

133. win 

134. green 

135. side 

136. sad 

137. breathing 

138. virtue 

139. remember 

140. haste 

141. heart 

142. defend 

143. years 

144. hell 

145. young 

146. law 

147. wrong 

148. to-morrow 

149. matter 

150. loss 

151. gentlemen 

152. subject 

153. saw 

154. withal 

155. play 

156. touch 

157. thought 

158. light 

159. rage 

160. dreams 

161. untimely 

162. reverend 

163. quoth 

164. earnest 

165. prime 

166. dispatch 

167. outward 

168. guilt 

169. proportion 

170. terror 

171. heart's 

172. woeful 

173. yielded 

174. bones 

175. fathers 

176. vantage 

177. state 

178. tempest 

179. nurse 

180. barren 

181. withdraw 

182. profane 

183. designs 

184. grace 

185. model 

186. bleeding 

187. wound 

188. height 

189. boldly 

190. sacrament 

191. liest 

192. upright 

193. privilege 

194. adversaries 

195. treasons 

196. dread 

197. kinsman 

198. record 

199. looks 

200. put 

201. false 

202. body 

203. melancholy 

204. woman 

205. deed 

206. pretty 

207. rude 

208. woes 

209. dust 

210. broken 

211. bend 

212. lands 

213. buried 

214. office 

215. judge 

216. urg'd 


246 
 

217. lend 

218. book 

219. jest 

220. shed 

221. lo 

222. times 

223. foe 

224. battle 

225. hot 

226. devil 

227. form 

228. deserve 

229. plain 

230. faith 

231. lies 

232. purpose 

233. country's 

234. despair 

235. course 

236. tear 

237. yea 

238. told 

239. clouds 

240. loving 

241. face 

242. presence 

243. seen 

244. son 

245. tyranny 

246. direction 

247. opposite 

248. shallow 

249. cousins 

250. voice 

251. prithee 

252. allies 

253. flourish 

254. grandam 

255. determin'd 

256. bless'd 

257. corse 

258. blunt 

259. beggar 

260. aunt 

261. creature 

262. divided 

263. plant 

264. prepare 

265. coronation 

266. glass 

267. beholding 

268. prophesy 

269. heinous 

270. confound 

271. contented 

272. usurp 

273. watch 

274. prevent 

275. scene 

276. damn'd 

277. angels 

278. windows 

279. loath 

280. trees 

281. yon 

282. dispers'd 

283. benefit 

284. pieces 

285. household 

286. urge 

287. intelligence 

288. bay 

289. heirs 

290. merry 

291. rights 

292. commends 

293. sickness 

294. mock 

295. envious 

296. tardy 

297. stopp'd 

298. tale 

299. awak'd 

300. faces 

301. sorrow's 

302. snow 

303. sullen 

304. heavier 

305. infant 

306. dire 

307. order 

308. victory 

309. reach 

310. ceremonious 

311. valour 

312. empty 

313. minister 

314. deputy 

315. spur 

316. ripe 

317. correction 

318. saint 

319. try 

320. balm 

321. grace's 

322. father 

323. lord 

324. knightly 

325. laid 

326. swear 

327. trembling 

328. frozen 

329. say 

330. nought 

331. blood 

332. devotion 

333. befall 

334. apparent 

335. sun 

336. comfort 


247 
 

337. traitor 

338. wept 

339. prey 

340. dog 

341. prayer 

342. worthy 

343. beg 

344. behalf 

345. kill'd 

346. stroke 

347. sign 

348. manner 

349. assur'd 

350. keeps 

351. spake 

352. fellow 

353. fool 

354. affairs 

355. doubt 

356. sets 

357. kingdom 

358. number 

359. forward 

360. pain 

361. depose 

362. falls 

363. large 

364. summer 

365. cut 

366. end 

367. vow 

368. brother 

369. rough 

370. head 

371. reverence 

372. betwixt 

373. divine 

374. hate 

375. other's 

376. shadow 

377. lest 

378. ay 

379. awhile 

380. wash 

381. quick 

382. enemies 

383. kill 

384. nature 

385. rebels 

386. mild 

387. taste 

388. banishment 

389. bed 

390. justice 

391. fight 

392. move 

393. precious 

394. cause 

395. full 

396. old 

397. dignity 

398. leads 

399. daughters 

400. infer 

401. lovel 

402. boar 

403. knocks 

404. knot 

405. conqueror 

406. sanctuary 

407. growth 

408. uncles 

409. ungovern'd 

410. babes 

411. lamentation 

412. vice 

413. red 

414. wife's 

415. zounds 

416. remorse 

417. fain 

418. perpetual 

419. smother'd 

420. butcher'd 

421. reported 

422. meed 

423. acquaint 

424. warn 

425. cheerfully 

426. scarcely 

427. crept 

428. toad 

429. homicide 

430. murd'rous 

431. devilish 

432. fouler 

433. fairer 

434. ugly 

435. holes 

436. evil 

437. lip 

438. graces 

439. victorious 

440. contrary 

441. endur'd 

442. beggars 

443. nails 

444. wonders 

445. consorted 

446. conference 

447. consequence 

448. contempt 

449. pass'd 

450. cried 

451. piece 

452. humility 

453. dew 

454. sights 

455. boon 


248 
 

456. looking-

glass 

457. usurp'd 

458. sovereignty 

459. duteous 

460. glories 

461. shook 

462. ages 

463. bond 

464. surrey 

465. revengeful 

466. plants 

467. sap 

468. government 

469. garden 

470. unruly 

471. saints 

472. scope 

473. bending 

474. bent 

475. cloudy 

476. signify 

477. slaughtered 

478. woe's 

479. women 

480. joints 

481. boys 

482. blot 

483. angel 

484. worldly 

485. blushing 

486. wand'ring 

487. outrage 

488. lower 

489. yields 

490. double 

491. fairly 

492. lineaments 

493. unrest 

494. wither'd 

495. pause 

496. accept 

497. employ'd 

498. foolish 

499. bounty 

500. process 
 

249 
 

Marlowian markers 

1. gold 

2. cast 

3. wealth 

4. hard 

5. money 

6. words 

7. seeing 

8. pass 

9. yes 

10. we'll 

11. place 

12. serve 

13. hang 

14. here's 

15. governor 

16. content 

17. villains 

18. sure 

19. pay 

20. crowns 

21. got 

22. dissemble 

23. base 

24. brave 

25. tush 

26. hale 

27. soon 

28. town 

29. hundred 

30. runs 

31. bring 

32. receive 

33. there's 

34. gone 

35. nuns 

36. sirrah 

37. round 

38. wonder 

39. resolute 

40. treasury 

41. that's 

42. follow 

43. soldiers 

44. realm 

45. loved 

46. christians 

47. carry 

48. appoint 

49. cruel 

50. they'll 

51. what's 

52. pull 

53. droop 

54. is't 

55. seek 

56. madam 

57. lordship 

58. fleet 

59. sooner 

60. minion 

61. barons 

62. sell 

63. friend 

64. force 

65. he's 

66. let's 

67. banish 

68. accursed 

69. goods 

70. wicked 

71. fatal 

72. knew 

73. remains 

74. forsake 

75. return 

76. earl 

77. walk 

78. leave 

79. perish 

80. friar 

81. daughter 

82. distress 

83. happily 

84. esteem 

85. messenger 

86. policy 

87. commit 

88. harbour 

89. letters 

90. saith 

91. fire 

92. left 

93. fury 

94. rule 

95. sea 

96. wind 

97. moved 

98. nun 

99. abbess 

100. ruled 

101. looked 

102. nunnery 

103. ha' 

104. price 

105. poisoned 

106. powder 

107. written 

108. fingers 

109. winds 

110. unkind 

111. inflict 

112. seized 

113. weapons 

114. amain 

115. road 

116. gates 

117. slain 


250 
 

118. gets 

119. requite 

120. fetch 

121. favour 

122. lov'st 

123. assure 

124. question 

125. hateful 

126. walks 

127. countenance 

128. seize 

129. cease 

130. ship 

131. anger 

132. wish 

133. nobles 

134. speeches 

135. bliss 

136. answer 

137. get 

138. farewell 

139. trouble 

140. bought 

141. grieves 

142. shake 

143. hardly 

144. who's 

145. life 

146. spare 

147. brook 

148. unto 

149. charge 

150. means 

151. christian 

152. riches 

153. device 

154. advance 

155. seest 

156. salute 

157. easily 

158. grieve 

159. remain 

160. ease 

161. levy 

162. safe 

163. reveng'd 

164. fast 

165. command 

166. pearl 

167. court 

168. villainy 

169. friars 

170. resolved 

171. tribute 

172. gotten 

173. coin 

174. paltry 

175. sum 

176. secret 

177. rescue 

178. sink 

179. betray'd 

180. promised 

181. fail 

182. asleep 

183. betray 

184. bravely 

185. desires 

186. message 

187. sail 

188. walls 

189. where's 

190. disdain 

191. master 

192. close 

193. train 

194. warrant 

195. lofty 

196. attempt 

197. overthrow 

198. fie 

199. honours 

200. pine 

201. torments 

202. yield 

203. wills 

204. wrathful 

205. proudest 

206. long 

207. strange 

208. peasant 

209. use 

210. rend 

211. goes 

212. guard 

213. worth 

214. conspire 

215. keep 

216. highness 

217. highly 

218. pride 

219. earls 

220. underneath 

221. grant 

222. lovely 

223. city 

224. sight 

225. writes 

226. begone 

227. courtesan 

228. nose 

229. revenged 

230. chaste 

231. governor's 

232. circumcised 

233. tormented 

234. prythee 

235. i'd 

236. turned 

237. convert 


251 
 

238. forced 

239. tribute-

money 

240. search 

241. profession 

242. discharged 

243. galleys 

244. entry 

245. custom 

246. costly 

247. diamonds 

248. pearls 

249. entertained 

250. bags 

251. thirst 

252. senses 

253. rests 

254. gallows 

255. hapless 

256. ope 

257. multiply 

258. escape 

259. remove 

260. despatch 

261. fits 

262. beard 

263. drives 

264. knights 

265. threats 

266. shipping 

267. knell 

268. groaning 

269. alone 

270. gather 

271. bestow 

272. window 

273. silks 

274. passions 

275. forlorn 

276. behoof 

277. realm's 

278. nephew 

279. quickly 

280. poison 

281. ungentle 

282. felicity 

283. towers 

284. run 

285. casts 

286. pope 

287. certify 

288. rent 

289. ruin 

290. channel 

291. robes 

292. street 

293. envied 

294. display 

295. preach 

296. expel 

297. knees 

298. comes 

299. running 

300. hope 

301. soldier's 

302. wait 

303. favourite 

304. greater 

305. mean 

306. home 

307. villain 

308. sir 

309. further 

310. unhappy 

311. dies 

312. bears 

313. read 

314. rich 

315. chance 

316. murderer 

317. store 

318. vain 

319. nay 

320. suffer 

321. bear 

322. longer 

323. father's 

324. aside 

325. pierce 

326. away 

327. people 

328. monstrous 

329. thrust 

330. misery 

331. extreme 

332. crave 

333. say'st 

334. room 

335. clean 

336. offended 

337. unnatural 

338. request 

339. sorrows 

340. going 

341. ring 

342. trust 

343. ships 

344. fare 

345. stays 

346. looking 

347. treasure 

348. feast 

349. quiet 

350. having 

351. dishonour 

352. perform 

353. cursed 

354. sighs 

355. distressed 

356. loves 


252 
 

357. worst 

358. speech 

359. favours 

360. prison 

361. kingly 

362. haughty 

363. sake 

364. gentle 

365. die 

366. hair 

367. reward 

368. peers 

369. enmity 

370. arms 

371. dearest 

372. servant 

373. challenge 

374. sums 

375. new-made 

376. girl 

377. determined 

378. demand 

379. silly 

380. brethren 

381. provide 

382. orient 

383. ashore 

384. trade 

385. laws 

386. naught 

387. keys 

388. door 

389. friendly 

390. vanish 

391. rice 

392. nation 

393. trow 

394. admit 

395. heaven's 

396. homage 

397. bills 

398. aid 

399. mean'st 

400. pledge 

401. stony 

402. spoil 

403. dearly 

404. sole 

405. commands 

406. uncontroll'd 

407. jar 

408. injuries 

409. arriv'd 

410. passionate 

411. lead 

412. humbly 

413. company 

414. argues 

415. mere 

416. chiefest 

417. 'twixt 

418. want 

419. groom 

420. slack 

421. work 

422. walking 

423. rob 

424. patiently 

425. higher 

426. banks 

427. swell 

428. pity 

429. equally 

430. discharge 

431. ocean 

432. write 

433. subscribe 

434. entreat 

435. waits 

436. violent 

437. bound 

438. regiment 

439. equal 

440. titles 

441. create 

442. poor 

443. exile 

444. know'st 

445. parley 

446. threaten 

447. stay 

448. 'tis 

449. offend 

450. cross 

451. spite 

452. parliament 

453. worship 

454. poverty 

455. multitude 

456. exil'd 

457. banquet 

458. shut 

459. hanged 

460. clothes 

461. whipt 

462. rogue 

463. shirt 

464. god-a 

465. monastery 

466. revealed 

467. lived 

468. lodging 

469. intolerable 

470. batter 

471. island 

472. sauced 

473. juice 

474. broth 

475. proverb 

476. rock 


253 
 

477. carried 

478. neatly 

479. reveal 

480. counting-

house 

481. lik'st 

482. locks 

483. requisite 

484. laughed 

485. chambers 

486. 'faith 

487. usurer 

488. physic 

489. gallery 

490. thou'lt 

491. 'scape 

492. critical 

493. sessions 

494. afire 

495. doors 

496. sacrifice 

497. diamond 

498. barefoot 

499. market-

place 

500. bullets 

 
254 
 

APPENDIX 5 

About us section on the interface of ALTXA 

 
	Tesis Juan Antonio Latorre García
	PORTADA
	ACKNOWLEDGEMENTS
	TABLE OF CONTENTS
	ABSTRACT
	RESUMEN
	LIST OF TABLES
	LIST OF FIGURES
	CHAPTER 1 | INTRODUCTION
	CHAPTER 2 | HISTORICAL AND LITERARY BACKGROUND
	CHAPTER 3 | LINGUISTIC BACKGROUND: AN INTRODUCTION TO FORENSIC LINGUISTICS AND AUTHORSHIP ATTRIBUTION STUDIES
	CHAPTER 4 | METHODOLOGY
	CHAPTER 5 | PRE-STUDIES
	CHAPTER 6 | CASE STUDY: ATTRIBUTION OF AUTHORSHIP OF THE SCENES OF ARDEN OF FAVERSHAM
	CHAPTER 7 | DISCUSSION OF THE RESULTS
	CHAPTER 8 | CONCLUSION AND FUTURE LINES OF RESEARCH
	PRIMARY SOURCES
	BIBLIOGRAPHY AND REFERENCES
	APPENDICES
	APPENDIX 1
	APPENDIX 2
	APPENDIX 3
	APPENDIX 4
	APPENDIX 5