UNIVERSIDAD COMPLUTENSE DE MADRID 
FACULTAD DE CIENCIAS FÍSICAS 

 
TESIS DOCTORAL 
 

Past climate studies with optimized networks using artificial 
intelligence 

 
Estudios del clima pasado con redes optimizadas usando 

inteligencia artificial 
 

MEMORIA PARA OPTAR AL GRADO DE DOCTOR 
 

PRESENTADA POR 
 

Fernando Jaume Santero 
 

Directores 
 

Ricardo García Herrera 
David Barriopedro Cepero 
Natalia Calvo Fernández 

 
Madrid 
 
 
© Fernando Jaume Santero, 2021 


Past climate studies with optimized
networks using artificial intelligence
Estudios del clima pasado con redes

optimizadas usando inteligencia artificial

Tesis que presenta
Fernando Jaume Santero

para optar al título de
Doctor

Directores:
Dr. Ricardo García Herrera

Dr. David Barriopedro Cepero
Dr. Natalia Calvo Fernández

Departamento de Física de la Tierra y Astrofísica
Facultad de Ciencias Físicas

Universidad Complutense de Madrid
Instituto de Geociencias, CSIC-UCM

Madrid, 2021


- A la memoria de Francisco Jaume Aguiló y Lindsay Pickler,
os echamos de menos. -


There ain’t no such thing as a free lunch.

Robert A. Heinlein


Acknowledgments

En primer lugar, tengo que agradecer a mis directores de tesis Ricardo Gar-
cía Herrera, David Barriopedro y Natalia Calvo por darme la oportunidad de
realizar este increíble doctorado. Además, me gustaría agradecer a mis com-
pañeros del grupo STREAM (y asociados): José Manuel, Antonio, Froila,
Maddalen, Javier, Leopoldo, Sancho, Samuel, Solange, Marina, Verónica,
Jacob, Marta, Carlos, Blanca, Álvaro, Adrián y Marie por los buenos mo-
mentos que pasamos juntos y las críticas constructivas que me han hecho
mejor científico. Gracias a la gente de Canadá (Almudena, Francisco, Ar-
lette, Ludovick, Lydia), a mis directores de máster Hugo Beltrami y Jean
Claude Mareschal, al grupo PalMA (en especial a Fidel y Marisa) y a Victor
Ocaña, porque sin ellos no me hubiera adentrado en el fascinante mundo de
las ciencias de la Tierra. Dar las gracias también a Stefan Brönnimann, Jörg
Franke y Raphael Neukom por la estancia que tuve el privilegio de hacer
en la bonita ciudad de Berna. Mi reconocimiento al Ministerio de Ciencia
e Innovación y al Ministerio de Universidades del Gobierno de España por
su apoyo económico a través de la beca FPI (BES-2016-077030) del proyecto
PALEOSTRAT (CGL2015-69699-R). Finalmente, me gustaría agradecer a
mis amigos de la Universidad de Zaragoza: Sergio, Ignacio Hermoso, Igna-
cio Hierro, Edu de 1◦, y Edu de 2◦, a mi colega Nick, a la doctora Jenny
Turton, y a todos mis familiares, especialmente a mi madre Marisa, a mi
hermana Clara y a Pablo, a mis tíos Teresa, Pedro Pablo y Carlos, a mis
primos Victoria, Gabriel, Lisa, Tolo y Miguel, a mi abuela María Teresa, a
mi familia canadiense (Jocelyne, Andy, Marie-Ève y Bear) y a la jefa del
mundo entero, la doctora Carolyne Pickler. Gracias por apoyarme y confiar
en mí, sin vuestro apoyo esta aventura no hubiera sido posible.

ix


List of Acronyms

20CRv3 . . . . . . . NOAA-CIRES-DOE 20th Century Reanalysis version 3

AH . . . . . . . . . . . . Azores High

AI . . . . . . . . . . . . . Artificial Intelligence

AM . . . . . . . . . . . Analogue Method

AR1 . . . . . . . . . . . lag-1 Autoregressive Model

AR5 . . . . . . . . . . . Fifth Assessment Report

CCA . . . . . . . . . . Canonical Correlation Analysis

CCSM4 . . . . . . . Community Climate System Model version 4

CE . . . . . . . . . . . . Common Era

CESM . . . . . . . . Community Earth System Model

CESM-LME . . CESM Last Millennium Ensemble

CFR . . . . . . . . . . Climate Field Reconstruction

CMIP6 . . . . . . . . Coupled Model Intercomparison Project phase 6

CRO . . . . . . . . . . Coral Reef Optimization

CRO-AM . . . . . Coral Reef Optimization coupled with the Analogue Method

CRO-CCA . . . . Coral Reef Optimization coupled with the Canonical Corre-
lation Analysis

CRO-MIN . . . . Minimum subset of perfect pseudo-proxies obtained with CRO-
AM necessary to outperform the reconstruction skill of the
full-proxy network

xi


xii List of Acronyms

CRO-OPT . . . . Optimized subset of perfect pseudo-proxies of the PAGES-2k
network obtained with CRO-AM that yields the best recon-
struction skill

E-Obs . . . . . . . . . European grid of Temperature Observations

EOF . . . . . . . . . . Empirical Orthogonal Function

FDR . . . . . . . . . . False Discovery Rate

FSNTOAC . . . Clear-sky net solar flux at top of the atmosphere

GCM. . . . . . . . . . General Circulation Models

GHG . . . . . . . . . . Greenhouse Gases

GISTEMP . . . . Goddard Institute for Space Studies Surface Temperature

GMT. . . . . . . . . . Global Mean Temperature

HadCRUT . . . . Hadley Centre/Climatic Research Unit Temperature

IL . . . . . . . . . . . . . Icelandic Low

IPCC . . . . . . . . . Intergovernmental Panel on Climate Change

LIA . . . . . . . . . . . Little Ice Age

LMR . . . . . . . . . . Last Millennium Reanalysis

MCA . . . . . . . . . . Medieval Climate Anomaly

ML . . . . . . . . . . . . Machine Learning

MSE . . . . . . . . . . Mean-Square Error

NAM . . . . . . . . . . Northern Annual Mode

NAO . . . . . . . . . . North Atlantic Oscillation

PAGES-2k . . . . Past Global Changes

PC . . . . . . . . . . . . Principal Components

PMIP4 . . . . . . . . Paleoclimate Modelling Intercomparison Project phase 4

RMSE . . . . . . . . Root-Mean-Square Error


List of Acronyms xiii

SLP . . . . . . . . . . . Sea Level Pressure

SLP-Obs . . . . . . Set of SLP Observations

SNR . . . . . . . . . . Signal to Noise Ratio

SST . . . . . . . . . . . Sea Surface Temperatures

SVD . . . . . . . . . . Singular Value Decomposition


Summary

The availability of high-quality climate records decreases backwards in time,
and the associated increase in uncertainty supports the use of complemen-
tary sources of climate information (such as model simulations) to understand
the underlying physics of the climate system, as well as its past and future
changes. In this Ph.D. thesis we assess the potential of Artificial Intelli-
gence as an additional efficient tool to solve complex problems in the field of
climate sciences. We show that these techniques can optimize the informa-
tion coming from different sets of climate networks such as meteorological
stations, historical records, and paleoclimate archives. Being employed to
address a plethora of questions, they share issues in terms of incompleteness.
Within this framework, we address different problems that are common in
the climate community by developing tailored methodologies with the same
goal of maximizing the extraction of information from incomplete climate
datasets. The developed approaches include metaheuristic algorithms and
cluster analyses and will be applied to incomplete datasets that are typically
employed for paleo-climate reconstructions and regional climate assessments,
respectively.

Recent developments in metaheuristics have shown the efficiency of evolu-
tionary algorithms to solve an extensive set of optimization problems. Specif-
ically, the Coral Reef Optimization algorithm has provided robust solutions
to high dimensional problems in atmospheric and climate sciences. Here, we
combine this optimization algorithm with different climate field reconstruc-
tion methods to find an optimal subset of pseudo-proxies that minimizes

xv


xvi Summary

the spatial bias of annual temperature reconstructions induced by the non-
homogeneous distribution of currently available records. The results indicate
that under certain conditions small subsets of records situated over represen-
tative locations can outperform the reconstruction skill of the full network.
These locations highlight the importance of high-latitude regions and ma-
jor teleconnection areas to reconstruct annual global temperature fields for
the last millennium and their simulated responses to external forcings and
internal variability. However, low frequency temperature variations such as
the transition between the Medieval Climate Anomaly and the Little Ice Age
are better resolved by pseudo-proxies situated at lower latitudes. According
to our idealized experiments a careful selection of proxy locations should be
performed depending on the targeted time scale of the phenomenon to be
reconstructed.

Moreover, we use the Coral Reef Optimization algorithm coupled with
the Analogue Method to obtain a new high-resolution (1◦ × 1◦) reconstruc-
tion of monthly North Atlantic Sea Level Pressure fields since 1750. After
assigning an optimized set of local weights to a network of land-based instru-
mental observations, the reconstruction skill is improved, particularly over
poorly sampled regions, such as that under the influence of the Azores High.
The reconstruction reproduces realistic variations of regional climate patterns
such as the North Atlantic Oscillation and the Azores High as compared to
other observational-based datasets. The results indicate that recent multi-
decadal changes in the winter Azores High intensity have been the highest of
the past 250 years, being concurrent with the prominent positive trend of the
winter North Atlantic Oscillation from the 1960s to the 1990s. Moreover, dif-
ferences in instrumental-based historical series of the winter North Atlantic
Oscillation are partially explained by disparities on the reconstruction of the
Azores High, rather than on the Iceland Low. Our findings also confirm
the importance of the summer Azores High for the European climate. In
particular, displacements of the Azores High center towards the north-east
coincided with extremely warm summers in western Europe, as inferred from
independent temperature reconstructions. Overall, the results suggest that


Summary xvii

substantial improvements in the characterization of the past North Atlantic
atmospheric variability could be achieved by reducing current uncertainties
of the Azores High past behaviour.

Finally, a new clustering technique (known as k-gaps) has been designed
to improve the clustering retrieved from traditional methods when applied
to climate datasets with sparse records and/or incomplete temporal infor-
mation. This method provides a new approach to cluster non-overlapping
and discontinuous time series, therefore exploiting information that other-
wise could be eliminated with data homogenization procedures. The method
has been applied to European station-based daily temperature series since
1950 and validated with synthetic datasets. The results show that k-gaps
performs well for limited networks in terms of both number of records and
temporal availability. The algorithm can generate a climatically consistent
clusterings similar to those obtained with complete time series, and outper-
forms other clustering methodologies developed to work with fragmentary
information. Therefore, k-gaps can provide a useful tool for regional assess-
ments of long-term trends and the detection of historical extreme events at
regional scales.


xviii Summary

Publications related to this Thesis

• Jaume-Santero, F., Barriopedro, D., García-Herrera, R., Calvo, N.
and Salcedo-Sanz, S. Selection of optimal proxy locations for temper-
ature field reconstructions using evolutionary algorithms. Sci. Rep.,
vol. 10(1), ISSN 2045-2322, doi:10.1038/s41598-020-64459-6, 2020.

• Carro-Calvo, L., Jaume-Santero, F., García-Herrera, R. and Salcedo-
Sanz, S. k-Gaps: a novel technique for clustering incomplete climatolog-
ical time series. Theor. Appl. Climatol., ISSN 1434-4483, doi:10.1007/s00704-
020-03396-w, 2020.

• Jaume-Santero, F., Barriopedro, D., García-Herrera, R. and Luter-
bacher, J. Monthly North Atlantic Sea Level Pressure reconstruction
back to 1750 CE using Artificial Intelligence optimization. J. Clim,
(Submitted) 2021.

Other publications

• Cuesta-Valero, F. J., García-García, A., Beltrami, H., Zorita, E. and
Jaume-Santero, F. Long-term Surface Temperature (LoST) database
as a complement for GCM preindustrial simulations. Clim. Past, vol.
15(3), pages 1099-1111, doi:10.5194/cp-15-1099-2019, 2019.

• Salcedo-Sanz, S., García-Herrera, R., Camacho-Gómez, C., Alexandre,
E., Carro-Calvo, L. and Jaume-Santero, F. Near-optimal selection
of representative measuring points for robust temperature field recon-
struction with the cro-sl and analogue methods. Glob. Planet. Change,
vol. 178, pages 15-34, doi:10.1016/j.gloplacha.2019.04.013, 2019.

• Franke, J., Valler, V., Brönnimann, S., Neukom, R. and Jaume-Santero,
F. The importance of input data quality and quantity in climate field
reconstructions - results from the assimilation of various tree-ring col-
lections. Clim. Past, vol. 16(3), pages 1061-1074, doi:10.5194/cp-16-
1061-2020, 2020.


Resumen

La disponibilidad de datos climáticos decrece exponencialmente a medida que
retrocedemos en el tiempo, siendo muchas veces necesario el uso de fuentes
complementarias de información (como las simulaciones de modelos de cir-
culación general) para comprender la física subyacente del sistema climático,
así como sus cambios pasados y futuros. En esta tesis doctoral evaluamos
el potencial de la Inteligencia Artificial como una herramienta eficiente que
se puede usar para resolver problemas complejos en la ciencia del clima.
Mostramos como estas técnicas pueden maximizar la información proveniente
de diferentes conjuntos de redes climáticas, como estaciones meteorológicas,
registros históricos y proxies paleoclimáticos. Todos ellos comparten un pro-
blema similar: son datos incompletos que proporcionan información por un
periodo de tiempo limitado. Por lo tanto, hemos abordado diferentes pro-
blemas cuyo objetivo común es maximizar la extracción de información de
conjuntos de datos incompletos. Los métodos desarrollados incluyen algorit-
mos metaheurísticos y análisis de conglomerados.

Recientes avances en metaheurística han demostrado la eficiencia de los
algoritmos evolutivos para resolver diferentes problemas de optimización. Es-
pecíficamente, el algoritmo de Coral Reef Optimization ha proporcionado
soluciones robustas a problemas de alta dimensionalidad en ciencias de la
Tierra y del clima. En esta tesis, combinamos este algoritmo de optimización
con diferentes métodos de reconstrucción climática para minimizar el sesgo
espacial inducido por la distribución no homogénea de datos paleoclimáticos.
Los resultados indican que subconjuntos de series situadas en zonas claves

xix


xx Resumen

pueden mejorar la reconstrucción obtenida con la red completa de datos en
paleo-reconstrucciones del último milenio. Se ha evidenciado la importan-
cia de las altas latitudes y áreas de teleconexión para reconstruir campos de
temperatura global anual y sus respuestas a los forzamientos externos y vari-
abilidad interna del sistema. Sin embargo, las variaciones de temperatura
a largo plazo, como la transición entre la Anomalía Climática Medieval y
la Pequeña Edad de Hielo, se resuelven mejor con registros situados en lat-
itudes más bajas. Nuestros experimentos indican que se debe realizar una
selección de ubicaciones representativas en función de la escala del fenómeno
investigado.

Siguiendo con la misma metodología, utilizamos el Coral Reef Optimiza-
tion junto con el Método de Análogos para obtener una nueva reconstrucción
en alta resolución (1◦× 1◦) de campos mensuales de Presión a Nivel del Mar
sobre el Atlántico Norte desde 1750. Estos campos han sido generados a
partir de una red optimizada de observaciones terrestres. Se ha comprobado
como la técnica de optimización maximiza la robustez de la reconstrucción de
los campos de presión en toda la región de estudio, incluso cuando solo hay
unas pocas observaciones disponibles, reproduciendo variaciones realistas de
los patrones climáticos regionales como la Oscilación del Atlántico Norte, y el
Anticiclón de las Azores desde mediados del siglo XVIII. Esta reconstrucción
optimizada muestra una tendencia positiva en la intensidad del Anticiclón
de las Azores en invierno, siendo la tendencia decenal más alta de los últi-
mos 250 años. Este cambio coincide con la tendencia positiva prominente
de la Oscilación del Atlántico Norte de invierno desde la década de 1960
hasta la de 1990. Además, también encontramos que las diferencias en las
reconstrucciones de la Oscilación del Atlántico Norte se explican en parte por
las diferencias en la reconstrucción del Anticiclón de las Azores, destacando
la importancia de reconstruir este sistema persistente de alta presión para
reproducir el clima de la región. Esta reconstrucción también muestra que
los desplazamientos de este anticiclón hacia el noreste provocaron eventos de
calentamiento extremo en Europa Occidental recogidos por fuentes de tem-
peratura independientes. Por tanto, los resultados sugieren que se podría


Resumen xxi

mejorar sustancialmente la caracterización y estudio de la variabilidad at-
mosférica del Atlántico Norte en el pasado, reduciendo las incertidumbres a
la hora de reconstruir el Anticiclón de las Azores.

Finalmente, hemos desarrollado una nueva técnica de conglomerados (de-
nominada k-gaps) con el objetivo de mejorar los analisis de climas regionales
utilizando conjuntos incompletos de datos climáticos. Este método propor-
ciona un nuevo enfoque para agrupar series de tiempo de diferentes longitudes
temporales, utilizando la mayor parte de la información recogida en conjun-
tos de series climáticas, que muchas veces se elimina durante los procesos de
homogeneización de datos. El método se ha aplicado a series de temperat-
uras diarias medidas en estaciones europeas desde 1950 y se ha validado con
conjuntos de datos sintéticos. Los resultados muestran que k-gaps es ideal
para el agrupamiento de pequeños conjuntos de datos climáticos (con pocas
muestras) y con huecos en las series temporales. El algoritmo puede generar
una regionalización climáticamente consistente similar a las obtenidas con
series de tiempo completas, superando a otras técnicas similares desarrolla-
das para trabajar con falta de información. Por lo tanto, k-gaps pueden ser
una herramienta útil para los análisis regionales de tendencias climáticas a
largo plazo y la detección de eventos extremos históricos a escalas regionales.


Contents

Acknowledgments ix

List of Acronyms x

Summary xv

Resumen xix

1 Introduction 1
1.1 Artificial Intelligence as a tool for Climate Science . . . . . . . 3
1.2 Main objectives and structure of the Thesis . . . . . . . . . . 5

1.2.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . 6

2 Data 9
2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Model simulations . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Observations . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Reanalysis . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Paleoclimate data . . . . . . . . . . . . . . . . . . . . . 15

2.2 Data post-processing . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Pseudo proxies . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Pseudo SLP observations . . . . . . . . . . . . . . . . . 17

3 Methodology 19
3.1 Climate Field Reconstruction methods . . . . . . . . . . . . . 20

3.1.1 Analogue Method . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Canonical Correlation Analysis . . . . . . . . . . . . . 24

3.2 The Coral Reef Optimization . . . . . . . . . . . . . . . . . . 26
3.2.1 Coral solutions . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . 29

xxiii


xxiv Contents

3.2.3 Search operators . . . . . . . . . . . . . . . . . . . . . 31
3.3 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Assumptions and Definitions . . . . . . . . . . . . . . . 34
3.3.2 The k-gaps algorithm . . . . . . . . . . . . . . . . . . . 36
3.3.3 Centroids calculation . . . . . . . . . . . . . . . . . . . 38
3.3.4 Assignment of records to clusters . . . . . . . . . . . . 41
3.3.5 Stop conditions . . . . . . . . . . . . . . . . . . . . . . 42

4 Selection of proxy locations for temperature reconstructions 43
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Selection of representative locations . . . . . . . . . . . . . . . 45
4.3 Reconstruction of temperature patterns . . . . . . . . . . . . . 59
4.4 Insights on past anomalous periods . . . . . . . . . . . . . . . 64

5 North Atlantic SLP reconstruction since 1750 73
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Optimized networks . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Skill of the optimized networks . . . . . . . . . . . . . . . . . 82
5.4 Climate variability and the Azores High . . . . . . . . . . . . 86

6 Clustering incomplete climatological time series 99
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Ideal case with complete information . . . . . . . . . . . . . . 100
6.3 Experiment setting . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Performance assessment . . . . . . . . . . . . . . . . . . . . . 104
6.5 Applications on climate studies . . . . . . . . . . . . . . . . . 110

7 Conclusions and outlook 115

A Supplementary Figures 123

B Supplementary Tables 127

Bibliography 133


List of Figures

3.1 Pearson correlation (blue) and variability ratio (red) for AM
reconstructions of global temperature fields as a function of the
number of analogues. For each member of the CESM-LME the
remaining 12 full-forcing simulations are used to reconstruct
the global temperature fields from the full-proxy PAGES-2k
network of perfect pseudo-proxies, using different number of
analogues. Shading shows the spread (two standard deviations
with respect to the mean values). (a) Correlation and variabil-
ity ratio for AM reconstructions with 1 to 100 analogues. (b)
A zoom of the black dashed square in (a). . . . . . . . . . . . 22

3.2 Sea level pressure reconstruction skill of the Analogue Method.
The performance of the method is shown as a function of the
number of selected analogues for (a) the RMSE, (b) the Pear-
son correlation, and (c) the variability ratio between the re-
constructions using observations and the 20CRv3 reanalysis
during the 1836-2004 time period. . . . . . . . . . . . . . . . . 24

3.3 Coral solutions. (a) Binary selection of representative proxy
locations for the reconstruction of annual 2-m air temperature
fields from 850 to 2005 CE. (b) Weighting of weather stations
for the reconstruction of SLP fields from monthly observations
since 1750 CE. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 CRO flowchart imitating the biological processes of corals within
a reef. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Data records considered (a and b), and the resulting series
obtained by merging them (c). All the records are presented
with their respective masks. . . . . . . . . . . . . . . . . . . . 35

3.6 Representations of two synthetic records. Series with different
mean values (a and b), but with correlated variabilty (c). . . . 36

xxv


xxvi List of Figures

3.7 k-Gaps flowchart. Circles contain general clustering proce-
dures, and squares describe specific k-gaps operations. The al-
gorithm’s conditions are represented as diamonds in the flowchart. 37

4.1 RMSE of 850-2005 CE global temperature fields reconstructed
with different subsets of perfect pseudo-proxies from the PAGES-
2k network. Orange dots represent the RMSE associated with
the reconstructions using the optimized subsets of 30, 120, and
400 perfect pseudo-proxies of the PAGES-2k network obtained
with the CRO-AM. Blue violins show the RMSE distribution
obtained from 10000 reconstructions using different combina-
tions of 30, 120, and 400 randomly selected pseudo-proxies
from the PAGES-2k network. RMSE are calculated with re-
spect to the global temperature fields of the target simulation
(the first member of the CESM-LME). . . . . . . . . . . . . . 47

4.2 RMSE of the 850-2005 CE global temperature fields recon-
structed with CRO-AM as a function of the number of se-
lected perfect (blue) and noisy pseudo-proxies (purple) of the
PAGES-2k network. The green and orange shades represent
the 13-member reconstruction skill spread obtained with all
(569) perfect and noisy pseudo-proxies (with SNR of 1) of
the PAGES 2-k network. The blue-shaded area represents the
spread obtained by using the optimized subset of N pseudo-
proxies obtained for the first CESM-LME member to recon-
struct the remaining members of the ensemble. The purple-
shaded area is the same as the blue-shaded one but for re-
constructions using noisy pseudo-proxies with SNR of 1. All
shades depict 2 standard deviations with respect to the mean. 48

4.3 RMSE of CCA reconstructions generated with the optimized
subsets of perfect pseudo-proxies of the PAGES-2k network se-
lected by the CRO-AM.The red dot and dashed line highlight
the minimum RMSE. . . . . . . . . . . . . . . . . . . . . . . . 49


List of Figures xxvii

4.4 Performance of the CRO-AM reconstruction with the optimal
subset of PAGES-2k records (CRO-OPT). (a) Spatial distribu-
tion of CRO-OPT records (orange dots) obtained from the full
PAGES-2k network (purple diamonds). (b) Spatial correla-
tion difference between the temperature reconstructions with
CRO-OPT and all perfect pseudo-proxies. Stippling points
illustrate significant correlation differences (p < 0.05). Ker-
nel density estimation of the (c), Normalized latitudinal dis-
tribution of records (in % with respect to the total num-
ber of pseudo-proxies) for the CRO-OPT subset (orange) and
the full-proxy PAGES-2k network (purple). (d) Latitudinal
mean Pearson correlations for the CRO-OPT (orange) and
full-proxy (purple) reconstructions. (e) Latitudinal logarithm
of the standard deviation ratio for the CRO-OPT (orange)
and full-proxy (purple) reconstructions (σrec) compared with
the target simulation (σori). The latitudinal axis is propor-
tional to the effective area. . . . . . . . . . . . . . . . . . . . . 51

4.5 Optimized subsets of 17 perfect pseudo-proxies of the PAGES-
2k network selected by CRO-AM (CRO-MIN) and CRO-CCA.
(a) 2-D and (b) latitudinal distributions of the CRO-MIN lo-
cations obtained with CRO-AM (orange dots and shading)
and the corresponding subset of perfect pseudo-proxies of the
PAGES-2k network (with the same size as CRO-MIN) ob-
tained with CRO-CCA (purple diamonds and line). . . . . . . 52

4.6 Sensitivity of CRO-AM reconstructions to pseudo-proxies with
different levels of observational error. (a) Latitudinal distribu-
tion of the optimized subsets of 30 locations selected by CRO-
AM from a PAGES-2k network of perfect pseudo-proxies (SNR
= ∞, orange shading), and noisy pseudo-proxies with SNR =
1 (purple line) and SNR = 0.5 (blue line). (b) RMSE of CRO-
AM reconstructions from pseudo-proxies with different SNR.
For each type of pseudoproxies, symbols indicate the RMSE
of the reconstruction obtained with the optimized subsets of
locations found for perfect pseudo-proxies (orange dots), and
noisy pseudo-proxies with SNR of 1 (purple diamonds) and
0.5 (blue crosses). Blue violins illustrate the RMSE distri-
butions of 10000 reconstructions obtained from subsets of 30
PAGES-2k locations selected at random. . . . . . . . . . . . . 53


xxviii List of Figures

4.7 RMSE difference between the CRO-OPT and full-proxy re-
constructions. Green (purple) color illustrates regions where
CRO-OPT yields lower (higher) RMSE than the reconstruc-
tion with the full PAGES-2k network of perfect pseudo-proxies.
RMSE are calculated with respect to the target field (the first
CESM-LME member). . . . . . . . . . . . . . . . . . . . . . . 54

4.8 Spatial map of e-folding distances of decorrelation for the an-
nual temperature of the first full-forcing CESM-LME member.
The distance (in kilometers) for each grid point defines the
area of the circle for which the averaged coefficient of deter-
mination (R2) has decayed below e−1. . . . . . . . . . . . . . . 54

4.9 As Fig. 4.4 but using the global temperature fields of the
CCC400 first ensemble member (1601-2005 CE) as target. The
reconstruction has been obtained from the optimized subset
of perfect pseudo-proxies of the PAGES-2k network (with the
same size as CRO-OPT) selected by CRO-AM in the CCC400
model ensemble. . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.10 Latitudinal distribution of optimized subsets of the PAGES-
2k network using different model ensembles as a pool for the
CRO-AM reconstruction of the first CESM-LMEmember. Each
subset includes 17 perfect pseudo-proxies obtained with the
CRO-AM using as a pool members of the CESM-LME (or-
ange shading) and the CCC400 ensemble (black line). In both
cases, the target is the 850-2005 CE global temperature fields
of the first member of the CESM-LME. The purple line illus-
trates the distribution of full-proxy PAGES-2k network. . . . . 57

4.11 Estimates of GMT anomalies (◦C) for the last millennium
as inferred from selected subsets of the PAGES-2k network.
(a) GMT anomalies from the LMR for 850-2000 CE (Inset
(a): GMT anomalies from HadCRUT4.2 for 1850-2000 CE).
Purple and orange lines show the GMTg of these datasets,
defined as the area-weighted temperature mean for the grid
points matching the PAGES-2k and CRO-OPT locations, re-
spectively. All anomalies are computed with respect to the
1961-1990 baseline. (b) Coefficient of determination between
the time series of GMT and GMTg from PAGES-2k (purple)
and CRO-OPT (orange) locations. Violins illustrate the dis-
tributions obtained for 10000 subsets (with the same size as
CRO-OPT) of randomly-selected locations from the PAGES-
2k network (blue) and the full global grid (red). . . . . . . . . 58


List of Figures xxix

4.12 Reconstruction skill of internal variability patterns with the
CRO-MIN subset of PAGES-2k records. Composite of annual
temperature anomalies (◦C, with respect to 850-2005 CE) for
El Niño events in (a), the target field (the first CESM-LME
full-forcing member). (b) The reconstructed field from the
CRO-MIN subset of perfect pseudo-proxies of the PAGES-2k
network (yellow dots). For each panel, crosses depict non-
significant temperature differences at 95% confidence level with
respect to its corresponding climatology inferred from a boot-
strap of 10000 random samples. El Niño events are defined as
years of the target simulation with standardized temperatures
above the 95th percentile at El Niño-3.4 region (black square). 60

4.13 Percentile distribution of simulated NAM values in the first
CESM-LME member (850-2005 CE) and their corresponding
reconstructions from the CRO-OPT subset of the PAGES-2k
network. Black diamonds represent the mean simulated NAM
for each percentile range, with vertical black lines showing
their respective minimum and maximum values. Blue violins
show the distribution of the reconstructed NAM values for the
same years included in each percentile range and 100 different
NAM reconstructions. Mean values of the violin distributions
are depicted as horizontal blue lines. . . . . . . . . . . . . . . 61

4.14 Annual temperature anomalies (oC) after Tambora’s eruption
(1815 CE) in (a) target simulation and (b) CRO-OPT recon-
struction. Anomalies are calculated as the difference between
the year after the eruption and the mean temperature of the
2 previous years. . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.15 Detection of external forcings in the reconstruction with the
CRO-OPT subset of the PAGES-2k network. (a) Annual mean
clear-sky net solar flux at top of the atmosphere for three
single-forcing ensemble simulations. Percentages of the 100
best analogue years selected from CRO-OPT with the same
dominant forcing as in the given year of the target simulation.
(b) volcanic, (c) greenhouse gases, and (d) solar forcing. Black
dashed lines depict the significance thresholds above which
there is an instantaneous detection of forced signals attributed
to the given forcing. . . . . . . . . . . . . . . . . . . . . . . . . 64

4.16 Spatial pattern of mean temperature difference (oC) between
MCA (950-1250 CE) and LIA (1450-1850 CE) in (a) the target
simulation and (b) the CRO-OPT reconstruction. . . . . . . . 66


xxx List of Figures

4.17 Distribution of representative locations for different experi-
ments with perfect pseudo-proxies and the CRO-AM. (a) 2-D
and (b-d), latitudinal distribution of optimized subsets of per-
fect pseudo-proxies (with the same size as CRO-MIN) selected
with the CRO-AM for different optimization problems. Opti-
mized reconstruction of the global annual temperature fields of
the last millennium from locations constrained to the PAGES-
2k network (CRO-MIN, orange) and from an unconstrained
selection (Free, purple). Optimized subsets of the PAGES-2k
network for the reconstruction of the global annual tempera-
ture fields of the MCA (red) and LIA (blue) periods separately,
and the spatial pattern of the mean temperature difference be-
tween the MCA and LIA (MCA-LIA, green). Latitudinal dis-
tribution of (b) CRO-MIN (orange shading) and Free (purple
line). (c) MCA-LIA. (d) MCA (red shading) and LIA (blue
line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.18 Time series with the total frequency of analogues for all years
of the MCA (950-1250 CE, red) and LIA (1450-1850 CE, blue)
in the CRO-OPT reconstruction. For each year of the MCA
and LIA in the target simulation, the 100 best analogues of
the annual temperature at the CRO-OPT locations are se-
lected. Their respective years of occurrence are retained and
accumulated through the MCA and LIA periods, separately. . 68

4.19 Number of records at polar regions (latitudes above 65◦N/S) in
optimized subsets of 20 perfect pseudo-proxies of the PAGES-
2k network for CRO-CCA reconstructions of global tempera-
ture fields on different time scales. The CRO-CCA has been
run to find subsets of 20 perfect pseudo-proxy locations of
the PAGES-2k network that optimize the reconstruction of
the global temperature fields of the first CESM-LME member
at annual, decadal and centennial time scales by using a 1-,
10- and 100-year low-pass filter smoothing, respectively. The
distributions include 200 different optimized solutions from 5
CRO-CCA runs (40 best solutions of each run were kept).
Boxes represent the median (orange line) and interquartile
ranges of the distribution, with whiskers denoting extreme so-
lutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


List of Figures xxxi

4.20 Power density spectrum of GMTs series for 850-2005 CE. Color
lines show the power spectrum of area-weighted global mean
temperature anomalies calculated for the target simulation
(black), the CRO-OPT reconstruction (orange), and the CRO-
AM reconstructions generated with the full-proxy PAGES-2k
network of perfect pseudo-proxies (purple), and an optimized
subset of 150 noisy pseudo-proxies with SNR=1 (red). Opaque
lines depict the mean spectral density of the ensemble, and
transparent lines represent individual spectral densities of all
(13) members of the ensemble. . . . . . . . . . . . . . . . . . . 70

5.1 Spatio-temporal distribution of SLP observations. (a) Spa-
tial distribution of stations with monthly SLP observations
for 1750-2004 CE. Shading shows the percentage of time with
available observations over the 1750-2004 CE period, with
darker shading indicating longer time series. (b) Evolution
of the frequency of observations (in percentage with respect
to the total number of stations) for 1750-2004 CE. . . . . . . . 76

5.2 Spatial distribution of optimized monthly weights for the ob-
serving network obtained with the CRO-AM algorithm. Weights
(from 0 to 1) apply to the observing network of (a, c) all Jan-
uaries, and (b,d) all Junes of the (a,c) 1750-1835 CE and (b,d)
1836-2004 CE period. The size of the dot is proportional to
the magnitude of local weight, which is also indicated by shad-
ing. Grey crosses in (c) and (d) represent observations without
available information for 1750-1835 CE. . . . . . . . . . . . . . 78

5.3 As Fig. 5.2 but for the 20CRv3 reanalysis experiment of the
CRO-AM reconstruction. Monthly weights for (a) January
and (b) June obtained with a perfect (noise free) network of
SLP pseudo-observations for the period 1919-2004 CE taken
from the 20CRv3 reanalysis with the same gaps as the real
observations for the period 1750-1835 CE. . . . . . . . . . . . 80


xxxii List of Figures

5.4 Comparison of the SLP reconstruction skill obtained with and
without optimization. (Top panels) Difference of performance
(Pearson correlation coefficient with the 20CRv3 reanalysis)
between the CRO-AM and AM SLP reconstructions generated
with the observing network of: (a) 1750-1835 CE; (b) 1836-
2004 CE. Crossed regions show non-significant differences (p>0.05).
(c) Monthly mean evolution of the area-weighted root-mean-
square error of the North Atlantic SLP (with respect to 20CRv3)
for the CRO-AM (blue) and AM (orange) reconstruction and
the observing network of 1750-1835 CE (solid) and 1836-2004
CE (dashed). The performance of the 1750-1835 CE network
is evaluated over the 1919-2004 CE period of the reanalysis. . 83

5.5 Area-weighted RMSE difference between monthly SLP recon-
structions generated with and without optimization of the ob-
serving network. The area-weighted RMSE of the North At-
lantic SLP reconstructions generated with the observing net-
work of 1750-1835 (1836-2004) CE is calculated with respect
to the 1919-2004 (1836-2004) CE period of the 20CRv3 reanal-
ysis, and shown in the top (bottom) panel. Blue (red) colors
indicate that the optimized reconstruction has lower (higher)
RMSE than its non-optimized counterpart. . . . . . . . . . . . 84

5.6 Pearson correlation between SLP target fields and their opti-
mized and non-optimized reconstructions. Pearson correlation
coefficient with the 20CRv3 reanalysis between the CRO-AM
(a and b) and AM (c and d) SLP reconstructions generated
with the observing network of: (a and c) 1750-1835 CE; (b and
d) 1836-2004 CE. All grid-point correlations are significant for
a 95% confidence interval (p<0.05). . . . . . . . . . . . . . . . 85

5.7 As the top panels of Fig. 5.4 but for the SLP pseudo-reconstructions
(1750-1835 CE) of the MRI-ESM2-0 model. Shading shows
the difference in performance (Pearson correlation coefficient
with the 1750-1835 CE targeted SLP field of the MRI-ESM2-
0 model) between the SLP pseudo-reconstructions generated
with and without optimization of the pseudo-observing net-
work (the simulated SLP series at the grid points matching the
locations and availability of the observing network for 1750-
1835 CE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


List of Figures xxxiii

5.8 SLP variability over the Azores High region from 1750 to 2004
CE. Climatological (1750-2004 CE) mean SLP (shading, in
hPa) obtained with the optimized reconstruction for: (a) win-
ter (DJF) and (b) summer (JJA). Seasonal mean time series
(1750-2004 CE) of the intensity of the (c) winter and (d) sum-
mer Azores High (blue lines, in hPa). Shading shows the un-
certainty range calculated as two standard deviations over the
5◦ × 5◦ area where the maximum of SLP is located. Dashed
lines illustrate the Azores High intensity for the HadSLP2 for
the 1850-2004 CE period. . . . . . . . . . . . . . . . . . . . . . 90

5.9 Decadal SLP trends of the Azores High pressure center in-
tensity. Decadal linear trends of the (a) winter and (b) sum-
mer SLP (blue line, in hPa per decade) obtained from the
CRO-SLP reconstruction (blue line), the 20CRv3 reanalysis
(red line), and the NCAR SLP dataset (dashed line), over the
Azores High center for 50-year running windows from 1750
to 2019 CE. Blue shading illustrates the uncertainty range
of CRO-SLP as two standard deviations with respect to the
mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.10 Contribution of the Azores High and Iceland Low to the win-
ter NAO. (a) Time series (1751-2004 CE) of the winter |AH|-
|IL| index, denoting the unbalanced contribution of the Azores
High and Iceland Low anomalies to the winter NAO. Posi-
tive (orange) and negative (blue) values show winters with
a leading influence of the Azores High and Iceland Low, re-
spectively. The red line represents the spread (standard de-
viation) of NAO indices for each winter of the 1900-2004 CE
period, calculated from a suite of instrumental-based NAO in-
dices standardized with respect to 1951-2000 CE (Table 5.1).
(b) Scatter plot (1850-2004 CE) of the spread of NAO indices
for winters dominated by the Iceland Low (blue section) and
the Azores High (orange section). Dashed lines represent sep-
arate linear regressions for each dominant component. Grey
shading shows the 95% confidence interval of the linear fits.
All series have been standardized with respect to the 1951-
2000 CE baseline. . . . . . . . . . . . . . . . . . . . . . . . . . 93


xxxiv List of Figures

5.11 Azores High shift in the extremely warm summer of 1783 CE.
(a) Summer mean SLP (shading, in hPa) for 1783 CE obtained
from the optimized reconstruction. (b) Summer mean temper-
ature anomalies (in oC, with respect to 1500-2002 CE) for 1783
CE. Black diamonds with error bars show the climatological
(1750-2002 CE) location (mean and two standard deviations)
of the Azores High center. Green crosses represent the center
of the Azores High for the summer of 1783 CE. . . . . . . . . 95

5.12 SLP and temperature difference between the top ten north-
easternmost minus top ten southwesternmost summer AH cen-
ters from 1750 to 2002 CE. (a) Summer mean SLP differ-
ence obtained from CRO-SLP. (b) Summer mean temperature
anomalies (in oC, with respect to 1500-2002 CE). White dots
show the location of the Azores High center. Red (blue) dia-
monds represent the center of the Azores High for the top ten
northeasternmost (southwesternmost) summers. . . . . . . . . 96

6.1 K-means clusterization of the E-Obs grid of daily summer tem-
peratures since 1950 to 2016 (k-means[EObs]) using absolute (a)
and standardized (b) temperatures. . . . . . . . . . . . . . . . 100

6.2 General representation of the spatial distribution of a Gaus-
sian model (blue shade) employed to generate sample-starved
datasets (with incomplete temperature series) where “×" de-
marcates the center. Darker blues indicate a higher concen-
tration of time series, whereas lighter blues depict fewer time
series. Inset: Time lengths of synthetic series generated with
random variations of predefined initial (tini) and final (tend)
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Distributions of the number of time series per dataset (left)
and level of missing values related to the temporal length of
the records (right) associated with 500 synthetic datasets. . . . 103

6.4 Temporal (a) and spatial (b) distribution of a sample study
composed of 20 Gaussian models centered at different points
in Europe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.5 Comparison of clustering techniques using adjusted Rand-Index
for absolute (a) and normalized (b) temperatures. The ad-
justed Rand-Index is calculated with respect to the clustering
obtained by applying k-means to the complete E-Obs gridded
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


List of Figures xxxv

6.6 Distribution of missing data for each one of the series included
in datasets P191, P280, and P404. Days without temperature
values are depicted in blue, and available daily temperatures
are shown in yellow. . . . . . . . . . . . . . . . . . . . . . . . 107

6.7 K-gaps clusterization of a study case with 815 time series
(black dots), for Basic (a) and Normalization (b) modes. . . . 108

6.8 K-gaps clusterization for P404 in Table 6.2 using the normal-
ization mode. The dataset is composed of 875 time series
(black dots) unevenly distributed over Europe. . . . . . . . . . 109

6.9 Histogram of trend errors estimated from differences between
k-gaps centroids and ideal centroids retrieved from k-means[EObs].
Temperature trends have been calculated for 500 synthetic
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.10 Probability of detecting a heatwave within clusters obtained
using k-gaps (in normalization mode for 500 synthetic datasets)
as a function of the number of time series associated with each
cluster. The colorbar represents the number of clusters (regions).112

A.1 Sensitivity of the optimization process to area-weighting. (a)
Weights assigned to perfect pseudo-proxies as function of their
latitude in two experiments of the CRO-AM. (b) Latitudinal
distribution of optimized subsets of 17 perfect pseudo-proxies
from the PAGES-2k network obtained with area-weighted (or-
ange shading) and unweighted (purple line, CRO-MIN) ver-
sions of CRO-AM. . . . . . . . . . . . . . . . . . . . . . . . . 123

A.2 Latitudinal distributions of 17 representative locations obtained
with the CRO-CCA using years from 1850 to 2005 (orange
shade) and from 1900 to 2005 as calibration periods. . . . . . 124

A.3 CRO-SLP NAO(red and blue shade) and |AH|-|IL| (black line)
indices from 1751 to 1850 CE. Both series were standardized
with respect to the 1751-1850 CE baseline. . . . . . . . . . . . 125


List of Tables

2.1 List of datasets used in the experiments of the thesis. . . . . . 11
2.2 List of rejected observations from SLP-Obs database. . . . . . 14

4.1 RMSE of global temperature fields for 850-2005 CE (in ◦C)
using CRO-AM reconstructions with N representative AR(1)
pseudo-proxies of the PAGES-2k network and different SNR. . 49

4.2 GMT differences between the MCA (950-1250 CE) and the
LIA (1450-1850 CE). Area-weighted mean temperatures are
calculated globally and for the Northern Hemisphere (NH).
The target is the first ensemble member of the CESM-LME.
Reconstructions are generated with CRO-AM using perfect
pseudo-proxies at the locations of CRO-MIN, CRO-OPT and
the full-proxy network of PAGES-2k. . . . . . . . . . . . . . . 65

5.1 Definition of NAO and EA indices. Time series have been
obtained from the sources indicated below (when provided) or
calculated from the spatially resolved fields as detailed in the
second column. All indices have been re-standardized with
respect to 1951-2000 CE. . . . . . . . . . . . . . . . . . . . . . 88

5.2 Pearson correlation coefficients of winter NAO and EA indices.
Correlations have been calculated for the overlapping interval
of each pair of indices within the 1751-1886 CE period (to
avoid chucks in some of the series that were filled or extended
with observations from other datasets). Coefficients in bold
are statistically significant at the 95% confidence interval. In-
formation about the NAO and EA indices can be found in
Table 5.1. All indices are standardized with respect to 1951-
2000 CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.1 Adjusted Rand-Index means of 500 synthetic case studies within
95% confidence interval for three clustering techniques. . . . . 105

xxxvii


xxxviii List of tables

6.2 Adjusted Rand-Index for k-gaps clusters with 3 synthetic datasets
selected from the pool of 500 datasets described in Subsection
6.3. Note that each dataset is composed of a different number
of time series (N), and each time series has lost, on average,
80% of the climate information for the period 1950-2016 CE. . 106

B.1 List of accepted observations from SLP-Obs database. . . . . . 127
B.2 Pearson correlations of the AH (rAH) and IL (rIL) with the

NAO spread calculated as the standard deviation of NAO in-
dices in Table 5.1 (a PC-based NAO Index from 20CRv3 is
also included). Different NAO spreads have been calculated
from 1900 to 2004 CE by excluding the series in the dropped
column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


Chapter 1

Introduction

In 1950, Alan M. Turing wrote for the first time about computing and intel-
ligence (Turing, 1950). Since then, advances in new technologies have led to
a revolution in areas such as telecommunications, computer science, and au-
tomation that have completely redefined the way we interact with the world
(Hodson, 2018). From the way we learn to the way we work, almost every
aspect of our lives is influenced by innovative developments that improve
our well-being conditions, increase our productivity, or connect us in a vast
network that facilitates human communication, providing global access to
an (almost) infinite source of knowledge. From an engineering point of view
where the world is conceived as a set of problems, new technologies are con-
ceived as optimized solutions to everyday problems.

Optimization is therefore a key component of many fields in social (Row-
land, 1946; Ballings et al., 2016), natural (Kell, 2012; Tchemisova et al.,
2015), and physical (Bounds, 1987; Vadlamani et al., 2020) sciences. Here
is where advanced statistical techniques and fast computational power unite
to create what it is nowadays known as AI (Artificial Intelligence), a mul-
tidisciplinary field with different areas of expertise such as robotics (Rajan
and Saffiotti, 2017), machine learning (LeCun et al., 2015), and optimization
(Swarnkar and Swarnkar, 2019; Soto et al., 2019), that have shown their pro-
ficiency to master high-dimensional and non-linear problems beyond human

1


2 Chapter 1. Introduction

capabilities, sometimes without any kind of supervision (Silver et al., 2016,
2018). In this sense, technological advances in the last few decades have
transformed past Alan Turing’s ideas into practical solutions to problems of
the modern world (Kates-Harbeck et al., 2019; Ho, 2020; Jin et al., 2020).

Nevertheless, modernization has also come at an expensive cost for our
planet. Since 1950, the great acceleration of socio-economic trends (e.g. Stef-
fen et al. (2015); world population, energy and water use, transportation, and
urban construction, among many others) has exerted an important impact
on the natural environment (IPCC, 2014). These activities mainly boosted
by fossil-fuel consumption caused a global radiative energy imbalance and
major changes in the climate system as shown by many studies (e.g. Hansen
et al., 2011; Huber and Knutti, 2012) scrutinized by periodic reports of the
IPCC (Intergovernmental Panel on Climate Change) (Myhre et al., 2013). It
is also clear that these changes in Earth’s system have been primarily caused
by increased concentrations of atmospheric GHG (Greenhouse Gases) such
as carbon dioxide, nitrous oxide, and methane attributed to human activi-
ties (Ribes et al., 2017). Climate projections in the AR5 (Fifth Assessment
Report) of the IPCC obtained from state-of-the-art GCM (General Circula-
tion Models) (e.g. Flato et al., 2013; Hausfather et al., 2020) indicate that
global mean temperatures will increase up to intolerable levels by the second
half of the 21st century if no mitigation policies are implemented in the next
few years (e.g. Collins et al., 2013; IPCC, 2018). The implications of an-
thropogenic changes on the climate system are so undeniable for our society
(Sanderson and O’Neill, 2020) that United Nations has characterized them
as one of the most pressing issues of our time.

Within this framework, and taking into account the last advances in com-
puterization, it seems quite reasonable to make use of our acquired technical
background to address climate-related problems that can contribute to this
challenge through an improved characterization of the complex climate sys-
tem.


1.1. Artificial Intelligence as a tool for Climate Science 3

1.1 Artificial Intelligence as a tool for Climate

Science

The understanding of our own planet has notoriously increased in the last
decades with the implementation of systems that allow for the continuous
monitorization of the Earth system (i.e. atmosphere, ocean, land, etc). For
instance, the exponential accumulation of information obtained from ocean
and land-based stations, observation satellites, and model simulations (Aga-
piou, 2017) has expanded the ways to broad our knowledge about the dynam-
ics of the climate system, non-linear interactions and responses to external
forcings, as well as their roles in shaping the climate conditions at regional
and global scales (e.g. IPCC, 2013).

Despite these unquestionable benefits, the generation of large amounts
of observational and simulated datasets comes with a problem of big data
where expensive storage infrastructures and fast supercomputers are required
in advance to perform robust climate analyses (e.g. Schnase et al., 2016). In
this regard, ML (Machine Learning) techniques have been used as a tool to
manage large datasets by implementing massive data processing into daily
research (Knüsel, 2019) and tackle complex problems with data-driven ap-
proaches (Karpatne et al., 2019; Reichstein, 2019). They combine advanced
statistical techniques and fast computational power to generate self-learning
systems that are trained with large amounts of data without being pro-
grammed for any specific task (e.g. Murphy, 2012). Some of the problems in
climate sciences and meteorology that have been addressed using these pro-
cedures are: climate sensitivity prediction (Caldwell et al., 2014), statistical
downscaling (He et al., 2016), remote sensing (Maxwell et al., 2018), tropical
cyclone forecasting (Richman et al., 2017), soil moisture prediction (Prakash
et al., 2018), cloud physics representation (Rasp et al., 2018), land use /
land change estimations (Aburas et al., 2019) pattern recognition (Boers et
al., 2019), climate change adaptation (Biesbroek et al., 2020), and model
parameterizations (Gagne II et al., 2020). All these contributions were pos-


4 Chapter 1. Introduction

sible thanks to plentiful of data available in public and private repositories
ready to be analyzed. However, there are areas of scientific research where
observational data are scarce and/or can only be retrieved from expensive
measuring campaigns and therefore they become a scarce commodity. Such
is the case of past climate reconstructions (Masson-Delmotte et al., 2013;
Emile-Geay et al., 2017) which rely on instrumental observations and pale-
oclimate records (proxies), respectively, with limited temporal coverage and
uneven distribution over the globe. In these cases, it is not possible to apply
data-driven techniques, and solutions must then focus on maximizing the
extraction of information from a limited network of records. Solutions to
this kind of optimization problems have recently been developed using other
branches of AI such as metaheuristic algorithms (Salcedo-Sanz, 2016; Del
Ser et al., 2019) and cluster analysis (Kettenring, 2006; Omran et al., 2007;
Netzel and Stepinski, 2016).

In computer science, metaheuristic procedures such as evolutionary and
genetic algorithms use different search engines to find "good enough" solu-
tions to optimization problems (Yang et al., 2014; Abdel-Basset et al., 2018).
They are especially useful with datasets containing incomplete information
and high-dimensional combinatorial problems where discrete optimal solu-
tions are required. Although previous studies have already unveiled some
of the potential that these techniques have in several fields of geosciences
such as hydrology (Yoo and Kim, 2014), soil stabilization (Kashani et al.,
2016), and wind power reconstruction (Salcedo-Sanz et al., 2018), they still
remain underexploited by the climate community (e.g. Knüsel, 2019; Reich-
stein, 2019; Kadow et al., 2020).

On the other hand, cluster analyses are unsupervised classification tech-
niques able to organize datasets with heterogeneous information into groups
with similar features. This property makes them suitable for the study of
different fields in Earth science such as geochemistry (Zhou et al., 2018),
geophysics, (Song et al., 2010), seismology (Seydoux et al., 2020), and cli-
matology (Netzel and Stepinski, 2016). However most clustering algorithms


1.2. Main objectives and structure of the Thesis 5

require datasets with complete information, limiting the analysis to records
with continuous and simultaneous observations, or forcing data homogeneiza-
tion (i.e, series are truncated and/or interpolated to avoid gaps). While these
are not big issues when there is a fair amount of available data, it might
lead to substantial loss of information for small networks of sparse records
with limited data availability such as historical observations and paleoclimate
archives. This stresses the need to develop new cluster methodologies that
maximize the extraction of information from incomplete climate datasets
while minimizing the loss of information by homogenization procedures.

1.2 Main objectives and structure of the Thesis

Finding solutions to the aforementioned problems need methodologies that
provide some sort of information maximization or error minimization, indi-
cating that they can be defined as optimization problems. Hence, the main
goal of this thesis is to develop techniques that maximize the extraction of cli-
mate information contained in sample-starved datasets, testing the potential
and effectiveness of different AI systems for this task. Although sometimes
these techniques are seen as black-box models without any physical back-
ground (Rudin and Radin, 2019), previous studies have argued that their
predictive skill is not only based on correlations but also causality (Pietsch,
2016). In this thesis we will then show that they can be combined with
methodologies commonly used in climate science to solve high-dimensional
and non-linear problems. The main goal of this thesis has therefore been di-
vided into two well-defined objectives, aiming to address different problems
that are frequently found in observational climate datasets, namely: the im-
provement of spatially-resolved climate field reconstructions obtained from
sets of climate networks, and spatial clustering of datasets with incomplete
sets of records. The specific nature of the problem varies with the type of
targeted information and the characteristics of the observables, therefore re-
quiring tailored developments, which will be dissected in dedicated chapters,
accompanied by their respective backgrounds.


6 Chapter 1. Introduction

The methods are relevant for the scientific community because climate
datasets are obtained from meteorological stations and measuring campaigns
whose elevated costs impede to have a full coverage of the study region. This
leads to a problem of data scarcity and a non-homogeneous distribution of
records that can debase global and regional climate reconstructions, increas-
ing the uncertainty of different history fields the further they go back in
time. In this regard, the key resides not only in extracting the maximum
information possible from each record, but also in selecting the representa-
tive areas that should be measured to maximize the reconstruction skill of a
given observable with limited funding, a task that can also be tackled with
these techniques.

1.2.1 Thesis structure

This thesis is divided into seven chapters. After the introduction, Chapter
2 describes the datasets and it has been divided into two sections: Data
description (Section 2.1) and Data post-processing (Section 2.2). Chapter 3
details the climate and AI tools employed in the subsequent chapters of the
thesis, including climate reconstruction methods (Section 3.1), metaheuristic
algorithms (Section 3.2), and clustering techniques (Section 3.3) that will be
applied to different observational datasets with incomplete information.

The next three chapters are the main core of the thesis. Chapter 4com-
bines metaheuristic algorithms and reconstruction methods to find an opti-
mal subset of locations within a global pseudo-proxy network that reduces
the spatial bias of annual temperature reconstructions of the last millennium.
The main results of this chapter can be found in Jaume-Santero et al. (2020).

In Chapter 5, a metaheuristic approach is again used to obtain a high-
resolution reconstruction of monthly North Atlantic SLP (Sea Level Pres-
sure) for the past two and half centuries using optimized networks of land


1.2. Main objectives and structure of the Thesis 7

observations. The main results are under review at the moment of writing
this thesis (Jaume-Santero et al., Submitted, 2021).

On the other hand, Chapter 6 describes the development of a new clus-
tering technique (k-gaps) aiming to generate a robust regional analysis of a
given observable using climate datasets with incomplete information in space
and time. The algorithm is applied to daily European temperature series for
the second half of the 20th century, and tested with synthetic series in order
to retrieve a regional classification of mean and extreme conditions. Carro-
Calvo et al. (2020) contains the most relevant results of this chapter.

Finally, Chapter 7 summarizes the main conclusions extracted from this
thesis as well as the outlook of future research.


Chapter 2

Data

In this chapter we detail all datasets employed to obtain the results shown in
the thesis. It has been divided into two sections: Data description and Data
post-processing. Section 2.1 describes the original datasets utilized in the
study. They are classified into four different subsections depending on their
data type: Model simulations (Subsection 2.1.1), Observations (Subsection
2.1.2), Reanalysis (Subsection 2.1.3), and Paleoclimate archives (Subsection
2.1.4). However, some climate datasets need to be post-processed so that
they can be properly used in the experiments. Therefore, in Section 2.2 we
describe the data processing necessary to generate modified versions of the
aforementioned datasets such as pseudo proxy records from model simula-
tions matching the locations of real paleoclimate archives (Subsection 2.2.1)
and pseudo observations of SLP (Subsection 2.2.2).

On the other hand, in Chapter 6 the validation of new clutering techniques
for the study of regional climates required the generation of 500 synthetic
datasets with time series of daily summer European temperatures with high
levels of missing data irregularly distributed over the region. For readability
purposes, the process followed to obtain them can be found in the experiment
setting (Section 6.3) of that chapter.

9


10 Chapter 2. Data

2.1 Data description

Choosing datasets for research purposes is never as straightforward as one
could think in a first instance. Due to the exponential increasing capabilities
of computerization in the last few decades, together with the vast develop-
ment of telecommunications, it has never been easier to share, exchange, and
download scientific data. Climate products such as reanalysis, climate recon-
structions, and simulations are known to get more and more accurate (more
realistic) with time, improving our understanding of Earth’s climate system.
However, quality comes at a price. On one side, increases in temporal and
spatial resolutions of climate simulations have led to a high volume of output
files (big data problem) which are difficult to manage with normal desktop
computers. On the other there is a scarcity of instrumental observations
prior to the 20th century, leading to an increase of uncertainty in the recon-
struction of the climate of the past. It is therefore important to combine
different types of datasets to provide more robust climate insights. Table 2.1
shows relevant information about the datasets employed in the subsequent
chapters.

Table 2.1: List of datasets used in the experiments of the thesis.

Name Version Period (CE) Time res. Spatial res. Reference

Model simulations
CESM-LME CESM 1.1.2 850-2005 Monthly 1.9◦ × 2.5◦ Otto-Bliesner (2016)
CCC400 ECHAM5.4 1601-2005 Monthly 2◦ × 2◦ Franke et al. (2017)
MRI-ESM2-0 2.0 850-2014 Daily 1◦ × 1◦ Yukimoto et al. (2019)

Observations
HadCRUT 4.2 1850-2014 Monthly 5◦ × 5◦ Jones et al. (1999)
SLP-Obs 1.0 1750-2004 Monthly 121 series Küttel et al. (2010)
E-Obs 14.0 1950-2016 Daily 0.25◦ × 0.25◦ Haylock et al. (2008)

Reanalysis
LMR 2.0 850-2000 Yearly 4.3◦ × 5.7◦ Tardif et al. (2019)
20CR 3.0 1836-2014 Monthly 1◦ × 1◦ Slivinski et al. (2019)

Paleoclimate archives
PAGES-2k 2.0.0 0-2000 Varying* 692 series Emile-Geay et al. (2017)
∗ Proxy records have temporal resolutions that span from seasonal (e.g. tree-rings) to

multidecadal (e.g. borehole temperature profiles).


2.1. Data description 11

2.1.1 Model simulations

Climate simulations employed in this thesis have been extracted from dif-
ferent GCM outputs. GCMs are numerical models describing the physical
processes within Earth’s climate subsystems such as the atmosphere, the
ocean, the cryosphere, and the land surface.

In Chaper 4 we employ the history fields from two different simulation
ensembles: the CESM-LME (CESM Last Millennium Ensemble) and the
CCC400. The CESM-LME Project(Otto-Bliesner, 2016) has released 36
last millennium simulations for 850-2005 CE (Common Era) from NCAR’s
CESM (Community Earth System Model) 1.1.2 GCM, 13 of them including
all transient forcings (solar radiation, volcanic aerosols, greenhouse gases,
land use/land cover conditions and orbital parameters). The remaining runs
are ensembles of single-forcing simulations for the last millennium with only
one transient forcing (either solar, volcanic, greenhouse gases, land use/land
cover or ozone) and one control simulation (with forcings fixed at 1850 CE).
Annual mean air temperature fields at 2-m (TREFHT) and 1.9◦×2.5◦ spatial
resolution for the 850-2005 CE period of the full-forcing CESM-LME have
been used as testbed to carry out the experiments with pseudo-proxies (see
Subsection 2.2.1). Moreover, the clear-sky net solar flux at the top of the
atmosphere (FSNTOAC, in Wm−2) from the control and single-forcing sim-
ulations have also been used to identify the dominant forcing for each year of
the last millennium. In addition, SLP fields from the full-forcing simulations
are employed to compute the Northern Annual Mode, defined as the first
empirical orthogonal function of annual mean area-weighted SLP anomalies
for [20-90] ◦N and 850-2005 CE.

To test the independence of the results with respect to the model em-
ployed in Chapter 4, experiments are performed with 2-m air temperature
fields from the CCC400, an ECHAM5.4 model ensemble (Bhend et al., 2012;
Franke et al., 2017) composed of 30 fully forced members for the period 1601-
2005 CE at 2◦ × 2◦ spatial resolution.


12 Chapter 2. Data

In Chapter 5, we have used last millennium (850-1849 CE) and historical
(1850-2014 CE) outputs of SLP from the MRI-ESM2-0 model (Yukimoto et
al., 2019) to verify that the results of the optimization process are robust with
respect to the reference dataset. The Japanese model was selected because
its grid resolution is similar to the 20CRv3 reanalysis but it comes from an
independent branch of the model genealogy tree, guaranteeing the consis-
tency of the results from different data sources. History fields of 1◦×1◦ have
been extracted from the r1i1p1f1 realization which has been run following
the full-forcing specifications of the CMIP6 (Coupled Model Intercompari-
son Project phase 6 ) and PMIP4 (Paleoclimate Modelling Intercomparison
Project phase 4 ) experiments (Eyring et al., 2016; Jungclaus et al., 2017).

2.1.2 Observations

In Chapter 4 we have used the HadCRUT (Hadley Centre/Climatic Research
Unit Temperature) 4.2, a monthly global dataset of gridded temperature se-
ries obtained after combining SST (Sea Surface Temperatures) records from
the Hadley Center (HadSST3) and land surface temperatures (CRUTEM4)
from the Climatic Research Unit of the University of East Anglia(Jones et
al., 1999). The grid has a spatial resolution of 5◦ × 5◦ and spans over the
period 1850-2014 CE.

The experiments performed in Chapter 5 employ the largest currently
available network of quality-checked SLP observations SLP-Obs (Set of SLP
Observations) used by Luterbacher et al. (2002) and Küttel et al. (2010).
It consists of 121 monthly series of SLP with different time lengths within
the 1750-2004 CE period, distributed over the east coast of North Amer-
ica, Greenland and Europe. However, after screening we rejected 20 records
(most of them situated in weather stations over the Alps) that did not meet
the standard quality as shown in Table 2.2. The accepted series are shown
in Table B.1.


2.1. Data description 13

Table 2.2: List of rejected observations from SLP-Obs database.

Name Latitude(DD) Longitude (DD) Start (CE) End (CE)
Bad Ischl 47.72 13.63 1854 2003
Basel 47.60 7.60 1754 2004
Geneva 46.30 6.10 1767 2004
Graz 47.07 15.45 1836 2004
Gr. St. Bernhard 45.50 7.10 1863 2004
Hohenpeissenb 47.80 11.00 1780 2004
Innsbruck 47.27 11.40 1829 2004
Jerusalem 31.80 35.20 1860 2003
Karlsruhe 49.01 8.39 1869 2004
Klagenfurt 46.70 14.30 1843 2004
Kremsmunster 48.05 14.13 1821 2004
Lugano 46.00 8.60 1863 2004
Munich 48.10 11.70 1824 2004
Neuchatel 46.60 6.60 1863 2004
Santis 47.20 9.20 1882 2004
Salzburg 47.80 13.03 1841 2004
Sonnblick 47.01 12.95 1886 2004
Vienna 48.30 16.40 1774 2004
Zurich 47.37 8.55 1863 2004
Po Plain∗ 45.12 9.66 1765 2004
∗ In the same grid-point, Milan has higher correlation (0.96 vs 0.78) with respect to the

20CRv3 reanalysis during the 1836-2004 CE time period, and it is a longer record.

Finally, in Chapter 6 we use the E-Obs (European grid of Temperature Ob-
servations) (version 14.0) (Haylock et al., 2008). The E-Obs dataset provides
daily spatially resolved European field temperatures with a spatial resolution
of 0.25◦ × 0.25◦. However, due to the presence of missing values, locations
with less than 6000 days were removed, as well as time periods when any
of the remaining data points presented missing values. Hence, there were
enough time series to generate a complete set (in space and time) of 17,452
grid points with 5569 summer day mean temperatures contained between
latitudes 35◦ S and 72◦ N, and longitudes 20◦ W and 42◦ E, for a time range
of 66 years since 1950 CE.


14 Chapter 2. Data

2.1.3 Reanalysis

Climate reanalyses combine instrumental observations with climate models
to reproduce realistic grids of history fields such as air temperature, pressure
and wind fields at different pressure levels, as well as mono-level variables at
the surface (e.g. SST and SLP). By prescribing real boundary conditions to
the model and assimilating certain records from instrumental observations,
historical, and paleoclimate archives, they provide hybrid products with con-
tinuous data, full spatial coverage, high-resolution, and physical consistency.

Two reanalyses have been used. Chapter 4 introduces the LMR (Last
Millennium Reanalysis), a 4.3◦ × 5.7◦ proxy-based annual temperature re-
construction for 850-2000 CE based on the assimilation of 2892 paleoclimate
records from tree-rings, lake core, ice core, coral and speleothem archives
(Hakim et al., 2016; Tardif et al., 2019). The LMR uses an ensemble data
filter that estimates a prior state vector from CCSM4 (Community Climate
System Model version 4 ) outputs, later modified by the assimilation of proxy
records and subsequently calibrated by the GISTEMP (Goddard Institute for
Space Studies Surface Temperature).

On the other hand, monthly SLP fields from the 20CRv3 (NOAA-CIRES-
DOE 20th Century Reanalysis version 3 ) are employed in Chapter 5. The
20CRv3 reanalysis (Slivinski et al., 2019) provides 1◦× 1◦ history fields from
1836 to 2014 CE. The reanalysis assimilates observations of surface pres-
sure and prescribes sea surface temperatures and sea ice distribution to es-
timate the remaining climate variables which are publicly available and can
be downloaded at NOAA’s website. Despite the presence of uncertainties in
this reanalysis and the SLP-Obs, the combined use of instrumental and rean-
alyzed observations through AI lens (Barnes et al., 2019) can help to identify
inconsistencies in the datasets and maximize the performance of the CFR by
exploiting robust relationships between the local series and the large-scale
field (see Chapter 5).

https://psl.noaa.gov/data/gridded/data.20thC_ReanV3.html


2.2. Data post-processing 15

2.1.4 Paleoclimate data

Decades of scientific fieldwork have left abundant proxy datasets to recon-
struct the climate of the past. They provide information about climate
changes at local and regional scales, and have been pooled to derive spatially-
resolved climate reconstructions of the last millennium (Crowley, 2000; 2k Con-
sortium, 2013). Due to the increasing availability of high-resolution records,
the initiative PAGES-2k (Past Global Changes) has scrutinized thousands
of temperature-sensitive proxies, releasing a global archive with paleoclimate
records for the last two millennia (Emile-Geay et al., 2017). This database
has been used to assimilate climate information to reconstruct annual tem-
perature fields (Franke et al., 2020), as well as in the aforementioned LMR
(Tardif et al., 2019). Moreover, in Chapter 4, we have used the locations of
proxy records contained in the database to obtain the set of optimal proxy
locations that best reconstructs 2-meter air temperature fields for the period
850-2005 CE (Jaume-Santero et al., 2020).

2.2 Data post-processing

Some datasets had to be post-processed to obtain the required files for the
experiments included in this thesis. For instance, pseudo-proxies (synthetic
temperature series) were generated in Chapter 4 by perturbing 2-m air tem-
peratures from GCM simulations (as described in Subsection 2.2.1), emu-
lating the characteristics of real-world paleoclimate archives. These pseudo-
proxies served us as a testbed to carry out the experiments under a controlled
framework. Within this context, pseudo-observations of SLP with realistic
levels of noise and time length were made for the experiments shown in Chap-
ter 5, with the aim of mimicking the features of the SLP-Obs dataset in series
of sea level pressure extracted from outputs of the MRI-ESM2-0 model (see
Subsection 2.2.2).


16 Chapter 2. Data

2.2.1 Pseudo proxies

Three types of pseudo-proxies have been constructed in Chapter 4 at the
locations of the PAGES-2k multi-proxy database from the annual mean 2-
meter air temperature of the first CESM-LME full-forcing member and the
first member of the CCC400 ensemble. They include perfect proxies and
more realistic pseudo-proxies (Smerdon, 2011; Gómez-Navarro et al., 2017;
Neukom et al., 2018) derived by adding red noise to the temperature series
using AR1 (lag-1 Autoregressive Model) autocorrelation. Different levels of
SNR (Signal to Noise Ratio) have been tested, including values such as infi-
nite (perfect pseudo-proxies), 1 and 0.5. Note that pseudo-proxies are only
sensitive to temperature and the observational availability is complete for
the entire period.

Following the methodology of Neukom et al. (2018), pseudo proxies, P(t),
have been synthetized for each time-step, t, from the temperatures series,
T(t), using the model provided by Eq. 2.1.

P(t) = T(t) + n(t), (2.1)

where n(t) is red noise iteratively defined by Eq. 2.2

n(t) = γ · n(t− 1) + δ (2.2)

where γ is the AR1 autocorrelation coefficient for each pseudo-proxy, and
δ is generated as white noise with zero mean and variance, σ2(δ), given by
Eq. 2.3.

σ2(t) =
(1− γ2) · σ2

T

SNR2 (2.3)

where σ2
T is the variance for each temperature series. This parameteri-

zation implies that noisy pseudo-proxies are auto-correlated, and the noise
variance has the same amplitude as the variance of the climate signal.


2.2. Data post-processing 17

2.2.2 Pseudo SLP observations

Pseudo-observations of SLP have been generated in Chapter 5 from MRI-
ESM2-0 model outputs (Yukimoto et al., 2019) by extracting 101 time series
at the locations of SLP-Obs (only those that were accepted after screen-
ing). These series have subsequently been perturbed to match the SLP-Obs
database in terms of data quality and missing values. Complete model SLP
series from 1750 to 2004 CE were therefore truncated one by one to have the
same time length than the real observations. Moreover, to account for impre-
cisions in meteorological instruments, they were perturbed with white noise
so that each series had the same correlation with the unperturbed (perfect)
series of the model, as the correlation of real observations with the 20CRv3
reanalysis.

Realistic noise was added to the series by defining a SNR following Eq.
2.4

SNR =

√
r2

1− r2 (2.4)

where r is the Pearson correlation between a real SLP time series and
the SLP of the 20CRv3 reanalysis extracted at the same location as the
weather station where that series was recorded. Then pseudo-observations
were generated by following the steps described for the pseudo-proxies of
Subsection 2.2.1 (Eqs. 2.1, 2.2, and 2.3). Note that as in this case white
noise has been added, the autocorrelation coefficient (γ) is zero.


Chapter 3

Methodology

Section 3.1 details the methods employed for the reconstruction of climate
fields also known as CFR (Climate Field Reconstruction) such as tempera-
ture and SLP. CFRs are spatially-resolved reconstructions of climate vari-
ables generated from different sources of climate information such as instru-
mental observations, historical documents, and paleoclimate archives. The
two independent CFR techniques employed in the studies carried out in
Chapters 4 and 5 are the AM (Analogue Method) (Subsection 3.1.1) and
the CCA (Canonical Correlation Analysis) (Subsection 3.1.2).

Subsequently Section 3.2 introduces the definition and main uses of an
evolutionary algorithm known as the CRO (Coral Reef Optimization). Evo-
lutionary algorithms (Eiben and Smith, 2015) are a branch of Artificial In-
telligence based on the optimization of high dimensional (and non-linear)
systems (Knüsel, 2019) using algorithms biologically-inspired by processes
that imitate the evolution and survival of best adapted individuals under
ever-changing environmental conditions. The iterative technique that we
used for the optimization of sampling networks is known as the CRO, an
hybrid-type evolutionary algorithm (Salcedo-Sanz, 2017) that simulates the
reproduction and evolution of corals within a reef with a limited number of
latching spots (more details in Section 3.2). In our case, we coupled this
algorithm with the two CFR methods of Section 3.1 to assess whether it

19


20 Chapter 3. Methodology

was possible to improve the skill of spatially-resolved reconstructions by op-
timizing networks of climate records available on public repositories. Two
different approaches were followed to tackle this: the first one was performed
by finding the optimal subset of representative records that best reconstruct
the climate field, and the second was focused on searching for optimal sets
of weights that applied over climate records generate the reconstruction with
the highest skill possible.

Lastly, a description of clustering techniques and their use on climate re-
search is presented in Section 3.3. These iterative and self-organizing meth-
ods associate data objects into sets whose individuals tend to share more
similarities among them than with individuals of other sets (Hartigan and
Wong, 1979; Phillips, 2002). Although these techniques are nowadays widely
employed in climate research, they usually require complete datasets with
homogeneous information, forcing the truncation of longer time series with
the subsequently loss of information. Hence, due the necessity of maximizing
data contained within climate records, a novel clustering method has been
developed (Carro-Calvo et al., 2020) to cluster sets with incomplete climato-
logical time series, allowing for the study of past climate variations beyond
the capabilities of classical techniques.

3.1 Climate Field Reconstruction methods

There are multiple reconstruction methods used within the paleoclimate com-
munity to spatially resolve climate history fields. Although most of them
generate robust reconstructions, they use different approaches, showing sig-
nificant differences in terms of computational performance. We have selected
two CFR techniques among many other reconstruction methods for their high
reconstruction skill and fast performance, being the latter essential for the
general purpose of this thesis: finding optimal solutions to high dimensional
problems. Note that optimization of any sort requires to process the infor-
mation as fast as possible, especially in soft-computing where evolutionary


3.1. Climate Field Reconstruction methods 21

algorithms have to generate millions of solutions and check their suitability
in a limited amount of time. Here we show how the AM (Gómez-Navarro
et al., 2017) and CCA (Smerdon et al., 2010) methods work, the two fastest
reconstruction techniques on record.

3.1.1 Analogue Method

The AM reconstructs an unknown target field from an available reference
dataset by searching analogues, defined as the spatially resolved fields with
the largest resemblance to the target variable over the observing network
(Lorenz, 1969; Franke et al., 2011; Gómez-Navarro et al., 2015; Talento et
al., 2019; Bothe and Zorita, 2020). As such, it does not require calibration,
but a large pool of spatially resolved fields in the reference dataset to guar-
antee an adequate number of good analogues. The proximity of the pooled
fields to observations is measured with a distance metric. To account for
the diversity of large-scale fields that are compatible with the distribution
of records over the limited network, a minimum number N of analogues is
predefined, whose average yields the reconstructed field. AM reconstructions
often use constant N values (the best N analogues of the observable for each
time step) but there is no objective criterion to define an optimal number of
analogues, and therefore the choice is based on the reconstruction skill.

In Chapter 4, global annual temperature fields for the 850-2005 CE pe-
riod have been reconstructed using the AM, which provides spatially resolved
global fields of temperature from the limited sample of pseudo-proxy records,
matching the locations from the PAGES-2k archive. In this case, the target
field to reconstruct is the annual mean air temperature at 2 meters (TRE-
FHT) of the first CESM-LME full-forcing member. For each year, we sorted
(from best to worst) analogues of the target simulation from a pool formed by
the remaining 12 last millennium members of the full-forcing ensemble. Best
analogues are the years of the pool with minimum RMSE (Root-Mean-Square
Error) for the pseudo-proxy locations. For each target year, the global tem-
perature patterns of the best analogues are averaged at each grid point and


22 Chapter 3. Methodology

taken as the reconstructed global field. The correlation between the recon-
structed and target fields increases with the number of analogues employed
but, at the same time, the variability ratio decreases as seen in Fig. 3.1. Ac-
cordingly, we retained the three best analogues of each year, which is similar
to those used in previous studies (Gómez-Navarro et al., 2017).

20 40 60 80 100
0.4

0.5

0.6

0.7

0.8

Pe
ar

so
n 

co
rre

la
tio

n

(a)

r log( rec/ ori)

-1.00

-0.75

-0.50

-0.25

0.00

1 2 3 4 5 6 7 8 9 10
0.4

0.5

0.6

0.7

0.8

(b)
-1.00

-0.75

-0.50

-0.25

0.00

Va
ria

bi
lit

y 
ra

tio

Number of Analogues

Figure 3.1: Pearson correlation (blue) and variability ratio (red) for AM re-
constructions of global temperature fields as a function of the number of ana-
logues. For each member of the CESM-LME the remaining 12 full-forcing
simulations are used to reconstruct the global temperature fields from the
full-proxy PAGES-2k network of perfect pseudo-proxies, using different num-
ber of analogues. Shading shows the spread (two standard deviations with
respect to the mean values). (a) Correlation and variability ratio for AM
reconstructions with 1 to 100 analogues. (b) A zoom of the black dashed
square in (a).

The sensitivity of the AM to proxy weighting was also tested by using
area-weighted and un-weighted fields before the reconstruction procedure.
Similar results were found as seen in Fig. A.1, and hence no area weighting
was applied, for coherence with (Gómez-Navarro et al., 2017). The AM was
also employed to derive NAM (Northern Annual Mode) reconstructions for
the first ensemble member from the temperature field over the pseudo-proxy
locations. The reconstructed NAM is retrieved separately for each year by
randomly picking the SLP field of one of the 100 best analogue years of the
target temperature field of that year. This yielded 100 different SLP recon-
structions for 850-2005 CE, with their corresponding NAM series.


3.1. Climate Field Reconstruction methods 23

Similarly, in Chapter 5, for each month of 1750-2004 CE, we selected the
N best analogues of the SLP distribution defined by the available observations
of the SLP-Obs network (see Section 2.1.2). The reference pool is formed by
the SLP fields of the 20CRv3 reanalysis (Slivinski et al., 2019) except the
map that we intend to reconstruct. Most suitable analogues are defined as
the N maps of the pool with the lowest RMSE between the SLP observations
and the corresponding grid point values of the reanalysis. Note that in most
cases, best analogues correspond to the actual month intended to reconstruct
or the months before and after it (e.g., for December 1900, the best analogue
is December 1979, whereas for December 1800, the best analogue is November
1845). The reconstructed SLP field is given by the mean of the best N
analogues of each month. The standard deviation across the N best analogues
provides a measure of the uncertainty, i.e. how much the large-scale field
is constrained by the available set of observations. After testing different
N values ranging from 1 to 50, we used the N=10 best analogues of each
month. This number was chosen as a balance between the Pearson correlation
coefficient, which increases with N, and the variability ratio, which decreases
with N (Fig. 3.2).


24 Chapter 3. Methodology

10 20 30 40 50

2.4

2.6

2.8

R
M

SE

a)

Obs-20CRv3

10 20 30 40 50

0.86

0.87

0.88

0.89

0.90

Pe
ar

so
n 

co
rre

la
tio

n

b)

10 20 30 40 50
Number of analogues (01/1836-12/2004 CE)

-0.2

-0.1

0.0

Va
ria

bi
lit

y 
ra

tio

c)

Figure 3.2: Sea level pressure reconstruction skill of the Analogue Method.
The performance of the method is shown as a function of the number of
selected analogues for (a) the RMSE, (b) the Pearson correlation, and (c)
the variability ratio between the reconstructions using observations and the
20CRv3 reanalysis during the 1836-2004 time period.

3.1.2 Canonical Correlation Analysis

The CCA is a mathematical procedure that infers information from cross-
covariance matrices (Hotelling, 1936). For instance, if we want to find rela-
tionships between different variables, the CCA will generate linear combina-
tions of these variables that maximize the correlations among them. There-
fore it can be considered a general test of significance in statistics (Knapp,
1978), and it is commonly used as a dimensionality reduction technique in
atmospheric science (Wilks, 2011).


3.1. Climate Field Reconstruction methods 25

The CCA as described by Smerdon et al. (2010) is employed as CFR
method in paleoclimate reconstructions (Neukom et al., 2019) based on max-
imizing the correlation (specifically the canonical correlation coefficients) of
reduced eigenvector spaces from the proxy (P) and calibration (C) datasets
(the latter defined as the annual temperature field of the target simulation
since 1850 to 2005 CE). All temperature series are standardized by subtract-
ing their mean and divided by their standard deviation. To avoid degen-
erated eigenvalues (eigenvalues with a value close to 0), both datasets have
been truncated (Eqs. 3.1 and 3.2) with a minimum tolerance of 0.01 (eigen-
values below this threshold have been discarded) and a maximum number of
74 eigenvectors (optimal for full-proxy reconstructions).

P = Pr + δp = Ur
p ·Λr

p ·VrT
p + δp (3.1)

C = Cr + δc = Ur
c ·Λr

c ·VrT
c + δc (3.2)

where Ur and Vr are matrices composed of orthogonal eigenvectors that
diagonalize the reduced proxy (Pr) and calibration (Cr) datasets (where δ
is the corresponding residual after the truncation), obtaining their diagonal
matrices (Λr) from SVD (Singular Value Decomposition). The matrix of
regression coefficients (B) is then generated as in Eq. 3.3,

B = Ur
c ·Λr

c ·VrT
c ·Vr

p · (Λr
p)
−1 ·UrT

p (3.3)

and subsequently decomposed into canonical correlation coefficients (Eq.
3.4) using SVD over VrT

c ·Vr
p.

B = Ur
c ·Λr

c ·Oc ·Λcca ·OT
p · (Λr

p)
−1 ·UrT

p (3.4)

where Oc and Op are matrices composed of orthogonal eigenvectors that
transform B into the diagonal matrix Λcca, whose diagonal elements are the
canonical correlation coefficients. Thus, the temperature field reconstruction,


26 Chapter 3. Methodology

R, is obtained in Eq. 3.5

R = Ur
c ·Λr

c ·Oc ·Λcca ·WT
p ·P (3.5)

where Wp is the CCA proxy weighting matrix defined as Eq. 3.6

Wp = Ur
p · (Λr

p)
−1 ·Op. (3.6)

3.2 The Coral Reef Optimization

Evolutionary algorithms are soft-computing techniques inspired by biological
processes (Del Ser et al., 2019) such as genetic recombinations and mutations
that ensure the survival and evolution of best suited individuals within a na-
tural competitive environment (Eiben and Smith, 2015). These algorithms
are designed to provide optimal solutions through the combination (Forrest,
1993) and competition of previously-generated solutions. For high dimen-
sional problems the direct search of the best solution (assuming that it exists
and it is unique) is computationally infeasible (Knüsel, 2019). These evo-
lutionary algorithms generate near optimal solutions (Salcedo-Sanz et al.,
2018), and tend to outperform other indirect methods when dealing with
non-linear problems, such as those of climate.

From all evolutionary algorithms available, we decided to use a version
of the CRO with four substrate layers (i.e. search operators). The CRO is a
hybrid algorithm (Salcedo-Sanz, 2017) that emulates the living processes of
corals and their evolution within an ocean reef. By limiting the number of
corals within the reef, the best adapted species will have a higher probabil-
ity of surviving, promoting the subsequent evolution of best individuals over
next generations. Thus, the same way that best adapted coral generations
survive over time by transferring their genetic information to their descen-
dants, the CRO applies different recombination procedures (Forrest, 1993;
Vrugt and Robinson, 2007) to transfer parts of sub-optimal solutions into
new optimal ones generated at each iteration. Mathematically, this is imple-


3.2. The Coral Reef Optimization 27

mented through different techniques (Vrugt and Robinson, 2007) known as
search operators (see Subsection 3.2.3) that recombine the set of solutions to
iteratively generate better solutions from parts of previous ones. Moreover,
to avoid falling in a local minimum, there is also a small probability of spon-
taneous random perturbations in the set of solutions (known as mutations)
to expand the space of solutions.

In climatology, the CRO algorithm has already been used to find sets
of locations that best describe spatially resolved climate fields such as wind
speed (Salcedo-Sanz et al., 2018) or temperature (Salcedo-Sanz et al., 2019),
and their results have been applied to improve solar and wind power forecasts
(Salcedo-Sanz et al., 2018) at local scales.

3.2.1 Coral solutions

In the CRO algorithm, corals are defined by two elements: a solution for
the predefined problem and a health function that characterizes how good or
bad that solution is. For instance, in Chapter 4, we use the CRO algorithm
to search for subsets of representative proxy locations of the PAGES-2k net-
work that optimize the area-weighted RMSE between the target field and its
reconstruction, obtained from different CFR methods (Section 3.1). There-
fore, to solve this optimization problem, corals are composed of a binary
array of proxy locations (i.e. the coral solution) with "1" values when such
locations were selected and "0" when they were not (Fig. 3.3a). Afterwards,
the health function for each coral was calculated as the RMSE between the
reconstruction using the selected locations and the target field.


28 Chapter 3. Methodology

a) Selection of proxy locations (0 rejected, 1 selected):

Record 1 Record 2 Record 3 ... Record n-1 Record n

1 0 1 ... 1 0

b) Weather station weighting (from 0 to 1):

Record 1 Record 2 Record 3 ... Record n-1 Record n

0.85 0.22 0.76 ... 0.91 0.18

Figure 3.3: Coral solutions. (a) Binary selection of representative proxy
locations for the reconstruction of annual 2-m air temperature fields from
850 to 2005 CE. (b) Weighting of weather stations for the reconstruction of
SLP fields from monthly observations since 1750 CE.

On the other hand, a more realistic approach is attempted in Chapter
5. It pursues an optimized non-discrete set of weights (ranging from 0 to 1)
applied to a real network of SLP observations (SLP-Obs) affected by gaps
and observational errors (Fig. 3.3b). The CRO algorithm was run to find
optimized weights for the network of SLP records, which are applied during
the AM reconstruction. Herein, different corals represent different sets of
weights, which measure the degree of large-scale representativeness of the
local records, under the given restrictions of the observing network (errors,
data availability, etc.). For ideally perfect conditions, weights would only
depend on the relationships between local observations and its links with the
ground truth. However, in the presence of uncertainties contaminating these
relationships, weights are also affected by changes in the observing network
and observational errors over the reconstructed period. For example, as-
suming an idealized configuration of equally skillful continuous records, the
optimization would assign lower weights to records with larger errors due
to inconsistencies with the remaining observations and the large-scale field.
This makes optimal weighting necessary to minimize biases induced by un-
certainties in the climate network.


3.2. The Coral Reef Optimization 29

3.2.2 The algorithm

Fig. 3.4 illustrates the CRO general steps inspired by the multiple mech-
anisms of coral reproduction. These mechanisms have been defined in the
algorithm as different recombination procedures to generate new solutions.
Here we present a brief description of each step (see Salcedo-Sanz (2017) for
further details).

Random generation

Coral solutions (as described in Subsection 3.2.1) are randomly synthetized
and latched to the reef. Note that the reef has a limited number of spots and
corals have to latch on one of them to survive for the next iteration.

External sexual reproduction

Some corals are randomly selected by pairs (i.e. parents) to generate new
"baby corals" known as larvae. In this step, new solutions are generated
by combining parts of both solutions encoded in the parents. Subsection
3.2.3 describes four different search operators used to embed the genetic
information of adult corals into new larvae.

Internal sexual reproduction

The remaining corals (those that have not been reproduced by couples) gen-
erate new larvae by copying its solution (genetic information) and applying
random perturbations (mutations), so that new solutions are partially differ-
ent to the original ones.


30 Chapter 3. Methodology

Asexual reproduction

Asexual reproduction has been included in the algorithm the same way as
the internal sexual reproduction. However, only a given percentage of the
best solutions will produce new larvae through this method. This is intended
to increase the number of better solutions available for reproduction in the
next iteration.

Larvae setting

After larvae are formed, they need to latch onto one of the reef’s spots in
order to become a coral. Two different situations can happen at this moment:
the spot is empty or it is occupied by other coral. In the first case the larva
is always clinged to the spot. In the second, the latching spot is disputed
by competition, and the coral with better health function (Subsection 3.2.1)
will gain the battle. For instance, if the health function is defined as the skill
of a certain climate reconstruction, the coral with the best skill (i.e. lowest
error) will latch onto the spot and survive, while the other will perish. This
way, the survival and reproduction of best solutions are guaranteed after
successive iterations.

Coral predation

After reproduction and coral settlement, if the stop condition is not met (i.e.
a predefined number of iterations has not been reached), a percentage of the
worst coral solutions in the reef are preyed, leaving empty spots for future
coral generations.


3.2. The Coral Reef Optimization 31

Random coral generation

External sexual reproduction

Internal sexual reproduction

Larvae setting

Asexual reproduction

Coral reproduction

Coral predationStop?

Finish

Yes

No

Figure 3.4: CRO flowchart imitating the biological processes of corals within
a reef.

3.2.3 Search operators

Advanced versions of the CRO algorithm simulate substrate layers within the
reef, implemented in the code as different search operators (Salcedo-Sanz,
2017; Salcedo-Sanz et al., 2019) during the step of external sexual reproduc-
tion. We have included four of these operators for the studies presented in
Chapters 4 and 5.


32 Chapter 3. Methodology

One-point crossover

Coral parents generate two new larvae by interchanging their solutions along
a randomly chosen point.

Two-point crossover

Coral parents generate two new larvae by interchanging their solutions along
two randomly chosen points.

Multi-point crossover

Coral parents generate two new larvae by interchanging their solutions along
multiple points.

Differential Evolution

Differential evolution is based on adding differences of two parents into a
third coral (Salcedo-Sanz et al., 2019), following Eq. 3.7 for each encoded
parameter (xi) of the solution.

xlarvai = xcoral1i + σ · (xcoral2i − xcoral3i ) (3.7)

where xlarvai is the ith parameter of the new larva solution, xcoralni are the
parameters (i) of three coral parents, and σ determines the fraction of the
difference between coral2 and coral3 that gets transferred into coral1.


3.3. Clustering techniques 33

3.3 Clustering techniques

Classical clustering techniques, such as the k-means algorithm (Hartigan and
Wong, 1979; Phillips, 2002), have become widespread in the past few years
as dimensionality reduction methods are able to extract relevant information
from extensive databases (Bernard et al., 2013; Bador et al., 2015; Zhang et
al., 2016). In climatology, these methodologies can arrange data according
to their internal structure by defining spatial regions for datasets with ge-
olocated climate information (Rao and Srinivas, 2006). Therefore, clustering
algorithms have been used for several purposes such as the identification of
regional climates (Aliaga et al., 2017), air pollution (Gao et al., 2011; Wang
et al., 2015), and ecology (Miele et al., 2014; Cheruvelil et al., 2017).

These classical clustering techniques often require complete datasets, lim-
iting regional analyses to series without time gaps. Unfortunately, most avail-
able climate archives (Haylock et al., 2008; Glaser and Riemann, 2009; Emile-
Geay et al., 2017) contain missing values which must be properly handled
prior to clustering. Most straightforward approaches consist of removing data
points (deletion) that do not cover the requested time period (Dixon, 1979),
whereas more sophisticated methods intend to estimate missing values (impu-
tation) by means of statistical procedures (Henn et al., 2013). This limitation
restricts cluster analyses to periods with complete information, disregarding
earlier climate imprints contained in longer time series. In turn, there are only
a few methods designed to work with inhomogeneities in climate datasets.
Such is the case of k-POD (Chi et al., 2016), an algorithm that instead of
relying on deletion and imputation, uses a majorization-minimization algo-
rithm (Lange et al., 2000) to cluster observed data with missing values. It
is however difficult to find a method that performs well with sparse climate
datasets.

In this section a new clustering method known as the k-gaps algorithm
has been defined to classify sets of time series with a significant number
of missing values. Its structure is similar to those employed in classical


34 Chapter 3. Methodology

clustering methods such as the k-means algorithm, but with some key changes
that allow for the selection and attachment of records with different temporal
lengths. The validation of the method together with its multiple applications
in the climate field are in Chapter 6.

3.3.1 Assumptions and Definitions

Let us assume we have a dataset of climate records at different locations,
and maybe with different temporal lengths, which together describe the cli-
matology of a specific zone (Europe in our case), for a given period T :=

[n1, n2, · · · , nT ], where ni stands for the time step at which a certain climate
variable has been measured, and T represents the total number of time steps.
Thus, a given climate record A is defined for a subset of T, and may (or may
not) overlap with other climate records included in the dataset, (i.e. incom-
plete records are considered).

Generally, clustering climate records implies to compute distances be-
tween some centroids (in our case, temperature series that are representative
of certain regions) and the records, by considering a pre-defined metric (the
root-mean-square deviation, for example). However, note that for incomplete
datasets, these distances can only be estimated during the time intervals with
available information. Therefore, to identify periods where some time series
overlap, a vector mask ranging from the oldest to the most recent time step
in the dataset has been defined for each time series. These record masks
can be defined as indicator functions (Eq. 3.8), filled with “1”s when their
associated time series have available data, and “0”s for the remaining period.

MaskA(ni) :=

1, if ni ∈ A
0, if ni /∈ A

∀ ni ∈ T (3.8)

For instance, Fig. 3.5 illustrates two randomly-generated records (signals
A and B with no specific units) with different temporal lengths, and the
subsequent time series obtained when these two records are merged (Fig.


3.3. Clustering techniques 35

3.5c), for example in a centroid calculation procedure.

0 50 100 150 200 250 300 350
0

1

2

3

4

5

a ) b ) c )

Available sect ion of signal A

Mask of signal A

0 50 100 150 200 250 300 350
0

1

2

3

4

5
Available sect ion of signal B

Mask of signal B

Centroid from  m erge sect ions of A and B

Centroid m ask

0 50 100 150 200 250 300 350
0

1

2

3

4

5

Figure 3.5: Data records considered (a and b), and the resulting series ob-
tained by merging them (c). All the records are presented with their respec-
tive masks.

In our case, signals A and B were combined by averaging them in the
time range where there are data available in both series (i.e. during the over-
lapping interval). Otherwise, when only one of the signals is available, its
values are included to the merged series. Note that the mask in Fig. 3.5c
contains the number of time series with available data at each time step and,
although the resulting period is defined from 50 to 325 (i.e. the time interval
in which there is at least one time series available), the overlapping interval
used to combine them is in-between 200 and 300. Thus the resulting series
has data of signal A from time 50 to time 199, the averaged information of
signal A and signal B from time 200 to time 300, and the information of
signal B from time 301 to time 325. Note that the use of vector masks allows
to merge records with different temporal lengths (e.g. temperature observa-
tions, or historical archives), and provides information about the number of
time series used to calculate a resulting centroid. Moreover, these masks are
important because keeping the number of series available at each time step
allows us to discern between periods with robust mean values (i.e. time steps
with a high number of temperature series available) and those time intervals
with data scarcity.

It is noteworthy to mention that clustering techniques can help to iden-


36 Chapter 3. Methodology

tify regional climates by grouping together instrumental time-series with si-
milar mean temperatures or correlated variability. While the former can be
achieved by using the series as they are (absolute values), clusterizing time-
series focusing on their variability requires normalizing the series first. For
instance, Fig. 3.6 depicts two records with different mean (Fig. 3.6 a and
b), but correlated variability (Fig. 3.6c), that would be clustered together if
they were normalized.

Figure 3.6: Representations of two synthetic records. Series with different
mean values (a and b), but with correlated variabilty (c).

The k-gaps algorithm has therefore been designed to cluster time series
with similar mean (“basic” mode hereafter) and also with correlated variabil-
ity (“normalization” mode). While the basic mode is directly applied over
temperature records, the normalization mode requires a workaround because
homogeneous normalizations cannot be calculated from records with differ-
ent temporal lengths. This issue has been tackled by applying an adjustment
of the climate data to calculate the centroids (Subsection 3.3.3) and a linear
fitting to properly reclassify the time series (Subsection 3.3.4).

3.3.2 The k-gaps algorithm

Taken into account these previous definitions, we can now describe the k-
gaps algorithm for clustering records of incomplete time series. The general
structure of k-gaps is similar to the well known k-means algorithm (Hartigan
and Wong, 1979), which is an iterative method whose main purpose is to
classify series of data within clusters represented by central vectors known


3.3. Clustering techniques 37

as centroids (Cj). In k-means, these centroids are calculated by averaging
the series associated with previous clusters, and subsequently reassigning the
records to the nearest centroid for the next iteration. The proposed k-gaps
algorithm follows the same idea, but including some extra steps to treat the
problematic case of having incomplete records in the database.

Random cluster initialization

Centroid calculation

to fix?

Classification

yes

yes

no

End

Stop?

1

2

3

4

5 6

Estimated from incomplete series

Mask aware distance estimation

Generation of

 new clusters

Reparation

no

Figure 3.7: k-Gaps flowchart. Circles contain general clustering procedures,
and squares describe specific k-gaps operations. The algorithm’s conditions
are represented as diamonds in the flowchart.


38 Chapter 3. Methodology

Fig. 3.7 shows a flowchart of the k-gaps implementation, which can be
summarized in the following steps:

1. Randomly assign the existing records (time series) over k initial clus-
ters.

2. Proceed to centroid calculation, following the procedure described in
Subsection 3.3.3.

3. Reassign records to clusters by similarity with the centroids (Subsection
3.3.4).

4. Count the number of records for each cluster.
For empty clusters, proceed to step 5. Otherwise, jump to step 6.

5. Generation of new clusters by applying random Gaussian noise to ex-
istent groups, and restart from step 2.

6. Check whether the stopping condition has been fulfilled (Subsection
3.3.5). If it is not, go to step 2.

3.3.3 Centroids calculation

In clustering algorithms requiring complete datasets such as the k-means,
centroids are usually calculated by averaging the time series associated with
a certain cluster at each time step. However, this procedure can debase the
robustness of the clustering when applied over incomplete datasets (i.e. sets
of records defined for different time periods) by producing unrealistic clus-
ters when some time series do not overlap. This bias will propagate for the
subsequent iterations, clustering artificially-mixed regions. To circumvent
this problem, centroids have been estimated by only averaging records that
overlap for at least a minimum time interval. This minimum overlapping
must be estimated in each case depending on the temporal resolution and
the number of time steps that characterize the dataset aimed to cluster. Af-
ter trying different parametrizations, in Chapter 6 we found that the k-gaps


3.3. Clustering techniques 39

algorithm could not converge for values lower than 90 days, yielding signifi-
cantly different clusterings over successive runs. This indicates that clusters
are barely robust and their results must not be trusted. We tackled this issue
by setting a minimum overlap of 90 days. The procedure to calculate the
centroids in the algorithm will be described next. Note that it is different de-
pending whether we consider k-gaps in basic mode or in normalization mode:

Procedure for k-gaps basic mode (i.e. clustering records with similar
means):

1. Set the longest time series as the new centroid (Cj).

2. Select the records (Si) with the longest overlap with the centroid (Ovi,j)
using Equation (3.9).

Ovi,j =

nT∑
n=0

Fi,j(n)→ Fi,j(n) :=

{
0 if Maski(n) ·Maskj(n) = 0

1 if Maski(n) ·Maskj(n) > 0

(3.9)
where j ∈ [0, k) represents a certain cluster (of a total of k clusters), n
is the time, nT is the last time step, Maski is the mask associated with
Si, and Maskj is the centroid mask.

3. Combine the centroid (Cj) and Si following Equation (3.10):

Cj = Cj + Si ·Maski (3.10)

Centroids are firstly calculated by adding at each time step the tem-
peratures of their associated series.

4. Update the centroid mask using Equation (3.11):

Maskj = Maskj +Maski (3.11)

5. Get back to Step 2 until all records are checked out or overlapping is
below a predefined threshold.


40 Chapter 3. Methodology

6. Divide the centroids in Step 3 by their respective Maskj to obtain their
average temperature at each time step (note that centroid masks are
defined as the number of series available at each time step).

For each cluster in normalization mode (i.e. clustering time series with
similar variability):

1. Set the longest series as the new centroid (Cj).

2. Select the time series (Si) with the longest overlap with the centroid
(Ovi,j) using Equation (3.9).

3. Adjust the centroid using a linear regression estimated in the overlap-
ping section between the centroid and the series (Equation (3.12))

Si = c1 · Cj + c0 → S ′i =
Si − c0
c1

(3.12)

where c0 and c1 are two constants known as the intercept and slope
of the regression line respectively. Note that while the centroid Cj

(which is a time series that determines a certain cluster) is generated
by combining normalized series (S ′i), it is adjusted by linear regression
to the original series Si, preventing c1 from being zero. (The intercept
(c0) is subtracted from the original series (Si) and the result is divided
by the slope (c1), obtaining S ′i).

4. Combine the centroid and S ′i following Equation (3.10).

5. Update the centroid mask using Equation (3.11).

6. Divide the centroid by its mask. This process is undertaken for each
time series considered, since it is necessary to estimate the centroid to
apply Step 3 for the forthcoming record.

7. Back to Step 2 until all records are checked out or overlapping is below
a predefined threshold.


3.3. Clustering techniques 41

3.3.4 Assignment of records to clusters

Clustering algorithms associate records with different clusters by means of a
certain metric. In its basic mode, k-gaps assigns each record to the cluster
whose MSE (Mean-Square Error) is minimum, following Equation (3.13).

MSE(i, j) = Ov−1i,j

Ovi,j∑
n=0

Fi,j(n) [Si(n)− Cj(n)]
2 + P (3.13)

where Ovi,j is the overlapping length and P is a penalty associated with
short overlapping intervals between two time series. Note that the number
of superimposed values should be long enough to ensure that the metric used
for the classification of time series into clusters (MSE) is significative, be-
cause otherwise time series with good fit during a short overlapping interval
could be assigned to the wrong cluster (that is why it is important to add
the value P in Equation 3.13). Clusterings of this sort are considered barely
robust because they show very different classification patterns each time the
algorithm is run. Hence, the procedure to find the right value for P is by
testing different values until robust clusterings are obtained. We used se-
ries of daily temperatures and found that by setting a minimum overlapping
period of 30 days and a penalty of 100, the final pattern of the clusteriza-
tion did not significantly change over successive runs of the k-gaps algorithm.

In the normalization mode, the assignment process is different: in this
case linear regressions are computed between time series (Si) and clusters
(Cj) to find the centroid that best fits each record. In our case, cluster
assignment is performed by minimization of mean quadratic errors (ε2j) ob-
tained from residuals of the linear fit estimated using Equation (3.14).

Si = c1 · Cj + c0 + εj → Errori,j = Ov−1i,j

Ovi,j∑
n=0

ε2j(n) + P (3.14)

where c0 and c1 are once again the intercept and slope of the linear fit,
respectively. Note that the linear regression is performed in the overlapping


42 Chapter 3. Methodology

section between each centroid and the time series.

3.3.5 Stop conditions

In the classic k-means algorithm, the stop condition is usually reached when
convergence is detected (no change of the solution in a number of iterations)
or, after a given number of iterations that ensures the algorithm has obtained
a “good enough” solution. However, additional criteria are required in the
k-gaps to handle incomplete records. For instance, since centroids are based
on compositions of uneven time series, they might be affected by adding and
removing records from their corresponding clusters, and therefore, total con-
vergence cannot be ensured. Furthermore, looping assignation may occur
when some series move cyclically from one cluster to another during consec-
utive iterations, affecting the convergence of the algorithm. Therefore, to
provide a general stop condition for k-gaps, a detector of clustering changes
(D) has been defined (Equation (3.15)) as the summation of absolute values
obtained from differences between the number of time series in each cluster
at a given step (s), and the number of time series in those clusters for the
next iteration (s+ 1).

D =
k∑

j=0

|Sizej(s)− Sizej(s+ 1)| (3.15)

where k is the total number of cluster, and Sizej(s) represents the number
of records in cluster j during step s.

Hence, when the value of D tends to zero, we assume that the algorithm
has converged, and the stop condition has been reached. On the other hand,
when D values are repeated over successive iterations, it indicates that the
algorithm entered in one of the aforementioned loops, and a halting proce-
dure should be applied, starting again the algorithm with a different initial
condition.


Chapter 4

Selection of proxy locations for
temperature reconstructions

4.1 Background

Decades of scientific fieldwork have left abundant proxy datasets to recon-
struct the climate of the past. They provide information about climate
changes at local and regional scales, and have been pooled to derive CFR
of the last millennium (Crowley, 2000; 2k Consortium, 2013). They are
spatially-resolved reconstructions of climate variables generated from dif-
ferent paleoclimate archives. These reconstructions are affected by several
sources of uncertainty (Christiansen and Ljungqvist, 2017) related to the
reconstruction method (e.g. dimensionality, dependence on parameters, fre-
quency coherence) (Evans et al., 2014), the underlying proxy observations
(Neukom et al., 2018) (e.g. observational error, irregular chronologies, ob-
served resolution), their links with the target field (e.g. multivariate signals,
stationarity, spatial and temporal covariance, resolved resolution and sea-
sonality), or the non-uniform spatial distribution of the observing network
(Comboul et al., 2015). While most of these uncertainties have been thor-
oughly studied, spatial biases induced by sparse sampling locations are still
not well understood. Due to the limited availability of paleoclimate archives,
most proxy records are restricted to land, mainly the middle latitudes of

43


44Chapter 4. Selection of proxy locations for temperature reconstructions

the Northern Hemisphere, while there are extensive un-sampled regions in
the Southern Hemisphere and high latitudes. This unbalanced distribution
of paleoclimate records induces a spatial bias in global CFRs that remains
poorly quantified.

Circumventing this issue requires, in first place, a wise selection of prox-
ies (Bradley, 1996). Due to the increasing availability of high-resolution
records, the initiative PAGES-2k has scrutinized thousands of temperature-
sensitive proxies, releasing a global archive with paleoclimate records for the
last two millennia (Emile-Geay et al., 2017). This network has been recently
exploited to derive global multi-proxy CFRs of annual temperature from a
suite of reconstruction methods, including model-based assimilation schemes
similar to those employed in modern reanalyses (Hakim et al., 2016; Franke
et al., 2017). However, as these proxies are unevenly distributed, the spatial
bias is still present. Selecting records strategically situated over key regions
that capture the diversity of global patterns and simultaneously minimize
the spatial bias of the observing network represents a significant challenge
(Evans et al., 1998, 2001). Problems of this kind cannot be directly solved by
testing all possible combinations due to their high dimensionality. Efforts to
address this issue have only been attempted with sequential approaches (i.e.
step-wise solutions based on local search procedures that perform incremen-
tally by adding the proxy records with the best performance). Differently,
a branch of artificial intelligence based on soft-computing techniques has re-
cently emerged to solve high dimensional problems (Reichstein, 2019; Knüsel,
2019). For instance, biologically-inspired methods such as evolutionary algo-
rithms (Eiben and Smith, 2015) are global search procedures that outperform
sequential approaches in the task of finding optimal solutions to the repre-
sentative selection problem within large and complex datasets (Salcedo-Sanz
et al., 2019; Del Ser et al., 2019).

In this study we deal with the spatial bias in global CFRs of annual
temperature arising solely from the non-homogeneous distribution of the
currently available network of proxy records for the last millennium. To


4.2. Selection of representative locations 45

better constrain this bias, other sources of uncertainty are avoided by using
the CESM-LME (Otto-Bliesner, 2016) as a surrogated reality, where pseudo-
proxies (synthetic temperature series from the target simulation) matching
the locations of the PAGES-2k archive are artificially generated (see Meth-
ods). For these idealized conditions of the PAGES-2k proxy network (com-
plete observational availability over time, univariate signals without obser-
vational error, stationary relationships, etc.) we generate global CFRs of
annual temperature for the last millennium simulation of the CESM-LME
that are biased by the uneven distribution of real proxies. CFR techniques
(Gómez-Navarro et al., 2017) are then coupled with an evolutionary algo-
rithm to explore if optimized subsets of PAGES-2k locations can be used
instead without sacrificing the reconstruction skill. These pseudo-proxy ex-
periments allow us to address the following questions in the perfectly known
model′s world: Can we quantify the spatial bias due to the uneven distribu-
tion of records? Do we need all available records of the PAGES-2k network
to maximize the skill of global temperature field reconstructions of the last
millennium? If not, how many records are required to reconstruct the tem-
perature of the last millennium without degrading the skill? Can we find
a subset of PAGES-2k proxy locations that reduces the spatial bias of the
full-proxy network?

4.2 Selection of representative locations

Annual temperature global patterns of the first full-forcing CESM-LMEmem-
ber simulation spanning the 850-2005 period of the Common Era (CE) are
chosen as the target fields to reconstruct from pseudo-proxies (Smerdon,
2011) at the 569 grid-points matching the locations of the PAGES-2k archive.
The accuracy of the CFR is quantified as its RMSE against the spatially-
resolved global temperature patterns of the target simulation. Most exper-
iments have been set under idealized conditions by using perfect pseudo-
proxies directly assembled from the simulated temperature series of the tar-
get simulation. Unlike real reconstructions, our CFRs are only affected by


46Chapter 4. Selection of proxy locations for temperature reconstructions

biases due to the spatial distribution of paleoclimate archives and the CFR
methodology. This approach avoids other sources of uncertainty, allowing us
to better constrain spatial biases, which can be inferred from changes in the
reconstruction skill arising from different rearrangements of the pseudo-proxy
network.

To test the sensitivity of the results to the CFR method, two different
CFR techniques have been coupled to the CRO algorithm: The AM and the
CCA as described in Section 3.1. Note that this coupling brings physical
consistence to our method, proving that AI techniques can serve as a tool for
climate science without losing the underlying physics of the Earth system.
We used the 1850-2005 period of the first member of CESM-LME for the
calibration of the CCA (similar results are obtained for shorter calibration
periods as shown in Fig. A.2), while the remaining members of the same
model ensemble are employed as a pool of analogues to derive the CFR with
AM. Our results for the first member have also been tested in other realiza-
tions of the same model, a different model, and more realistic datasets, as
explained below. To add complexity and more realistic conditions, experi-
ments with additional sources of uncertainty were also performed. They allow
us to test the robustness of the results and benchmark the magnitude of the
spatial bias against that arising from other sources of uncertainty (the effect
of multiple uncertainty sources should not be considered additive, though).
In particular, we account for the amplitude of the observational error SNR
by adding different levels of red noise to the perfect pseudo-proxies with a
lag-1 autoregressive model (see Subection 2.2.1).

The AM-reconstructed global temperature fields using all 569 perfect
pseudo-proxies (i.e. SNR = ∞) leads to a measurable RMSE of 0.65 ◦C,
which should be ascribed to uncertainties in both the limited number and
distribution of records (spatial bias) and the CFR method.


4.2. Selection of representative locations 47

30 120 400
Number of proxies

0.60 0.60

0.65 0.65

0.70 0.70

0.75 0.75

0.80 0.80

R
M

SE
 (

C
)

Random search CRO-AM

Figure 4.1: RMSE of 850-2005 CE global temperature fields reconstructed
with different subsets of perfect pseudo-proxies from the PAGES-2k net-
work. Orange dots represent the RMSE associated with the reconstructions
using the optimized subsets of 30, 120, and 400 perfect pseudo-proxies of
the PAGES-2k network obtained with the CRO-AM. Blue violins show the
RMSE distribution obtained from 10000 reconstructions using different com-
binations of 30, 120, and 400 randomly selected pseudo-proxies from the
PAGES-2k network. RMSE are calculated with respect to the global tem-
perature fields of the target simulation (the first member of the CESM-LME).

To quantify the spatial bias, we derived new CFRs from reduced subsets of
N perfect pseudo-proxies constrained by CRO-AM (Coral Reef Optimization
coupled with the Analogue Method). For all optimized solutions of N records
tested, the CRO-AM generates more skillful reconstructions than selecting
subsets of N perfect pseudo-proxies at random (Fig. 4.1). Consequently,
the improvement obtained with CRO-AM is not related to the reduction
of random error, but a meaningful identification of representative locations
for the reconstruction of temperature fields. According to Fig. 4.2 (blue
curve), a minimum optimized set of 17 perfect pseudo-proxies, CRO-MIN
(Minimum subset of perfect pseudo-proxies obtained with CRO-AM necessary
to outperform the reconstruction skill of the full-proxy network) is enough
to obtain global temperature fields with the same RMSE as the full-proxy
reconstruction (green dashed line).


48Chapter 4. Selection of proxy locations for temperature reconstructions

30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570
Number of representative locations (N)

0.60 0.60

0.62 0.62

0.64 0.64

0.66 0.66

0.68 0.68

0.70 0.70

R
oo

t-M
ea

n-
Sq

ua
re

 E
rro

r (
C

)
Optimized perfect subset
Optimized noisy subset

Perfect Member spread
Noisy Member spread

569 perfect spread
569 noisy spread

Figure 4.2: RMSE of the 850-2005 CE global temperature fields reconstructed
with CRO-AM as a function of the number of selected perfect (blue) and
noisy pseudo-proxies (purple) of the PAGES-2k network. The green and
orange shades represent the 13-member reconstruction skill spread obtained
with all (569) perfect and noisy pseudo-proxies (with SNR of 1) of the PAGES
2-k network. The blue-shaded area represents the spread obtained by using
the optimized subset of N pseudo-proxies obtained for the first CESM-LME
member to reconstruct the remaining members of the ensemble. The purple-
shaded area is the same as the blue-shaded one but for reconstructions using
noisy pseudo-proxies with SNR of 1. All shades depict 2 standard deviations
with respect to the mean.

Therefore, under ideally perfect conditions, skillful reconstructions can
be achieved even when the number of records is reduced up to one order of
magnitude with respect to the PAGES-2k network. But, can we reduce
further the spatial bias of the full-proxy reconstruction with a subset of
well-distributed records larger than CRO-MIN? Our idealized experiments
indicate that the RMSE of full-proxy temperature field reconstructions can
be reduced up to 0.05 ◦C (Fig. 4.2) after selecting an optimal set of 120
perfect pseudo-proxies (CRO-OPT). Note that the associated error (0.60
◦C) approaches that obtained from a full global grid coverage of perfect
pseudo-proxies (0.54 ◦C). Besides minimizing the RMSE of the CFR for the
first member, CRO-OPT (Optimized subset of perfect pseudo-proxies of the
PAGES-2k network obtained with CRO-AM that yields the best reconstruc-
tion skill) locations also yield comparable levels of performance for the other
members of the ensemble (Fig. 4.2, shading). This indicates that the se-
lected network is representative across the ensemble, although the compara-


4.2. Selection of representative locations 49

tively higher RMSE in those members points to additional improvements if
the CRO-AM were applied separately to each realization. Similar estimates
of the spatial bias are also found for CRO-CCA (Coral Reef Optimization
coupled with the Canonical Correlation Analysis) reconstructions (a RMSE
reduction of 0.05 ◦C for an optimized subset of 120 perfect pseudo-proxies),
stressing the robustness of the results with respect to the CFR technique.
Indeed, the RMSE variation with the size of the selected network shown in
Fig. 4.2 is also reproduced by CCA reconstructions based on the locations
selected by the CRO-AM (Fig. 4.3).

100 200 300 400 500
Locations (from CRO-AM)

0.52

0.54

0.56

0.58

R
M

SE
 (

C
)

Min RMSE = 0.51 C, N = 125

Figure 4.3: RMSE of CCA reconstructions generated with the optimized
subsets of perfect pseudo-proxies of the PAGES-2k network selected by the
CRO-AM.The red dot and dashed line highlight the minimum RMSE.

Table 4.1: RMSE of global temperature fields for 850-2005 CE (in ◦C) using
CRO-AM reconstructions with N representative AR(1) pseudo-proxies of the
PAGES-2k network and different SNR.

N SNR = ∞ SNR = 1 SNR = 0.5
30 0.62 0.71 0.82
200 0.60 0.64 0.71
569 0.65 0.67 0.72


50Chapter 4. Selection of proxy locations for temperature reconstructions

The reduction of the RMSE with CRO-OPT is not negligible, since it is
equivalent to doubling the SNR in the full-proxy reconstruction (Table 4.1).
Focusing on the noise, the RMSE derived from noisy pseudo-proxies with
a SNR of 1 is only increased by 0.02 ◦C, whereas a SNR of 0.5 degrades
the reconstruction skill by 0.07 ◦C. In the case of SNR-1 pseudo-proxies, the
minimum number of records necessary to reach the same skill as their full-
proxy reconstruction increases with respect to that obtained from perfect
pseudo-proxies (60 instead of 17 in CRO-MIN). Still, it is possible to find
representative subsets of 150 noisy pseudo-proxies that outperform the skill
of the complete network of noisy pseudo-proxies (Fig. 4.2, lines). The same
conclusion is obtained for pseudo-proxies with more realistic components of
random noise (SNR=0.5), although previous experiments have shown that
this level of SNR provides pseudo-proxy CFRs with lower skill than real-world
reconstructions (Neukom et al., 2018). In addition, the noisy subsets that
optimize the CFR of the first member are able to reduce the reconstruction
error obtained for the other members of the ensemble from their complete
networks of noisy pseudo-proxies (Fig. 4.2, purple shading). For certain con-
ditions (SNR=1 in Fig. 4.2), optimized subsets of noisy pseudo-proxies can
even outperform the skill of the full network of perfect pseudo-proxies.

Therefore, with all other sources of uncertainty being absent, spatial bi-
ases induced by a non-uniform distribution of high-quality (ideally perfect)
proxies can potentially be larger than those obtained from a reduced subset
of well-distributed noisy proxies with SNR of 1. Most of the selected loca-
tions in CRO-OPT are situated at high latitudes (Fig. 4.4a), stressing the
importance of Arctic (Kaufman, 2009; Routson, 2019) and Antarctic (Steig
and Neff, 2018) regions. Although there is not a unique solution, the distri-
bution of optimal locations does not vary substantially with the number of
selected records (cf. Figs. 4.4a and 4.5), and the CFR method (Fig. 4.5).
In this regard, Figure 4.6a also shows that this distribution does not change
regardless of the noise associated with the pseudo-proxies even when higher
levels of SNR (e.g 0.5) are employed.


4.2. Selection of representative locations 51

(a) (b)
-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

C
or

re
la

tio
n 

di
ffe

re
nc

e

0.0 0.0

0.5 0.5

1.0 1.0

1.5 1.5

Pr
ox

y 
di

st
rib

ut
io

n 
(%

) (c) PAGES CRO-OPT

0.3 0.3

0.5 0.5

0.7 0.7

Pe
ar

so
n 

co
rre

la
tio

n (d)

-80 -60 -40 -20 0 20 40 60 80
Latitude (DD)

-0.5 -0.5

-0.4 -0.4

-0.3 -0.3

lo
g(

re
c/

or
i)

(e)

Figure 4.4: Performance of the CRO-AM reconstruction with the optimal
subset of PAGES-2k records (CRO-OPT). (a) Spatial distribution of CRO-
OPT records (orange dots) obtained from the full PAGES-2k network (purple
diamonds). (b) Spatial correlation difference between the temperature recon-
structions with CRO-OPT and all perfect pseudo-proxies. Stippling points
illustrate significant correlation differences (p < 0.05). Kernel density esti-
mation of the (c), Normalized latitudinal distribution of records (in % with
respect to the total number of pseudo-proxies) for the CRO-OPT subset (or-
ange) and the full-proxy PAGES-2k network (purple). (d) Latitudinal mean
Pearson correlations for the CRO-OPT (orange) and full-proxy (purple) re-
constructions. (e) Latitudinal logarithm of the standard deviation ratio for
the CRO-OPT (orange) and full-proxy (purple) reconstructions (σrec) com-
pared with the target simulation (σori). The latitudinal axis is proportional
to the effective area.


52Chapter 4. Selection of proxy locations for temperature reconstructions

0 1 2 3 4 5 6 7 8 9

-80

-60

-40

-20

0

20

40

60

80

La
tit

ud
e 

(D
D

)

Number of locations

(a) (b)

Analogue Method Canonical Correlation Analysis

Figure 4.5: Optimized subsets of 17 perfect pseudo-proxies of the PAGES-2k
network selected by CRO-AM (CRO-MIN) and CRO-CCA. (a) 2-D and (b)
latitudinal distributions of the CRO-MIN locations obtained with CRO-AM
(orange dots and shading) and the corresponding subset of perfect pseudo-
proxies of the PAGES-2k network (with the same size as CRO-MIN) obtained
with CRO-CCA (purple diamonds and line).

Moreover, Figure 4.6b illustrates how this distribution is always better
than choosing locations at random, indicating that more robust reconstruc-
tions of the past climate are generated when representative locations are
selected. The improvement of the CRO-OPT reconstruction is also reflected
at local scales (Fig. 4.4b), implying that small changes in the global skill (Ta-
ble 4.1) hide large regional improvements. The Pearson correlation coefficient
of the target simulation with this reconstruction is significantly higher than
with the full-proxy reconstruction for almost the entire Southern Hemisphere
and the Arctic, and only performs worse in regions where the complete net-
work presents high density of perfect pseudo-proxies. This is consistent with
the spatial pattern of the RMSE difference between the CRO-OPT and full-
proxy reconstructions (Fig. 4.7). Interestingly, highly sampled regions by the
PAGES-2k network tend to coincide with areas of low spatial autocorrelation
(Fig. 4.8). As such, the regional details of the temperature field over these
regions are better captured by the denser full-proxy network than by CRO-


4.2. Selection of representative locations 53

OPT. However, our experiments indicate that the regional improvement of
the full-proxy network is attained with sacrifices in its global performance.
This is explained by the fact that improving the reconstruction of small areas
of the map comes at the expense of debasing the skill (i.e. lower correlation
and higher RMSE) of the entire study region.

-80 -60 -40 -20 0 20 40 60 80
Latitude (DD)

0

1

2

3

4

5

6

7

8

9

N
um

be
r o

f p
ro

xy
 lo

ca
tio

ns

SNR = 
SNR = 1
SNR = 0.5

1 0.5
SNR

0.65 0.65

0.70 0.70

0.75 0.75

0.80 0.80

0.85 0.85

0.90 0.90

0.95 0.95

R
M

SE
 (

C
)

(a) (b)

CRO-AM 
CRO-AM 1
CRO-AM 0.5

Figure 4.6: Sensitivity of CRO-AM reconstructions to pseudo-proxies with
different levels of observational error. (a) Latitudinal distribution of the
optimized subsets of 30 locations selected by CRO-AM from a PAGES-2k
network of perfect pseudo-proxies (SNR = ∞, orange shading), and noisy
pseudo-proxies with SNR = 1 (purple line) and SNR = 0.5 (blue line). (b)
RMSE of CRO-AM reconstructions from pseudo-proxies with different SNR.
For each type of pseudoproxies, symbols indicate the RMSE of the recon-
struction obtained with the optimized subsets of locations found for perfect
pseudo-proxies (orange dots), and noisy pseudo-proxies with SNR of 1 (pur-
ple diamonds) and 0.5 (blue crosses). Blue violins illustrate the RMSE dis-
tributions of 10000 reconstructions obtained from subsets of 30 PAGES-2k
locations selected at random.


54Chapter 4. Selection of proxy locations for temperature reconstructions

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

C
R

O
-O

PT
 - 

PA
G

ES
 R

M
SE

 (K
)

Figure 4.7: RMSE difference between the CRO-OPT and full-proxy recon-
structions. Green (purple) color illustrates regions where CRO-OPT yields
lower (higher) RMSE than the reconstruction with the full PAGES-2k net-
work of perfect pseudo-proxies. RMSE are calculated with respect to the
target field (the first CESM-LME member).

1000

1500

2000

2500

3000

3500

4000

4500

5000

D
is

ta
nc

e 
(k

m
)

Figure 4.8: Spatial map of e-folding distances of decorrelation for the annual
temperature of the first full-forcing CESM-LME member. The distance (in
kilometers) for each grid point defines the area of the circle for which the
averaged coefficient of determination (R2) has decayed below e−1.


4.2. Selection of representative locations 55

These results show that spatial clusters of proxy records may debase
the skill of spatially-resolved global temperature reconstructions. Note that
the CRO-OPT distribution (Fig. 4.4c) is not simply a uniform one, and
fewer points are often selected over areas with high spatial autocorrela-
tion (e.g. the tropics). Still, its reconstruction leads to generalized lat-
itudinal improvements in terms of correlation (Fig. 4.4d) and variability
(Fig. 4.4e). These strategic locations also retrieve skillful reconstructions
of area-weighted GMT (Global Mean Temperature) for the last millennium.
The GMT of the reconstruction generated with CRO-OPT shows high skill
(RMSE of 0.09 and R2 of 0.88), improving that from the full-proxy recon-
struction (RMSE of 0.16 and R2 of 0.71). Recall that the selected locations
represent an optimal subset of the PAGES-2k network for the specific condi-
tions and target field of our idealized experiments. To test the independence
of these results from the model employed, the CRO-AM experiment has also
been run in the CCC400 model ensemble (Bhend et al., 2012; Franke et al.,
2017). The results for an optimized subset of 120 locations (the same number
as in CRO-OPT) are depicted in Fig. 4.9.

It is noteworthy to mention that False detection tests (Hu et al., 2017)
regarding serial correlation and test multiplicity have been applied for the de-
termination of global statistical significance in correlation maps between the
reconstructed and target fields (Fig. 4.4 and Fig. 4.9). Because of the large
N, effective degrees of freedom are high, leading to significant correlations at
the 95% confidence level for all grid points with significant differences in Figs.
4.4 and 4.9. Furthermore, FDR (False Discovery Rate) have been calculated
to assess whether there are falsely attributed significant correlations (Hu et
al., 2017). In all cases, correlation maps passed the multiplicity test for a
FDR of 5%. Note that these tests have been performed for pseudo-proxy
experiments under idealized conditions, and the high statistical significance
of the results is not necessarily representative of real proxy-based reconstruc-
tions, where additional sources of uncertainty are present.


56Chapter 4. Selection of proxy locations for temperature reconstructions

(a) (b)
-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

C
or

re
la

tio
n 

di
ffe

re
nc

e

0.0 0.0

0.5 0.5

1.0 1.0

1.5 1.5

2.0 2.0

Pr
ox

y 
di

st
rib

ut
io

n 
(%

) (c) PAGES CRO-OPT

0.5 0.5

0.7 0.7

0.9 0.9

Pe
ar

so
n 

co
rre

la
tio

n (d)

-75 -60 -45 -30 -15 0 15 30 45 60 75
Latitude (DD)

-0.4 -0.4

-0.3 -0.3

-0.2 -0.2

-0.1 -0.1

lo
g(

re
c/

or
i)

(e)

Figure 4.9: As Fig. 4.4 but using the global temperature fields of the CCC400
first ensemble member (1601-2005 CE) as target. The reconstruction has
been obtained from the optimized subset of perfect pseudo-proxies of the
PAGES-2k network (with the same size as CRO-OPT) selected by CRO-AM
in the CCC400 model ensemble.

The spatial distribution of selected locations, the improvement of the cor-
relation at high latitudes and part of the Pacific Ocean, and the latitudinal
correlation and variability patterns resemble their CESM-LME counterparts
presented in Fig. 4.4. A similar distribution persists in out-of-sample vali-
dation tests where the CCC400 ensemble is used as a pool of analogues to
reconstruct the first member of the CESM-LME (Fig. 4.10), indicating that
the selection of representative locations is not sensitive to the training data
used in the process.


4.2. Selection of representative locations 57

-80 -60 -40 -20 0 20 40 60 80
Latitude (DD)

0

5

10

15

20

25

30

35

N
um

be
r o

f p
ro

xy
 lo

ca
tio

ns
 (%

)

CCC400 pool PAGES CESM-LME pool

Figure 4.10: Latitudinal distribution of optimized subsets of the PAGES-2k
network using different model ensembles as a pool for the CRO-AM recon-
struction of the first CESM-LME member. Each subset includes 17 perfect
pseudo-proxies obtained with the CRO-AM using as a pool members of the
CESM-LME (orange shading) and the CCC400 ensemble (black line). In
both cases, the target is the 850-2005 CE global temperature fields of the
first member of the CESM-LME. The purple line illustrates the distribution
of full-proxy PAGES-2k network.

We have also performed more stringent tests to assess whether the CRO-
OPT locations inferred from the CESM-LME are also informative of GMTs
in more realistic datasets. The exercise has been applied to instrumental
temperature data based on HadCRUT 4.2 (Jones et al., 1999) for the post-
industrial period (1850-2008 CE), as well as to proxy-based temperature
reconstructions for the last millennium (850-2000 CE) provided by the Last
Millennium Reanalysis (LMR, Methods). These products are independent
from the CCC400 and CESM-LME, except for the fact that the LMR uses
prior state estimates from an earlier version of the CESM atmospheric model.
For each of these datasets, we computed first guess GMTs (GMTg hereafter)
as the area-weighted mean temperature for two different sets of locations:
the CRO-OPT, previously obtained with the CESM-LME, and the complete
PAGES-2k network. Note that GMTg are directly obtained from the tem-
perature series at these specific locations, without reconstructing the global
temperature fields of the respective dataset. Although this approach is not
ideal, and differs from global aggregation strategies typically employed for the


58Chapter 4. Selection of proxy locations for temperature reconstructions

computation of GMTs (area-weighted global means from re-gridded fields),
it gains importance in the light of incomplete coverage and time-varying
availability of global records.

1000 1200 1400 1600 1800 2000
Year (CE)

-0.50

-0.25

0.00

0.25

0.50

0.75

1.00

1.25

LM
R

 G
M

T 
an

om
al

y 
(

C
)

(a)

1860 1880 1900 1920 1940 1960 1980 2000

-0.50

-0.25

0.00

0.25

0.50

0.75

H
ad

C
R

U
T 

G
M

T 
an

om
al

y 
(

C
)

Full grid
PAGES GMTg
CRO-OPT GMTg

LMR GMTg HadCRUT GMTg
0.70

0.75

0.80

0.85

0.90

0.95

1.00

R
2

(b)

CRO-OPT
PAGES

PAGES
Full-grid

Figure 4.11: Estimates of GMT anomalies (◦C) for the last millennium as
inferred from selected subsets of the PAGES-2k network. (a) GMT anoma-
lies from the LMR for 850-2000 CE (Inset (a): GMT anomalies from Had-
CRUT4.2 for 1850-2000 CE). Purple and orange lines show the GMTg of
these datasets, defined as the area-weighted temperature mean for the grid
points matching the PAGES-2k and CRO-OPT locations, respectively. All
anomalies are computed with respect to the 1961-1990 baseline. (b) Coef-
ficient of determination between the time series of GMT and GMTg from
PAGES-2k (purple) and CRO-OPT (orange) locations. Violins illustrate the
distributions obtained for 10000 subsets (with the same size as CRO-OPT)
of randomly-selected locations from the PAGES-2k network (blue) and the
full global grid (red).

Figure 4.11 shows how the HadCRUT 4.2 and LMR GMTs are overall
consistent with their GMTg. In both cases, the coefficients of determination
(R2) are higher for CRO-OPT than for the complete PAGES-2k network, and
they are also significantly higher than selecting random sets of locations from
the PAGES-2k archive (Fig. 4.11b). This evidences the representativeness
of CRO-OPT in observations and real-world reconstructions.


4.3. Reconstruction of temperature patterns 59

4.3 Reconstruction of temperature patterns

Despite its improvement, the optimized selection of locations is constrained
by the original distribution of proxy records in the PAGES-2k archive, which
displays limited coverage and a spatial bias towards mid-latitude land ar-
eas of the Northern Hemisphere, with very few proxies over oceanic regions.
Therefore, we explored the skill of this reduced subset to capture internal
variability and externally forced signals in the global temperature patterns.
This also allows us to assess whether the selected locations have physical
meaning or simply represent an optimized statistical distribution. El Niño-
Southern Oscillation (ENSO) is chosen as an example of the former. The per-
fect pseudo-proxy CFR with CRO-OPT locations explains more than 80% of
the SST variance over El Niño 3.4 region for the pre-industrial period of the
target simulation, and even the more constrained CRO-MIN reconstruction
captures global ENSO teleconnections reasonably well (Fig. 4.12).

Taking into account that ENSO is the main mode of internal variability
on interannual time-scales, it is surprising that only one of the selected lo-
cations in CRO-MIN is over the tropical Pacific (Fig. 4.12b). However, the
spatial autocorrelation of temperature variations in this region is among the
highest of the globe (Fig. 4.8), which arguably reduces the effective number
of records required to reproduce them. The ability of this subset to capture
ENSO signals can be further explained by the optimized distribution, with
some of the selected locations laying in areas strongly affected by ENSO tele-
connections, such as the nodes of the Pacific North/South American pattern
(Lewis and LeGrande, 2015). This suggests that the CRO-AM tends to prior-
itize those records strategically situated over major climate teleconnections,
which together explain a large fraction of the global temperature variance.


60Chapter 4. Selection of proxy locations for temperature reconstructions

(a)

-2 -1 0 1 2

(b)

Temperature anomaly ( C)
-2 -1 0 1 2

Figure 4.12: Reconstruction skill of internal variability patterns with the
CRO-MIN subset of PAGES-2k records. Composite of annual temperature
anomalies (◦C, with respect to 850-2005 CE) for El Niño events in (a), the tar-
get field (the first CESM-LME full-forcing member). (b) The reconstructed
field from the CRO-MIN subset of perfect pseudo-proxies of the PAGES-2k
network (yellow dots). For each panel, crosses depict non-significant temper-
ature differences at 95% confidence level with respect to its corresponding
climatology inferred from a bootstrap of 10000 random samples. El Niño
events are defined as years of the target simulation with standardized tem-
peratures above the 95th percentile at El Niño-3.4 region (black square).

Indeed, the CRO-OPT subset can also capture large-scale internal modes
of atmospheric circulation variability, such as the Northern Annular Mode
(NAM), herein reconstructed from the annual mean sea level pressure (SLP)
fields of the best temperature analogue years (Fig. 4.13). The reconstructed
NAM series follows the simulated variations in the target run, although with
considerable uncertainty, and a tendency to underestimate the target ampli-
tude. The latter suggests that, despite being informative, networks specifi-
cally optimized for a given target field do not necessarily represent the opti-
mized solutions for the reconstruction of other fields.


4.3. Reconstruction of temperature patterns 61

[0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90) [90-100)
Percentile range (%)

-3

-2

-1

0

1

2

3

N
AM

Reconstruction spread Target

Figure 4.13: Percentile distribution of simulated NAM values in the first
CESM-LME member (850-2005 CE) and their corresponding reconstructions
from the CRO-OPT subset of the PAGES-2k network. Black diamonds rep-
resent the mean simulated NAM for each percentile range, with vertical black
lines showing their respective minimum and maximum values. Blue violins
show the distribution of the reconstructed NAM values for the same years
included in each percentile range and 100 different NAM reconstructions.
Mean values of the violin distributions are depicted as horizontal blue lines.

As for the external forcings, the constrained reconstructions from subsets
of perfect pseudo-proxies also capture the timing and spatial fingerprint of
the simulated cooling response to major volcanic events of the Last Millen-
nium This is seen in Fig. 4.14 where a clear cooling with respect to the
mean temperatures of the two previous years appears in the target and re-
constructed field from CRO-OPT following the Tambora eruption in 1815.
Interestingly, CRO-AM replicates these forced responses by choosing years
from the CESM-LME with significant volcanic activity, indicating that the
reconstruction method is capturing the physical processes behind the climate
response to external forcings.


62Chapter 4. Selection of proxy locations for temperature reconstructions

(a)

-4 -3 -2 -1 0 1 2 3 4

(b)

Temperature anomaly ( C)
-4 -3 -2 -1 0 1 2 3 4

Figure 4.14: Annual temperature anomalies (oC) after Tambora’s erup-
tion (1815 CE) in (a) target simulation and (b) CRO-OPT reconstruction.
Anomalies are calculated as the difference between the year after the eruption
and the mean temperature of the 2 previous years.

This suggests that the CRO-OPT reconstruction can discern the volcanic
fingerprint in the global temperature patterns from the internal variability.
Therefore, in a more general context, we have assessed if CRO-OPT can de-
tect externally forced responses and attribute them to the responsible forcing.
This is done by first assigning each year of the CESM-LME to a dominant
factor (i.e. the one with the largest radiative forcing in the top of the atmo-
sphere for that year) by using single forcing simulations of the CESM-LME.

To determine if a particular forcing is dominant for a certain year, we
have used the area-weighted mean FSNTOAC (Clear-sky net solar flux at
top of the atmosphere) from the control simulation and the ensemble of
single-forcing simulations for the following available forcings: orbital, so-
lar, volcanic, land use/land cover, greenhouse gas concentrations and ozone.
For each year and single-forcing run, we computed the absolute FSNTOAC
difference with respect to the 1156-yr mean of the control simulation. By do-
ing this, we estimated the ensemble mean radiative imbalance for each year
and forcing. A given year is assigned to an external forcing if two conditions
are met: that the forcing has the highest absolute FSNTOAC difference,
and this value is above 2 standard deviations of the control mean (to avoid
false forcing assignations), which provides a measure of internal variability.


4.3. Reconstruction of temperature patterns 63

Otherwise, the year is not assigned to any external forcing, this subset com-
prising years of weak or multiple external forcings, which may obscure the
attribution exercise. With this approach, 60% of years on record could not
be assigned to a single forcing, whereas 21% and 10% of years were as-
sociated with volcanic and solar forcings, respectively. On the other hand,
only 2% of years on record have been associated with greenhouse gas emis-
sions (note that this forcing is not relevant before 1850 CE). The remaining
forcings, namely land use/land cover ( 4%), ozone ( 2%) and orbital (less
than 1%), dominated for very few years and are not considered, since their
frequency series for the last millennium did not show signals discernible from
the internal variability. Note that this attribution method is instantaneous
(i.e. it assigns dominant forcings year by year), and not fully independent
since forcing years are selected from different simulations of the same model.

Then, for each year of the target simulation we count the detected fre-
quency of each external forcing in the 100 best analogue years selected using
CRO-OPT locations and test whether this frequency is significantly larger
than that expected by random chance. The results for key forcings of the last
millennium (Fig. 4.15) confirm that CRO-OPT is able to detect some forced
signals and assign them to the right forcing. This is true for years following
large volcanic eruptions (e.g. Tambora 1816) and periods with high volcanic
activity (e.g. the 18th century). Similarly, the conspicuous warming of the
second half of the 20th century is preferentially reconstructed from years
with strong anthropogenic forcing, so that this signal can be attributed to
increasing concentration of greenhouse gases. On the contrary, solar signals
on annual time scales are not well discernible from the internal variability.
Note, however, that it does not mean a lack of solar forcing signals since
our approach is instantaneous (based on annual forcings and temperature
patterns) and hence it does not take into account lagged or low-frequency
responses to external forcings.


64Chapter 4. Selection of proxy locations for temperature reconstructions

-2 -2

-1 -1

0 0

1 1

FS
N

TO
AC

 (W
m

2 )
(a) Volcanic GHG Solar

0 0

20 20

40 40

60 60

80 80

100 100
(b)

0 0

10 10

20 20

30 30

Pe
rc

en
ta

ge
 o

f a
na

lo
gu

es

(c)

1000 1200 1400 1600 1800 2000
Years (CE)

0 0

10 10

20 20

30 30
(d)

Figure 4.15: Detection of external forcings in the reconstruction with the
CRO-OPT subset of the PAGES-2k network. (a) Annual mean clear-sky net
solar flux at top of the atmosphere for three single-forcing ensemble simula-
tions. Percentages of the 100 best analogue years selected from CRO-OPT
with the same dominant forcing as in the given year of the target simula-
tion. (b) volcanic, (c) greenhouse gases, and (d) solar forcing. Black dashed
lines depict the significance thresholds above which there is an instantaneous
detection of forced signals attributed to the given forcing.

4.4 Insights on past anomalous periods

We now focus on longer time-scales and explore how well the CRO-AM recon-
structions capture key anomalous periods of the past, such as the MCA (Me-
dieval Climate Anomaly) (950-1250 CE) (Bradley, 1996) and the LIA (Little
Ice Age) (1450-1850 CE) (Bradley and Jones, 1993). Although model sim-
ulations and proxy-based reconstructions agree on the existence of a global


4.4. Insights on past anomalous periods 65

mean temperature difference between both periods, there are discrepancies
in its magnitude and spatial pattern, which are still not well understood
(Mann, 2009; Fernández-Donado et al., 2013). Models usually yield a vari-
ety of spatial temperature patterns for the MCA-LIA transition as well as
weaker differences than proxy-based reconstructions (Fernández-Donado et
al., 2013).

Table 4.2: GMT differences between the MCA (950-1250 CE) and the LIA
(1450-1850 CE). Area-weighted mean temperatures are calculated globally
and for the Northern Hemisphere (NH). The target is the first ensemble
member of the CESM-LME. Reconstructions are generated with CRO-AM
using perfect pseudo-proxies at the locations of CRO-MIN, CRO-OPT and
the full-proxy network of PAGES-2k.

MCA-LIA Temperature (oC)
Map Global NH
Target 0.19 0.20

CRO-MIN (17) 0.08 0.10
CRO-OPT (120) 0.11 0.13
PAGES-2k (569) 0.09 0.11

The MCA-LIA GMT difference for our target simulation is 0.19 ◦C, si-
milar to the remaining members of the CESM-LME. However, this value is
halved in the reconstructions obtained with CRO-MIN, CRO-OPT and the
complete PAGES-2k network, even if we use ideally perfect pseudo-proxies
(Table 4.2). The attenuation of the global warming amplitude is further
evidenced by the spatial pattern of temperature differences depicted in Fig.
4.16. Note that all regions show milder differences in the reconstruction,
especially in the Southern Ocean and part of the Arctic .


66Chapter 4. Selection of proxy locations for temperature reconstructions

(a)
-0.75

-0.50

-0.25

0.00

0.25

0.50

0.75

M
C

A-
LI

A 
Ta

rg
et

(b)
-0.75

-0.50

-0.25

0.00

0.25

0.50

0.75

M
C

A-
LI

A 
R

ec
on

st
ru

ct
io

n 
(A

M
)

Figure 4.16: Spatial pattern of mean temperature difference (oC) between
MCA (950-1250 CE) and LIA (1450-1850 CE) in (a) the target simulation
and (b) the CRO-OPT reconstruction.

To deep further into the causes of the MCA-LIA underestimation, we first
assess whether they stem from a limited coverage of the current PAGES-2k
archive. To address this question, the CRO-AM was run to find the same
number of representative locations as in CRO-MIN, but from the full grid of
CESM-LME (i.e. without constraining the search to the PAGES-2k network,
Free run in Fig. 4.17).

The resulting distribution of selected locations has similarities with CRO-
MIN, stressing the relevance of high latitudes (Fig. 4.17b). Therefore, we
conclude that climatically representative regions for the last millennium are
sufficiently covered by the PAGES-2k archive. This also holds for the MCA
and LIA periods, whose reconstruction skills with CRO-MIN are comparable
to those obtained for the entire period of the last millennium. Dedicated
experiments to find specific subsets of the PAGES-2k network that optimize
the CFR for the MCA and LIA separately (MCA and LIA runs in Fig.
4.17) also bear strong resemblance with the distribution obtained in the last
millennium experiment (CRO-MIN). Indeed, good analogues of the MCA
and LIA global patterns can be found through most of the last millennium,
although with reduced probability during exceptionally cold and warm inter-
vals, respectively, including the 20th century (Fig. 4.18). Interestingly, MCA
patterns do not resemble those of the 20th century, indicating distinguishable
MCA signatures with respect to the organized global warming of the 20th

century in the model world.Therefore, the question is if there is any subset


4.4. Insights on past anomalous periods 67

0 1 2 3 4 5 6

-80

-60

-40

-20

0

20

40

60

80

Number of locations

(a) (b) (c) (d)0 1 2 3 4

-80

-60

-40

-20

0

20

40

60

80

0 1 2 3 4 5 6 7

-80

-60

-40

-20

0

20

40

60

80

La
tit

ud
e 

(D
D

)

CRO-MIN Free MCA-LIA MCA LIA

Figure 4.17: Distribution of representative locations for different experiments
with perfect pseudo-proxies and the CRO-AM. (a) 2-D and (b-d), latitu-
dinal distribution of optimized subsets of perfect pseudo-proxies (with the
same size as CRO-MIN) selected with the CRO-AM for different optimiza-
tion problems. Optimized reconstruction of the global annual temperature
fields of the last millennium from locations constrained to the PAGES-2k net-
work (CRO-MIN, orange) and from an unconstrained selection (Free, purple).
Optimized subsets of the PAGES-2k network for the reconstruction of the
global annual temperature fields of the MCA (red) and LIA (blue) periods
separately, and the spatial pattern of the mean temperature difference be-
tween the MCA and LIA (MCA-LIA, green). Latitudinal distribution of (b)
CRO-MIN (orange shading) and Free (purple line). (c) MCA-LIA. (d) MCA
(red shading) and LIA (blue line).

of PAGES-2k locations that can reproduce the MCA-LIA pattern (i.e. the
spatial pattern of temperature differences on multi-centennial scales). This
has been accomplished by modifying the health function of the CRO algo-
rithm in order to find subsets of perfect pseudo-proxies that minimize the
spatial RMSE of the MCA minus LIA temperature pattern (MCA-LIA run
in Fig. 4.17). The results of this experiment reveal that it is possible to find
a subset of the same size as CRO-MIN that matches the MCA-LIA pattern
with almost the same amplitude as in the target simulation (0.15 ◦C).


68Chapter 4. Selection of proxy locations for temperature reconstructions

1000 1200 1400 1600 1800 2000
Years (C.E.)

0 0

100 100

200 200

300 300
N

um
be

r o
f a

na
lo

gu
es

MCA (950-1250)
LIA (1450-1850)

Figure 4.18: Time series with the total frequency of analogues for all years of
the MCA (950-1250 CE, red) and LIA (1450-1850 CE, blue) in the CRO-OPT
reconstruction. For each year of the MCA and LIA in the target simulation,
the 100 best analogues of the annual temperature at the CRO-OPT locations
are selected. Their respective years of occurrence are retained and accumu-
lated through the MCA and LIA periods, separately.

However, this improvement in reproducing the MCA-LIA pattern is made
by sacrificing the reconstruction skill on annual scales, with the RMSE in-
creasing from 0.65 ◦C in CRO-MIN to 0.80 ◦C in the MCA-LIA run. Fur-
thermore, MCA-LIA locations are mainly situated over tropical and extra-
tropical latitudes (Fig. 4.17c), leading to a completely different latitudinal
distribution compared to that in the annually-resolved MCA, LIA, CRO-
MIN or Free experiments.

These results indicate that perfect pseudo-proxy locations that optimize
the reconstruction in the high-frequency do not necessarily bring an improve-
ment in the lower frequencies of the spectrum. To support this statement,
we have performed additional experiments where annual temperature fields
of the target simulation have been low-pass filtered with 10- and 100-year
running windows. The CRO-CCA has been run this time to find optimized
subsets of the PAGES-2k network targeting the CFR on each time scale. Fig-
ure 4.19 shows how the number of selected records at high latitudes (poleward
of 65◦ N) significantly decreases for lower frequencies.


4.4. Insights on past anomalous periods 69

Annual Decadal Centennial
Lowpass filtering

4

6

8

10

12
Pr

ox
ie

s 
at

 p
ol

ar
 re

gi
on

s 
(6

5
N

/S
)

Figure 4.19: Number of records at polar regions (latitudes above 65◦N/S)
in optimized subsets of 20 perfect pseudo-proxies of the PAGES-2k network
for CRO-CCA reconstructions of global temperature fields on different time
scales. The CRO-CCA has been run to find subsets of 20 perfect pseudo-
proxy locations of the PAGES-2k network that optimize the reconstruction
of the global temperature fields of the first CESM-LME member at annual,
decadal and centennial time scales by using a 1-, 10- and 100-year low-pass
filter smoothing, respectively. The distributions include 200 different opti-
mized solutions from 5 CRO-CCA runs (40 best solutions of each run were
kept). Boxes represent the median (orange line) and interquartile ranges of
the distribution, with whiskers denoting extreme solutions.

This agrees with the optimized distribution for the reconstruction of the
MCA-LIA spatial pattern shown in Fig. 4.17. As the large interannual
variability at high latitudes was removed for the low-pass filtered fields, we
hypothesize that fewer high-latitude records are needed to reconstruct long
term temperature variations. As a consequence, the optimality of the lo-


70Chapter 4. Selection of proxy locations for temperature reconstructions

cations selected for the annual CFR is progressively lost towards the lower
frequencies of the spectrum. This is further supported by the power vari-
ance spectrum of GMT reconstructions obtained with different subsets of
the PAGES-2k network (Fig. 4.20), which shows similar performance of the
CRO-OPT and full-proxy reconstructions on longer time scales for all en-
semble members.

10

10

20

20

50

50

100

100

200

200

Period (Years)

10
1

10
1

10
0

10
0

Sp
ec

tra
l D

en
si

ty

Target
PAGES

CRO-OPT
SNR=1

Figure 4.20: Power density spectrum of GMTs series for 850-2005 CE. Color
lines show the power spectrum of area-weighted global mean temperature
anomalies calculated for the target simulation (black), the CRO-OPT re-
construction (orange), and the CRO-AM reconstructions generated with the
full-proxy PAGES-2k network of perfect pseudo-proxies (purple), and an op-
timized subset of 150 noisy pseudo-proxies with SNR=1 (red). Opaque lines
depict the mean spectral density of the ensemble, and transparent lines rep-
resent individual spectral densities of all (13) members of the ensemble.

Both networks underestimate the low frequency variations of the true
GMT, in agreement with the reduced amplitude of their MCA-LIA recon-
structions. Although the ultimate causes are unclear and require dedicated
studies in more realistic conditions, our experiments point towards a signif-
icant dependence of the optimal subset of PAGES-2k locations on the time
scale variations of the target field that are pursued by the CFR.


4.4. Insights on past anomalous periods 71

Keynotes

These are the most important findings of Chapter 4 to take home:

• The bias induced by non-homogeneous distributions is similar to that
obtained by adding red noise with the same variance as the climate
signal.

• A reduced set of representative proxies generates reconstructions with
better skill than using the entire PAGES-2k network.

• There is a significant increase of correlation in reconstructions obtained
with proxies at representative locations.

• Annual temperature fields are better reconstructed with locations in
high latitudes and teleconnection regions.

• Long-term climate variations are better reconstructed with locations
over low and middle latitudes.

• Representative locations depend on the temporal resolution of the tar-
get.


Chapter 5

North Atlantic SLP
reconstruction since 1750

5.1 Background

Atmospheric conditions are nowadays continuously monitored by systems
ranging from land-based meteorological stations to geostationary satellites
orbiting the planet. However, the amount of available information from
instrumental records drastically decreases the further we go back in time,
leading to a problem of data scarcity, particularly acute in the pre-industrial
era (before 1850 of the Common Era, CE) and over the oceans (Küttel et
al., 2010; Cram et al., 2015; Franke et al., 2017; Brönnimann et al., 2020;
Noone et al., 2020). Within this context, new methods that maximize the
extraction of information from climate datasets have recently emerged (e.g.,
Ilyas et al., 2017; Benestad et al., 2019; Carro-Calvo et al., 2020; Kadow et
al., 2020; Vaccaro et al., 2021). They represent promising tools to manage
large amounts of current information (big data) as well as situations of data
scarcity including a better preservation of variability and resolve more ac-
curately the covariance structure for regions with data gaps. In this sense,
optimization techniques including evolutionary algorithms (Vrugt and Robin-
son, 2007; Eiben and Smith, 2015) have been developed in the area of AI to
maximize the skill of climate field reconstructions (CFR) as seen in Chapter

73


74 Chapter 5. North Atlantic SLP reconstruction since 1750

4 (Salcedo-Sanz et al., 2019; Jaume-Santero et al., 2020). As it is described
in Section 3.2, evolutionary algorithms (Eiben and Smith, 2015) are soft-
computing techniques inspired by biological and natural selection processes
(Del Ser et al., 2019) which are based on the reproduction and survival of
best suited individuals within a competitive environment.

In this chapter, we have coupled a CFR method with the CRO (i.e., the
evolutionary algorithm employed in Chapter 4) to obtain optimized high-
resolution (1◦ × 1◦) monthly SLP fields over the North Atlantic for 1750-
2004 CE from historical land-based observations over Europe, Greenland
and North America included in the SLP-Obs dataset (see Subsection 2.1.2).
The evolutionary algorithm is designed to find an optimized combination of
weights for the observing network that maximizes the reconstruction skill of
the SLP field (Section 5.2).This new monthly SLP data set presented herein
supersedes the statistically-derived 5◦ × 5◦ resolved gridded seasonal SLP
dataset of Luterbacher et al. (2002) and Küttel et al. (2010) that cover the
eastern North Atlantic Europe and the Mediterranean area back to 1750 CE
using terrestrial instrumental pressure series and marine wind information
from ship logbooks. In contrast to Luterbacher et al. (2002) and Küttel et
al. (2010), we use more station pressure series, provide a higher resolved spa-
tial reconstruction and apply a novel method that can deal more accurately
with a low number of station observations, and that can better preserve vari-
ability and better resolve the covariance structure, including a more realistic
representation of circulation extremes.

The reconstruction performance is compared against other reconstruc-
tions obtained using the same set of observations but without optimization
(Section 5.3), and allows us to study the extratropical climate variability of
the Eastern North Atlantic Ocean and Europe for over the past 255 years
focusing on the NAO (North Atlantic Oscillation). The NAO is the leading
mode of climate variability related to the rearrangement of air mass between
subtropical and polar latitudes, and whose alternating phases generate strong
changes in surface temperatures, wind, and precipitation over the Atlantic re-


5.2. Optimized networks 75

gion and its surroundings (Hurrell and Deser, 2010). Although several NAO
indices have been published previously, discrepancies among them (especially
present prior to the 19th century where the lack of information is notoriously
higher) impairs the study of changes in these phases and their influence on
regional climates (e.g., Schmutz et al. (2000); Hernández et al. (2020) and
references therein).

In this chapter, we apply the optimized network approach to reconstruct
SLP fields back to 1750 CE, that allow to study the change of atmospheric
circulation over the Eastern North Atlantic-European area as well as the as-
sociated atmospheric action centers such as the AH (Azores High) and the
Icelandic Low IL (Icelandic Low) (Section 5.4). The AH is a persistent sub-
tropical anticyclone situated around the Azores islands that is associated with
the descending branch of the Hadley Cell over the Atlantic Ocean (Zishka
and Smith, 1980; Davis et al., 1997; Ioannidou and Yau, 2008; Iqbal et al.,
2019). In spite of the important role of the AH in the NAO pattern dur-
ing winter time, and the influence of this high pressure system on European
hot temperatures and drier weather in summer (Hasanean, 2004), there are
not many high-resolution reconstructions of the AH due to the difficulty of
extracting data from that oceanic area. It is therefore crucial to optimize
land-based networks of climate observations to maximize the skill over the
Atlantic region.

5.2 Optimized networks

Note that the AM in its standalone version (see Subsection 3.1.1) employs
the full network of observations, without discrimination among the records,
which are considered equally informative (i.e. weights of 1). To test the effect
of optimizing the observing network, an additional CRO-AM reconstruction
was derived by imposing an optimized set of weights provided by the CRO
algorithm (see Section 3.2) during the AM reconstruction. These weights are
used in the computation of the metric (i.e., the RMSE), therefore affecting


76 Chapter 5. North Atlantic SLP reconstruction since 1750

the selection of the best analogues and the reconstructed field. In that way,
the information of the full network is exploited, while acknowledging differ-
ences in the predictive skill of individual sites. This approach is novel, and
represents an important step with respect to the experiments described in
Chapter 4, where a subset of optimal locations was selected (weights equal
to 1 or 0), therefore discarding observations that could still provide useful
information when the availability of the remaining records is compromised.

a)

40

50

60

70

80

90

100

Ti
m

e 
le

ng
th

 1
75

0 
- 2

00
4 

(%
)

1750 1800 1850 1900 1950 2000
Year (CE)

0 0

20 20

40 40

60 60

80 80

100 100

N
um

be
r o

f o
bs

er
va

tio
ns

 (%
) b)

Figure 5.1: Spatio-temporal distribution of SLP observations. (a) Spatial
distribution of stations with monthly SLP observations for 1750-2004 CE.
Shading shows the percentage of time with available observations over the
1750-2004 CE period, with darker shading indicating longer time series. (b)
Evolution of the frequency of observations (in percentage with respect to the
total number of stations) for 1750-2004 CE.


5.2. Optimized networks 77

Within this framework, weights were optimized for each calendar month
of the year by minimizing the area-weighted RMSE of the North Atlantic
SLP reconstruction obtained with the AM over the reconstructed period.
The performance of the CRO-AM and AM reconstructions, obtained with
and without optimized weights, are assessed with respect to the 1836-2004
CE validation period of the 20CRv3 reanalysis, allowing us to quantify the
improvements of the optimized reconstruction. While this dataset covers
the 1836-2004 CE period of the observations, it does not provide a ground
truth to optimize the observing network of 1750-1835 CE. Although using
the weights of the 1836-2004 CE interval through the entire reconstruction
period (1750-2004 CE) is the simplest way to address this issue, it would not
take into account changes in the predictive skill of the local records caused
by the actual gaps of the earlier period and the pronounced changes in the
spatial distribution of the observing network (Fig. 5.1). Therefore, the set
of monthly weights were optimized separately for 1750-1835 CE.

To do so, we took the observations of the 1919-2004 CE interval (which
comprises the same number of years) and reconstructed their concurrent
SLP fields by imposing the same constraints in data availability as in the
observing network of the 1750-1835 CE period. In that way, weights can be
optimized for a set of records that preserves the spatio-temporal distribution
of observations of the earlier period. This set of optimized weights is sub-
sequently applied to the actual observations of 1750-1835 CE to reconstruct
the North Atlantic SLP for this earlier period with the AM. This experiment
assumes that the relationships between the local records and the large-scale
field do not change substantially with time, and that temporal changes in
observational errors are not sufficiently large to affect the overall distribu-
tion of weights. While these assumptions may not hold for all local records,
they were preferred to unrealistic approximations (i.e., no changes in data
availability) that are systematically violated by the full network.


78 Chapter 5. North Atlantic SLP reconstruction since 1750

a) January (1836-2004) b) June (1836-2004)

0.0

0.2

0.4

0.6

0.8

1.0

W
ei

gh
ts

c) January (1750-1835) d) June (1750-1835)

Figure 5.2: Spatial distribution of optimized monthly weights for the observ-
ing network obtained with the CRO-AM algorithm. Weights (from 0 to 1)
apply to the observing network of (a, c) all Januaries, and (b,d) all Junes of
the (a,c) 1750-1835 CE and (b,d) 1836-2004 CE period. The size of the dot
is proportional to the magnitude of local weight, which is also indicated by
shading. Grey crosses in (c) and (d) represent observations without available
information for 1750-1835 CE.

Figure 5.2 shows the spatial distribution of optimized weights for two
representative months of the year and the two considered sub-periods: 1836-
2004 CE (Fig. 5.2a and b) and 1750-1835 CE (Fig. 5.2c and d). Local
weights for the earlier reconstruction period are substantially different from
their more recent counterparts. Overall, higher weighting values are assigned
to the reduced subset of observations of the 1750-1835 CE period than to the
records of the almost complete network of 1836-2004 CE. For instance, ∼85%
of the October observations from 1836 to 2004 CE have low weights (below
0.5), whereas this number decreases to ∼70% for the 1750-1835 CE period.
Therefore, observations gain representativeness when the lack of informa-
tion increases, and locations with low weights in a large network can be
very informative when considering a reduced subset of the network. In spite


5.2. Optimized networks 79

of this, there are no generalized high weights, even in the case of extreme
data scarcity. Indeed, for the earlier reconstruction period, only 7 out of 23
records have weights with values above 0.5. Therefore, the CRO algorithm
only assigns high values to a few locations, stressing the need for exploit-
ing the information of the full network, particularly when gaps are present.
The spatial distribution of weights in the 1836-2004 CE period is preserved
for different months of the year, with higher values for latitudes in-between
30◦N and 50◦N. However, the pattern of weights changes from one month
to another for 1750-1835 CE (Fig. 5.2c and d), indicating that a one-fits-all
pattern of monthly weights can be challenging when there are extensive un-
sampled areas. It is difficult to determine whether this seasonal cycle stems
from climatological aspects that are not evidenced in larger networks or from
peculiarities of the limited network of 1750-1835 CE.

Recall that weights apply to an incomplete network of observations, and
hence low weights can be caused by poor instrument calibration, reduced
data availability of records (especially in key areas such as the North At-
lantic Ocean), an overall weak predictive skill, or redundant information
with respect to that provided by the remaining records. Therefore, infer-
ences of physical links among stations (or between local records and the
large-scale flow) based on the detailed distribution of local weights within
the network can be misleading and should be interpreted with caution. Note
also that although changes in data availability were taken into account in the
optimization process, weights are time invariant during each reconstructed
period (except for the annual cycle). Strictly speaking, weights are expected
to change with time, following the configuration of the observing network
at any time, and ideally, they should be optimized for each month of the
1750-2004 CE period. However, that approach would be computationally
challenging and is not expected to cause large differences in the optimized
weights for relatively small changes in the distribution of records, partic-
ularly if the number of observations is large. The pronounced changes in
the coverage of observations for the earlier reconstruction period leave large
un-sampled areas (e.g., the northern part of the North Atlantic and the east


80 Chapter 5. North Atlantic SLP reconstruction since 1750

coast of North America) as compared to 1836-2004 CE, justifying a separated
optimization.

a) January (1750-1835) b) June (1750-1835)

0.0

0.2

0.4

0.6

0.8

1.0

W
ei

gh
ts

Figure 5.3: As Fig. 5.2 but for the 20CRv3 reanalysis experiment of the
CRO-AM reconstruction. Monthly weights for (a) January and (b) June
obtained with a perfect (noise free) network of SLP pseudo-observations for
the period 1919-2004 CE taken from the 20CRv3 reanalysis with the same
gaps as the real observations for the period 1750-1835 CE.

To further test the robustness of our results, we also performed a CRO-
AM reconstruction of the 1919-2004 CE SLP fields of the 20CRv3 reanaly-
sis by replacing the observations with the closest reanalysis grid point series
and imposing the same spatio-temporal availability as in 1750-1835 CE. Grid
point series are not perturbed so that they represent perfect local predictors
(SNR = ∞) of the 20CRv3 ground truth. This subset is also less affected by
artifacts (e.g. the mismatch of the spatial scales resolved by the reanalysis
and the station-based observations) and is more physically consistent with
the large-scale field targeted by the reconstruction. The resulting weights are
similar to those found for the observations (Fig. 5.3), stressing the coherence
of the station-based and reanalysis grid point series. As the SNR is higher
in the reanalysis experiment, the overall agreement also suggests a limited
influence of local observational errors on the distribution of weights of the
observing network.

Moreover, to verify that the results of the optimization are robust with
respect to the datasets employed, a model-based sensitivity experiment was


5.2. Optimized networks 81

performed with historical and past1000 simulations from CMIP6-PMIP4
(Eyring et al., 2016; Jungclaus et al., 2017). Both simulations are fully-
forced with standard forcing data sets following the specifications included
in the input4MIPs documentation (http://goo.gl/r8up31). As the optimiza-
tion process is time-consuming, a multi-model ensemble was not affordable.
Therefore, we used the MRI-ESM2-0 model (Yukimoto et al., 2019), whose
spatial resolution (1◦) is similar to that of the 20CRv3. Model outputs were
also regridded using a bilinear interpolation method to match the resolution
of the reanalysis (i.e., grid points at the same latitudes and longitudes). As
described in Subsection 2.2.2, we selected the SLP series from 1750 to 1835
of the model grid points that contain the observed records to create pseudo-
observations mimicking the characteristics of the observing network (which
has the same spatial distribution because of the regridding), and used them
to reconstruct the simulated SLP fields of the model. A similar spatial pat-
tern of weights for this model experiment is also found, indicating that the
optimization is little affected by the specific realization of internal variability.
The model experiment also suggests that model biases in the climatology or
fingerprints of external forcings are either small or play a minor role in the
optimized weights. These results, and the overall model agreement with the
observed distribution of weights for 1750-1835 CE lend support to the ap-
proach adopted for inferring the optimized weights of the observing network
for that period.


82 Chapter 5. North Atlantic SLP reconstruction since 1750

5.3 Skill of the optimized networks

Figure 5.4 summarizes the performance of the CRO-AM optimized networks
for 1750-1835 CE and 1836-2004 CE, and compares it to that obtained with
the AM only (without weighting). The skill is quantified with the area-
weighted RMSE of SLP over the North Atlantic, computed with respect to
the corresponding validation period of the 20CRv3 reanalysis employed dur-
ing the optimization process (1919-2004 CE and 1836-2004 CE, respectively).
The optimized network of the 1750-1835 CE period yields area-weighted
RMSE below 4 hPa all year round, almost doubling the RMSE retrieved
with the much denser network of the 1836-2004 CE period. In both cases,
the RMSE displays a pronounced annual cycle with maxima in winter and
minima in summer, as expected from the seasonal changes in variability of
the North Atlantic atmospheric circulation. The optimized networks have
higher skill than their unweighted counterparts for all months of the year.
Indeed, for some months the RMSE of the optimized but sparse network of
1750-1835 CE approaches that of the denser unweighted network of 1835-
2004 CE, and the weighting improvement is half of that obtained from a
4-fold increase in data density. Note that the improved skill is present for
most time-steps of the reconstruction as seen in Fig. 5.5, indicating that the
network optimization performs well regardless of the number and distribu-
tion of observations.

Although optimized and non-optimized reconstructions have significant
correlations (p < 0.05) in both periods (Fig. 5.6) with respect to the tar-
get field (20CRv3), CRO-AM reconstructions perform significantly better
than AM reconstructions for most regions, especially over the North Atlantic
Ocean and the Canadian North Pole, where Pearson correlation coefficients
increase by up to 0.2 above values obtained without optimization (Fig. 5.4a
and b).


5.3. Skill of the optimized networks 83

a) b)
-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

C
or

re
la

tio
n 

di
ffe

re
nc

e

January February March April May June July August September October November December
Month

1.5 1.5

2.0 2.0

2.5 2.5

3.0 3.0

3.5 3.5

4.0 4.0

R
M

SE
 (h

Pa
)

c)

Weighted (101)
Unweighted (101)

Weighted (23)
Unweighted (23)

Figure 5.4: Comparison of the SLP reconstruction skill obtained with and
without optimization. (Top panels) Difference of performance (Pearson cor-
relation coefficient with the 20CRv3 reanalysis) between the CRO-AM and
AM SLP reconstructions generated with the observing network of: (a) 1750-
1835 CE; (b) 1836-2004 CE. Crossed regions show non-significant differences
(p>0.05). (c) Monthly mean evolution of the area-weighted root-mean-square
error of the North Atlantic SLP (with respect to 20CRv3) for the CRO-AM
(blue) and AM (orange) reconstruction and the observing network of 1750-
1835 CE (solid) and 1836-2004 CE (dashed). The performance of the 1750-
1835 CE network is evaluated over the 1919-2004 CE period of the reanalysis.

As a matter of fact, the only regions where the reconstruction is not
improved by network weighting are those with a higher density of stations
(e.g., specific areas of Europe). This indicates that the skill of the AM re-
construction is biased towards well sampled regions, and at the expense of
sacrificing its performance over regions with a sparser distribution of ob-
servations. Differently, the optimized networks maximize the reconstruction
skill of the whole study region, reducing the spatial bias induced by the
non-homogeneous and ever-changing distribution of climate observations.


84 Chapter 5. North Atlantic SLP reconstruction since 1750

1920 1930 1940 1950 1960 1970 1980 1990 2000
December
November

October
September

August
July

June
May
April

March
February
January

M
on

th
s

1840 1860 1880 1900 1920 1940 1960 1980 2000
Years (CE)

December
November

October
September

August
July

June
May
April

March
February
January

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

R
M

SE
 d

iff
er

en
ce

 (h
Pa

)

Figure 5.5: Area-weighted RMSE difference between monthly SLP recon-
structions generated with and without optimization of the observing network.
The area-weighted RMSE of the North Atlantic SLP reconstructions gener-
ated with the observing network of 1750-1835 (1836-2004) CE is calculated
with respect to the 1919-2004 (1836-2004) CE period of the 20CRv3 reanaly-
sis, and shown in the top (bottom) panel. Blue (red) colors indicate that the
optimized reconstruction has lower (higher) RMSE than its non-optimized
counterpart.

This is consistent with findings in Chapter 4 where pseudo-proxy recon-
structions of global temperature fields from reduced sets of representative
locations were improved at the expense of losing skill in over-sampled re-
gions. The optimized reconstruction in the MRI model experiment shows
the same pattern of improvement (note the similarity between Fig. 5.7 and
Fig. 5.4a), confirming that the reduction of biases in under-sampled regions
is a robust feature of the optimized network, and relatively insensitive to
high- and low-frequency changes in the background state. Interestingly, and
despite the large differences in the distribution of weights for the observing
networks of the early and late sub-periods (Fig. 5.2), they bring very simi-
lar patterns of improvement (Fig. 5.4). In particular, some of the largest
increases in skill are observed over the southern half of the North Atlantic
Ocean.


5.3. Skill of the optimized networks 85

a) b)

0.0

0.2

0.4

0.6

0.8

1.0

Pe
ar

so
n 

co
rre

la
tio

n

c) d)

Figure 5.6: Pearson correlation between SLP target fields and their optimized
and non-optimized reconstructions. Pearson correlation coefficient with the
20CRv3 reanalysis between the CRO-AM (a and b) and AM (c and d) SLP
reconstructions generated with the observing network of: (a and c) 1750-1835
CE; (b and d) 1836-2004 CE. All grid-point correlations are significant for a
95% confidence interval (p<0.05).

Being far enough from major continental areas and the well sampled Eu-
ropean territories to yield skillful reconstructions, this region has often been
disregarded in previous reconstructions. However, regional SLP variations in
this area are of paramount importance for the climate of Europe and North
and Central America by modulating the southern lobe of the NAO in winter
and the intensity and location of the Azores-Bermuda Subtropical High in
summer (Davis et al., 1997; Portis et al., 2001). The following sections focus
on these seasonal aspects of the atmospheric circulation.


86 Chapter 5. North Atlantic SLP reconstruction since 1750

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

C
or

re
la

tio
n 

di
ffe

re
nc

e

Figure 5.7: As the top panels of Fig. 5.4 but for the SLP pseudo-
reconstructions (1750-1835 CE) of the MRI-ESM2-0 model. Shading shows
the difference in performance (Pearson correlation coefficient with the 1750-
1835 CE targeted SLP field of the MRI-ESM2-0 model) between the SLP
pseudo-reconstructions generated with and without optimization of the
pseudo-observing network (the simulated SLP series at the grid points match-
ing the locations and availability of the observing network for 1750-1835 CE).

5.4 Climate variability and the Azores High

After demonstrating the added value of network weighting for the valida-
tion period of the CRO-AM reconstructions, we generated the SLP field
reconstructions (CRO-SLP, hereafter) of the 1750-2004 CE period with the
observations of the optimized networks for 1750-1835 and 1836-2004 CE.
They yield more than 250 years of high-resolution monthly SLP fields over
the North Atlantic, allowing us to study the major components of the North
Atlantic atmospheric circulation that govern the climate conditions of its
surroundings (i.e., Europe, Greenland, and the east coast of North Amer-
ica). Such is the case of the winter NAO (Hurrell and Deser, 2010) and the
East Atlantic (EA) pattern (Barnston and Livezey, 1987; Mellado-Cano et
al., 2019), the first and second leading modes of climate variability over the
region. By using the CRO-SLP reconstruction from 1751 to 2004 CE, and
the 20CRv3 reanalysis from 1836 to 2014, we derived the winter (DJF) in-


5.4. Climate variability and the Azores High 87

dices of the NAO and EA for both datasets, defined as the first and second
PC (Principal Components) of standardized SLP fields over [95◦W - 50◦E]
and [20 - 73]◦N. While the first EOF (Empirical Orthogonal Function) (asso-
ciated with the winter NAO) explains 49% of the total variance, the second
EOF (associated with the EA pattern) represents the 19.5%.

The same process was employed to generate the EA index from the sea-
sonal SLP 5◦×5◦ reconstruction of Küttel et al. (2010), which is provided for
1750-1886 CE and as an extension of the HadSLP2 (Allan and Ansell, 2006).
Deriving a PC-based NAO index in Küttel et al. (2010) was hampered by
its limited spatial coverage (the reconstruction stops at 40◦W), and hence
it was better obtained as the standardized SLP difference between Azores
and Iceland, these regions being defined as the mean of the four closest grid
points on the 5◦ × 5◦ grid (Luterbacher et al., 2001). Our indices have also
been compared against other NAO and EA instrumental-based indices (see
Table 5.1) from previous studies (Jones et al., 1997; Luterbacher et al., 2001;
Comas-Bru and Hernández, 2018), obtaining statistically significant correla-
tions (p < 0.05) with all of them, as shown in Table 5.2.

Overall, the correlations are significantly higher for the NAO than for the
EA index, arguably due to the degraded skill of the CRO-SLP reconstruc-
tion over Europe further back in time. This would affect the node of the EA
index as well as the European node in the dipolar definitions of the EA. Note
that, despite the diversity of data and methodologies employed in the defi-
nition of these indices, in all cases the CRO-SLP NAO and EA indices yield
higher correlations than their counterparts obtained from the reconstruction
of Küttel et al. (2010) that used wind records from ship logbooks over the
ocean, in addition to many of the land-based observations. Although the
comparison across indices must be taken with caution, and higher correla-
tions do not necessarily involve better reconstructions, these results suggest
that optimized networks of land-based observations might eventually outper-
form non-optimized networks including land and ocean records.


88 Chapter 5. North Atlantic SLP reconstruction since 1750

Table 5.1: Definition of NAO and EA indices. Time series have been obtained
from the sources indicated below (when provided) or calculated from the
spatially resolved fields as detailed in the second column. All indices have
been re-standardized with respect to 1951-2000 CE.

Index Definition Period

NAO
Jones et al. (1997) Normalized pressure difference between

Azores and Iceland
1825-2014

Luterbacher et al. (2001) Standardized SLP difference between
Azores and Iceland, each region defined as
the mean of 4 grid points on a 5o x 5o grid

1750-2001

Küttel et al. (2010) Standardized SLP difference between
Azores and Iceland, each region defined the
mean of 4 grid points on a 5o x 5o grid

1750-2004

Hurrell and Deser (2010) First principal component of standardized
SLP anomalies (20o-80oN, 90oW-40oE)

1899-2019

EA
Küttel et al. (2010) Second principal component of standard-

ized SLP anomalies (20o-70oN, 40oW-50oE)
1750-2004

Comas-Bru and Hernández (2018) Composite of EA series generated from his-
torical records at Bergen Florida and Valen-
tia, and five reanalyses

1852-2014

As the CRO-SLP reconstruction brings the largest improvement over the
AH pressure system, we performed a more detailed assessment of this sub-
tropical high for 1750-2004 CE. In CRO-SLP (Fig. 5.8a, b), the AH is readily
identified from the seasonally averaged SLP fields of all winters and summers
of the 1751-2004 CE period. Spatial patterns of the AH show significant sea-
sonal differences, exhibiting a wider high pressure center across the Atlantic
for summer, and a weaker system for winter as described in previous findings
(Davis et al., 1997; Wanner et al., 1997; Portis et al., 2001; Küttel et al.,
2010). The 1750-2004 CE evolution of the seasonal AH is depicted in Fig.
5.8c and d. Its intensity has been defined as the maximum 5◦ × 5◦ mean
SLP within the [20 - 55]◦N and [10 - 70]◦W domain. These criteria were cho-
sen to facilitate the identification of the AH center and avoid misdetections,
but the results are relatively robust to small changes in the selected domain.
There are seasonal differences in the interannual evolution of the AH pressure


5.4. Climate variability and the Azores High 89

Table 5.2: Pearson correlation coefficients of winter NAO and EA indices.
Correlations have been calculated for the overlapping interval of each pair
of indices within the 1751-1886 CE period (to avoid chucks in some of the
series that were filled or extended with observations from other datasets).
Coefficients in bold are statistically significant at the 95% confidence interval.
Information about the NAO and EA indices can be found in Table 5.1. All
indices are standardized with respect to 1951-2000 CE.

CRO-SLP Küttel et al. (2010)

NAO
Jones et al. (1997)
(1825-1886)

0.92 0.77

Luterbacher et al. (2001)
(1751 - 1886)

0.75 0.60

EA
Comas-Bru and Hernández (2018)
(1852-1886)

0.73 0.28

system, with summers yielding quite stable pressures around 1024 hPa, and
winters displaying larger variability on interannual and longer time scales,
including a long-term trend towards the end of the analyzed period.

To place this trend in the context of the last 250 years, we have computed
trends of the winter AH intensity for running windows of 50 years from 1751
to 2004 CE (Fig. 5.9) with CRO-SLP, from 1837 to 2013 CE with 20CRv3,
and from 1899 to 2019 CE with NCAR SLP (Hurrell et al., 2020). In win-
ter, decadal trends are comparatively smaller and show no large variations
from 1751 to 1900 CE, being followed by an increase during the mid-1920s,
a decrease during the 1940s, and a sharp increase during the 1960s onwards,
in agreement with Davis et al. (1997) and Hasanean (2004). The last change
is concurrent with the prominent positive trend of the winter NAO from
the 1960s to the 1990s (Pinto and Raible, 2012). Associated impacts of the
recent AH strengthening have already been reported (Falarz, 2019). The
CRO-SLP captures this trend and further reveals that it is unprecedented


90 Chapter 5. North Atlantic SLP reconstruction since 1750

since 1750 CE, with the last 50 years exhibiting the largest intensification of
the winter AH pressure system (0.55±0.19 hPa/10y in 2002) of the last two
and half centuries. In contrast, a recent intensification of the AH center is
not observed during summer, although some studies using stream function
as a diagnostic have reported a strengthening and westward movement of its
western ridge over North America (Li et al., 2012).

a) Winter b) Summer

995

1000

1005

1010

1015

1020

Se
a 

Le
ve

l P
re

ss
ur

e 
(h

Pa
)

1015 1015

1020 1020

1025 1025

1030 1030

Az
or

es
 H

ig
h 

(h
Pa

)

c) Winter

1800 1850 1900 1950 2000
Year (CE)

1015 1015

1020 1020

1025 1025

1030 1030d) Summer

Reconstruction HadSLP2

Figure 5.8: SLP variability over the Azores High region from 1750 to 2004
CE. Climatological (1750-2004 CE) mean SLP (shading, in hPa) obtained
with the optimized reconstruction for: (a) winter (DJF) and (b) summer
(JJA). Seasonal mean time series (1750-2004 CE) of the intensity of the (c)
winter and (d) summer Azores High (blue lines, in hPa). Shading shows the
uncertainty range calculated as two standard deviations over the 5◦×5◦ area
where the maximum of SLP is located. Dashed lines illustrate the Azores
High intensity for the HadSLP2 for the 1850-2004 CE period.


5.4. Climate variability and the Azores High 91

Time periods [year - 50, year) (CE)-1.0 -1.0

-0.5 -0.5

0.0 0.0

0.5 0.5

1.0 1.0

W
in

te
r S

LP
 tr

en
d

a)

CRO-SLP NCAR 20CRv3

1800 1825 1850 1875 1900 1925 1950 1975 2000
Time periods [year - 50, year) (CE)

-1.0 -1.0

-0.5 -0.5

0.0 0.0

0.5 0.5

1.0 1.0

Su
m

m
er

 S
LP

 tr
en

d

b)

Figure 5.9: Decadal SLP trends of the Azores High pressure center intensity.
Decadal linear trends of the (a) winter and (b) summer SLP (blue line, in
hPa per decade) obtained from the CRO-SLP reconstruction (blue line), the
20CRv3 reanalysis (red line), and the NCAR SLP dataset (dashed line), over
the Azores High center for 50-year running windows from 1750 to 2019 CE.
Blue shading illustrates the uncertainty range of CRO-SLP as two standard
deviations with respect to the mean.

The most striking feature of the long-term evolution of the summer AH
is an overall intensification since the second half of the 18th century and
during most of the 19th century (Fig. 5.8d and 5.9b). However, this trend
coincides with the period of largest uncertainties, and hence the overall AH
weakening towards the beginning of the analyzed period may result from
limitations of the observing network (e.g., analogue fields poorly constrained
by the scarcity of observations). Moreover, we have also assessed the contri-
bution of AH variations to the historical evolution of the NAO. To do so, the
NAO index was decomposed as the standardized sum of its AH and IL com-
ponents. They have been obtained separately as the projection of the winter
SLP anomalies onto the grid points where the NAO pattern was strictly pos-
itive (AH) and negative (IL), respectively. The first EOFs associated with
both projections describe the 55.4% of the total variance over the AH region
and the 49.7% over the IL region.


92 Chapter 5. North Atlantic SLP reconstruction since 1750

The sum of these AH and IL indices has a correlation (R2) of 0.99 and
a RMSE of 0.11 with respect to the original CRO-SLP NAO index used
in Table 5.2 for the 1751-2004 CE period. This linear behavior allows us
to quantify the AH and IL contributions to the NAO of each winter, and
discern the dominant component through the 1750-2004 CE period. The
leading contributor is easily identified from the absolute values of the AH
and IL indices. Fig. 5.10a shows the time series of |AH| minus |IL|, being
this difference negative if the IL was predominant for a certain year and
positive if the AH was the dominant one. Although AH and IL indices are
strongly anti-correlated, there are few years with almost equal contributions
to the NAO (e.g., |AH| - |IL| ∈ (-0.1, 0.1) for 6.7% of winters during the
1751-2004 time period). Although AH and IL dominant years tend to al-
ternate without a clear pattern of long-term trends, the time series displays
some low-frequency variability, with e.g. the AH (IL) dominating the NAO
over the second half of the 18th century (the first half of the 19th century),
which does not translate into concurrent variations in the sign of the NAO
index (Fig. A.3). Some of these periods are more evident at the beginning
of the series, and may be affected by larger uncertainties of the CRO-SLP
reconstruction at that time (Fig. 5.8).

Overall, Fig. 5.10a illustrates that differences in the reconstruction skill
of the AH and IL would cause time-varying uncertainties with an impact on
the magnitude and even the sign of the NAO (note that 45% of the winters
have absolute differences larger than 0.5 standard deviations). To further
address this issue, we have calculated the spread of the historical NAO time
series across the different NAO indices defined in Table 5.1 (the PC-based
NAO from the 20CRv3 reanalysis has also been included). The analysis has
been restricted to 1900-2004 CE, since the number of available indices de-
creases backwards. Interestingly, the evolution of the NAO spread from 1900
to 2004 CE tends to follow that of the |AH| - |IL| difference in AH dominant
years (|AH| - |IL|>0), indicating that the more dominant the AH was with
respect to the IL, the higher the differences between NAO reconstructions.


5.4. Climate variability and the Azores High 93

1800 1850 1900 1950 2000
Years (CE)

-3

-2

-1

0

1

2

3

PC
1 

di
ffe

re
nc

e 
(|A

H
| -

 |I
L|

) a)

AH dominant IL dominant
-1.0

-0.5

0.0

0.5

1.0

St
an

da
rd

 d
ev

ia
tio

n 
(

)

NAO Index spread ( )

-2 -1 0 1 2 3
PC1 difference (|AH| - |IL|)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

N
AO

 In
de

x 
sp

re
ad

 (
)

b)

Figure 5.10: Contribution of the Azores High and Iceland Low to the winter
NAO. (a) Time series (1751-2004 CE) of the winter |AH|-|IL| index, denot-
ing the unbalanced contribution of the Azores High and Iceland Low anoma-
lies to the winter NAO. Positive (orange) and negative (blue) values show
winters with a leading influence of the Azores High and Iceland Low, re-
spectively. The red line represents the spread (standard deviation) of NAO
indices for each winter of the 1900-2004 CE period, calculated from a suite of
instrumental-based NAO indices standardized with respect to 1951-2000 CE
(Table 5.1). (b) Scatter plot (1850-2004 CE) of the spread of NAO indices
for winters dominated by the Iceland Low (blue section) and the Azores High
(orange section). Dashed lines represent separate linear regressions for each
dominant component. Grey shading shows the 95% confidence interval of the
linear fits. All series have been standardized with respect to the 1951-2000
CE baseline.

Indeed, the Pearson correlation coefficient between the NAO spread and |AH|
- |IL| series is 0.24, but increases to 0.36 (p<0.01) for AH dominant years.
This is better illustrated in Fig. 5.10b where there is a positive trend in


94 Chapter 5. North Atlantic SLP reconstruction since 1750

NAO spread for AH dominant years, and no significant correlation for IL
dominant periods. Accordingly, NAO indices tend to show better agreement
in years dominated by the IL and higher discrepancies for years when the
NAO was largely determined by AH anomalies. Part of the NAO spread
is expected to arise from differences in the NAO definition. However, we
still find significant correlations when one NAO index is dropped from the
spread, with a minimum correlation of 0.21 if the 20CRv3 is not included
and a maximum correlation of 0.42 if Jones et al. (1997) is not considered
(see Table B.2). Therefore, uncertainties in the AH represent an important
source of disagreement for instrumental NAO series. As these NAO indices
are obtained from station-based observations or instrumental SLP fields, the
result points to different levels of performance in these datasets to capture
the winter AH pressure system. This stresses the added value of the CRO-
SLP reconstruction, which brings a significant increase in the SLP skill over
the AH region (Fig. 5.2), and of optimization as a way to overcome potential
shortcomings affecting instrumental datasets.

In summer, the AH becomes detached from the summer NAO and in-
creases its areal extent and intensity. Although interannual changes in inten-
sity are relatively small (Fig. 5.8d), variations in location or extension can
be pronounced and affect the climate conditions of the surrounding continen-
tal regions in subtropical and mid latitudes. Thanks to the high resolution
of CRO-SLP (Iles et al., 2020), it has been possible to trace the AH center
from 1750 to 2004 CE, defined as the central location of the 5◦×5◦ box with
maximum averaged SLP, among those within the [20 - 55]◦N and [10 - 70]◦W
domain. The results indicate that the center of action has not experienced
long-term changes, being usually situated within [34 - 39]◦N and [26 - 39]◦W.
Despite the relatively stable locations of the summer AH over the 250 years,
we found some pronounced excursions. The largest one corresponds to an
extreme shift towards the north-east (43◦N and 18◦W) in summer of 1783
CE (Fig. 5.11).


5.4. Climate variability and the Azores High 95

a)

995

1000

1005

1010

1015

1020

Se
a 

Le
ve

l P
re

ss
ur

e 
(h

Pa
)

b)

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

Te
m

pe
ra

tu
re

 a
no

m
al

y 
(

C
)

Figure 5.11: Azores High shift in the extremely warm summer of 1783 CE.
(a) Summer mean SLP (shading, in hPa) for 1783 CE obtained from the
optimized reconstruction. (b) Summer mean temperature anomalies (in oC,
with respect to 1500-2002 CE) for 1783 CE. Black diamonds with error bars
show the climatological (1750-2002 CE) location (mean and two standard
deviations) of the Azores High center. Green crosses represent the center of
the Azores High for the summer of 1783 CE.

This year is remembered by the great dry fog in Europe (Stothers, 1996;
Thordarson and Self, 2003; Schmidt et al., 2012) after the Laki eruption
(Iceland) in June, and the ∼3oC cooling during the following winters. In
contrast, reconstructed temperatures (Luterbacher et al., 2004) for the sum-
mer of 1783 CE show an European-mean warming of∼3◦C, being particularly
pronounced in western Europe. Previous studies have already acknowledged
the difficulty of GCMs to reproduce such warming event as a fast response
to the volcanic forcing (Zambri et al., 2019), and have rather associated this
extreme summer to persistent atmospheric blocking conditions, more likely
caused by internal variability. Our results only partially support this hy-
pothesis. While blocking events often cause extremely warm conditions in
Europe (Barriopedro et al., 2011), they typically occur in northern latitudes
of the continent and are rarely accompanied by anomalies in the summer
AH such as those revealed by the CRO-SLP reconstruction. The latter are
more typically associated with meridional excursions of subtropical air masses
towards western Europe, which can cause simultaneous extreme conditions
over a large range of latitudes (e.g., Sousa et al. (2018, 2019); the 2003 mega-
heatwave, or the more recent 2019 European heatwave). Consistently, Fig.


96 Chapter 5. North Atlantic SLP reconstruction since 1750

5.11a shows how the AH pattern obtained from CRO-SLP was abnormally
elongated towards the north-east during that summer, resulting on higher-
than-normal SLP values over western Europe that are in good agreement
with the warming inferred from independent temperature reconstructions.
The meridional excursion of the summer AH is among the largest ones in our
250 year-long record, which could explain why extreme temperatures reached
unusual poleward latitudes, exceeding the record-breaking values of the 2019
warm air intrusion reported so far (Sousa et al., 2019).

a)

-3

-2

-1

0

1

2

3

Se
a 

Le
ve

l P
re

ss
ur

e 
an

om
al

y 
(h

Pa
)

b)

-1.00

-0.75

-0.50

-0.25

0.00

0.25

0.50

0.75

1.00

Te
m

pe
ra

tu
re

 a
no

m
al

y 
(

C
)

Figure 5.12: SLP and temperature difference between the top ten north-
easternmost minus top ten southwesternmost summer AH centers from 1750
to 2002 CE. (a) Summer mean SLP difference obtained from CRO-SLP. (b)
Summer mean temperature anomalies (in oC, with respect to 1500-2002 CE).
White dots show the location of the Azores High center. Red (blue) diamonds
represent the center of the Azores High for the top ten northeasternmost
(southwesternmost) summers.

Similar patterns are obtained when comparing mean SLP and tempera-
ture fields from summers with AH centers situated at the top ten northeast-
ernmost vs southwesternmost locations for the 1750-2002 CE period (Fig.
5.12), confirming the general increase in European temperatures (especially
in Northern Europe) associated with displacements of this high pressure sys-
tem. Future projections also indicate an intensification, poleward shift and
expansion of the summer AH, particularly towards the north-west and sec-
ondarily the north-east (He et al., 2017; Cherchi et al., 2018). Although


5.4. Climate variability and the Azores High 97

significant trends in the latter are not detected yet, Figs. 5.11 and 5.12 may
represent an example of European summers that are still to come.

Keynotes

These are the most important findings of Chapter 5:

• Evolutionary algorithms maximize the reconstruction skill of SLP fields
over the study region.

• General improvement in reconstructions are at the expense of sacri-
ficing skill at over-sampled locations which is compensated by larger
improvements over representative regions.

• The reconstruction of SLP over the North Atlantic allows to study the
NAO and its associated action centers.

• There is a positive winter AH intensity trend above 0.5 hPa per decade
during the second half of the 20th century.

• Differences in reconstructions of the NAO are partially explained by
disparities on the reconstruction of the AH.

• Displacements of the AH center towards the north-east caused extreme
warming events in Western Europe during the summer of 1783 CE.


Chapter 6

Clustering incomplete
climatological time series

6.1 Background

Marked variations in regional climate patterns arise as a response to persis-
tent changes of the climate system. Identifying these patterns is therefore
fundamental for a better understanding of past climate changes at local and
regional scales. Thus, with increasing computational power, the number
of classification methodologies providing robust characterizations of regional
climates has quickly escalated in the climate community, becoming a com-
mon tool for the study of past climatic patterns (Abatzoglou et al., 2009;
Perdinan and Winkler, 2015; Horton et al., 2015). This chapter is therefore
intended to validate the k-gaps algorithm, a novel technique to obtain robust
clusterings from sample-starved datasets using all the information contained
in their records. k-Gaps has already been described in Section 3.3 and the
next sections will show some of its potential uses in the study of past climate
events.

99


100 Chapter 6. Clustering incomplete climatological time series

6.2 Ideal case with complete information

To obtain the “ideal” case scenario where complete data are available for re-
gionalization, two k-means clusterings have been performed (Fig. 6.1) using
the EObs gridded dataset of daily temperatures described in Section 2.1.2
. Thus, absolute temperatures have been used to provide the clusterization
of areas with similar temperature means (k-gaps basic mode), whereas stan-
dardized temperatures have been employed for the clustering of records with
correlated variability (the normalization mode).

a)                                                                    b)

Figure 6.1: K-means clusterization of the E-Obs grid of daily summer temper-
atures since 1950 to 2016 (k-means[EObs]) using absolute (a) and standardized
(b) temperatures.

Note that, while the former (Fig. 6.1a) associates Northern regions with
high altitude locations such as the Alps (Rubel et al., 2017) (as expected
due to their similar lower mean temperatures), the latter (Fig. 6.1b) yields
coherent patterns in terms of regional climate variability. Interestingly, the
regionalization of Fig. 6.1a has certain similarities with the Köppen-Geiger
classification (Köppen, 1884), especially in the aforementioned association of


6.3. Experiment setting 101

high altitude temperatures with northernmost climates. However, it is not
possible to directly compare against these Köppen-Geiger climate classifica-
tions because their classification system also takes into account precipitation
patterns that are beyond the scope of this study. The regions obtained using
k-means with the entire grid of European temperatures (k-means[EObs], here-
after) will be assigned as target to test the skill of k-gaps and other clustering
techniques by comparison. The closer the regionalization to k-mean[EObs] re-
gions, the higher the skill of the method.

6.3 Experiment setting

To assess the robustness of different clustering methods, synthetic datasets
are generated from the complete E-Obs temperature grid using three main
parameters: the number of synthetic records (i.e. datasets have different
number of time series), their spatial distribution (i.e. temperatures are ex-
tracted from different locations for each dataset), and their time length (i.e.
missing information is different for each dataset). These parameters can be
interrelated since sampling networks and measuring campaigns have usually
been conducted by organizations at regional scales. So most sampling collec-
tions are regionally related in time, leading to changes in the spatial coverage
of the zone at different time periods. Synthetic data are therefore generated
using Gaussian models (Fig. 6.2) that imitate the distribution and time in-
tervals of real measuring campaigns. These models are spatially conformed
by marked centers where there is a higher concentration of records, and a
sparser coverage of their surroundings. On the other hand, different tempo-
ral lengths for synthetic time series are obtained by randomly altering the
start and end times (Inset of Fig. 6.2) of complete temporal records from
the E-Obs temperature dataset.


102 Chapter 6. Clustering incomplete climatological time series

t
tini

t
end

Spatial

 dispersion 

of the time 

series

Temporal distribution

Figure 6.2: General representation of the spatial distribution of a Gaussian
model (blue shade) employed to generate sample-starved datasets (with in-
complete temperature series) where “×" demarcates the center. Darker blues
indicate a higher concentration of time series, whereas lighter blues depict
fewer time series. Inset: Time lengths of synthetic series generated with
random variations of predefined initial (tini) and final (tend) days.

Hence, 500 synthetic datasets of temperature records were generated to
assess the performance of k-gaps, combining 20 Gaussian models with dif-
ferent centers, distributions, and time periods. Each one of them presents
less than 1100 time series, containing from 55% to 90% of missing values
within their chronologies (Fig. 6.3). These reduced number of records and
data missing level are similar to real historical (Küttel et al., 2010) and pa-
leoclimate (Emile-Geay et al., 2017) databases, and will serve as a test bed
to validate the method for future studies.


6.3. Experiment setting 103

0 200 400 600 800 1000 0 20 40 60 80 100

Number of time series

N
u
m

b
e
r 

o
f 

sy
n
th

e
ti

c 
d
a
ta

se
ts

N
u
m

b
e
r

o
f

sy
n
th

e
ti

c
d
a
ta

se
ts

Percentage of missing data

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

Figure 6.3: Distributions of the number of time series per dataset (left) and
level of missing values related to the temporal length of the records (right)
associated with 500 synthetic datasets.

For example, Fig. 6.4 illustrates one of these case studies. It is composed
of 815 time series with a higher density of temperature records over Cen-
tral Europe, while large territories of Southern Europe such as Italy, Spain
and Portugal remain almost uncovered. Moreover, the number of available
records is not constant and decreases back in time since 1980, restricting the
information of past temperature changes. These features are different for
each synthetic dataset and, therefore, provide a good framework to assess
the robustness of clustering algorithms.

1950 1960 1970 1980 1990 2000 2010 2020

0

50

100

150

200

250

a) b) 
Years

N
um

be
r 

of
  t

im
e 

se
ri

es

Figure 6.4: Temporal (a) and spatial (b) distribution of a sample study
composed of 20 Gaussian models centered at different points in Europe.


104 Chapter 6. Clustering incomplete climatological time series

6.4 Performance assessment

To assess the k-gaps performance, a statistical analysis has been carried out
using the adjusted Rand Index (de Vargas and Bedregal, 2013), a metric
that is used in data analysis to assess the similarity between clusterings.
The index is constructed to detect whether two clusters obtained from dif-
ferent methods have the same associated time series, indicating how much
those cluster look alike. In this sense, it can be seen as an accuracy measure,
as well as a comparison test for clustering methods. In our case, the Rand
Index was employed to compare the k-gaps results with the ideally perfect
k-means[EObs] clustering. The index ranges from 0 to 1, where a value of 0
means that all data points correspond to completely different clusters and
value of 1 would indicate that both clusterings are the same.

The performance of k-gaps has been compared with two other cluster-
ing techniques: the k-POD algorithm (Chi et al., 2016), and the k-means
algorithm (from now onwards k-means[rs], where rs stands for “reduced set”,
indicating that it is only applied over a few time series, in contraposition with
k-means[EObs] which uses the entire grid of temperatures). Note that, whereas
k-gaps and k-POD clusterize incomplete time series (i.e. series with differ-
ent temporal lengths that lead to gaps of missing information for some time
intervals), k-means[rs] clusterized the same 500 datasets but with complete
records (i.e. series covering the entire time period since 1950 CE). Thus, the
comparison of k-gaps and k-POD with k-means[rs] provides information about
the skill of these techniques to reproduce robust spatial patterns with frag-
mentary temporal data. Table 6.1 shows the adjusted Rand-Index means and
standard deviations associated with these three methods for both modes. k-
Gaps exhibited good results for most datasets with index values close enough
to k-means[rs]’s indices, outperforming the skill of k-POD to cluster uneven
time series.


6.4. Performance assessment 105

Table 6.1: Adjusted Rand-Index means of 500 synthetic case studies within
95% confidence interval for three clustering techniques.

Mode k-POD k-gaps k-means[rs]

Basic 0.12±0.09 0.47±0.17 0.56±0.18
Normalized 0.13±0.11 0.54±0.24 0.61±0.19

k-Gaps clusterings obtained higher Rand-indices than k-POD counter-
parts for all synthetic datasets. Moreover, all clustering techniques yielded
higher indices once the time series were normalized, indicating that these
methods achieve better skill when records are clusterized in terms of their
climate variability. Note that the capability to generate robust clusterings
with these methodologies depends on two main factors: the temporal lengths
of the time series, and their respective locations.

N
u
m

b
e
r 

o
f 

ca
se

s

N
u
m

b
e
r 

o
f 

ca
se

s

Figure 6.5: Comparison of clustering techniques using adjusted Rand-Index
for absolute (a) and normalized (b) temperatures. The adjusted Rand-Index
is calculated with respect to the clustering obtained by applying k-means to
the complete E-Obs gridded dataset.


106 Chapter 6. Clustering incomplete climatological time series

To see which factor is predominant, let us examine Fig. 6.5, where k-
means[rs] index distributions (blue bars) illustrate the impact of sparse sam-
pling locations on the skill to reproduce perfect clusterings (remember that
k-means[rs] uses time series with the same length whereas k-gaps does not).
In this regard, performance differences between k-gaps (red bars) and k-
means[rs] indices can be explained due to the loss of temporal information
(time series with gaps). As a matter of fact, k-gaps and k-means[rs] perfor-
mances are quite similar (i.e. their Rand Index distributions obtained from
500 synthetic datasets overlap), indicating that clusterings are more sensi-
tive to the distribution of sampling locations rather than to differences in the
temporal length of records. The skill of the method has also been assessed
as a function of the number of records used in the clustering process. Table
6.2 shows Rand indices for 3 different-sized datasets without any clear cor-
respondence between size and performance (for instance, P404 is the biggest
dataset with 875 time series, and its clustering has the worst performance
in the normalized mode), suggesting that the efficiency of k-gaps does not
strongly rely on the number of time series employed.

Table 6.2: Adjusted Rand-Index for k-gaps clusters with 3 synthetic datasets
selected from the pool of 500 datasets described in Subsection 6.3. Note that
each dataset is composed of a different number of time series (N), and each
time series has lost, on average, 80% of the climate information for the period
1950-2016 CE.

Dataset N Missing data (%) Basic Normalized

P191 815 87.9 0.52 0.69
P280 623 87.9 0.38 0.69
P404 875 83.6 0.51 0.38

Note that in these examples, the average amount of missing data per
series is above 80%, and the loss of information (i.e. the distribution of
missing data) is different for each dataset as shown in Fig. 6.6.


6.4. Performance assessment 107

P191

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

200

400

600

800

P280

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

100

200

300

400

500

600

P404

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

200

400

600

800

Time (days)

Se
ri

es

Figure 6.6: Distribution of missing data for each one of the series included
in datasets P191, P280, and P404. Days without temperature values are
depicted in blue, and available daily temperatures are shown in yellow.

This implies that the algorithm is able to cluster time series that are al-
most empty, as long as the entire period of study is covered by the sum of
the different series included in the dataset. It is also shown that different
performances can be obtained for the same dataset depending on the cluster-
ing mode (e.g. dataset P280 in Table 6.2 shows a lower Rand index in Basic
mode than in Normalized mode), which indicates that a robust clustering
with absolute temperature series requires different locations than clusterings
with normalized series. This is consistent with the fact that regions with
similar mean temperatures (i.e. those selected in Basic mode) are not the
same as regions with similar variability (i.e. obtained in Normalized mode),
as shown in Fig. 6.1.


108 Chapter 6. Clustering incomplete climatological time series

To visualize the spatial patterns obtained in a regionalization with the k-
gaps algorithm, we have clusterized dataset 191 with an intermediate Rand
Index value (its clustering performance in terms of Rand Index is repre-
sentative of what can be expected from the clustering of the 500 synthetic
datasets). Fig. 6.7 displays the resulting clusterings of these series of tem-
perature together with their locations. Note that to facilitate the comparison
with Fig. 6.1, once P191 has been clusterized, we have interpolated the clus-
tering in the remaining grid points where temperatures are not available by
using the k-nearest neighbours algorithm (Cover and Hart, 1967). The ad-
justed Rand indices for basic and normalization modes are 0.52 and 0.69,
respectively. These values indicate that kgaps has similar performance than
k-means (which needs complete temporal information) using sample-starved
climate datasets with different temporal lengths.

a) b)

Figure 6.7: K-gaps clusterization of a study case with 815 time series (black
dots), for Basic (a) and Normalization (b) modes.


6.4. Performance assessment 109

On the other hand, lower Rand indices are related to irregular spatial dis-
tributions, where small regions are well characterized by a disproportionated
number of climate records, while large extensions of land remain unsampled.
Such is the case of dataset P404, chosen as a clustering with low Rand Index
(0.38 for normalization mode as seen in Table 6.2), and whose regionalization
using the normalization mode is depicted in Fig. 6.8. In this case, the dif-
ference between the number of time series in Central Europe and Southern
Europe has altered the formation of clusters by associating regions in the
Iberian Peninsula with the big aggregation of temperature records in France
as seen in Fig. 6.8 (black dots depict the spatial distribution of dataset
P404). At the same time, the concentration of points in Ireland produces a
new cluster for the British Isles which is not present in k-means[EObs]. This
indicates that special attention should be paid to disparities in the coverage
of the territory because they play an important role in the final structure
of the clustering, and an uneven distribution of time series can debase the
analysis of regional climates.

Figure 6.8: K-gaps clusterization for P404 in Table 6.2 using the normaliza-
tion mode. The dataset is composed of 875 time series (black dots) unevenly
distributed over Europe.


110 Chapter 6. Clustering incomplete climatological time series

However, while non-homogeneous distributions force the splitting and
merging of some clusters, some of the spatial patterns from Fig. 6.1 can be
identified, proving that realistic information about temperature variability
can still be extracted from low quality datasets such as P404.

6.5 Applications on climate studies

The full potential of k-gaps is evidenced in Fig. 6.7 where the resulting
spatial patterns for both modes exhibit important similarities with those ob-
tained for k-means[EObs] (Fig. 6.1). This is quite remarkable, because k-gaps
is applied over incomplete datasets (i.e. time series with gaps unevenly dis-
tributed over Europe) whose size in terms of number of time series is 21
times smaller than the homogeneous grid of temperatures which has com-
plete temperature information for the entire time period (no gaps) in all grid
points (complete spatial coverage). It is also noteworthy to mention that al-
though synthetic records are sparsely distributed across Europe, clusters are
defined for almost the same regions as for the gridded dataset, confirming
the robustness of the spatial clustering. For instance, spatial patterns over
Iberia and the United Kingdom are well reconstructed with the method even
when there is a lack of sampling records (black dots in Fig. 6.7 represent
the locations with temperature series) in the southern part of their territory.
This suggests that only a few number of locations are necessary to reproduce
most of the climatology within these areas.

Furthermore, k-gaps cluster analyses provide a useful framework for the
study of past climate trends and the detection of extreme events at regional
scales. In the first case, centroids obtained from k-gaps are time series ranging
the entire study period whose variations represent changes in the tempera-
ture of climatically different regions. Therefore, the linear regression of these
centroids can be utilized to estimate regional temperature trends for periods
beyond the scope of complete datasets that have been truncated using homo-
geneization procedures. For instance, the distribution of temperature trends


6.5. Applications on climate studies 111

obtained from clusterings of 500 synthetic datasets is shown Fig. 6.9. Trend
differences between k-means[EObs] and k-gaps are below 0.03 ◦C/year for 75%
of our synthetic datasets, indicating that although there could be small biases
induced by the irregular distribution of uneven (and scarce) time series, the
temperature trends obtained from k-gaps centroids are similar to those ob-
tained from the clustering of complete gridded datasets (i.e. k-gaps centroids
obtained with fragmentary information have similar trends as centroids gen-
erated with k-means[EObs]). Therefore, these k-gaps centroids can be used to
study changes in past temperature trends from incomplete datasets.

−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4

Tendency error in ◦C/year

0

100

200

300

400

500

600

N
u
m
b
er

o
f
ca
se
s

Figure 6.9: Histogram of trend errors estimated from differences between k-
gaps centroids and ideal centroids retrieved from k-means[EObs]. Temperature
trends have been calculated for 500 synthetic datasets.


112 Chapter 6. Clustering incomplete climatological time series

On the other hand, as the normalization mode of k-gaps associates time
series with similar climate variability, it is possible to detect extreme events
in regions where those time series are located. Take for instance series of
temperature observations, applying the k-gaps algorithm in normalization
mode will allow us to discern regions (delimited by the k-gaps clusters) where
temperature anomalies were significantly higher (or lower) with respect to
the climatology. This methodology can therefore be used to study when and
where extreme temperature events such as heatwaves (or cold spells) took
place.

P
ro

ba
bi

lit
y 

of
 d

et
ec

tin
g 

a
n 

ex
tr

em
e

 d
ay

Daily mean of available records per cluster

Figure 6.10: Probability of detecting a heatwave within clusters obtained
using k-gaps (in normalization mode for 500 synthetic datasets) as a function
of the number of time series associated with each cluster. The colorbar
represents the number of clusters (regions).


6.5. Applications on climate studies 113

Following Demuzere et al. (2011), we identified a heatwave in a cluster
when at least 95% of the time series associated with that cluster reached
values above the 95th percentile of temperatures on record. We can there-
fore assess the probability of detecting heatwaves using k-gaps clusterings
by comparing extreme temperature events detected with the 500 synthetic
datasets (i.e. datasets with incomplete series and fragmentary information)
against extreme events ideally detected with k-means using the complete
E-Obs grid of temperatures (i.e. a dataset with complete spatio-temporal
information). Note that the probability of detecting extreme temperatures
is higher for clusters composed of an elevated number of time series, however
Fig. 6.10 shows that the probability of detecting a heatwave using k-gaps
clusterings is high enough even when the number of temperature series per
day is reduced down to a few tens. For example, clusters with only 20 time
series have, on average, between 60% and 80% chance of detecting temper-
ature extremes (remember that we obtain these values by comparison with
the extreme events detected using k-means[EObs] clusters). This probability
increases above 90% for clusters with at least 30 time series, indicating that
the detection of extremes events within regions with similar climate variabil-
ity defined by the k-gaps algorithm is possible with sets of just a few time
series per cluster.


114 Chapter 6. Clustering incomplete climatological time series

Keynotes

These are the most important findings of Chapter 6:

• k-Gaps is a new clustering technique for climatological regionalizations.

• It does not require complete timeseries.

• It is able to use the full length of records without the need of filling the
gaps.

• It reproduces similar patterns as clusterings obtained with complete
time-series.

• Regional trends can be estimated from centroids of k-gaps that span for
longer time periods than centroids calculated using classical clustering
methods.

• Extreme events can be regionally detected from the clustering of nor-
malized time-series.


Chapter 7

Conclusions and outlook

The use of AI is growing in climate sciences, but applications have mainly
focused on big data, in order to synthesize and extract information from mas-
sive datasets where data availability is not an issue. Here we have shown that
AI procedures such as evolutionary algorithms and clustering techniques can
also be useful in equally important Earth science areas, where the opposite
problem of data scarcity is a major limiting factor (e.g. paleoclimatology).
Perhaps paradoxically, limited data availability also requires solving high
dimensional and non-linear problems, as there is a common optimization
goal of extracting the maximum useful information. By focusing on specific
climate problems (spatially resolved climate field reconstruction or climate
regionalization), we show the potential of these new methodologies to op-
timize the information of incomplete datasets, for a diversity of observing
networks in terms of observables (from perfect temperature pseudo-proxies
to instrumental observations of dynamical fields), temporal resolution (from
annual to daily) and targets (from global to regional). Interestingly, for the
cases analyzed here, we came to the same conclusion that more data is not
always better. Of course, this might not apply to other networks, arguably
those with very few observables or large errors in other sources of uncertainty.
However, our results suggest that even in those cases, some level of improve-
ment can be achieved through optimized rearrangements. If the skill gained
from the optimization is or not substantial will depend on the characteristics

115


116 Chapter 7. Conclusions and outlook

of the network and the pursued objective, which calls for individual assess-
ments that eventually may require tailored developments of AI procedures
to improve their efficiency, as illustrated through the chapters of this thesis.

In Chapter 4, we coupled different reconstruction methods to an evolu-
tionary algorithm in order to reconstruct global temperature fields simulated
by model ensembles of the last millennium from a pseudo-proxy network with
the same number and distribution of records as in the real world. In doing so,
we have quantified a measurable spatial bias in global temperature
reconstructions due to the non-uniform distribution of currently
available paleoclimate records (Emile-Geay et al., 2017). In our ideal-
ized experiments, the field can be reconstructed from an optimized
selection of pseudo-proxies from the full network without sacrific-
ing the skill of the reconstruction. For all experiments performed, the
skill of the full-proxy reconstruction can be improved by using subsets of
pseudo-proxies strategically situated over representative locations (Jaume-
Santero et al., 2020).

The set of optimal locations highlights the importance of po-
lar regions and major teleconnections areas to reconstruct annual
global temperature patterns. The optimized network also captures the
global responses to major external forcings and modes of internal variabil-
ity. However, while annual temperature fields are well described by
high-latitude records, low frequency fluctuations such as the MCA-
LIA transition are better represented by pseudo-proxies situated
at lower latitudes. As the optimal distribution of records varies with the
spectral frequency of the target field, this advises against pooling together
proxies that resolve different time scales and encourages further research
for weighting records depending on their location and response resolution.
Our results are robust to the reconstruction methodology and the ensemble
member or model employed, and they hold for more realistic pseudo-proxy
experiments and datasets. Still, important assumptions have been made
and all uncertainty sources have not been considered, which calls for caution


117

when extrapolating these results to real reconstructions. This was a major
motivation for Chapter 5, where complexity was added to the experiments
by using a network of historical instrumental observations with changes in
data availability.

Further experiments are encouraged in order to account for additional
sources of uncertainty not included in Chapters 4 or 5, such as the non-
univariate dependency of proxy records. Considering all sources of uncer-
tainty would require dedicated but in principle feasible implementations in
the CRO algorithm, potentially bringing changes in the structure of the op-
timized networks reported in Chapter 4. For example, more skillful recon-
structions could be attempted through optimization functions tackling with
the climate signal and response resolution embedded in different paleocli-
mate archives. Our approach also paves the way for determining valuable
regions to carry out future measuring campaigns of proxy records, which can
be elucidated by mapping areas with natural paleoclimate archives that are
under-represented in current proxy networks. Note that, the CRO algorithm
can be very useful to minimize the costs of expensive measuring campaigns
because of its suitability to find the minimum number of representative loca-
tions to set new meteorological stations and/or to sample proxy records. The
algorithm can also be modified to know where it should not be measured,
avoiding regions that might present the right observational conditions but
are not representative for the aimed task.

In Chapter 5 we made the leap to the real world by using station-based ob-
servations of monthly SLP over the North Atlantic from 1750 to 2004 CE. We
found that evolutionary algorithms can also outperform non-optimized recon-
structions of dynamical fields on regional and monthly scales (Luterbacher
et al., 2002; Küttel et al., 2010). Instead of selecting an optimal subset, in
this case we tunned the CRO to derive optimal sets of weights that maximize
the reconstruction skill of the observing network. The optimization process
exploits the information of the full dataset, taking into account changes in
data availability, as well as inconsistencies within the network and with the


118 Chapter 7. Conclusions and outlook

large-scale field caused by other uncertainties. The relationships learnt by the
CRO algorithm are transformed in local weights during the reconstruction of
the large-scale SLP field. The optimized reconstruction improves the
performance of reconstructions generated without optimal weight-
ing over almost the entire region (especially around the North Atlantic
Ocean). Additional reconstruction experiments using reanalysis and model
data with the same constraints in data availability show that the spatial
distribution of weights and the pattern of improvement are robust to the
reference dataset and internal variations of the large-scale field targeted by
the reconstruction. This is in agreement with the pseudo-reconstructions of
Chapter 4, that also demonstrated that the CRO brings similar patterns of
improvement for different reconstruction techniques.

According to our results, changes in spatio-temporal data availability sub-
stantially affect the representativeness of local observations in the network,
therefore arising as an important source of uncertainty in our SLP recon-
structions. This result justified a separate optimization of the observing net-
work for the earlier reconstruction period (1750-1835 CE), which displayed
marked changes in the number and coverage of observations as compared
to the remaining period. Despite major constraints in data availability, the
optimized weights for the earlier period bring a pattern of improvement that
resembles the one achieved by the denser network of the latter period. The
generalized improvement with respect to the non-optimized recon-
struction is reached at the expense of sacrificing some skill in the
better-sampled region of Europe. However, this is overcompen-
sated by comparatively larger improvements over the North At-
lantic Ocean. The loss of skill over continental areas can be admissible
for our reconstruction as well as other reconstructions of dynamical fields
concerned with internal and forced aspects of the large-scale atmospheric
circulation, which typically involve changes in major oceanic action centers.
Additional improvements could be attempted through optimal subsets tai-
lored to the temporal availability of the observing network (i.e. by deriving
weights that have been optimized for each configuration of the observing net-


119

work during the reconstruction period).

The CRO-SLP reconstruction provides high-resolution monthly
SLP fields over the North Atlantic for 1750-2004 CE, from which we
derived the longest records of seasonal indices for the main modes
of climate variability of the North Atlantic, such as the NAO or the
EA pattern, and its main action centers, the AH and the IL (albeit
with large uncertainties before the 1830s). In particular, we have focused
on the AH, because it covers the region with the largest improvement in
the CRO-SLP reconstruction, and different to the NAO, there have been
few attempts to reconstruct the AH before the 20th century, likely due to
the scarcity of observations over the southern half of the North Atlantic
Ocean. Despite the lack of long-term trends in the summer AH, it can
experience substantial interannual changes in location/extension that match
with anomalous European conditions inferred from independent temperature
reconstructions and might serve as historical analogues of upcoming summers
under the projected northern shift and expansion of the summer AH (Cherchi
et al., 2018).

Our reconstructions reveal an intensification of the winter AH
during the second half of the 20th century that had no precedents
in at least the last 250 years. The strengthening of the winter AH is
now declining and is timely with the well-reported positive NAO trend of
1960s-1990s. While different causes have been proposed for this NAO trend,
including anthropogenic climate change, several studies show that it is not
exceptional as compared to pre-industrial periods before 1650 CE or sta-
tistically distinguishable from atmosphere-ocean internal variability (Pinto
and Raible, 2012) and references therein). This may also be the case of the
trend in the winter AH, whose changes in intensity have been associated with
North Atlantic sea surface temperature anomalies (Falarz, 2019). Our SLP
reconstruction provides a longer benchmark of pre-industrial conditions to
characterize the low-frequency variability of the AH and explore the causes
of this recent anomalous behaviour.


120 Chapter 7. Conclusions and outlook

We also found that the spread among NAO indices is larger for
winters when the NAO was dominated by the AH. Although part
of these differences may arise from the different definitions of the southern
node of the NAO, they seem to be present across indices that follow the same
methodology. As such, according to our results, current discrepancies in
instrumental NAO indices would stem more from uncertainties in
the AH than in the IL. This points to limitations of current datasets to
capture accurately the historical evolution of the AH, and stresses the need
for improved SLP reconstructions over this region. Here is where the CRO
algorithm can help to find the right locations to set land-based stations of
pressure observations.

On the other hand, in Chapter 6 we have presented a novel cluster-
ing technique for incomplete datasets known as k-gaps (Carro-Calvo
et al., 2020). This algorithm is an iterative technique that allows for the
clustering of heterogeneous datasets using most of the information contained
in incomplete time series (i.e. with at least 55% of the information unavail-
able). The method is fully based on the structure of the well-known k-means
algorithm, but including different procedures in order to adapt the algorithm
to series with gaps and/or different temporal lengths. Thus, the k-gaps al-
gorithm allows to cluster climate fields whose sampling records are
only available for certain time periods (e.g. as it is typically the case
of historical datasets of basic meteorological variables such as temperature,
precipitation, wind or SLP). The results show that this classification algo-
rithm performs well with datasets containing high levels of temporal missing
values. In the case of daily European temperatures, k-gaps exhibited a good
performance with most of the 500 synthetic datasets (with an average of
around 80% of missing data per record) employed for its validation, yield-
ing mean and variability patterns similar to those obtained when traditional
methods (including k-means) are applied to the originally unbiased records
(i.e. complete series). Moreover, consistent climatic clusterings have been
achieved for a reduced availability of climate records even when the number
of time series were 21 times smaller than the original grid of temperatures.


121

Therefore, the k-gaps algorithm is well-suited for the analysis of regional cli-
mates from datasets with an uneven distribution of records in both space
and time (e.g. station-based observations with missing values). On the other
hand, additional experiments with synthetic datasets showed no clear corre-
spondence between the number of time series employed in the analysis and
the skill of the clustering, indicating that an adequate spatial distri-
bution of sampling records over the region of interest can be more
important than the size of the dataset. Therefore, and similarly to the
results obtained in Chapters 4 and 5 with the CRO, we came to the conclu-
sion that a dense network is not necessarily better than a subset of
well distributed records.

Furthermore, it has been shown that k-gaps is well suited for the
reconstruction of regional climate trends, and the detection of ex-
treme events at regional scales, as obtained from time series contained
in the clusters. If the clustering is performed in basic mode, the resulting
centroids can provide an estimation of the temperature trends of each clus-
ter, whereas if it is done in normalization mode, time series with correlated
variability are grouped together, allowing for the detection of extreme events
at regional scale. Future applications of the k-gaps algorithm could be ap-
plied to other climatically differentiated regions apart from Europe. Other
experiments can be carried out to test the robustness of the method with
several climate variables such as precipitation and SLP, and other temporal
resolutions (e.g. monthly seasonal, annual, etc). Moreover, although further
research should be carried out to test its performance under records with sig-
nificant amounts of noise, the k-gaps algorithm also paves the way for future
applications on regional analyses of past climate changes based on histor-
ical and paleoclimate archives. Finally, it is also possible to combine the
methodologies developed in Chapters 4 and 5 (CRO algorithm) with k-gaps.
For example, there are cases (e.g. climate extremes) for which increasing
trends may artificially result from increasing data availability. The method
would help to detect more robust trends in regional extremes, which would
be useful for detection and attribution exercises. A shortcoming that could


122 Chapter 7. Conclusions and outlook

be solved with evolutionary algorithms is that k-gaps does not deal with e.g.
temporal changes in data quality, which may also cause spurious trends. Fur-
ther improvements could be achieved by a CRO-based optimized selection of
time series that maximize the relationship with the regional field before the
application of the k-gaps.


Appendix A

Supplementary Figures

-80 -60 -40 -20 0 20 40 60 80
0.00

0.25

0.50

0.75

1.00

Pr
ox

y 
w

ei
gh

t

(a)

Unweighted Area-weighted

-80 -60 -40 -20 0 20 40 60 80
Latitude (DD)

0
1
2
3
4
5
6

N
um

be
r o

f l
oc

at
io

ns

(b)

Figure A.1: Sensitivity of the optimization process to area-weighting. (a)
Weights assigned to perfect pseudo-proxies as function of their latitude in two
experiments of the CRO-AM. (b) Latitudinal distribution of optimized sub-
sets of 17 perfect pseudo-proxies from the PAGES-2k network obtained with
area-weighted (orange shading) and unweighted (purple line, CRO-MIN) ver-
sions of CRO-AM.

123


124 Appendix A. Supplementary Figures

-80 -60 -40 -20 0 20 40 60 80
Latitude (DD)

0

10

20

30

40

50
N

um
be

r o
f p

ro
xy

 lo
ca

tio
ns

 (%
)

CCA (1900-2005)
CCA (1850-2005)

Figure A.2: Latitudinal distributions of 17 representative locations obtained
with the CRO-CCA using years from 1850 to 2005 (orange shade) and from
1900 to 2005 as calibration periods.


125

1760 1780 1800 1820 1840
Years (CE)

-3

-2

-1

0

1

2

3

zs
co

re

|AH|-|IL| NAO+ NAO-

Figure A.3: CRO-SLP NAO(red and blue shade) and |AH|-|IL| (black line)
indices from 1751 to 1850 CE. Both series were standardized with respect to
the 1751-1850 CE baseline.


Appendix B

Supplementary Tables

Table B.1: List of accepted observations from SLP-Obs database.

Name Latitude(DD) Longitude (DD) Start (CE) End (CE)
Boston 42.36 -71.03 1840 2004
Toronto 43.70 -79.42 1841 1980
Bermuda 32.32 -64.77 1836 2004
Nashville 36.16 -86.78 1854 1980
Cincinatti 39.90 -84.27 1850 2004
Chicago 41.87 -87.62 1853 1980
St Louis 38.63 -90.20 1854 1980
Havana 23.12 -82.38 1858 2002
Jacksonville 30.33 -81.65 1871 1980
Key West 24.55 -81.75 1850 2004
Charleston 32.78 -79.93 1850 2004
New Orleans 29.59 -90.15 1850 2004
New York 40.73 -73.93 1850 1980
Galveston 29.30 -94.79 1873 1980
Father Point 48.52 -68.47 1874 1980
Montreal 45.51 -73.59 1874 1980
Minneapolis 44.99 -93.26 1865 1980

Continued on next page

127


128 Appendix B. Supplementary Tables

Table B.1 – Continued from previous page

Name Latitude(DD) Longitude (DD) Start (CE) End (CE)
Jacobshavn 69.22 -51.10 1866 1980
Ivigtut 61.21 -48.17 1866 1980
Aberdeen 57.20 -2.22 1855 2004
Akureyri 65.68 -18.08 1873 2004
Archangelsk 64.60 40.50 1850 2004
Armagh 54.40 -6.70 1849 2002
Astrakhan 46.60 48.00 1850 2004
Athens 38.00 23.70 1857 2004
Barcelona 41.20 2.10 1779 2004
Bergen 60.40 5.30 1815 2002
Berlin 52.40 13.10 1850 2004
Bidston 53.40 -3.00 1845 2002
Boothville 29.19 -89.23 1873 2002
Budapest 47.50 19.00 1809 2004
Cadiz 36.50 -6.30 1786 2004
Cairo 30.10 31.40 1856 2003
Copenhague 55.70 12.60 1841 2004
De Bilt 52.10 5.20 1848 2004
Des Moines 41.60 -93.60 1878 2002
Dublin 53.40 -6.30 1830 2004
Edinburgh 56.00 -3.40 1769 2004
Florence 43.80 11.30 1813 2004
Funchal 32.60 -16.90 1849 2004
Gdansk 54.30 18.90 1806 2004
Genoa 44.24 8.55 1832 2004
Gibraltar 36.20 -5.40 1821 2004
Nuuk 64.17 -51.75 1873 2004
Goteborg 57.70 11.99 1859 2004
Haparanda 65.80 24.20 1860 2004
Harnosand 62.38 17.56 1859 2004

Continued on next page


129

Table B.1 – Continued from previous page

Name Latitude(DD) Longitude (DD) Start (CE) End (CE)
Helsinki 60.30 25.00 1850 2004
Istanbul 41.00 29.10 1855 2004
Kiev 50.40 30.50 1849 2004
La Coruna 43.21 -8.24 1866 2004
Lisbon 38.70 -9.20 1849 2004
Lockbourne 39.81 -82.97 1878 2002
London 51.20 -1.00 1773 2004
Lund 55.40 13.10 1779 2004
Luxembourg 49.36 6.70 1837 2004
Lvov 49.80 24.00 1850 2004
Madrid 40.40 -3.70 1785 2004
Malta 35.90 14.50 1851 2004
Marseille 43.50 5.20 1850 2004
Milan 45.50 9.10 1764 2004
Moscow 55.80 37.60 1837 2004
Nantes 47.20 -1.60 1850 2004
Nicosia 35.10 33.20 1866 2003
Nordby 55.52 8.57 1873 2002
Odessa 46.50 30.60 1841 2004
Oporto 41.10 -8.60 1862 2004
Oslo 60.00 10.70 1815 2004
Padua 45.40 11.80 1749 2004
Palma 39.60 2.70 1849 2004
Paris 49.00 2.50 1764 2004
Ponta Delgada 37.80 -25.70 1864 2002
Prag 50.10 14.30 1788 2004
Reykjavik 64.10 -21.80 1820 2004
Rome 41.80 12.20 1849 2003
Sibiu 45.80 24.20 1850 2004
Split 43.30 16.26 1849 2003

Continued on next page


130 Appendix B. Supplementary Tables

Table B.1 – Continued from previous page

Name Latitude(DD) Longitude (DD) Start (CE) End (CE)
Stockholm 59.40 18.10 1849 2004
St Petersburg 60.00 30.30 1821 2004
Stykkisholmur 65.08 -22.73 1849 2003
Sulina 45.15 29.67 1849 2004
Tbilisi 41.43 44.47 1843 2002
Torshavn 62.02 -6.77 1866 2004
Triestre 45.70 13.80 1840 2004
Tromso 69.40 18.56 1873 2004
Trondheim 63.50 10.90 1767 2004
Upernavik 72.47 -56.10 1873 2004
Uppsala 59.90 17.60 1749 2004
Valentia 51.93 -10.25 1865 2004
Vardo 70.40 31.10 1860 2004
Visby 57.38 18.17 1859 2002
Warsaw 52.20 21.00 1835 2004
Wroctaw 51.10 16.88 1849 2004
Zagreb 45.80 16.00 1861 2003
Azores 38.31 -28.38 1850 2004
Beirut 33.54 35.30 1850 2003
Tunis 36.83 10.22 1875 2004
Sable Isla 43.93 -60.02 1850 2004
Moosonee 51.27 -80.65 1850 2004
Merida 20.59 -89.39 1850 2004
Nassau 25.05 -77.47 1850 2004


131

Table B.2: Pearson correlations of the AH (rAH) and IL (rIL) with the NAO
spread calculated as the standard deviation of NAO indices in Table 5.1 (a
PC-based NAO Index from 20CRv3 is also included). Different NAO spreads
have been calculated from 1900 to 2004 CE by excluding the series in the
dropped column.

Dropped rAH rIL
None 0.36 -0.07
Küttel et al. (2010) 0.36 -0.03
Luterbacher et al. (2001) 0.30 -0.09
Hurrell and Deser (2010) 0.36 -0.07
20CRv3 0.21 -0.16
Jones et al. (1997) 0.42 -0.03


Bibliography

Abatzoglou, J. T., Redmond, K. T. and Edwards, L. M. Classification
of Regional Climate Variability in the State of California. J. Appl. Meteo-
rol. Climatol., vol. 48(8), pages 1527–1541, doi:10.1175/2009JAMC2062.1,
2009.

Abdel-Basset, M., Abdel-Fatah, L. and Sangaiah, A. K. Chapter
10 - Metaheuristic Algorithms: A Comprehensive Review , pages 185–231.
Intelligent Data-Centric Systems. Academic Press, 2018. ISBN 978-0-12-
813314-9.

Aburas, M. M., Ahamad, M. S. S. and Omar, N. Q. Spatio-temporal
simulation and prediction of land-use change using conventional and ma-
chine learning models: a review. Environ. Monit. Assess., vol. 191(4), page
205, ISSN 1573-2959, doi:10.1007/s10661-019-7330-6, 2019.

Agapiou, A. Remote sensing heritage in a petabyte-scale: satellite data
and heritage Earth Engine c© applications. Int. J. Digit. Earth, vol. 10(1),
pages 85–102, ISSN 1753-8947, doi:10.1080/17538947.2016.1250829, 2017.

Aliaga, V. S., Ferrelli, F. and Piccolo, M. C. Regionalization of
climate over the Argentine Pampas. Int. J. Climatol., vol. 37, pages 1237–
1247, ISSN 1097-0088, doi:10.1002/joc.5079, 2017.

Allan, R. and Ansell, T. A New Globally Complete Monthly His-
torical Gridded Mean Sea Level Pressure Dataset (HadSLP2): 1850-
2004. J. Clim., vol. 19(22), pages 5816–5842, ISSN 0894-8755,
doi:10.1175/JCLI3937.1, 2006.

133


134 Bibliography

Bador, M., Naveau, P., Gilleland, E., Castellà, M. and Ariv-

elo, T. Spatial clustering of summer temperature maxima from
the CNRM-CM5 climate model ensembles & E-OBS over Europe.
Weather Clim. Extremes , vol. 9, pages 17 – 24, ISSN 2212-0947,
doi:10.1016/j.wace.2015.05.003, 2015.

Ballings, M., Van den Poel, D. and M., B. Social media opti-
mization: Identifying an optimal strategy for increasing network size
on Facebook. Omega, vol. 59, pages 15 – 25, ISSN 0305-0483,
doi:10.1016/j.omega.2015.04.017, 2016.

Barnes, E. A., Hurrell, J. W., Ebert-Uphoff, I., Anderson, C.

and Anderson, D. Viewing Forced Climate Patterns Through an AI
Lens. Geophys. Res. Lett., vol. 46(22), pages 13389–13398, ISSN 0094-
8276, doi:10.1029/2019GL084944, 2019.

Barnston, A. G. and Livezey, R. E. Classification, Seasonality
and Persistence of Low-Frequency Atmospheric Circulation Patterns.
Mon. Weather Rev., vol. 115(6), pages 1083–1126, doi:10.1175/1520-
0493(1987)115<1083:CSAPOL>2.0.CO;2, 1987.

Barriopedro, D., Fischer, E. M., Luterbacher, J., Trigo, R. M.

and García-Herrera, R. The hot summer of 2010: Redrawing the
temperature record map of europe. Science, vol. 332(6026), page 220,
doi:10.1126/science.1201224, 2011.

Benestad, R. E., Erlandsen, H. B., Mezghani, A. and Parding,

K. M. Geographical Distribution of Thermometers Gives the Appearance
of Lower Historical Global Warming. Geophys. Res. Lett., vol. 46(13),
pages 7654–7662, ISSN 0094-8276, doi:10.1029/2019GL083474, 2019.

Bernard, E., Naveau, P., Vrac, M. and Mestre, O. Clustering of
Maxima: Spatial Dependencies among Heavy Rainfall in France. J. Clim.,
vol. 26(20), pages 7929–7937, doi:10.1175/JCLI-D-12-00836.1, 2013.


Bibliography 135

Bhend, J., Franke, J., Folini, D., Wild, M. and Brönnimann, S. An
ensemble-based approach to climate reconstructions. Clim. Past , vol. 8,
doi:10.5194/cp-8-963-2012, 2012.

Biesbroek, R., Badloe, S. and Athanasiadis, I. N. Machine Learn-
ing for research on climate change adaptation policy integration: an ex-
ploratory UK case study. Reg. Environ. Change, vol. 20(3), page 85, ISSN
1436-378X, doi:10.1007/s10113-020-01677-8, 2020.

Boers, N., Goswami, B., Rheinwalt, A., Bookhagen, B., Hoskins,

B. and Kurths, J. Complex networks reveal global pattern of extreme-
rainfall teleconnections. Nature, vol. 566(7744), pages 373–377, ISSN 1476-
4687, doi:10.1038/s41586-018-0872-x, 2019.

Bothe, O. and Zorita, E. Proxy surrogate reconstructions for Europe
and the estimation of their uncertainties. Clim. Past , vol. 16(1), pages
341–369, doi:10.5194/cp-16-341-2020, 2020.

Bounds, D. G. New optimization methods from physics and bi-
ology. Nature, vol. 329(6136), pages 215–219, ISSN 1476-4687,
doi:10.1038/329215a0, 1987.

Bradley, R. S. Are there optimum sites for global paleotemperature re-
construction? In (eds. Jones P.D., Bradley R.S., Jouzel J.) Climatic vari-
ations and forcing mechanisms of the last 2000 years. NATO ASI Series
(Series I: Global Environmental Change), pages 603–624. 1996.

Bradley, R. S. and Jones, P. D. ‘Little Ice Age’ summer temperature
variations: their nature and relevance to recent global warming trends.
Holocene, vol. 3, doi:10.1177/095968369300300409, 1993.

Brönnimann, S., Allan, R., Ashcroft, L., Baer, S., Barrien-

dos, M., Brázdil, R., Brugnara, Y., Brunet, M., Brunetti,

M., Chimani, B., Cornes, R., Domínguez-Castro, F., Filipiak,

J., Founda, D., Herrera, R. G., Gergis, J., Grab, S., Hannak,

L., Huhtamaa, H., Jacobsen, K. S., Jones, P., Jourdain, S.,


136 Bibliography

Kiss, A., Lin, K. E., Lorrey, A., Lundstad, E., Luterbacher,

J., Mauelshagen, F., Maugeri, M., Maughan, N., Moberg, A.,
Neukom, R., Nicholson, S., Noone, S., Nordli, Ø., Ólafsdót-

tir, K. B., Pearce, P. R., Pfister, L., Pribyl, K., Przybylak, R.,
Pudmenzky, C., Rasol, D., Reichenbach, D., Reznícková, L., Ro-

drigo, F. S., Rohr, C., Skrynyk, O., Slonosky, V., Thorne, P.,
Valente, M. A., Vaquero, J. M., Westcottt, N. E., Williamson,

F. and Wyszynski, P. Unlocking Pre-1850 Instrumental Meteorological
Records: A Global Inventory. BAMS , vol. 100(12), pages ES389–ES413,
ISSN 0003-0007, doi:10.1175/BAMS-D-19-0040.1, 2020.

Caldwell, P. M., Bretherton, C. S., Zelinka, M. D., Klein, S. A.,
Santer, B. D. and Sanderson, B. M. Statistical significance of cli-
mate sensitivity predictors obtained by data mining. Geophys. Res. Lett.,
vol. 41(5), pages 1803–1808, ISSN 0094-8276, doi:10.1002/2014GL059205,
2014.

Carro-Calvo, L., Jaume-Santero, F., García-Herrera, R. and
Salcedo-Sanz, S. k-Gaps: a novel technique for clustering incom-
plete climatological time series. Theor. Appl. Climatol., ISSN 1434-4483,
doi:10.1007/s00704-020-03396-w, 2020.

Cherchi, A., Ambrizzi, T., Behera, S., Freitas, A. C. V., Morioka,

Y. and Zhou, T. The Response of Subtropical Highs to Climate
Change. Curr. Clim., vol. 4(4), pages 371–382, ISSN 2198-6061,
doi:10.1007/s40641-018-0114-1, 2018.

Cheruvelil, K. S., Yuan, S., Webster, K. E., Tan, P.-N., Lapierre,

J.-F., Collins, S. M., Fergus, C. E., Scott, C. E., Henry, E. N.,
Soranno, P. A., Filstrup, C. T. and Wagner, T. Creating multi-
themed ecological regions for macroscale ecology: Testing a flexible, re-
peatable, and accessible clustering method. Ecol. Evol., vol. 7(9), pages
3046–3058, ISSN 2045-7758, doi:10.1002/ece3.2884, 2017.


Bibliography 137

Chi, J. T., Chi, E. C. and Baraniuk, R. G. k-POD: A Method for k-
Means Clustering of Missing Data. Am. Stat., vol. 70(1), pages 91–99,
doi:10.1080/00031305.2015.1086685, 2016.

Christiansen, B. and Ljungqvist, F. C. Challenges and perspectives
for large-scale temperature reconstructions of the past two millennia. Rev.
Geophys., vol. 55, doi:10.1002/2016rg000521, 2017.

Collins, M., Knutti, R., Arblaster, J., Dufresne, J.-L., Fichefet,

T., Friedlingstein, P., Gao, X., Gutowski, W. J., Johns, T.,
Krinner, G., Shongwe, M., Tebaldi, C., Weaver, A. J. and
Wehner, M. Long-term climate change: Projections, commitments and
irreversibility. In Climate Change 2013: The Physical Science Basis. Con-
tribution of Working Group I to the Fifth Assessment Report of the In-
tergovernmental Panel on Climate Change (ed. by T. F. Stocker, D. Qin,
G.-K. Plattner, M. Tignor, S. K. Allen, J. Doschung, A. Nauels, Y. Xia,
V. Bex and P. M. Midgley), pages 1029–1136. Cambridge University Press,
Cambridge, UK and New York, USA, 2013.

Comas-Bru, L. and Hernández, A. Reconciling north atlantic climate
modes: revised monthly indices for the east atlantic and the scandinavian
patterns beyond the 20th century. Earth Syst. Sci. Data, vol. 10(4), pages
2329–2344, doi:10.5194/essd-10-2329-2018, 2018.

Comboul, M., Emile-Geay, J., Hakim, G. J. and Evans, M. N. Pa-
leoclimate Sampling as a Sensor Placement Problem. J. Clim., vol. 28,
doi:10.1175/jcli-d-14-00802.1, 2015.

2k Consortium, P. Continental-scale temperature variability during the
past two millennia. Nat. Geosci., vol. 6, doi:10.1038/ngeo1797, 2013.

Cover, T. and Hart, P. Nearest neighbor pattern classification.
IEEE Trans. Inf. Theory , vol. 13(1), pages 21–27, ISSN 0018-9448,
doi:10.1109/TIT.1967.1053964, 1967.


138 Bibliography

Cram, T. A., Compo, G. P., Yin, X., Allan, R. J., McColl, C.,
Vose, R. S., Whitaker, J. S., Matsui, N., Ashcroft, L., Auch-

mann, R., Bessemoulin, P., Brandsma, T., Brohan, P., Brunet,

M., Comeaux, J., Crouthamel, R., Gleason Jr, B. E., Grois-

man, P. Y., Hersbach, H., Jones, P. D., Jónsson, T., Jourdain,

S., Kelly, G., Knapp, K. R., Kruger, A., Kubota, H., Lentini, G.,
Lorrey, A., Lott, N., Lubker, S. J., Luterbacher, J., Marshall,

G. J., Maugeri, M., Mock, C. J., Mok, H. Y., Nordli, Ø., Rod-

well, M. J., Ross, T. F., Schuster, D., Srnec, L., Valente, M. A.,
Vizi, Z., Wang, X. L., Westcott, N., Woollen, J. S. and Worley,

S. J. The International Surface Pressure Databank version 2. Geosci. Data
J., vol. 2(1), pages 31–46, ISSN 2049-6060, doi:10.1002/gdj3.25, 2015.

Crowley, T. J. Causes of climate change over the past 1000 years. Science,
vol. 289, doi:10.1126/science.289.5477.270, 2000.

Davis, R. E., Hayden, B. P., Gay, D. A., Phillips, W. L. and Jones,

G. V. The North Atlantic Subtropical Anticyclone. J. Clim., vol. 10(4),
pages 728–744, doi:10.1175/1520-0442(1997)010<0728:TNASA>2.0.CO;2,
1997.

Del Ser, J., Osaba, E., Molina, D., Yang, X., Salcedo-Sanz, S.,
Camacho, D., Das, S., N. Suganthan, P., Coello Coello, C. A.

and Herrera, F. Bio-inspired computation: Where we stand and what’s
next. Swarm Evol. Comput., vol. 48, pages 220 – 250, ISSN 2210-6502,
doi:10.1016/j.swevo.2019.04.008, 2019.

Demuzere, M., Kassomenos, P. and Philipp, A. The COST733 cir-
culation type classification software: an example for surface ozone con-
centrations in Central Europe. Theor. Appl. Climatol., vol. 105(1), pages
143–166, ISSN 1434-4483, doi:10.1007/s00704-010-0378-4, 2011.

Dixon, J. K. Pattern Recognition with Partly Missing Data. IEEE Trans.
Syst. Man. Cybern. B Cybern., vol. 9(10), pages 617–621, ISSN 0018-9472,
doi:10.1109/TSMC.1979.4310090, 1979.


Bibliography 139

Eiben, A. E. and Smith, J. From evolutionary computation to the evolution
of things. Nature, vol. 521, doi:10.1038/nature14544, 2015.

Emile-Geay, J., McKay, N. P., Kaufman, D. S., von Gunten, L.,
Wang, J., Anchukaitis, K. J., Abram, N. J., Addison, J. A., Cur-

ran, M. A., Evans, M. N., Henley, B. J., Hao, Z., Martrat, B.,
McGregor, H. V., Neukom, R., Pederson, G. T., Stenni, B.,
Thirumalai, K., Werner, J. P., Xu, C., Divine, D. V., Dixon,

B. C., Gergis, J., Mundo, I. A., Nakatsuka, T., Phipps, S. J.,
Routson, C. C., Steig, E. J., Tierney, J. E., Tyler, J. J., Allen,

K. J., Bertler, N. A., Björklund, J., Chase, B. M., Chen, M.-T.,
Cook, E., de Jong, R., DeLong, K. L., Dixon, D. A., Ekaykin,

A. A., Ersek, V., Filipsson, H. L., Francus, P., Freund, M. B.,
Frezzotti, M., Gaire, N. P., Gajewski, K., Ge, Q., Goosse,

H., Gornostaeva, A., Grosjean, M., Horiuchi, K., Hormes, A.,
Husum, K., Isaksson, E., Kandasamy, S., Kawamura, K., Kil-

bourne, K. H., Koç, N., Leduc, G., Linderholm, H. W., Lor-

rey, A. M., Mikhalenko, V., Mortyn, P. G., Motoyama, H.,
Moy, A. D., Mulvaney, R., Munz, P. M., Nash, D. J., Oerter, H.,
Opel, T., Orsi, A. J., Ovchinnikov, D. V., Porter, T. J., Roop,

H. A., Saenger, C., Sano, M., Sauchyn, D., Saunders, K. M., Sei-

denkrantz, M.-S., Severi, M., Shao, X., Sicre, M.-A., Sigl, M.,
Sinclair, K., St. George, S., St. Jacques, J.-M., Thamban, M.,
Kuwar Thapa, U., Thomas, E. R., Turney, C., Uemura, R., Viau,

A. E., Vladimirova, D. O., Wahl, E. R., White, J. W., Yu, Z.,
Zinke, J. and PAGES2k-Consortium. A global multiproxy database
for temperature reconstructions of the Common Era. Sci. Data, vol. 4(1),
page 170088, ISSN 2052-4463, doi:10.1038/sdata.2017.88, 2017.

Evans, M. N., A., K., A., C. M. and R., V. Globality and optimality
in climate field reconstructions from proxy data. In (ed. Markgraf V.)
Interhemispheric Climate Linkages., pages 53–72. Elsevier BV, 2001.

Evans, M. N., Kaplan, A. and Cane, M. A. Optimal sites for coral-based


140 Bibliography

reconstruction of global sea surface temperature. Paleoceanogr. Paleocli-
matol., vol. 13, doi:10.1029/98pa02132, 1998.

Evans, M. N., Smerdon, J. E., Kaplan, A., Tolwinski-Ward, S. E.

and González-Rouco, J. F. Climate field reconstruction uncertainty
arising from multivariate and nonlinear properties of predictors. Geophys.
Res. Lett., vol. 41, doi:10.1002/2014GL062063, 2014.

Eyring, V., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B.,
Stouffer, R. J. and Taylor, K. E. Overview of the Coupled Model
Intercomparison Project Phase 6 (CMIP6) experimental design and organi-
zation. Geosci. Model Dev., vol. 9(5), pages 1937–1958, doi:10.5194/gmd-
9-1937-2016, 2016.

Falarz, M. Azores High and Hawaiian High: correlations, trends and
shifts (1948–2018). Theor. Appl. Climatol., vol. 138(1-2), pages 417–431,
doi:10.1007/s00704-019-02837-5, 2019.

Fernández-Donado, L., González-Rouco, J. F., Raible, C. C., Am-

mann, C. M., Barriopedro, D., García-Bustamante, E., Jung-

claus, J. H., Lorenz, S. J., Luterbacher, J., Phipps, S. J., Ser-

vonnat, J., Swingedouw, D., Tett, S. F. B., Wagner, S., Yiou,

P. and Zorita, E. Large-scale temperature response to external forcing
in simulations and reconstructions of the last millennium. Clim. Past , vol.
9(1), pages 393–421, doi:10.5194/cp-9-393-2013, 2013.

Flato, G., Marotzke, J., Abiodun, B., Braconnot, P., Chou, S. C.,
Collins, W., Cox, P., Driouech, F., Emori, S., Eyring, V., For-

est, C., Gleckler, P., Guilyardi, E., Jakob, C., Kattsov, V.,
Reason, C. and Rummukainen, M. Evaluation of Climate Models. In
Climate Change 2013: The Physical Science Basis. Contribution of Work-
ing Group I to the Fifth Assessment Report of the Intergovernmental Panel
on Climate Change (ed. by T. F. Stocker, D. Qin, G.-K. Plattner, M. Tig-
nor, S. K. Allen, J. Doschung, A. Nauels, Y. Xia, V. Bex and P. M. Midg-
ley), pages 741–866. Cambridge University Press, Cambridge, UK and New
York, USA, 2013.


Bibliography 141

Forrest, S. Genetic algorithms: principles of natural selection applied to
computation. Science, vol. 261, doi:10.1126/science.8346439, 1993.

Franke, J., Brönnimann, S., Bhend, J. and Brugnara, Y. A monthly
global paleo-reanalysis of the atmosphere from 1600 to 2005 for studying
past climatic variations. Sci. Data, vol. 4, doi:10.1038/sdata.2017.76, 2017.

Franke, J., González-Rouco, J. F., Frank, D. and Graham, N. E.

200 years of European temperature variability: insights from and tests of
the proxy surrogate reconstruction analog method. Clim. Dyn., vol. 37(1),
pages 133–150, ISSN 1432-0894, doi:10.1007/s00382-010-0802-6, 2011.

Franke, J., Valler, V., Brönnimann, S., Neukom, R. and Jaume-

Santero, F. The importance of input data quality and quantity in cli-
mate field reconstructions – results from the assimilation of various tree-
ring collections. Clim. Past , vol. 16(3), pages 1061–1074, doi:10.5194/cp-
16-1061-2020, 2020.

Gagne II, D. J., Christensen, H. M., Subramanian, A. C. and
Monahan, A. H. Machine Learning for Stochastic Parameterization:
Generative Adversarial Networks in the Lorenz ’96 Model. J. Adv.
Model. Earth Syst., vol. 12(3), page e2019MS001896, ISSN 1942-2466,
doi:10.1029/2019MS001896, 2020.

Gao, H., Chen, J., Wang, B., Tan, S. C., Lee, C. M., Yao,

X., Yan, H. and Shi, J. A study of air pollution of city clus-
ters. Atmos. Environ., vol. 45(18), pages 3069 – 3077, ISSN 1352-2310,
doi:10.1016/j.atmosenv.2011.03.018, 2011.

Glaser, R. and Riemann, D. A thousand-year record of tempera-
ture variations for Germany and Central Europe based on documen-
tary data. J. Quat. Sci., vol. 24(5), pages 437–449, ISSN 1099-1417,
doi:10.1002/jqs.1302, 2009.

Gómez-Navarro, J. J., Werner, J., Wagner, S., Luterbacher, J.

and Zorita, E. Establishing the skill of climate field reconstruction tech-
niques for precipitation with pseudoproxy experiments. Clim. Dyn., vol.


142 Bibliography

45(5), pages 1395–1413, ISSN 1432-0894, doi:10.1007/s00382-014-2388-x,
2015.

Gómez-Navarro, J. J., Zorita, E., Raible, C. C. and Neukom, R.

Pseudo-proxy tests of the analogue method to reconstruct spatially re-
solved global temperature during the Common Era. Clim. Past , vol. 13,
doi:10.5194/cp-13-629-2017, 2017.

Hakim, G. J., Emile-Geay, J., Steig, E. J., Noone, D., Anderson,

D. M., Tardif, R., Steiger, N. and Perkins, W. A. The Last Millen-
nium climate reanalysis project: Framework and first results. J. Geophys.
Res. Atmos., vol. 121(12), pages 6745–6764, doi:10.1002/2016JD024751,
2016.

Hansen, J., Sato, M., Kharecha, P. and von Schuckmann, K. Earth’s
energy imbalance and implications. Atmos. Chem. Phys., vol. 11, pages
13421–13449, doi:10.5194/acp-11-13421-2011, 2011.

Hartigan, J. A. and Wong, M. A. Algorithm AS 136: A K-Means Clus-
tering Algorithm. J. Royal Stat. Soc. Series C , vol. 28(1), pages 100–108,
ISSN 00359254, 14679876, doi:10.2307/2346830, 1979.

Hasanean, H. M. Variability of the North Atlantic subtropical high and
associations with tropical sea-surface temperature. Int. J. Climatol., vol.
24(8), pages 945–957, ISSN 0899-8418, doi:10.1002/joc.1042, 2004.

Hausfather, Z., Drake, H. F., Abbott, T. and Schmidt, G. A.

Evaluating the Performance of Past Climate Model Projections. Geo-
phys. Res. Lett., vol. 47(1), page e2019GL085378, ISSN 0094-8276,
doi:10.1029/2019GL085378, 2020.

Haylock, M. R., Hofstra, N., Klein Tank, A. M. G., Klok,

E. J., Jones, P. D. and New, M. A European daily high-
resolution gridded data set of surface temperature and precipitation for
1950-2006. J. Geophys. Res. Atmos., vol. 113(D20), ISSN 2156-2202,
doi:10.1029/2008JD010201, 2008.


Bibliography 143

He, C., Wu, B., Zou, L. and Zhou, T. Responses of the Summertime
Subtropical Anticyclones to Global Warming. J. Clim., vol. 30(16), pages
6465–6479, doi:10.1175/JCLI-D-16-0529.1, 2017.

He, X., Chaney, N. W., Schleiss, M. and Sheffield, J. Spa-
tial downscaling of precipitation using adaptable random forests. Wa-
ter Resour. Res., vol. 52(10), pages 8217–8237, ISSN 0043-1397,
doi:10.1002/2016WR019034, 2016.

Henn, B., Raleigh, M. S., Fisher, A. and Lundquist, J. D. A Compar-
ison of Methods for Filling Gaps in Hourly Near-Surface Air Temperature
Data. J. Hydrometeorol., vol. 14(3), pages 929–945, doi:10.1175/JHM-D-
12-027.1, 2013.

Hernández, A., Sánchez-López, G., Pla-Rabes, S., Comas-Bru, L.,
Parnell, A., Cahill, N., Geyer, A., Trigo, R. M. and Giralt,

S. A 2000-year Bayesian NAO reconstruction from the Iberian Peninsula.
Sci. Rep., vol. 10(1), page 14961, ISSN 2045-2322, doi:10.1038/s41598-020-
71372-5, 2020.

Ho, D. Artificial Intelligence in cancer therapy. Science, vol. 367(6481),
page 982, doi:10.1126/science.aaz3023, 2020.

Hodson, R. Digital Revolution. Nat. Outlook , vol. 563(7733), page 1,
doi:10.1038/d41586-018-07500-z, 2018.

Horton, D. E., Johnson, N. C., Singh, D., Swain, D. L., Rajarat-

nam, B. and Diffenbaugh, N. S. Contribution of changes in atmo-
spheric circulation patterns to extreme temperature trends. Nature, vol.
522(7557), page 465, doi:10.1038/nature14550, 2015.

Hotelling, H. Relation between two sets of variates*. Biometrika, vol.
28(3-4), pages 321–377, ISSN 0006-3444, doi:10.1093/biomet/28.3-4.321,
1936.


144 Bibliography

Hu, J., Emile-Geay, J. and Partin, J. Correlation-based interpretations
of paleoclimate data – where statistics meet past climates. Earth Planet.
Sci. Lett., vol. 459, doi:10.1016/j.epsl.2016.11.048, 2017.

Huber, M. and Knutti, R. Anthropogenic and natural warming inferred
from changes in Earth’s energy balance. Nat. Geosci., vol. 5(1), pages
31–36, ISSN 1752-0908, doi:10.1038/ngeo1327, 2012.

Hurrell, J., Trenberth, K. and NCAR. The Climate Data Guide:
NCAR Sea Level Pressure. In (Eds. National Center for Atmospheric
Research Staff ) Retrieved from https://climatedataguide.ucar.edu/climate-
data/ncar-sea-level-pressure. 2020.

Hurrell, J. W. and Deser, C. North Atlantic climate variability: The
role of the North Atlantic Oscillation. J. Mar. Syst., vol. 78(1), pages 28
– 41, ISSN 0924-7963, doi:10.1016/j.jmarsys.2008.11.026, 2010.

Iles, C. E., Vautard, R., Strachan, J., Joussaume, S., Eggen, B. R.

and Hewitt, C. D. The benefits of increasing resolution in global and
regional climate simulations for European climate extremes. Geosci. Model
Dev., vol. 13(11), pages 5583–5607, doi:10.5194/gmd-13-5583-2020, 2020.

Ilyas, M., Brierley, C. M. and Guillas, S. Uncertainty in regional
temperatures inferred from sparse global observations: Application to a
probabilistic classification of El Niño. Geophys. Res. Lett., vol. 44(17),
pages 9068–9074, ISSN 0094-8276, doi:10.1002/2017GL074596, 2017.

Ioannidou, L. and Yau, M. K. A climatology of the Northern Hemi-
sphere winter anticyclones. J. Geophys. Res. Atmos., vol. 113(D8),
doi:10.1029/2007JD008409, 2008.

IPCC. In Climate Change 2013: The Physical Science Basis. Contribution of
Working Group I to the Fifth Assessment Report of the Intergovernmental
Panel on Climate Change (ed. by T. F. Stocker, D. Qin, G.-K. Plattner,
M. Tignor, S. K. Allen, J. Doschung, A. Nauels, Y. Xia, V. Bex and P. M.
Midgley), pages 1–1535. Cambridge University Press, Cambridge, UK and
New York, USA, 2013.


Bibliography 145

IPCC. Summary for policymakers. In Climate Change 2014: Im-
pacts,Adaptation, and Vulnerability. Part A: Global and Sectoral Aspects.
Contribution of Working Group II to the Fifth Assessment Report of the
Intergovernmental Panel on Climate Change (ed. by C. Field, V. R. Barros,
D. J. Dokken, K. J. Mach, M. D. Mastrandrea, T. E. Bilir, M. Chatterjee,
K. L. Ebi, Y. O. Estrada, R. C. Genova, B. Girma, E. S. Kissel, A. N.
Levy, S. MacCracken, P. R. Mastrandrea and L. L. White), pages 1–32.
Cambridge University Press, Cambridge, UK and New York, USA, 2014.

IPCC. Global Warming of 1.5◦C. In An IPCC Special Report on the im-
pacts of global warming of 1.5◦C above pre-industrial levels and related
global greenhouse gas emission pathways, in the context of strengthening
the global response to the threat of climate change, sustainable development,
and efforts to eradicate poverty (ed. by V. Masson-Delmotte, P. Zhai, H. O.
Pörtner, D. Roberts, J. Skea, P. R. Shukla, A. Pirani, W. Moufouma-Okia,
C. Péan, R. Pidcock, S. Connors, J. B. R. Matthews, Y. Chen, X. Zhou,
M. Gomis, E. Lonnoy, T. Maycock, M. Tignor and T. Waterfield), pages
1–616. Cambridge University Press, Cambridge, UK and New York, USA,
2018.

Iqbal, M. J., Rehman, S. U., Hameed, S. and Qureshi, M. A. Changes
in Hadley circulation: the Azores high and winter precipitation over trop-
ical northeast Africa. Theor. Appl. Climatol., vol. 137(3-4), pages 2941–
2948, doi:10.1007/s00704-019-02765-4, 2019.

Jaume-Santero, F., Barriopedro, D., García-Herrera, R., Calvo,

N. and Salcedo-Sanz, S. Selection of optimal proxy locations for tem-
perature field reconstructions using evolutionary algorithms. Sci. Rep.,
vol. 10(1), ISSN 2045-2322, doi:10.1038/s41598-020-64459-6, 2020.

Jaume-Santero, F., Barriopedro, D., García-Herrera, R. F. and
Luterbacher, J. North Atlantic Sea Level Pressure reconstruction back
to 1750 CE using optimized networks of observations. J. Clim., Submitted,
2021.


146 Bibliography

Jin, C., Chen, W., Cao, Y., Xu, Z., Tan, Z., Zhang, X., Deng, L.,
Zheng, C., Zhou, J., Shi, H. and Feng, J. Development and evaluation
of an artificial intelligence system for COVID-19 diagnosis. Nat. Commun.,
vol. 11(1), page 5088, ISSN 2041-1723, doi:10.1038/s41467-020-18685-1,
2020.

Jones, P. D., Jonsson, T. and Wheeler, D. Extension
to the North Atlantic oscillation using early instrumental pres-
sure observations from Gibraltar and south-west Iceland. Int.
J. Clim., vol. 17(13), pages 1433–1450, doi:10.1002/(SICI)1097-
0088(19971115)17:13<1433::AID-JOC203>3.0.CO;2-P, 1997.

Jones, P. D., New, M., Parker, D. E., Martin, S. and Rigor, I. G.

Surface air temperature and its variations over the last 150 years. Rev.
Geophys., vol. 37, doi:10.1029/1999rg900002, 1999.

Jungclaus, J. H., Bard, E., Baroni, M., Braconnot, P., Cao, J.,
Chini, L. P., Egorova, T., Evans, M., González-Rouco, J. F.,
Goosse, H., Hurtt, G. C., Joos, F., Kaplan, J. O., Khodri, M.,
Klein Goldewijk, K., Krivova, N., LeGrande, A. N., Lorenz,

S. J., Luterbacher, J., Man, W., Maycock, A. C., Meinshausen,

M., Moberg, A., Muscheler, R., Nehrbass-Ahles, C., Otto-

Bliesner, B. I., Phipps, S. J., Pongratz, J., Rozanov, E., Schmidt,

G. A., Schmidt, H., Schmutz, W., Schurer, A., Shapiro, A. I.,
Sigl, M., Smerdon, J. E., Solanki, S. K., Timmreck, C., Toohey,

M., Usoskin, I. G., Wagner, S., Wu, C.-J., Yeo, K. L., Zanchet-

tin, D., Zhang, Q. and Zorita, E. The PMIP4 contribution to CMIP6
– Part 3: The last millennium, scientific objective, and experimental de-
sign for the PMIP4 past1000 simulations. Geosci. Model Dev., vol. 10(11),
pages 4005–4033, doi:10.5194/gmd-10-4005-2017, 2017.

Kadow, C., Hall, D. M. and Ulbrich, U. Artificial intelligence re-
constructs missing climate information. Nat. Geosci., vol. 13(6), pages
408–413, ISSN 1752-0908, doi:10.1038/s41561-020-0582-5, 2020.


Bibliography 147

Karpatne, A., Ebert-Uphoff, I., Ravela, S., Babaie, H. A. and Ku-

mar, V. Machine Learning for the Geosciences: Challenges and Opportu-
nities. IEEE Trans. Knowl. Data Eng., vol. 31(8), pages 1544–1554, ISSN
1558-2191, doi:10.1109/TKDE.2018.2861006, 2019.

Kashani, A. R., Gandomi, A. H. and Mousavi, M. Imperialistic Com-
petitive Algorithm: A metaheuristic algorithm for locating the critical slip
surface in 2-Dimensional soil slopes. Geosci. Front., vol. 7(1), pages 83–89,
ISSN 1674-9871, 2016.

Kates-Harbeck, J., Svyatkovskiy, A. and Tang, W. Predicting dis-
ruptive instabilities in controlled fusion plasmas through deep learning. Na-
ture, vol. 568(7753), pages 526–531, ISSN 1476-4687, doi:10.1038/s41586-
019-1116-4, 2019.

Kaufman, D. S. Recent warming reverses long-term Arctic cooling. Science,
vol. 325, doi:10.1126/science.1173983, 2009.

Kell, D. B. Scientific discovery as a combinatorial optimisation problem:
how best to navigate the landscape of possible experiments? BioEssays ,
vol. 34(3), pages 236–244, ISSN 1521-1878, doi:10.1002/bies.201100144,
2012.

Kettenring, J. R. The Practice of Cluster Analysis. J. Classif., vol. 23(1),
pages 3–30, ISSN 1432-1343, doi:10.1007/s00357-006-0002-6, 2006.

Knapp, T. R. Canonical correlation analysis: A general parametric
significance-testing system. Psychol. Bull., vol. 85(2), pages 410–416,
doi:10.1037/0033-2909.85.2.410, 1978.

Knüsel, B. Applying big data beyond small problems in climate research.
Nat. Clim. Change, vol. 9, doi:10.1038/s41558-019-0404-1, 2019.

Köppen, W. Die Wärmezonen der Erde, nach der Dauer der heissen,
gemässigten und kalten Zeit und nach der Wirkung der Wärme auf die
organische Welt betrachtet (The thermal zones of the earth according
to the duration of hot, moderate and cold periods and to the impact


148 Bibliography

of heat on the organic world). Meteorol. Z., vol. 1(21), pages 215–226,
doi:10.1127/metz/2016/0816, 1884.

Küttel, M., Xoplaki, E., Gallego, D., Luterbacher, J., García-

Herrera, R., Allan, R., Barriendos, M., Jones, P. D., Wheeler,

D. and Wanner, H. The importance of ship log data: reconstruct-
ing North Atlantic, European and Mediterranean sea level pressure fields
back to 1750. Clim. Dyn., vol. 34(7), pages 1115–1128, ISSN 1432-0894,
doi:10.1007/s00382-009-0577-9, 2010.

Lange, K., Hunter, D. R. and Yang, I. Optimization Transfer Using
Surrogate Objective Functions. J. Comput. Graph. Stat., vol. 9(1), pages
1–20, ISSN 10618600, doi:10.2307/1390605, 2000.

LeCun, Y., Bengio, Y. and Hinton, G. Deep Learning. Nature,
vol. 521(7553), pages 436–444, ISSN 1476-4687, doi:10.1038/nature14539,
2015.

Lewis, S. C. and LeGrande, A. N. Stability of ENSO and its tropical
Pacific teleconnections over the Last Millennium. Clim. Past , vol. 11,
doi:10.5194/cp-11-1347-2015, 2015.

Li, W., Li, L., Ting, M. and Liu, Y. Intensification of Northern Hemi-
sphere subtropical highs in a warming climate. Nat. Geosci., vol. 5(11),
pages 830–834, ISSN 1752-0908, doi:10.1038/ngeo1590, 2012.

Lorenz, E. N. Atmospheric Predictability as Revealed by Naturally
Occurring Analogues. J. Atmos. Sci., vol. 26(4), pages 636–646,
doi:10.1175/1520-0469(1969)26<636:APARBN>2.0.CO;2, 1969.

Luterbacher, J., Dietrich, D., Xoplaki, E., Grosjean, M. and Wan-

ner, H. European Seasonal and Annual Temperature Variability, Trends,
and Extremes Since 1500. Science, vol. 303(5663), pages 1499–1503, ISSN
0036-8075, doi:10.1126/science.1093877, 2004.

Luterbacher, J., Xoplaki, E., Dietrich, D., Jones, P. D., Davies,

T. D., Portis, D., Gonzalez-Rouco, J. F., von Storch, H., Gyal-


Bibliography 149

istras, D., Casty, C. and Wanner, H. Extending North Atlantic os-
cillation reconstructions back to 1500. Atmos. Sci. Lett , vol. 2(1-4), pages
114–124, doi:10.1006/asle.2002.0047, 2001.

Luterbacher, J., Xoplaki, E., Dietrich, D., Rickli, R., Jacobeit,

J., Beck, C., Gyalistras, D., Schmutz, C. and Wanner, H. Re-
construction of sea level pressure fields over the Eastern North Atlantic
and Europe back to 1500. Clim. Dyn., vol. 18(7), pages 545–561, ISSN
1432-0894, doi:10.1007/s00382-001-0196-6, 2002.

Mann, M. E. Global signatures and dynamical origins of the Lit-
tle Ice Age and Medieval Climate Anomaly. Science, vol. 326,
doi:10.1126/science.1177303, 2009.

Masson-Delmotte, V., Schulz, M., Abe-Ouchi, A., Beer, J.,
Ganopolski, A., Rouco, J. F. G., Jansen, E., Lambeck, K., Luter-

bacher, J., Naish, T., Osborn, T., Otto-Bliesner, B., Quinn, T.,
Ramesh, R., Rojas, M., Shao, X. and Timmermann, A. Informa-
tion from Paleoclimate Archives. In Climate Change 2013: The Physical
Science Basis. Contribution of Working Group I to the Fifth Assessment
Report of the Intergovernmental Panel on Climate Change (ed. by T. F.
Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Doschung,
A. Nauels, Y. Xia, V. Bex and P. M. Midgley), pages 383–464. Cambridge
University Press, Cambridge, UK and New York, USA, 2013.

Maxwell, A. E., Warner, T. A. and Fang, F. Implementation
of machine-learning classification in remote sensing: an applied review.
Int. J. Remote Sens., vol. 39(9), pages 2784–2817, ISSN 0143-1161,
doi:10.1080/01431161.2018.1433343, 2018.

Mellado-Cano, J., Barriopedro, D., García-Herrera, R., Trigo,

R. M. and Hernández, A. Examining the North Atlantic Oscillation,
East Atlantic Pattern, and Jet Variability since 1685. J. Clim., vol. 32(19),
pages 6285–6298, ISSN 0894-8755, doi:10.1175/JCLI-D-19-0135.1, 2019.


150 Bibliography

Miele, V., Picard, F. and Dray, S. Spatially constrained clustering of
ecological networks. Methods Ecol. Evol., vol. 5(8), pages 771–779, ISSN
2041-210X, doi:10.1111/2041-210X.12208, 2014.

Murphy, K. Machine Learning: A Probabilistic Perspective. The MIT
Press. ISBN 978-0-262-01802-9, 2012.

Myhre, G., Shindell, D., Bréon, F. M., Collins, W., Fuglestvedt,

J., Huang, J., Koch, D., Lamarque, J. F., Lee, D., Mendoza,

B., Nakajima, T., Robock, A., Stephens, G., Takemura, T. and
Zhang, H. Anthropogenic and Natural Radiative Forcing. In Climate
Change 2013: The Physical Science Basis. Contribution of Working Group
I to the Fifth Assessment Report of the Intergovernmental Panel on Cli-
mate Change (ed. by T. F. Stocker, D. Qin, G.-K. Plattner, M. Tignor,
S. K. Allen, J. Doschung, A. Nauels, Y. Xia, V. Bex and P. M. Midg-
ley), pages 659–740. Cambridge University Press, Cambridge, UK and New
York, USA, 2013.

Netzel, P. and Stepinski, T. On Using a Clustering Approach for Global
Climate Classification. J. Clim., vol. 29(9), pages 3387–3401, ISSN 0894-
8755, doi:10.1175/JCLI-D-15-0640.1, 2016.

Neukom, R., Schurer, A. P., Steiger, N. J. and Hegerl, G. C. Pos-
sible causes of data model discrepancy in the temperature history of the
Last Millennium. Sci. Rep., vol. 8, doi:10.1038/s41598-018-25862-2, 2018.

Neukom, R., Steiger, N., Gómez-Navarro, J. J., Wang, J. and
Werner, J. P. No evidence for globally coherent warm and cold pe-
riods over the preindustrial Common Era. Nature, vol. 571(7766), pages
550–554, ISSN 1476-4687, doi:10.1038/s41586-019-1401-2, 2019.

Noone, S., Atkinson, C., Berry, D. I., Dunn, R. J. H., Freeman,

E., Perez Gonzalez, I., Kennedy, J. J., Kent, E. C., Kettle, A.,
McNeill, S., Menne, M., Stephens, A., Thorne, P. W., Tucker,

W., Voces, C. and Willett, K. M. Progress towards a holistic land


Bibliography 151

and marine surface meteorological database and a call for additional con-
tributions. Geosci. Data J., ISSN 2049-6060, doi:10.1002/gdj3.109, 2020.

Omran, M. G., Engelbrecht, A. P. and Salman, A. An overview
of clustering methods. Intell. Data Anal., vol. 11, pages 583–605, ISSN
1571-4128, doi:10.3233/IDA-2007-11602, 2007.

Otto-Bliesner, B. L. Climate Variability and Change since 850 C.E.: An
Ensemble Approach with the Community Earth System Model (CESM).
BAMS , vol. 97, doi:10.1175/bams-d-14-00233.1, 2016.

Perdinan and Winkler, J. A. Selection of climate information for regional
climate change assessments using regionalization techniques: an example
for the Upper Great Lakes Region, USA. Int. J. Climatol., vol. 35(6),
pages 1027–1040, doi:10.1002/joc.4036, 2015.

Phillips, S. J. Acceleration of K-Means and Related Clustering Algo-
rithms. In Algorithm Engineering and Experiments (ed. by D. M. Mount
and C. Stein), pages 166–177. Springer Berlin Heidelberg, Berlin Heidel-
berg, 2002. ISBN 978-3-540-45643-8.

Pietsch, W. The Causal Nature of Modeling with Big Data. Philos. Tech-
nol., vol. 29(2), pages 137–171, ISSN 2210-5441, doi:10.1007/s13347-015-
0202-2, 2016.

Pinto, J. G. and Raible, C. C. Past and recent changes in the North
Atlantic oscillation. WIREs Clim. Change, vol. 3(1), pages 79–90, ISSN
1757-7780, doi:10.1002/wcc.150, 2012.

Portis, D. H., Walsh, J. E., El Hamly, M. and Lamb, P. J. Seasonality
of the North Atlantic Oscillation. J. Clim., vol. 14(9), pages 2069–2078,
doi:10.1175/1520-0442(2001)014<2069:SOTNAO>2.0.CO;2, 2001.

Prakash, S., Sharma, A. and Sahu, S. S. Soil Moisture Prediction Using
Machine Learning. In 2018 Second International Conference on Inventive
Communication and Computational Technologies (ICICCT), 2018 Second


152 Bibliography

International Conference on Inventive Communication and Computational
Technologies (ICICCT), pages 1–6. 2018.

Rajan, K. and Saffiotti, A. Towards a science of integrated AI
and Robotics. Artif. Intell., vol. 247, pages 1–9, ISSN 0004-3702,
doi:10.1016/j.artint.2017.03.003, 2017.

Rao, A. R. and Srinivas, V. V. Regionalization of watersheds by hybrid-
cluster analysis. J. Hydrol., vol. 318(1), pages 37 – 56, ISSN 0022-1694,
doi:10.1016/j.jhydrol.2005.06.003, 2006.

Rasp, S., Pritchard, M. S. and Gentine, P. Deep Learning to repre-
sent subgrid processes in climate models. PNAS , vol. 115(39), page 9684,
doi:10.1073/pnas.1810286115, 2018.

Reichstein, M. Deep learning and process understanding for data-driven
Earth system science. Nature, vol. 566, doi:10.1038/s41586-019-0912-1,
2019.

Ribes, A., Zwiers, F. W., Azaïs, J.-M. and Naveau, P. A new statis-
tical approach to climate change detection and attribution. Clim. Dyn.,
vol. 48(1), pages 367–386, ISSN 1432-0894, doi:10.1007/s00382-016-3079-6,
2017.

Richman, M. B., Leslie, L. M., Ramsay, H. A. and Klotzbach, P. J.

Reducing Tropical Cyclone Prediction Errors Using Machine Learning Ap-
proaches. Procedia Comput. Sci., vol. 114, pages 314–323, ISSN 1877-0509,
doi:10.1016/j.procs.2017.09.048, 2017.

Routson, C. C. Mid-latitude net precipitation decreased with Arctic warm-
ing during the Holocene. Nature, vol. 1476, doi:10.1038/s41586-019-1060-3,
2019.

Rowland, E. Theory of Games and Economic Behavior. Nature, vol.
157(3981), pages 172–173, ISSN 1476-4687, doi:10.1038/157172a0, 1946.


Bibliography 153

Rubel, F., Brugger, K., Haslinger, K. and Auer, I. The cli-
mate of the European Alps: Shift of very high resolution Köppen-
Geiger climate zones 1800-2100. Meteorol. Z., vol. 26(2), pages 115–125,
doi:10.1127/metz/2016/0816, 2017.

Rudin, C. and Radin, J. Why Are We Using Black Box Models in AI When
We Don’t Need To? A Lesson From An Explainable AI Competition.
Harvard Data Sci. Rev., vol. 1(2), doi:10.1162/99608f92.5a8a3a3d, 2019.

Salcedo-Sanz, S. Modern meta-heuristics based on nonlinear physics pro-
cesses: A review of models and design procedures. Phys. Rep., vol. 655,
pages 1–70, ISSN 0370-1573, doi:10.1016/j.physrep.2016.08.001, 2016.

Salcedo-Sanz, S. A review on the coral reefs optimization algorithm: new
development lines and current applications. Prog. Artif. Intell., vol. 6,
doi:10.1007/s13748-016-0104-2, 2017.

Salcedo-Sanz, S., García-Herrera, R., Camacho-Gómez, C.,
Alexandre, E., Carro-Calvo, L. and Jaume-Santero, F. Near-
optimal selection of representative measuring points for robust temperature
field reconstruction with the CRO-SL and analogue methods. Glob. Planet.
Change, vol. 178, pages 15–34, doi:10.1016/j.gloplacha.2019.04.013, 2019.

Salcedo-Sanz, S., García-Herrera, R., Camacho-Gómez, C.,
Aybar-Ruíz, A. and Alexandre, E. Wind power field reconstruction
from a reduced set of representative measuring points. Appl. Energ., vol.
228, doi:10.1016/j.apenergy.2018.07.003, 2018.

Sanderson, B. M. and O’Neill, B. C. Assessing the costs of historical
inaction on climate change. Sci. Rep., vol. 10(1), page 9173, ISSN 2045-
2322, doi:10.1038/s41598-020-66275-4, 2020.

Schmidt, A., Thordarson, T., Oman, L. D., Robock, A. and Self, S.

Climatic impact of the long-lasting 1783 Laki eruption: Inapplicability of
mass-independent sulfur isotopic composition measurements. J. Geophys.
Res. Atmos., vol. 117(D23), ISSN 0148-0227, doi:10.1029/2012JD018414,
2012.


154 Bibliography

Schmutz, C., Luterbacher, J., Gyalistras, D., Xoplaki, E. and
Wanner, H. Can we trust proxy-based NAO index reconstructions?
Geophys. Res. Lett., vol. 27(8), pages 1135–1138, ISSN 0094-8276,
doi:10.1029/1999GL011045, 2000.

Schnase, J. L., Lee, T. J., Mattmann, C. A., Lynnes, C. S., Cinquini,

L., Ramirez, P. M., Hart, A. F., Williams, D. N., Waliser, D.,
Rinsland, P., Webster, W. P., Duffy, D. Q., McInerney, M. A.,
Tamkin, G. S., Potter, G. L. and Carriere, L. Big Data Challenges
in Climate Science: Improving the next-generation cyberinfrastructure.
IEEE Geosci. Remote Sens. Mag., vol. 4(3), pages 10–22, ISSN 2168-6831,
doi:10.1109/MGRS.2015.2514192, 2016.

Seydoux, L., Balestriero, R., Poli, P., Hoop, M. d., Campillo,

M. and Baraniuk, R. Clustering earthquake signals and background
noises in continuous seismic data with unsupervised Deep Learning. Nat.
Commun., vol. 11(1), page 3972, ISSN 2041-1723, doi:10.1038/s41467-020-
17841-x, 2020.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den

Driessche, G., Schrittwieser, J., Antonoglou, I., Panneer-

shelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham,

J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M.,
Kavukcuoglu, K., Graepel, T. and Hassabis, D. Mastering the game
of Go with deep neural networks and tree search. Nature, vol. 529(7587),
pages 484–489, ISSN 1476-4687, doi:10.1038/nature16961, 2016.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M.,
Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lil-

licrap, T., Simonyan, K. and Hassabis, D. A general reinforcement
learning algorithm that masters chess, shogi, and Go through self-play.
Science, vol. 362(6419), page 1140, doi:10.1126/science.aar6404, 2018.

Slivinski, L. C., Compo, G. P., Whitaker, J. S., Sardeshmukh,

P. D., Giese, B. S., McColl, C., Allan, R., Yin, X., Vose, R.,


Bibliography 155

Titchner, H., Kennedy, J., Spencer, L. J., Ashcroft, L., Brön-

nimann, S., Brunet, M., Camuffo, D., Cornes, R., Cram, T. A.,
Crouthamel, R., Domínguez-Castro, F., Freeman, J. E., Ger-

gis, J., Hawkins, E., Jones, P. D., Jourdain, S., Kaplan, A., Kub-

ota, H., Blancq, F. L., Lee, T.-C., Lorrey, A., Luterbacher, J.,
Maugeri, M., Mock, C. J., Moore, G. K., Przybylak, R., Pud-

menzky, C., Reason, C., Slonosky, V. C., Smith, C. A., Tinz, B.,
Trewin, B., Valente, M. A., Wang, X. L., Wilkinson, C., Wood,

K. and Wyszynski, P. Towards a more reliable historical reanalysis: Im-
provements for version 3 of the Twentieth Century Reanalysis system. Q.
J. R. Meteorol. Soc., vol. 145(724), pages 2876–2908, doi:10.1002/qj.3598,
2019.

Smerdon, J. E. Climate models as a test bed for climate reconstruc-
tion methods: pseudoproxy experiments. WIREs: Clim. Change, vol.
3, doi:10.1002/wcc.149, 2011.

Smerdon, J. E., Kaplan, A., Chang, D. and Evans, M. N. A
pseudoproxy evaluation of the CCA and RegEM methods for recon-
structing climate fields of the Last Millennium. J. Clim., vol. 23,
doi:10.1175/2010JCLI3328.1, 2010.

Song, Y.-C., Meng, H.-D., O’Grady, M. J. and O’Hare, G. M. P. The
application of cluster analysis in geophysical data interpretation. Comput.
Geosci., vol. 14(2), pages 263–271, ISSN 1573-1499, doi:10.1007/s10596-
009-9150-1, 2010.

Soto, R., Gómez-Pulido, J. A., Caro, S. and Lanza-Gutiérrez,

J. M. Data Science and AI-Based Optimization in Scientific Pro-
gramming. Sci. Program., vol. 2019, page 7154765, ISSN 1058-9244,
doi:10.1155/2019/7154765, 2019.

Sousa, P. M., Barriopedro, D., Ramos, A. M., García-Herrera,

R., Espírito-Santo, F. and Trigo, R. M. Saharan air intrusions as
a relevant mechanism for Iberian heatwaves: The record breaking events


156 Bibliography

of August 2018 and June 2019. Weather. Clim. Extremes , vol. 26, page
100224, ISSN 2212-0947, doi:10.1016/j.wace.2019.100224, 2019.

Sousa, P. M., Trigo, R. M., Barriopedro, D., Soares, P. M. M. and
Santos, J. A. European temperature responses to blocking and ridge
regional patterns. Clim. Dyn., vol. 50(1), pages 457–477, ISSN 1432-0894,
doi:10.1007/s00382-017-3620-2, 2018.

Steffen, W., Broadgate, W., Deutsch, L., Gaffney, O. and
Ludwig, C. The trajectory of the Anthropocene: The Great Ac-
celeration. Anthr. Rev., vol. 2(1), pages 81–98, ISSN 2053-0196,
doi:10.1177/2053019614564785, 2015.

Steig, E. J. and Neff, P. D. The prescience of paleoclimatology and the
future of the Antarctic ice sheet. Nat. Commun., vol. 9(1), page 2730,
ISSN 2041-1723, doi:10.1038/s41467-018-05001-1, 2018.

Stothers, R. B. The Great Dry Fog of 1783. Clim. Change, vol. 32, pages
79–89, doi:10.1007/BF00141279, 1996.

Swarnkar, A. and Swarnkar, A. Artificial Intelligence Based Optimiza-
tion Techniques: A Review. In (eds. Kalam A., Niazi K., Soni A., Siddiqui
S., Mundra A.) Intelligent Computing Techniques for Smart Energy Sys-
tems. Lecture Notes in Electrical Engineering . 2019.

Talento, S., Schneider, L., Werner, J. and Luterbacher, J.

Millennium-length precipitation reconstruction over south-eastern Asia:
a pseudo-proxy approach. Earth Sys. Dyn., vol. 10(2), pages 347–364,
doi:10.5194/esd-10-347-2019, 2019.

Tardif, R., Hakim, G. J., Perkins, W. A., Horlick, K. A., Erb,

M. P., Emile-Geay, J., Anderson, D. M., Steig, E. J. and Noone,

D. Last Millennium Reanalysis with an expanded proxy database and
seasonal proxy modeling. Clim. Past , vol. 15(4), pages 1251–1273,
doi:10.5194/cp-15-1251-2019, 2019.


Bibliography 157

Tchemisova, T., Freitas, A., Plakhov, A. and Weber, G. W. Special
issue: Optimization in the natural sciences. Optimization, vol. 64(6), pages
1363–1365, doi:10.1080/02331934.2015.1027530, 2015.

Thordarson, T. and Self, S. Atmospheric and environmental effects
of the 1783–1784 Laki eruption: A review and reassessment. J. Geo-
phys.Res. Atmos., vol. 108(D1), pages AAC 7–1–AAC 7–29, ISSN 0148-
0227, doi:10.1029/2001JD002042, 2003.

Turing, A. M. I.-Computing Machinery and Intelligence. Mind , vol.
LIX(236), pages 433–460, ISSN 0026-4423, doi:10.1093/mind/LIX.236.433,
1950.

Vaccaro, A., Emile-Geay, J., Guillot, D., Verna, R., Morice, C.,
Kennedy, J. and Rajaratnam, B. Climate field completion via Markov
random fields – Application to the HadCRUT4.6 temperature dataset. J.
Clim., pages 1–66, doi:10.1175/JCLI-D-19-0814.1, 2021.

Vadlamani, S. K., Xiao, T. P. and Yablonovitch, E. Physics success-
fully implements Lagrange multiplier optimization. PNAS , vol. 117(43),
pages 26639–26650, ISSN 0027-8424, doi:10.1073/pnas.2015192117, 2020.

de Vargas, R. R. and Bedregal, B. R. C. A Way to Obtain the Quality
of a Partition by Adjusted Rand Index. In 2013 2nd Workshop-School on
Theoretical Computer Science, pages 67–71. IEEE, 2013.

Vrugt, J. A. and Robinson, B. A. Improved evolutionary optimiza-
tion from genetically adaptive multimethod search. PNAS , vol. 104,
doi:10.1073/pnas.0610471104, 2007.

Wang, S., Li, G., Gong, Z., Du, L., Zhou, Q., Meng, X., Xie, S.

and Zhou, L. Spatial distribution, seasonal variation and regionalization
of PM2.5 concentrations in China. Sci. China Chem., vol. 58(9), pages
1435–1443, ISSN 1869-1870, doi:10.1007/s11426-015-5468-9, 2015.

Wanner, H., Rickli, R., Salvisberg, E., Schmutz, C. and Schüepp,

M. Global climate change and variability and its influence on Alpine


158 Bibliography

climate — concepts and observations. Theor. Appl. Climatol., vol. 58(3),
pages 221–243, ISSN 1434-4483, doi:10.1007/BF00865022, 1997.

Wilks, D. S. Chapter 13 - Canonical Correlation Analysis (CCA), vol.
100 of Statistical Methods in the Atmospheric Sciences , pages 563–582.
Academic Press, 2011.

Yang, X.-S., Chien, S. F. and Ting, T. O. Computational Intelligence
and Metaheuristic Algorithms with Applications. Sci. World J., vol. 2014,
page 425853, ISSN 2356-6140, doi:10.1155/2014/425853, 2014.

Yoo, D. G. and Kim, J. H. Meta-heuristic algorithms as tools for hy-
drological science. Geosci. Lett., vol. 1(1), page 4, ISSN 2196-4092,
doi:10.1186/2196-4092-1-4, 2014.

Yukimoto, S., Kawai, H., Koshiro, T., Oshima, N., Yoshida, K.,
Urakawa, S., Tsujino, H., Deushi, M., Tanaka, T., Hosaka, M.,
Yabu, S., Yoshimura, H., Shindo, E., Mizuta, R., Obata, A.,
Adachi, Y. and Ishii, M. The Meteorological Research Institute Earth
System Model Version 2.0, MRI-ESM2.0: Description and Basic Evalua-
tion of the Physical Component. J. Meteorol. Soc. Jpn. Ser. II , vol. 97(5),
pages 931–965, doi:10.2151/jmsj.2019-051, 2019.

Zambri, B., Robock, A., Mills, M. J. and Schmidt, A. Modeling the
1783-1784 Laki Eruption in Iceland: 2. Climate Impacts. J. Geophys. Res.
Atmos., vol. 124(13), pages 6770–6790, doi:10.1029/2018JD029554, 2019.

Zhang, Y., Moges, S. and Block, P. Optimal Cluster Analysis for Objec-
tive Regionalization of Seasonal Precipitation in Regions of High Spatial-
Temporal Variability: Application to Western Ethiopia. J. Clim., vol.
29(10), pages 3697–3717, doi:10.1175/JCLI-D-15-0582.1, 2016.

Zhou, S., Zhou, K., Wang, J., Yang, G. and Wang, S. Application
of cluster analysis to geochemical compositional data for identifying ore-
related geochemical anomalies. Front. Earth Sci., vol. 12(3), pages 491–
505, ISSN 2095-0209, doi:10.1007/s11707-017-0682-8, 2018.


Bibliography 159

Zishka, K. M. and Smith, P. J. The Climatology of Cyclones and An-
ticyclones over North America and Surrounding Ocean Environs for Jan-
uary and July, 1950–77. Mon. Weather Rev., vol. 108(4), pages 387–401,
doi:10.1175/1520-0493(1980)108<0387:TCOCAA>2.0.CO;2, 1980.


	Tesis Fernando Jaume Santero
	Portada
	Dedicatoria
	Acknowledgments
	List of Acronyms
	Summary
	Resumen
	Índices
	Contents
	List of Figures
	List of Tables

	Chapter 1. Introduction
	Artificial Intelligence as a tool for Climate Science
	Main objectives and structure of the Thesis
	Thesis structure


	Chapter 2. Data
	Data description
	Model simulations
	Observations
	Reanalysis
	Paleoclimate data

	Data post-processing
	Pseudo proxies
	Pseudo SLP observations


	Chapter 3. Methodology
	Climate Field Reconstruction methods
	Analogue Method
	Canonical Correlation Analysis

	The Coral Reef Optimization
	Coral solutions
	The algorithm
	Search operators

	Clustering techniques
	Assumptions and Definitions
	The k-gaps algorithm
	Centroids calculation
	Assignment of records to clusters
	Stop conditions


	Chapter 4. Selection of proxy locations for temperature reconstructions
	Background
	Selection of representative locations
	Reconstruction of temperature patterns
	Insights on past anomalous periods

	Chapter 5. North Atlantic SLP reconstruction since 1750
	Background
	Optimized networks
	Skill of the optimized networks
	Climate variability and the Azores High

	Chapter 6. Clustering incomplete climatological time series
	Background
	Ideal case with complete information
	Experiment setting
	Performance assessment
	Applications on climate studies

	Chapter 7. Conclusions and outlook
	Appendix A. Supplementary Figures
	Appendix B. Supplementary Tables
	Bibliography