Janus II : a new generation application-driven

computer for spin-system simulations

M. Baity-Jesia,b,c, R. A. Bañosb,d, A. Cruzd,b, L. A. Fernandeza,b, J. M.
Gil-Narvionb, A. Gordillo-Guerreroe,b, D. Iñiguezb,k, A. Maioranoc,b, F.

Mantovanif,1, E. Marinarig, V. Martin-Mayora,b, J. Monforte-Garciab,d, A.
Muñoz Sudupea, D. Navarroh, G. Parisig, S. Perez-Gavirob,k, M. Pivantif, F.

Ricci-Tersenghig, J. J. Ruiz-Lorenzoi,b, S. F. Schifanoj, B. Seoanec,b, A.
Tarancond,b, R. Tripiccionef, D. Yllanesc,b

aDepartamento de F́ısica Teórica I, Universidad Complutense, 28040 Madrid, Spain.
bInstituto de Biocomputación y Fisica de Sistema Complejos (BIFI),

50009 Zaragoza, Spain.
cDipartimento di Fisica, Università di Roma “La Sapienza”, 00185 Roma, Italy.

dDepartamento de F́ısica Teórica, Universidad de Zaragoza, 50009 Zaragoza, Spain.
eD. de Ingenieŕıa Eléctrica, Electrónica y Automática, U. de Extremadura,

10071 Cáceres, Spain.
fDipartimento di Fisica e Scienze della Terra, Università di Ferrara, and INFN,

44100 Ferrara, Italy.
gDipartimento di Fisica, IPCF-CNR, UOS Roma Kerberos and INFN,

Università di Roma “La Sapienza”, 00185 Roma, Italy.
hD. de Ingenieŕıa, Electrónica y Comunicaciones and I3A, U. de Zaragoza,

50009 Zaragoza, Spain.
iDepartamento de Fisica, Universidad de Extremadura, 06071 Badajoz, Spain.
jDipartimento di Matematica e Informatica, Università di Ferrara, and INFN,

44100 Ferrara, Italy.
kFundación ARAID, Diputación General de Aragón, Zaragoza, Spain.

Abstract

This paper describes the architecture, the development and the implemen-
tation of Janus II, a new generation application-driven number cruncher
optimized for Monte Carlo simulations of spin systems (mainly spin glasses).
This domain of computational physics is a recognized grand challenge of
high-performance computing: the resources necessary to study in detail the-
oretical models that can make contact with experimental data are by far

1Now at Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain.

Preprint submitted to Computer Physics Communications October 4, 2013

ar
X

iv
:1

31
0.

10
32

v1
  [

cs
.A

R
] 

 3
 O

ct
 2

01
3


beyond those available using commodity computer systems. On the other
hand, several specific features of the associated algorithms suggest that un-
conventional computer architectures – that can be implemented with avail-
able electronics technologies – may lead to order of magnitude increases in
performance, reducing to acceptable values on human scales the time needed
to carry out simulation campaigns that would take centuries on commercially
available machines. Janus II is one such machine, recently developed and
commissioned, that builds upon and improves on the successful JANUS ma-
chine, which has been used for physics since 2008 and is still in operation
today. This paper describes in detail the motivations behind the project, the
computational requirements, the architecture and the implementation of this
new machine and compares its expected performances with those of currently
available commercial systems.

Keywords: Spin glass, Monte Carlo, Application-driven computers, FPGA
computing

1. Overview

Understanding glassy behavior is a major challenge in condensed matter
physics (see for instance Refs. [1, 2]). Glasses are materials that do not reach
thermal equilibrium on macroscopic time scales (e.g., years): bulk material
properties of a macroscopic sample, such as compliance modulus or specific
heat, change in time even if the sample is kept for days (years) at constant
experimental conditions. This sluggish dynamics is a major problem for the
theoretical and experimental investigation of glasses.

Spin glasses, usually regarded as prototypical glassy systems (or, more
generally, prototypical complex systems), have been extensively studied the-
oretically; over the years this theoretical work has been widely supported
by numerical simulations, mostly using Monte Carlo techniques. The Monte
Carlo simulation of spin glass systems is a recognized grand challenge of com-
puting, as it requires inordinately large resources and at the same time it has
number-crunching requirements at large variance with mainstream computer
developments.

In a typical spin-glass model (see later for a detailed description), the
dynamical variables, the spins, are discrete and sit at the nodes of discrete
D-dimensional lattices. In order to make contact with experiments, we may
want to follow the evolution of a large enough 3D lattice, say 1003 sites,

2


for time periods of the order of 1 second. One Monte Carlo sweep (MCS)
—the update of all the spins in the lattice— roughly corresponds to a time
scale of 10−12 seconds for a real sample, so we need some 1012 such steps,
that is 1018 spin updates. Also, in order to properly account for disorder,
we have to collect statistics on several (e.g., O(100)) copies of the system,
adding up to 1020 Monte Carlo spin updates. One easily reckons that one
needs a computer able to process on average one spin in 1 picosecond or
less, in order to carry out this simulation program within reasonable human
timescales (say, less than one year).

The algorithms associated to the Monte Carlo simulation of spin glasses
have several properties that – in principle – open the way to very efficient
processing. First, as already noted, the degrees of freedom of several widely
studied spin glass models are discrete, and their values can be mapped on
a small number of bits (just one, for several popular models); discrete bit-
valued variables are operated upon with simple logic (as opposed to arith-
metic) operations, that can be performed by just a few logic gates. Second,
it is easy to identify a very large amount of parallelism in the required com-
putation, as one concurrently processes spins that do not have direct interac-
tions among one another. Virtually all commercially available computers are
not able to exploit in full these properties; indeed, processors are optimized
for arithmetic (integer or – even worse – floating point) operations, each
such operation requiring a large number of logic gates. The added burden
of performing logic operations by hardware structures optimized to perform
arithmetic operations also severely limits the amount of parallel computation
that each processor is able to support.

On the other hand, these features, if consistently exploited, open the way
to a conceptually simple and efficient application-driven computing architec-
ture, carefully optimized for spin glass simulations, that promises to offer
huge performance advantages. Application-oriented systems have been used
in many cases in computational physics, not only spin-system simulations
but also in Lattice QCD [3] and for the simulation of gravitationally coupled
[4] and biological [5] systems. Application-driven number crunchers for spin
systems have a long history: the pioneering work by Pearson and Richard-
son [6] in the late 70s was followed by that of Ogielsky and Condon [7] in
the 80s; these pioneering attempts were followed by the SUE project [8] and
more recently by JANUS [9, 10, 11] —of which the work described in this pa-
per is the natural evolution. SUE and JANUS acknowledge that an optimal
architecture for spin simulators requires a dedicated processor architecture

3


and use Field Programmable Gate Arrays (FPGA) as the enabling technol-
ogy to implement that architecture. FPGAs are integrated circuits that can
be configured at will after they have been assembled in an electronic system.

In the last ten years, dedicated spin-glass crunchers, with their order-
of-magnitude better performance than available with traditional computers
have been instrumental to reach several key results in spin glass physics;
the most recent such machine JANUS, commissioned in 2008 and still in
operation today, has indeed made it possible to establish several new results
(see later for details).

In the same time frame, several innovative development in mainstream
computer architecture – including many-core processors and GPUs – have
made it possible to develop increasingly more parallel implementations of
Monte Carlo algorithms for spin systems, significantly boosting performance,
and largely closing the gap with respect to JANUS. At the same time and in
parallel with mainstream computer systems, progress in electronic technology
has also significantly boosted the level of parallelism and performance that
can be harvested using FPGAs.

This background has motivated the start of the development of Janus II,
described in detail in this paper, that has the potential to provide order of
magnitude better performance than commercial computers in a time window
of at least the next five years, as well as superior energy efficiency.

Janus II is an FPGA-based massively parallel spin-glass number cruncher,
that architecturally builds on JANUS and improves on it in several direc-
tions: i) it uses latest generation FPGA technology, corresponding to an
order of magnitude increase in performance per processor, ii) it includes
an improved communication interconnection among Janus II nodes, that
makes it efficient to simulate large lattices using inter-node parallelism on
top of intra-node parallelism, iii) it enlarges by two orders of magnitude the
size of the memory available to the system and iv) it tightly couples the
dedicated number-cruncher nodes with traditional host computers, improv-
ing data throughput and allowing a mixed-mode operation of the system
in which potentially complex control operations are handled efficiently by
traditional programs. All these improvements help boost the expected per-
formance of Janus II as a spin glass number cruncher; moreover, points iii)
and iv) above enlarge the class of applications for which Janus II is a poten-
tial efficient computer: while the project is still mainly motivated as a spin
glass simulator, we expect interesting results is such diverse areas as graph
theory, cryptography or simulation of VLSI circuits.

4


This paper is structured in the following way: after this section, we
present a short introduction to spin glass models and to the Monte Carlo
techniques used to simulate them; the paper continues with a description
of the Janus II architecture, that closely matches the requirements outlined
in the preceding section. A section on the programming and development
environment available for this machine follows, that also contains some per-
formance figures. This is followed by a section that – building on the expected
performance of the machine – tries to identify several important questions
in spin glass physics accessible to Janus II that were not within reach of
JANUS. The following section compares Janus II performances with those
available on currently available computers and tries to forecast the extent of
the window of opportunity of our new machine. The paper ends with some
concluding remarks.

2. Spin Glass models

Both JANUS and Janus II have been designed from scratch to optimize
their performance for a specific application: the Monte Carlo simulation of
spin glasses. In this section we review the spin glass models that we want to
study with this machine.

Spin glasses are disordered magnetic alloys whose low-temperature phase
is a frozen disordered state, rather than the uniform patterns one finds in
more conventional magnetic systems [12, 13]. They are important because
they are widely regarded as the simplest possible model of a complex system.
In fact, as we will see below, spin glass models are extremely simple to define.
In spite of this, finding the lowest energy configuration of a three dimensional
Ising spin glass is an NP-hard problem [14]. The main ingredients that make
the problem so hard are randomness in the interactions and frustration. By
frustration we refer to the impossibility of satisfying simultaneously all the
demands that the interactions pose on individual spins.

One of the most famous family of spin-glass models was proposed by
Edwards and Anderson [15] in the 70s. They consider a regular lattice and
define spins sitting at the lattice nodes. Spins are unit-length vectors of n
components: ~Si = (Si,1, Si,2, . . . , Si,n), ~Si · ~Si = 1. The interaction energy is

H = −
∑
i,j

Jij ~Si · ~Sj , (1)

5


where the indices i and j run over the nodes of the lattice. The coupling
constants Jij are chosen randomly (quenched disordered). They are statisti-
cally independent, and identically distributed. We shall be mostly concerned
with short-ranged interactions (i.e. Jij vanish, unless i and j sites are lattice
nearest neighbors). An instance of the coupling constants {Jij} defines a
sample of the physical system. The number n of components of the spins
is also important. Some cases have special names: Heisenberg (n = 3), XY
(n = 2) and Ising (n = 1).

The Ising spin glass model, Si = ±1, with short-range interaction (one
of the prototypical material is Fe0.5Mn0.5TiO3) has deserved special scientific
attention for decades; its energy function reads:

H = −
∑
〈i,j〉

JijSiSj , (2)

where 〈i, j〉 indicates that the sum runs only on nearest neighbors in the
lattice. Since the spin Si is a binary variable, it can be coded on just one
bit. Computational opportunities arise from this simplicity, and we aim to
explore some of them.

The goal of the game is to obtain assignments of the spin variables {Si} –
named configurations – statistically distributed according to the Boltzmann
weight at temperature T :

PB({Si}) =
exp[−H({Si})/T ]

Z
, Z =

∑
{Si}

exp[−H({Si})/T ] . (3)

One may try to achieve this goal by means of Markov Chain Monte Carlo
simulations (see, for instance, [16, 17, 18] for a detailed introduction). In
principle, one just needs to implement some dynamics fulfilling detailed bal-
ance and run it for a long enough time. However, in the limit of vanishing
temperature, the Boltzmann weight goes to zero unless the spin configura-
tion is a ground state of the system (a lowest energy configuration). Since
finding ground states for a typical three dimensional spin-glass sample is a
NP-complete problem, something should go wrong with our simple-minded
strategy. The problem is in the length of the simulation: the autocorrelation
times for the Markov chain become inordinately large at low T ; the simula-
tion gets trapped for a long time in some of the many local minima of the
energy (2). It is maybe worth mentioning that physical spin glasses (such as
AgxMn1−x, for instance) do suffer from the same problem: the system does

6


not reach thermal equilibrium even if it is allowed to evolve under constant
laboratory conditions for hours, or even days.

Finding equilibrium configurations for a single sample {Jij} is only half
of the problem. In order to obtain physically meaningful answers one needs
to average the thermal mean-values [i.e., the mean values corresponding to
the Boltzmann weight (3) of a given sample], over a fair number of samples
[i.e. performing the average of the quenched disorder]. The meaning of fair
is very much dependent on the physical questions that one asks and on the
lattice size: it may range from less than one hundred samples to maybe 105

samples.
The dynamics that implement our Markov Chain Monte Carlo on a given

sample at temperature T are pretty standard: Metropolis or Heat Bath.
For instance, the Metropolis procedure for the Ising spin glass starts from
an initial arbitrary configuration and generates new configurations by pick-
ing one spin in the lattice (Si) and tentatively flipping it. One then com-
putes the energy difference ∆E associated to this tentative change, ∆E =
2
∑

<j>(JijSiSj) [where j runs over all the nearest neighbors of the site i]. If
∆E ≤ 0 the tentative flip is accepted and the algorithm moves to another
lattice site. If, on the other hand ∆E > 0, the tentative flip is accepted con-
ditionally with a probability proportional to e−∆Eβ (β = 1/T is the inverse
of the system temperature).

One easily identifies a large degree of parallelism, as one applies the pro-
cedure in parallel to any subset of spins that do not share a coupling term in
the energy function (so one correctly computes all ∆E terms): one usually
partitions the lattice as a checkerboard and applies the algorithm first to all
black sites and then to all white ones, corresponding to an available paral-
lelism of degree LD/2: in principle, we may schedule one full Monte Carlo
Sweep (MCS, the application of the algorithm to all sites of the lattice) for
any lattice size in just two computational steps, if enough computational
resources are available.

Simulations at constant temperature are not up to the task, if one wants
to produce a thermalized set of configurations at low temperature. One
then resorts to the parallel tempering (PT) algorithm [19]. We consider NT

temperatures T1 < T2 < T3 < . . . < TNT
. For each temperature, we consider

a statistically independent spin configuration {Si,a} with a = 1, 2, . . . , NT :

PB({Si,a=1}, {Si,a=2}, . . . , {Si,a=NT
}) =

NT∏
a=1

exp[−H({Si,a})/Ta]
Z(Ta)

. (4)

7


Each of the NT systems is independently simulated at its own temperature
by means of one of the standard algorithms. However, every nPT constant-T
sweeps, one performs parallel tempering. The elementary parallel-tempering
step is the exchange attempt of the configurations at two consecutive temper-
atures Ta and Ta+1. The configuration exchange is accepted with Metropolis
probability:

ProbPT = min

[
1,

exp(− H({Si,a+1})
Ta

− H({Si,a})
Ta+1

)

exp(− H({Si,a})
Ta

− H({Si,a+1})
Ta+1

)

]
. (5)

One attempts to exchange configurations towards ascending temperatures
(so, in principle, the configuration at the lowest temperature could reach the
highest temperature in just one PT step): the rationale behind the parallel
tempering algorithm is simple. If a configuration trapped in a local minimum
is raised to a high enough temperature, it will be able to escape thanks to a
thermal fluctuation.

Parallel tempering has several tunable parameters. First, the set of tem-
peratures {Ta}NT

a=1 should be such that the acceptance probability (5) be
reasonable (say, ∼ 10%). This requires a relatively small temperature spac-
ing. On the other hand, the largest temperature TNT

should be high enough
to ensure a quick equilibration by means of the constant-T algorithm. One
has to reach a compromise among these conflicting goals, as, the larger NT ,
the larger are the needed computational resources. The parameter that con-
trols the parallel tempering frequency, nPT, can be tuned as well. In our
experience, the algorithmic performance depends on nPT only slightly. This
is fortunate, because the parallel tempering breaks for sure parallelism. One
may diminish its frequency (by increasing nPT), although some tradeoff must
also be found.

Let us finally mention that one may extend the Edwards-Anderson model
by including an external, site dependent magnetic field hi:

H = −
∑
〈i,j〉

Jijsisj −
∑
i

hisi . (6)

In this case, a sample is defined by the set of coupling constants {Jij, hi}. The
addition of the local magnetic fields hi does not add any real complication to
the numerical simulation, and it has the advantages of enlarging the set of
problems that can be considered. Examples are the random field Ising model

8


(RFIM) and the diluted anti-ferromagnetic in a field model (DAFF) [12]. A
further extension consists on the consideration of integer-valued spins si =
1, 2, . . . , Q (the so called Q-states Potts model), which can be formulated in
a similar way (see the chapter by Binder in Ref. [12]).

3. Janus II architecture

The Janus II architectural concept and its implementation follow directly
from its predecessor (JANUS), built and commissioned in 2008 and still in
operation today. The main guiding principle behind the old and new JANUS
architectures is the attempt to leverage on state-of-the-art electronics tech-
nology in order to: i) exploit the huge parallelism available in the simulation
of one spin glass (SG) system to speed up the Monte Carlo evolution of that
system, ii) simulate in parallel a relatively large number of system samples,
and iii) connect as tightly as possible the dedicated, massively parallel num-
ber crunching array with a traditional host computer system, so that complex
and non-parallelizable computing functions (e.g., the proper handling of the
parallel tempering temperature exchange) are done with as little impact on
global performance as possible.

The simulation of most SG models implies a mix of logic operations on
bits (as opposed to arithmetic operations on long data words). Since vir-
tually all commercially available computer architectures focus on arithmetic
operations, they are conceptually a poor option for SG simulations. An op-
timal choice would be to hardwire all the logic gates that can be fabricated
on one silicon die in order to perform exactly the set of required logic op-
erations, developing a fully customized integrated circuit. This is possible
in principle (integrated circuits designed to perform a specific function are
called Application Specific Integrated Circuits, or ASICs), but the time and
the costs associated to its development do not allow to pursue this option.
We therefore choose the second best option, and adopt Field Programmable
Gate Arrays (FPGAs) as the basic building block for Janus II. FPGAs are
integrated circuits whose logical gates can be connected at will, in order to
perform a specific set of logic functions. FPGA configuration is a simple
process that can be done repeatedly, so the same FPGA can be used for
widely different logic functions. Currently available FPGAs have hundreds
of thousands of so-called logic cells, each able to perform any logic operation
of several bits; equally important, FPGAs come with several tens of Mbit
embedded memory.

9


Figure 1: Architecture of the Janus II Processing Board (PB). The array of 16 FPGA-
based Simulation Processors (SPs, right) is connected by a 2D (x and y) toroidal network.
All SPs have an additional independent connection to the IOP processor; the latter is part
of the CP complex, that includes a commodity PC (adopting the COM form factor) and
runs the Linux operating system; the CP has Gbit-Ethernet and Infiniband networking
ports to the external world. Additional high speed connections are available for a tight
coupling to other PBs in the z direction.

10


The overall architecture of Janus II is a parallel structure shown in fig-
ure 1. The basic processing element of the system is the Simulation Processor
(SP) whose computational structure is fully based on just one FPGA device.
Each SP includes one Xilinx Virtex-7 XC7VX485T FPGA and two banks of
DDR-3 memory of 8 Gbyte each. The choice of our FPGA has been done
based mainly on cost and availability issues for this specific device. The se-
lected FPGA has some 485000 logic cells and includes ∼ 32 Mbit embedded
memory. As shown later in detail, we expect to embed within each SP more
than 2000 spin-flip engines, each updating one spin (all of the same color in
a checkerboard structure) in one clock cycle. This corresponds to an average
update rate of 1 spin every 2.5 ps (with a conservative clock frequency of 200
MHz).

A set of 16 SPs are mounted onto a Processing Board (PB); the SPs of
each PB are logically assembled at the nodes of a 4 × 4 array. Each SP
in the array has direct point-to-point bi-directional links with its 4 nearest
neighbors; toroidal boundary conditions are applied. Each logical link is
engineered as 8 physical links that we expect to operate at a bandwidth in
the range from 3 to 5 Gbit/second.

All SPs belonging to each PB are directly connected and controlled by
a Control Processor (CP). The CP is a full fledged computer, running the
Linux operating system. The CP plays several roles in the Janus II system:
first, it is able to configure the FPGAs of the SPs, so they perform the desired
logic operations; second it moves data from/to the SPs, so – for instance –
initial data can be loaded to the SP and results of a simulation can go back
to the CP. Finally, the CP controls the operations of all SPs, e.g. starting
a simulation program, monitoring their status, collecting results, executing
those parts of the global computation that cannot be offloaded to the SPs
and handling any errors.

The CP uses a commercially available Computer-on-Module (COM) sys-
tem, based on an Intel Core i7 processor running at 2.2 GHz; it connects via
the PCIe interface to a so-called Input-Output-Processor (IOP) built inside
yet another FPGA; the IOP actually manages all connections to all SPs,
using a set of dedicated bi-directional high speed links (one to each SP), run-
ning at ∼ 3 Gbit/s and a small number of dedicated control and status lines.
The IOP formats and appropriately routes data in transits from the CPU to
the SPs, controls the configuration procedure of all SPs, controls their oper-
ation and monitors their status. Since the IOP is itself a configurable unit,
we are considering to use it – on a longer time scale – for additional com-

11


putational/communication tasks; for instance the IOP might support a full
crossbar switch among all SPs, or handle directly the temperature exchange
phase of a PT algorithm distributed over several SPs.

The CP is the main architectural improvement of Janus II with respect
to its predecessor: JANUS only had a Gigabit Ethernet link between a set of
SPs and an external computer; the new arrangement increases the available
bandwidth between the SP array and host to 4 Gbyte/second (a factor ∼ 40×
larger than in the previous system) and reduces communication latency from
∼ 15µs to ∼ 1µs. A much more tightly coupled operation of the SP array
becomes possible, allowing to split more finely a simulation program on the
control CPU and the SP array.

The combination of one CP and 16 SPs is the basic functional block
of a Janus II system. All these components are assembled inside a box
that also contains power supplies and the forced-air cooling system. This
module operates as an independent computing system and can be networked
with other Janus II boxes and with traditional computers via Ethernet and
Infiniband interfaces.

A Janus II installation can be made of any number N of Janus II boxes;
the boxes can be used as logically independent systems, running simulations
of different physical systems, or the whole system can operate as just one
larger system; in the latter case, the machine can be seen as a 3D structure
of 4×4×N SPs. Bidirectional links are in fact available on each SP to build
the interconnection structure in the third dimension.

The project – at the present stage – has already assembled and tested a
system with 16 Janus II boxes, installed at BIFI in Zaragoza. The Janus
II team worked on the conceptual design of the system architecture while
our industrial partner – Link Engineering Srl, Bologna (Italy) [20] – have
carried out the detailed engineering design and the actual construction of
the prototype and of the presently available system.

Fig. 2 shows an SP module while Fig. 3 shows a Janus II box. Fig. 4 is
a partial close-up view of the fully assembled system.

4. Structuring and programming a spin glass simulation on Janus
II

A Janus II program is a combination of a standard C program, running
on the CP and a computational kernel, running on one or more appropriately
configured SPs and operating on data moved to the SP by the CP-resident

12


Figure 2: Pictures of a Janus II SP module; the picture at left has a small heat radiator,
providing a complete view of all components; the picture at right shows the large heat
radiator needed to allow high frequency operation of the machine.

program. This programming style is similar to the one usually adopted in
processing systems that include some form of co-processor or accelerator: a
perhaps familiar example is GPU programming, where the host processor
sets up all required data-structures, initializes data values and controls the
outer loops of the program; the computationally heavy kernels run on the
GPU. The main difference is of course that, while GPUs execute a program
written in an appropriate programming language (e.g. CUDA or openCL),
SPs in Janus II run the hardwired sequence of operations implied by the
configured FPGA. Several development environments are available to assist
in configuring FPGAs; we use VHDL, a relatively low-level language that re-
quires a detailed description of the structures that store data, the operations
that are performed on data and of instruction control: our experience shows
however that only this low level, largely handcrafted approach guarantees
the high performance that we look for.

From the perspective outlined in the previous paragraph, Janus II might
be seen as a (possibly exotic) general purpose computer; however the main
driving force behind the project is of course that one expects outstanding
performance when the SPs are configured for spin glass Monte Carlo simula-
tions. Still, the fact that Janus II processing elements can be configured in
arbitrary ways keeps the door open for other uses of this machine.

The simplest operation mode for Janus II will be the one already adopted
for JANUS: each SP performs a full Monte Carlo simulation of one SG sys-
tem, while different replicas of the system or physical systems at different
temperatures are assigned to several SPs.

The update engine for one lattice site has a very simple structure. We
consider again for definiteness the Ising spin glass in 3D; one maps the spins

13


Figure 3: Picture of a Janus II box; there are 16 SP modules (plugged vertically on the
printed circuit, while the CP module is at the center of the structure; at left one sees the
cooling fans and the power supplies.

and coupling into bit-valued ({0,1}) variables:

Sk → σk = (1 + Sk)/2 Jkm → jkm = (1 + Jkm)/2 . (7)

Once this is done, the evaluation of ∆E = 2
∑

<j>(JijSiSj), only implies
6 logic bit-wise xor functions (replacing the products JijSj) followed by an
arithmetic sum of just six bit-valued operands. The result can be seen as the
pointer to a small look-up table where the corresponding pre-computed val-
ues of eβ∆E are stored. At this point, one arithmetically compares the value
of the selected table entry with a freshly generated random number: accord-
ing to the outcome of the comparison the previous value of the spin is left
unchanged, or the flipped value is written to memory. The required sequence
of operation is similar for more complex spin glass models or different Monte
Carlo algorithms: different and (possibly) more complex logic manipulations
may be needed; in most cases the generation of pseudo-random numbers re-
mains the most complex operation. On JANUS we were able to implement
∼ 1000 such basic engines in each FPGA, using the Parisi-Rapuano [21] gen-
erator. With Janus II we plan to double this number and to increase the
operating frequency by a factor 4. Under these conditions, the estimated
power consumption of each SP – based on data made available by Xilinx –
is between ∼ 25 and 30 Watts.

14


Figure 4: Close-up view of the Janus II machine installed at BIFI (Zaragoza). The
installation has 16 Janus II boxes (12 are visible in the picture). The cables supporting
the data-links in the z direction are mounted in loop-back mode for test purposes.

One should notice that processing each spin implies reading 13 bits and
writing one bit result (the new value of Si) and reading a few 32-bit num-
bers (3 for the Parisi-Rapuano generator) to compute the next element in
the sequence of random numbers. One quickly evaluates the overall memory
traffic for 2000 spin-processing elements running at 200 MHz in excess of
4 Tbyte/second, orders of magnitude beyond the bandwidth available with
the large memory banks outside the FPGA. The needed bandwidth is on
the other hand available using the large number of memory blocks embedded
inside our FPGAs; a rather complex memory allocation scheme that matches
our requirements and can be efficiently implemented within the FPGA was
devised for JANUS [10] and can be carried over directly to Janus II. This
requires however that all data items required by the program fit inside the
available on-chip memory. In our case the size of the FPGA embedded mem-
ory is ∼ 32 Mbit so we are able to handle 3D lattices with L < 200, taking
into account that each lattice site needs 4 bits of data. Alternatively, one can
squeeze 30 copies of a lattice of size 643 inside each SP, making it possible
to run a large parallel tempering protocol on one or two SPs. In this case,
the CP would collect the energies of the lattices at all temperatures {Ta}
after nPT Monte Carlo steps, re-assign temperatures according to Eq. (5)
and start a new iteration.

15


If one wants to simulate larger lattices, all SPs can be used concurrently:
under the same assumptions as above, all 16 SPs in one Janus II box are
able to handle a 3D lattice with L ≈ 500 and even larger lattices fit the
complete array of 16 boxes; in this case, the lattice is partitioned on all SPs
in 1D or 2D slices and data associated to abutting faces of the sub-lattice
are moved across SPs on the appropriate data links.

A combination of the strategies discussed produces extremely high com-
puting performance on Janus II. As discussed above, we can partition the
lattice on several SPs, slicing along one dimension. The average time to
process one spin on each processor is

Tspin =
1

npf
, (8)

where np (np ∼ 2000) is the number of update cores available on each SP
and f is the SP operating frequency, expected in the range from 125 to 250
MHz.

If we partition our lattice on P processors (e.g., P = 16) the aggregated
mean spin update time is

Tglobal =
1

npfP
, (9)

corresponding to a Tglobal from 0.125 to 0.250 ps, in our frequency range.
In order to sustain these processing rates, the node-to-node communica-

tion harness must provide a matching communication bandwidth: during the
time in which one SP updates all spins of its sublattice, we must move data
associated to the spin configuration of one face of the lattice from one SP to
its neighbor. Each SP sweeps all spins of its sublattice in a time

Tlat =
1

npf

L3

P
. (10)

The communication harness must move data belonging to one 2D face of the
lattice in the same amount of time (this is just one bit per site on the surface);
assuming the network has nl lanes each with a communication bandwidth of
fc bit/second, we have

Tdat =
L2

nlfc

=
L2

nl(fc/f)

1

f
. (11)

16


Figure 5: Estimates of the computing time (Tlat(L), red) and the SP-to-SP communication
time (Tcom(L), blue) as a function of the lattice size L, assuming that the full lattice is
split in 16 strips, each assigned to one SP within a Janus II box. One clearly sees that
communication overheads are small for lattices of size L ∼ 150 or larger and become fully
negligible as soon as L ≥ 250.

Communication is not a bottleneck as long as

1

np

L3

P
≥ L2

nl (fc/f)
. (12)

Figure 5 shows the behavior of the two sides of Eq. (12) as a function
of the lattice size L, with the already stated values of the parameters and
fc/f = 15 (we expect that fc/f will be somewhere in the 12 to 20 range):
we see that the communication infrastructure is powerful enough to handle
lattices with L ∼ 250 or larger.

Let us consider a very large lattice for the current state-of-the-art (e.g.,
L = 500); from either Eq. (9) or Eq. (10) one finds that the processing time
for one sweep of the whole sublattice is of the order of Tproc from 15 to 30 µs;
in this simulation campaign, each Janus II box would run an independent
replica of the system, so in one year of operation one can hope to follow for
several 1011 Monte Carlo steps of ∼ 10 replicas of this very large system with
3 or 4 values of the temperature.

17


5. Janus II impact on spin-glass simulations

To a large extent, Janus II is a follow up of JANUS, which has been
a major player in the field of spin glasses during the last five years [22,
23, 24, 25, 26, 27, 28, 29, 30]. Hence, it is natural to ask which are the
important physics questions accessible to Janus II that were not within reach
for JANUS?

In the previous sections we have estimated that the computing power
available from one SP in Janus II is roughly 10× larger than available with
JANUS. The (on board) available memory is also 10× larger and, last but
not least, SP-to-SP communications make it possible to efficiently simulate
SG samples on just one or on a collection of SPs, allowing flexible ways to
trade the simulation speed of one sample with the concurrent simulation of
several samples.

Having these figures in mind, a rather blunt comparison with JANUS
would be as follows. The total number of spin updates in a simulation cam-
paign is

Nspin−flips = NT ×Nspins ×NMCS ×Nsamples , (13)

where NT is the number of temperatures at which we simulate, Nspins is the
number of spins in the simulated lattice (i.e., in D spatial dimensions, for
a lattice of size L, Nspins = LD), NMCS is the number of full-lattice updates
performed for a single sample and Nsamples is the number of independent
samples in the simulation. As we said above, for a given wall-clock time,
on Janus II the l.h.s. of Eq. (13) will be roughly ten times larger than on
JANUS.

In fact, depending on the setup and the goals of the simulation campaign,
with Janus II we can select which of the factors in (13) we want to increase
by 10× or we can decide to spread the total gain on two or more such
factors. In addition, thanks to the improved communications, it is possible to
spread the simulation of a single sample over several FPGAs, thus increasing
further Nspins or NMCS at the cost of reducing Nsamples. It turns out that
increasing by one order of magnitude either Nspins or NMCS or Nsamples opens
new opportunity windows.

Roughly speaking, typical SG simulations come in two flavors: non-
equilibrium and equilibrium. Surprisingly enough, the two turn out to be
complementary [26].

In non-equilibrium simulations one tries to analyze the relaxation pro-
cesses that take place in experimental spin glasses such as CuMn. Below their

18


glass temperature, such materials never reach thermal equilibrium. Hence,
one should perform simulations at a single temperature (i.e. NT = 1), with a
dynamic rule such as Metropolis or heat-bath that try to mimic the real spin
dynamics. These simulations should be as long as possible (i.e. NMCS should
be large), and the system size (i.e. Nspins) should be large enough to ensure
that thermal equilibrium is never approached. The only good news is that
the number of samples can be moderate, Nsamples ∼ 100, because most of the
quantities that one computes are self-averaging (i.e., their sample-to-sample
fluctuations tends to zero as 1/Na

spins, with a ≈ 1/2).
On the other hand we have equilibrium simulations. Here, we need to

approach the equilibrium distribution, Eq. (3). We are not tied to any phys-
ical dynamics: any trick that one may invent is acceptable, provided that it
verifies the balance condition [18]. In particular, we may employ the parallel
tempering algorithm explained in Sec. 2, which requires NT ∼ 40. As one
may easily guess, the larger the system size the more valuable the physical
information obtained from the simulation. Unfortunately, the efficiency of
parallel tempering is rather moderate: JANUS established a world record by
equilibrating lattices with L = 32 in three dimensions [25]. Another big issue
is that the interesting physical quantities are not self-averaging at equilib-
rium: sample-to-sample fluctuations are huge, which makes it desirable to
simulate a large number of samples.

At this point we are ready to appreciate the benefits of increasing by a
factor of 10 each of the individual factors in the r.h.s of Eq. (13).

• Increasing system sizes will mostly benefit non-equilibrium simulations.
Indeed, the coherence length ξ(t), the typical size of the glassy domains,
grows with the simulation time as ξ(t) ∼ t1/z(T ), with z(T ) ≈ 6.86Tc

T
[22,

23] (we measure the time t in lattice sweeps; Tc = 1.109(10) is the
critical temperature [31]). In experimental samples ξ(t) is negligibly
small as compared with the system size. Typical figures are L = 108

and ξ(t) ∼ 100 lattice spacings [32, 33]. In fact, we know that in order
to stay in the non-equilibrium regime one should have L ≥ 7ξ(t) [22].
In other words, for any L there is a maximum safe simulation time
t∗. This t∗ was amply surpassed in some of the simulations performed
with JANUS. Indeed, in a month of continued operation one of the
JANUS FPGAs simulated an L = 80 lattice up to t = 1011 (this is
the equivalent of one tenth of a second in physical time). However, in
particular close to the critical temperature, L = 80 is not large enough.

19


Finite-size effects were felt at t∗ = 109. Fortunately, in the same month
of continued operation Janus II will be able to reach t = 1011 for lattice
sizes L = 180 (single FPGA), L ' 500 (16 FPGAs in a single board
working in parallel) or L ' 700 (full machine). It is highly unlikely
that, for t = 1011 and L ' 500 finite-size effects will be relevant.

• Increasing the number of samples. JANUS previous campaigns were
remarkable for the sizes of the simulated samples, and the low tempera-
tures reached. However, the number of simulated samples was typically
in the range 1000 − 10000. Some important physical effects, however,
can be traced only through rare events. Hence, an adequate investi-
gation requires a significant boost in the number of samples (at least
by a factor ∼ 10). There are at least two major problems where the
sample-number issue is crucial. One is the so called temperature chaos
problem [34]. The other is the survival (or lack of) of the spin glass
phase in the presence of an external magnetic field [35, 36, 37, 28, 29].

• Increasing the simulation time. Both equilibrium and non-equilibrium
simulations may benefit by increasing NMCS. Non-equilibrium simu-
lations for temperatures T = 0.6 and below reached a quite modest
coherence length ξ(t) at t = 1011 [22, 23]. Thus, extending the dura-
tion of these L = 80 simulations to t = 1012 will be informative while
not endangering the non-equilibrium condition L ≥ 7ξ(t). Another off-
equilibrium example is the dynamical study of the possible transition
in the physics of the spin glass in a field in D = 3. With JANUS we
were able to identify a dynamical transition, but our precision was not
enough to decide definitively between several possible scenarios [29].
Extending the time window where we follow the evolution of the sys-
tem could be crucial to improve our understanding of this system.

In equilibrium simulations one could either try to lower the reached
temperature while keeping the system size fixed to L = 32, or to in-
crease the system size to L = 48 while holding fixed the lowest tem-
perature Tmin = 0.7026 [25]. By decreasing the lowest temperature, we
could probe deeply in the spin-glass phase to study its many intriguing
features (ultrametricity, statistics of overlap distributions, temperature
chaos, etc.). On the other hand, increasing the system size at fixed
temperature should allow us to assess finite-size effects, and to make
rare-events less rare (the probability for a sample not to display a rare

20


event is expected to go as exp[−NspinsΩ] with Ω small but positive [34]).

Finally, let us mention another frontier to be explored, namely studying
more sophisticated spin glasses. Indeed, JANUS’ limited memory implied
that, in practice, one was forced to consider only spin glass with Ising spins.
However, there are important problems [38, 39] that cannot be treated within
this framework. Janus II should be able to simulate, at the very least, XY
spins [n = 2 in Eq. (2)], maybe with some discretization. In fact, a Migdal-
Kadanoff renormalization study of the discretization of the XY model by
means of a clock model has recently appeared [40]. The discretization issue
seems to be rather subtle and worth of investigation on itself.

In short, the enhanced power of Janus II will allow us to improve our
understanding of key topics in spin-glass physics that have already been in-
vestigated with JANUS (temperature chaos, ultrametricity, non-coarsening
isothermal dynamics, presence of a phase transition in a field) but also to
delve into new problems (more sophisticated spin-glass models, non-isothermal
dynamics, etc.).

6. Performance comparison with commodity computers

When undertaking a major development project, like Janus II, one should
ensure that the performance gain over commodity computers is large enough
to justify the effort and that this gap can be reasonably forecast to stay
for a long enough time window. In this section we compare the expected
performance of Janus II with that available from several commodity systems,
measured over the last few years and try to derive reasonable forecasts for
the near future.

We start reminding that the discussion of the previous sections shows
that our computational problem would optimally suit a super-slim processor
that handles bit-valued variables. Since commodity processors have wide
data words (and the current trend for recent processor is for wider and wider
vector words), efficient use of the computing resources mandates that spins
and couplings of different sites of the same lattice are grouped together on
the same (scalar or vector) data word and operated upon by bit-wise logic
operations; this approach – that also naturally supports SIMD vectorization
– is known in the literature as multi-spin coding [41, 42]. One then maps V
spins of a given sample on the same computer word and processes these spins
in parallel. In principle V can be as large as the machine word size S, but

21


one independent random value is needed for each spin, so, as V increases the
incremental performance gain quickly fades away. As a further optimization
step, one can then process in parallel spins belonging to W independent
samples (e.g., W = S/V ) since just one random value can be used to process
W spins belonging to independent samples, introducing a tolerable amount
of sample-to-sample correlation; in the following we will say that we have a
sample parallelism of degree V and a global parallelism of degree W . The
optimal trade-off for most commercial architectures is that V is significantly
smaller than S, implying that a large number W of samples is simulated
concurrently. This is useful from the point of view of accumulating statistics
over samples, but – we stress it once again – in no way helps solve the key
problem of speeding up the Monte Carlo dynamics of each sample; this is
precisely where an application-driven architecture, for which V is O(103),
produces its biggest dividends.

We will use performance metrics directly relevant for physics; we define
the Sample spin update Time (SUT) as the average time needed to update
one spin of one lattice sample. For each SP in Janus II we have estimated in
the previous section a SUT of 2 ps; for one full Janus II box working on one
lattice, SUT goes down to 0.125 ps. We also define the Global spin Update
Time (GUT), appropriate when one simulation job handles several samples
of the lattice at the same time; GUT is simply defined as SUT/W . For Janus
II, GUT equals SUT for each SP (and can be defined as SUT divided by the
number of SPs working on different samples).

When the JANUS project started, early 2006, state-of-the-art commodity
systems had dual-core CPUs; on those processors carefully optimized codes
had a SUT of ∼ 1000 ps and GUT of ∼ 400 ps. In the following years,
processors have changed significantly with the introduction of many-core
CPUs and of general purpose GPUs; these are better SG machines than
traditional CPUs as one maps the available parallelism on more cores (or on
more threads, for GPUs). Over the years, we have compared [43, 44] JANUS
with several multi-core systems. In Table 1 we report the best SUT measured
on several processors for a simulation of a lattice of 643 sites. We clearly
see that over the years the large performance gap of Janus over commodity
processor (e.g. Core 2 Duo) has been significantly reduced; an interesting
first example was the extremely efficient IBM-Cell CPU, for which we have
measured a SUT of 150 ps. As of today, the best figure is offered by a 16-cores
Sandy Bridge processor, for which SUT is ≈ 60 ps. Processor like the Xeon-
Phi performs better on large lattices, for example we have measured a SUT

22


System Core 2 Duo CBE (16 cores) JANUS C1060 NH (8 cores) C2050 SB (16 cores) K20X Xeon-Phi Janus II
Year 2007 2007 2008 2009 2009 2010 2012 2012 2013 2013
Power (W) 150 220 35 200 220 300 300 300 300 25
SUT (ps/flip) 1000 150 16 720 200 430 60 230 52 2
Energy/flip (nJ/flip) 150 33 0.56 144 244 129 18 69 15.6 0.05

Table 1: Spin-update-time (SUT) of EA simulation codes on a 643 lattice on several
architectures. CBE is a system based on the IBM Cell processor; Tesla C1060, C2050 and
K20X are NVIDIA GP-GPUs; NH (SB) are dual-socket systems based respectively on the
4-core Nehalem Xeon-5560 (8-core Sandybridge Xeon-E5-2680) processors, and Xeon-Phi
is the recent launched MIC architecture of Intel. The table also shows rough estimates of
the energy needed to perform all the computing steps associated to one spin flip.

of 30 ps on a lattice of 1283 which improves the performance of Sandy Bridge
by a factor 2. Equally significant is the energy efficiency of the Janus II
system; data is shown again in Table 1, in which we display the approximate
energy cost associated to the Monte Carlo update of one spin.

All in all, a Janus II box will be able to simulate in parallel a large spin
glass lattice more than 200 times faster than the best currently available com-
modity option, and using ∼ 300 times less energy. The next obvious question
that one has to face when developing a custom system is how long it will keep
its performance edge over commercial systems. Looking at Figure 6, plot-
ting data of Table 1, we see that performances of spin-glass applications on
commodity systems have increased over the time following a regular trend.
Conversely application-specific projects evolve in steps, as there is no per-
formance increase till a new generation is developed. The plot clearly shows
three lines of evolution of commodity systems: they all scale according to
Moore’s law, with different pre-factors corresponding to different broadly-
defined families of architecture.

Looking at SUT figures for the Intel Nehalem and Sandy Bridge micro-
architectures with respect to those of the Core 2 Duo processor we clearly
see an abrupt jump in the scaling behavior associated to Moore’s law; we
interpret this fact as the consequence of a performance gap that happened
when multi-core processors were introduced, followed by a regular Moore’s
behavior (compare the two Moore’s lines in the picture). Looking at the
performance plots of the JANUS-class machines, we see that JANUS will
remain competitive until the end of 2014, and Janus II comes into operation
at the end-of-life of its predecessor; from this analysis we can reasonably look
into our crystal-ball and expect that Janus II should remain competitive
through the year 2017. Our analysis also shows the outstanding performance
of the IBM-Cell processor, whose production has however been discontinued

23


Figure 6: Performance trends (measured in spin-flips/picosecond) for the simulation of the
EA spin glass model with optimized programs for several commodity architectures and
for JANUS and Janus II. The lines scale according to Moore’s law. See the text for a
complete discussion.

and the poor performance of GPU-based accelerator which suffer as they are
more strongly optimized for floating-point arithmetics and lack cache-systems
that are crucial for this class of applications. Concerning the very recent
Xeon-Phi processors, in spite of a very careful optimization, performances
are not better than a dual Sandybridge system for small lattices; on the other
hand, large on-chip caches in this processor keep its performance constant
on larger lattices [45].

7. Conclusions

In this paper we have described the architecture and implementation
of the Janus II application-driven machine, emphasizing its potential for
performance in the simulation of spin glass systems. As described in detail
in the previous section, the new machine will make it possible to carry out
Monte Carlo simulation campaigns that would take centuries if performed
on currently available computer systems.

The possibility to obtain such a large performance gap stems mainly from
the fact that the number crunching requirements associated to this class of
simulations are very different from those for which state-of-the-art computers

24


are optimized. At the same time, FPGAs offer an enabling technology that
allows to implement real machines with a reasonable engineering effort and
at costs affordable to a small scientific collaboration.

Janus II builds and improves on the experience of its predecessor –
JANUS – that has been running physics simulations for the last 6 years,
and replaces the older machine at a point in time when the JANUS perfor-
mance edge over commercial systems is significantly reduced. JANUS and
Janus II have been designed with the main aim of speeding up the Monte
Carlo simulation of (a wide class of) spin glass models. At the basic hardware
level, both machines are not specialized for these classes of simulations, so
their use for other computational tasks is in principle possible and efficient.
In practice, attempts at using JANUS for other applications have been hit
by the serious bottleneck of the small size of available memory. Janus II
addresses explicitly this problem, since each SP node has 2 large banks of
fast memory; we are now starting to work on the assessment of the poten-
tial efficiency of our machine for other applications, including such areas as
cryptography, graph optimization and simulation of VLSI circuits.

Acknowledgments

We warmly acknowledge the excellent work done by the Janus II team
at Link Engineering. In particular we thank Pietro Lazzeri, Pamela Pedrini,
Roberto Preatoni, Luigi Trombetta and Alessandro Zambardi for their pro-
fessional and enthusiastic work. The Janus II project was supported by the
European Regional Development Fund (ERDF/2007-2013, FEDER project
UNZA08-4E-020); by the European Research Council under the European
Union’s Seventh Framework Programme (FP7/2007-2013, ERC grant agree-
ment no. 247328) ; by the MICINN (Spain) (contracts FIS2012-35719-C02,
FIS2010-16587); by Junta de Extremadura (contract GR101583); by the Ital-
ian Ministry of Education and Research (PRIN Grant 2010HXAW77 007).

References

[1] C. Angell, Science 267, 1924 (1995).

[2] P. Debenedetti, Metastable liquids, Princeton University Press, Prince-
ton (1997).

[3] R. Tripiccione, Comp. Phys. Comm. 169, 442 (2005).

25


[4] J. Makino et al., The Astrophysical Journal 480, 432 (1997).

[5] D. E. Shaw et al., Communications of the ACM 51, 91 (2008).

[6] R. Pearson, J. Richardson, D. Toussaint and A Special, Purpose Ma-
chine for Monte Carlo Simulations, Tech. Report NSF-ITP-81-139, Inst.
Theoretical Physics, Univ. California, Santa Barbara, 1981.

[7] J.H. Condon and A.T. Ogielski, Rev. Sci. Instruments 56 1691 (1985);
A.T. Ogielski, Phys. Rev. B 32. 7384 (1985).

[8] J. Pech et al., Comp. Phys. Comm. 106, 10 (1997); A. Cruz et al.,
Comp. Phys. Comm. 133, 165 (2001).

[9] F. Belletti et al., Computing in Science & Engineering 8 41 (2006).

[10] F. Belletti et al., Computer Physics Communications 178, 208 (2008).

[11] F. Belletti et al., Computing in Science & Engineering 11 48 (2009).

[12] A. P. Young (editor), Spin Glasses and Random Fields (World Scientific,
Singapore, 1998).

[13] J. A. Mydosh, Spin Glasses: an Experimental Introduction (Taylor and
Francis, London, 1993).

[14] F. Barahona, J. Phys. A 15, 3241 (1982).

[15] S. F. Edwards and P. W. Anderson, J. Phys. F: Metal Phys. 5, 965
(1975); ibid. 6, 1927 (1976).

[16] K. Binder and D. W. Heerman, Monte Carlo Simulation in Statistical
Physics, (Springer, Berlin, 2010).

[17] M. Creutz, Quantum Fields on the Computer, World Scientific, 1992.

[18] A. D. Sokal in Functional Integration: Basics and Applications (1996
Cargèse School), C. DeWitt-Morette, P. Cartier and A. Folacci eds.
(Plenum, New York, 1997).

[19] E. Marinari and G. Parisi, Europhys. Lett. 19. 451 (1992); K.
Hukushima, K. Nemoto, J. Phys. Soc. Jpn. 65, 1604 (1996); M.C. Tesi,
et al., J. Stat. Phys. 82, 155 (1996).

26


[20] http://www.linkengineering.it

[21] G. Parisi and F. Rapuano, Phys. Lett. B 157, 301 (1985).

[22] Janus Collaboration: F. Belletti et al., Phys. Rev. Lett. 101, 157201
(2008).

[23] Janus Collaboration: F. Belletti et al., J. Stat. Phys. 135, 1121-1158
(2009).

[24] Janus Collaboration: A. Cruz, et al, Phys. Rev. B 79, 184408 (2009).

[25] Janus Collaboration: R. A. Banos, et al., J. Stat. Mech. P06026 (2010)
.

[26] Janus Collaboration: R. Alvarez Banos, et al., Phys. Rev. Lett. 105,
177202 (2010).

[27] Janus Collaboration: R. A. Banos, et al., Phys. Rev. B 84, 174209
(2011).

[28] Janus Collaboration: R. A. Baños et al., Proc. Natl. Acad. Sci. USA
109, 6452 (2012).

[29] Janus collaboration: M. Baity-Jesi et al., arXiv:1307.4998.

[30] M. Baity-Jesi et al., The European Physical Journal: Special Topics
210, 33-51 (2012).

[31] M. Hasenbusch, A. Pelissetto and E. Vicari, Phys. Rev. B 78, 214205
(2008).

[32] Y. G. Joh et al., Phys. Rev. Lett. 82, 438 (1999).

[33] F. Bert et al., Phys. Rev. Lett. 92, 167203 (2004).

[34] L.A. Fernandez, V. Martin-Mayor, G. Parisi and B. Seoane,
arXiv:1307.2361.

[35] A.J. Bray and M.A. Moore, Phys. Rev. B 83, 224408 (2011).

[36] A.P. Young and H.G. Katzgraber, Phys. Rev. Lett. 93, 207203 (2004).

27

http://www.linkengineering.it
http://arxiv.org/abs/1307.4998
http://arxiv.org/abs/1307.2361


[37] T. Jörg, H. Katzgraber and F. Krzakala, Phys. Rev. Lett. 100, 197202,
(2008)

[38] A.P. Young and A. Sharma, Phys. Rev. B 83, 214405 (2011).

[39] V. Martin-Mayor and S. Perez-Gaviro, Phys. Rev. B 84, 024419 (2011).

[40] E. Ilker and A. Nihat Berker, Phys. Rev. E 87, 032124 (2013).

[41] C. Michael, Phys. Rev. B, 33, 7861-7862 (1986).

[42] G. Bhanot, D. Duke and R. Salvador, Phys. Rev. B, 33, 7841-7844
(1986).

[43] M. Guidetti et al., Spin Glass Monte Carlo Simulations on the Cell
Broadband Engine in Proc. of PPAM09, LNCS 6067, 467-476 (Springer,
Heidelberg 2010).

[44] M. Guidetti et al., Monte Carlo Simulations of Spin Systems on Multi-
core Processors K. Jonasson ed.), LNCS 7133, 220-230 (Springer, Hei-
delberg 2010) .

[45] A. Gabbana, M. Pivanti, S. F. Schifano, R. Tripiccione, Benchmarking
MIC architectures with Monte Carlo simulations of spin glass systems,
in Proceedings of the High Performance Computing Conference, 2013,
Bangalore (India), in press.

28


	1 Overview
	2 Spin Glass models
	3 Janus II architecture
	4 Structuring and programming a spin glass simulation on Janus II
	5 Janus II impact on spin-glass simulations
	6 Performance comparison with commodity computers
	7 Conclusions