Janus II : a new generation application-driven computer for spin-system simulations M. Baity-Jesia,b,c, R. A. Bañosb,d, A. Cruzd,b, L. A. Fernandeza,b, J. M. Gil-Narvionb, A. Gordillo-Guerreroe,b, D. Iñiguezb,k, A. Maioranoc,b, F. Mantovanif,1, E. Marinarig, V. Martin-Mayora,b, J. Monforte-Garciab,d, A. Muñoz Sudupea, D. Navarroh, G. Parisig, S. Perez-Gavirob,k, M. Pivantif, F. Ricci-Tersenghig, J. J. Ruiz-Lorenzoi,b, S. F. Schifanoj, B. Seoanec,b, A. Tarancond,b, R. Tripiccionef, D. Yllanesc,b aDepartamento de F́ısica Teórica I, Universidad Complutense, 28040 Madrid, Spain. bInstituto de Biocomputación y Fisica de Sistema Complejos (BIFI), 50009 Zaragoza, Spain. cDipartimento di Fisica, Università di Roma “La Sapienza”, 00185 Roma, Italy. dDepartamento de F́ısica Teórica, Universidad de Zaragoza, 50009 Zaragoza, Spain. eD. de Ingenieŕıa Eléctrica, Electrónica y Automática, U. de Extremadura, 10071 Cáceres, Spain. fDipartimento di Fisica e Scienze della Terra, Università di Ferrara, and INFN, 44100 Ferrara, Italy. gDipartimento di Fisica, IPCF-CNR, UOS Roma Kerberos and INFN, Università di Roma “La Sapienza”, 00185 Roma, Italy. hD. de Ingenieŕıa, Electrónica y Comunicaciones and I3A, U. de Zaragoza, 50009 Zaragoza, Spain. iDepartamento de Fisica, Universidad de Extremadura, 06071 Badajoz, Spain. jDipartimento di Matematica e Informatica, Università di Ferrara, and INFN, 44100 Ferrara, Italy. kFundación ARAID, Diputación General de Aragón, Zaragoza, Spain. Abstract This paper describes the architecture, the development and the implemen- tation of Janus II, a new generation application-driven number cruncher optimized for Monte Carlo simulations of spin systems (mainly spin glasses). This domain of computational physics is a recognized grand challenge of high-performance computing: the resources necessary to study in detail the- oretical models that can make contact with experimental data are by far 1Now at Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain. Preprint submitted to Computer Physics Communications October 4, 2013 ar X iv :1 31 0. 10 32 v1 [ cs .A R ] 3 O ct 2 01 3 beyond those available using commodity computer systems. On the other hand, several specific features of the associated algorithms suggest that un- conventional computer architectures – that can be implemented with avail- able electronics technologies – may lead to order of magnitude increases in performance, reducing to acceptable values on human scales the time needed to carry out simulation campaigns that would take centuries on commercially available machines. Janus II is one such machine, recently developed and commissioned, that builds upon and improves on the successful JANUS ma- chine, which has been used for physics since 2008 and is still in operation today. This paper describes in detail the motivations behind the project, the computational requirements, the architecture and the implementation of this new machine and compares its expected performances with those of currently available commercial systems. Keywords: Spin glass, Monte Carlo, Application-driven computers, FPGA computing 1. Overview Understanding glassy behavior is a major challenge in condensed matter physics (see for instance Refs. [1, 2]). Glasses are materials that do not reach thermal equilibrium on macroscopic time scales (e.g., years): bulk material properties of a macroscopic sample, such as compliance modulus or specific heat, change in time even if the sample is kept for days (years) at constant experimental conditions. This sluggish dynamics is a major problem for the theoretical and experimental investigation of glasses. Spin glasses, usually regarded as prototypical glassy systems (or, more generally, prototypical complex systems), have been extensively studied the- oretically; over the years this theoretical work has been widely supported by numerical simulations, mostly using Monte Carlo techniques. The Monte Carlo simulation of spin glass systems is a recognized grand challenge of com- puting, as it requires inordinately large resources and at the same time it has number-crunching requirements at large variance with mainstream computer developments. In a typical spin-glass model (see later for a detailed description), the dynamical variables, the spins, are discrete and sit at the nodes of discrete D-dimensional lattices. In order to make contact with experiments, we may want to follow the evolution of a large enough 3D lattice, say 1003 sites, 2 for time periods of the order of 1 second. One Monte Carlo sweep (MCS) —the update of all the spins in the lattice— roughly corresponds to a time scale of 10−12 seconds for a real sample, so we need some 1012 such steps, that is 1018 spin updates. Also, in order to properly account for disorder, we have to collect statistics on several (e.g., O(100)) copies of the system, adding up to 1020 Monte Carlo spin updates. One easily reckons that one needs a computer able to process on average one spin in 1 picosecond or less, in order to carry out this simulation program within reasonable human timescales (say, less than one year). The algorithms associated to the Monte Carlo simulation of spin glasses have several properties that – in principle – open the way to very efficient processing. First, as already noted, the degrees of freedom of several widely studied spin glass models are discrete, and their values can be mapped on a small number of bits (just one, for several popular models); discrete bit- valued variables are operated upon with simple logic (as opposed to arith- metic) operations, that can be performed by just a few logic gates. Second, it is easy to identify a very large amount of parallelism in the required com- putation, as one concurrently processes spins that do not have direct interac- tions among one another. Virtually all commercially available computers are not able to exploit in full these properties; indeed, processors are optimized for arithmetic (integer or – even worse – floating point) operations, each such operation requiring a large number of logic gates. The added burden of performing logic operations by hardware structures optimized to perform arithmetic operations also severely limits the amount of parallel computation that each processor is able to support. On the other hand, these features, if consistently exploited, open the way to a conceptually simple and efficient application-driven computing architec- ture, carefully optimized for spin glass simulations, that promises to offer huge performance advantages. Application-oriented systems have been used in many cases in computational physics, not only spin-system simulations but also in Lattice QCD [3] and for the simulation of gravitationally coupled [4] and biological [5] systems. Application-driven number crunchers for spin systems have a long history: the pioneering work by Pearson and Richard- son [6] in the late 70s was followed by that of Ogielsky and Condon [7] in the 80s; these pioneering attempts were followed by the SUE project [8] and more recently by JANUS [9, 10, 11] —of which the work described in this pa- per is the natural evolution. SUE and JANUS acknowledge that an optimal architecture for spin simulators requires a dedicated processor architecture 3 and use Field Programmable Gate Arrays (FPGA) as the enabling technol- ogy to implement that architecture. FPGAs are integrated circuits that can be configured at will after they have been assembled in an electronic system. In the last ten years, dedicated spin-glass crunchers, with their order- of-magnitude better performance than available with traditional computers have been instrumental to reach several key results in spin glass physics; the most recent such machine JANUS, commissioned in 2008 and still in operation today, has indeed made it possible to establish several new results (see later for details). In the same time frame, several innovative development in mainstream computer architecture – including many-core processors and GPUs – have made it possible to develop increasingly more parallel implementations of Monte Carlo algorithms for spin systems, significantly boosting performance, and largely closing the gap with respect to JANUS. At the same time and in parallel with mainstream computer systems, progress in electronic technology has also significantly boosted the level of parallelism and performance that can be harvested using FPGAs. This background has motivated the start of the development of Janus II, described in detail in this paper, that has the potential to provide order of magnitude better performance than commercial computers in a time window of at least the next five years, as well as superior energy efficiency. Janus II is an FPGA-based massively parallel spin-glass number cruncher, that architecturally builds on JANUS and improves on it in several direc- tions: i) it uses latest generation FPGA technology, corresponding to an order of magnitude increase in performance per processor, ii) it includes an improved communication interconnection among Janus II nodes, that makes it efficient to simulate large lattices using inter-node parallelism on top of intra-node parallelism, iii) it enlarges by two orders of magnitude the size of the memory available to the system and iv) it tightly couples the dedicated number-cruncher nodes with traditional host computers, improv- ing data throughput and allowing a mixed-mode operation of the system in which potentially complex control operations are handled efficiently by traditional programs. All these improvements help boost the expected per- formance of Janus II as a spin glass number cruncher; moreover, points iii) and iv) above enlarge the class of applications for which Janus II is a poten- tial efficient computer: while the project is still mainly motivated as a spin glass simulator, we expect interesting results is such diverse areas as graph theory, cryptography or simulation of VLSI circuits. 4 This paper is structured in the following way: after this section, we present a short introduction to spin glass models and to the Monte Carlo techniques used to simulate them; the paper continues with a description of the Janus II architecture, that closely matches the requirements outlined in the preceding section. A section on the programming and development environment available for this machine follows, that also contains some per- formance figures. This is followed by a section that – building on the expected performance of the machine – tries to identify several important questions in spin glass physics accessible to Janus II that were not within reach of JANUS. The following section compares Janus II performances with those available on currently available computers and tries to forecast the extent of the window of opportunity of our new machine. The paper ends with some concluding remarks. 2. Spin Glass models Both JANUS and Janus II have been designed from scratch to optimize their performance for a specific application: the Monte Carlo simulation of spin glasses. In this section we review the spin glass models that we want to study with this machine. Spin glasses are disordered magnetic alloys whose low-temperature phase is a frozen disordered state, rather than the uniform patterns one finds in more conventional magnetic systems [12, 13]. They are important because they are widely regarded as the simplest possible model of a complex system. In fact, as we will see below, spin glass models are extremely simple to define. In spite of this, finding the lowest energy configuration of a three dimensional Ising spin glass is an NP-hard problem [14]. The main ingredients that make the problem so hard are randomness in the interactions and frustration. By frustration we refer to the impossibility of satisfying simultaneously all the demands that the interactions pose on individual spins. One of the most famous family of spin-glass models was proposed by Edwards and Anderson [15] in the 70s. They consider a regular lattice and define spins sitting at the lattice nodes. Spins are unit-length vectors of n components: ~Si = (Si,1, Si,2, . . . , Si,n), ~Si · ~Si = 1. The interaction energy is H = − ∑ i,j Jij ~Si · ~Sj , (1) 5 where the indices i and j run over the nodes of the lattice. The coupling constants Jij are chosen randomly (quenched disordered). They are statisti- cally independent, and identically distributed. We shall be mostly concerned with short-ranged interactions (i.e. Jij vanish, unless i and j sites are lattice nearest neighbors). An instance of the coupling constants {Jij} defines a sample of the physical system. The number n of components of the spins is also important. Some cases have special names: Heisenberg (n = 3), XY (n = 2) and Ising (n = 1). The Ising spin glass model, Si = ±1, with short-range interaction (one of the prototypical material is Fe0.5Mn0.5TiO3) has deserved special scientific attention for decades; its energy function reads: H = − ∑ 〈i,j〉 JijSiSj , (2) where 〈i, j〉 indicates that the sum runs only on nearest neighbors in the lattice. Since the spin Si is a binary variable, it can be coded on just one bit. Computational opportunities arise from this simplicity, and we aim to explore some of them. The goal of the game is to obtain assignments of the spin variables {Si} – named configurations – statistically distributed according to the Boltzmann weight at temperature T : PB({Si}) = exp[−H({Si})/T ] Z , Z = ∑ {Si} exp[−H({Si})/T ] . (3) One may try to achieve this goal by means of Markov Chain Monte Carlo simulations (see, for instance, [16, 17, 18] for a detailed introduction). In principle, one just needs to implement some dynamics fulfilling detailed bal- ance and run it for a long enough time. However, in the limit of vanishing temperature, the Boltzmann weight goes to zero unless the spin configura- tion is a ground state of the system (a lowest energy configuration). Since finding ground states for a typical three dimensional spin-glass sample is a NP-complete problem, something should go wrong with our simple-minded strategy. The problem is in the length of the simulation: the autocorrelation times for the Markov chain become inordinately large at low T ; the simula- tion gets trapped for a long time in some of the many local minima of the energy (2). It is maybe worth mentioning that physical spin glasses (such as AgxMn1−x, for instance) do suffer from the same problem: the system does 6 not reach thermal equilibrium even if it is allowed to evolve under constant laboratory conditions for hours, or even days. Finding equilibrium configurations for a single sample {Jij} is only half of the problem. In order to obtain physically meaningful answers one needs to average the thermal mean-values [i.e., the mean values corresponding to the Boltzmann weight (3) of a given sample], over a fair number of samples [i.e. performing the average of the quenched disorder]. The meaning of fair is very much dependent on the physical questions that one asks and on the lattice size: it may range from less than one hundred samples to maybe 105 samples. The dynamics that implement our Markov Chain Monte Carlo on a given sample at temperature T are pretty standard: Metropolis or Heat Bath. For instance, the Metropolis procedure for the Ising spin glass starts from an initial arbitrary configuration and generates new configurations by pick- ing one spin in the lattice (Si) and tentatively flipping it. One then com- putes the energy difference ∆E associated to this tentative change, ∆E = 2 ∑ (JijSiSj) [where j runs over all the nearest neighbors of the site i]. If ∆E ≤ 0 the tentative flip is accepted and the algorithm moves to another lattice site. If, on the other hand ∆E > 0, the tentative flip is accepted con- ditionally with a probability proportional to e−∆Eβ (β = 1/T is the inverse of the system temperature). One easily identifies a large degree of parallelism, as one applies the pro- cedure in parallel to any subset of spins that do not share a coupling term in the energy function (so one correctly computes all ∆E terms): one usually partitions the lattice as a checkerboard and applies the algorithm first to all black sites and then to all white ones, corresponding to an available paral- lelism of degree LD/2: in principle, we may schedule one full Monte Carlo Sweep (MCS, the application of the algorithm to all sites of the lattice) for any lattice size in just two computational steps, if enough computational resources are available. Simulations at constant temperature are not up to the task, if one wants to produce a thermalized set of configurations at low temperature. One then resorts to the parallel tempering (PT) algorithm [19]. We consider NT temperatures T1 < T2 < T3 < . . . < TNT . For each temperature, we consider a statistically independent spin configuration {Si,a} with a = 1, 2, . . . , NT : PB({Si,a=1}, {Si,a=2}, . . . , {Si,a=NT }) = NT∏ a=1 exp[−H({Si,a})/Ta] Z(Ta) . (4) 7 Each of the NT systems is independently simulated at its own temperature by means of one of the standard algorithms. However, every nPT constant-T sweeps, one performs parallel tempering. The elementary parallel-tempering step is the exchange attempt of the configurations at two consecutive temper- atures Ta and Ta+1. The configuration exchange is accepted with Metropolis probability: ProbPT = min [ 1, exp(− H({Si,a+1}) Ta − H({Si,a}) Ta+1 ) exp(− H({Si,a}) Ta − H({Si,a+1}) Ta+1 ) ] . (5) One attempts to exchange configurations towards ascending temperatures (so, in principle, the configuration at the lowest temperature could reach the highest temperature in just one PT step): the rationale behind the parallel tempering algorithm is simple. If a configuration trapped in a local minimum is raised to a high enough temperature, it will be able to escape thanks to a thermal fluctuation. Parallel tempering has several tunable parameters. First, the set of tem- peratures {Ta}NT a=1 should be such that the acceptance probability (5) be reasonable (say, ∼ 10%). This requires a relatively small temperature spac- ing. On the other hand, the largest temperature TNT should be high enough to ensure a quick equilibration by means of the constant-T algorithm. One has to reach a compromise among these conflicting goals, as, the larger NT , the larger are the needed computational resources. The parameter that con- trols the parallel tempering frequency, nPT, can be tuned as well. In our experience, the algorithmic performance depends on nPT only slightly. This is fortunate, because the parallel tempering breaks for sure parallelism. One may diminish its frequency (by increasing nPT), although some tradeoff must also be found. Let us finally mention that one may extend the Edwards-Anderson model by including an external, site dependent magnetic field hi: H = − ∑ 〈i,j〉 Jijsisj − ∑ i hisi . (6) In this case, a sample is defined by the set of coupling constants {Jij, hi}. The addition of the local magnetic fields hi does not add any real complication to the numerical simulation, and it has the advantages of enlarging the set of problems that can be considered. Examples are the random field Ising model 8 (RFIM) and the diluted anti-ferromagnetic in a field model (DAFF) [12]. A further extension consists on the consideration of integer-valued spins si = 1, 2, . . . , Q (the so called Q-states Potts model), which can be formulated in a similar way (see the chapter by Binder in Ref. [12]). 3. Janus II architecture The Janus II architectural concept and its implementation follow directly from its predecessor (JANUS), built and commissioned in 2008 and still in operation today. The main guiding principle behind the old and new JANUS architectures is the attempt to leverage on state-of-the-art electronics tech- nology in order to: i) exploit the huge parallelism available in the simulation of one spin glass (SG) system to speed up the Monte Carlo evolution of that system, ii) simulate in parallel a relatively large number of system samples, and iii) connect as tightly as possible the dedicated, massively parallel num- ber crunching array with a traditional host computer system, so that complex and non-parallelizable computing functions (e.g., the proper handling of the parallel tempering temperature exchange) are done with as little impact on global performance as possible. The simulation of most SG models implies a mix of logic operations on bits (as opposed to arithmetic operations on long data words). Since vir- tually all commercially available computer architectures focus on arithmetic operations, they are conceptually a poor option for SG simulations. An op- timal choice would be to hardwire all the logic gates that can be fabricated on one silicon die in order to perform exactly the set of required logic op- erations, developing a fully customized integrated circuit. This is possible in principle (integrated circuits designed to perform a specific function are called Application Specific Integrated Circuits, or ASICs), but the time and the costs associated to its development do not allow to pursue this option. We therefore choose the second best option, and adopt Field Programmable Gate Arrays (FPGAs) as the basic building block for Janus II. FPGAs are integrated circuits whose logical gates can be connected at will, in order to perform a specific set of logic functions. FPGA configuration is a simple process that can be done repeatedly, so the same FPGA can be used for widely different logic functions. Currently available FPGAs have hundreds of thousands of so-called logic cells, each able to perform any logic operation of several bits; equally important, FPGAs come with several tens of Mbit embedded memory. 9 Figure 1: Architecture of the Janus II Processing Board (PB). The array of 16 FPGA- based Simulation Processors (SPs, right) is connected by a 2D (x and y) toroidal network. All SPs have an additional independent connection to the IOP processor; the latter is part of the CP complex, that includes a commodity PC (adopting the COM form factor) and runs the Linux operating system; the CP has Gbit-Ethernet and Infiniband networking ports to the external world. Additional high speed connections are available for a tight coupling to other PBs in the z direction. 10 The overall architecture of Janus II is a parallel structure shown in fig- ure 1. The basic processing element of the system is the Simulation Processor (SP) whose computational structure is fully based on just one FPGA device. Each SP includes one Xilinx Virtex-7 XC7VX485T FPGA and two banks of DDR-3 memory of 8 Gbyte each. The choice of our FPGA has been done based mainly on cost and availability issues for this specific device. The se- lected FPGA has some 485000 logic cells and includes ∼ 32 Mbit embedded memory. As shown later in detail, we expect to embed within each SP more than 2000 spin-flip engines, each updating one spin (all of the same color in a checkerboard structure) in one clock cycle. This corresponds to an average update rate of 1 spin every 2.5 ps (with a conservative clock frequency of 200 MHz). A set of 16 SPs are mounted onto a Processing Board (PB); the SPs of each PB are logically assembled at the nodes of a 4 × 4 array. Each SP in the array has direct point-to-point bi-directional links with its 4 nearest neighbors; toroidal boundary conditions are applied. Each logical link is engineered as 8 physical links that we expect to operate at a bandwidth in the range from 3 to 5 Gbit/second. All SPs belonging to each PB are directly connected and controlled by a Control Processor (CP). The CP is a full fledged computer, running the Linux operating system. The CP plays several roles in the Janus II system: first, it is able to configure the FPGAs of the SPs, so they perform the desired logic operations; second it moves data from/to the SPs, so – for instance – initial data can be loaded to the SP and results of a simulation can go back to the CP. Finally, the CP controls the operations of all SPs, e.g. starting a simulation program, monitoring their status, collecting results, executing those parts of the global computation that cannot be offloaded to the SPs and handling any errors. The CP uses a commercially available Computer-on-Module (COM) sys- tem, based on an Intel Core i7 processor running at 2.2 GHz; it connects via the PCIe interface to a so-called Input-Output-Processor (IOP) built inside yet another FPGA; the IOP actually manages all connections to all SPs, using a set of dedicated bi-directional high speed links (one to each SP), run- ning at ∼ 3 Gbit/s and a small number of dedicated control and status lines. The IOP formats and appropriately routes data in transits from the CPU to the SPs, controls the configuration procedure of all SPs, controls their oper- ation and monitors their status. Since the IOP is itself a configurable unit, we are considering to use it – on a longer time scale – for additional com- 11 putational/communication tasks; for instance the IOP might support a full crossbar switch among all SPs, or handle directly the temperature exchange phase of a PT algorithm distributed over several SPs. The CP is the main architectural improvement of Janus II with respect to its predecessor: JANUS only had a Gigabit Ethernet link between a set of SPs and an external computer; the new arrangement increases the available bandwidth between the SP array and host to 4 Gbyte/second (a factor ∼ 40× larger than in the previous system) and reduces communication latency from ∼ 15µs to ∼ 1µs. A much more tightly coupled operation of the SP array becomes possible, allowing to split more finely a simulation program on the control CPU and the SP array. The combination of one CP and 16 SPs is the basic functional block of a Janus II system. All these components are assembled inside a box that also contains power supplies and the forced-air cooling system. This module operates as an independent computing system and can be networked with other Janus II boxes and with traditional computers via Ethernet and Infiniband interfaces. A Janus II installation can be made of any number N of Janus II boxes; the boxes can be used as logically independent systems, running simulations of different physical systems, or the whole system can operate as just one larger system; in the latter case, the machine can be seen as a 3D structure of 4×4×N SPs. Bidirectional links are in fact available on each SP to build the interconnection structure in the third dimension. The project – at the present stage – has already assembled and tested a system with 16 Janus II boxes, installed at BIFI in Zaragoza. The Janus II team worked on the conceptual design of the system architecture while our industrial partner – Link Engineering Srl, Bologna (Italy) [20] – have carried out the detailed engineering design and the actual construction of the prototype and of the presently available system. Fig. 2 shows an SP module while Fig. 3 shows a Janus II box. Fig. 4 is a partial close-up view of the fully assembled system. 4. Structuring and programming a spin glass simulation on Janus II A Janus II program is a combination of a standard C program, running on the CP and a computational kernel, running on one or more appropriately configured SPs and operating on data moved to the SP by the CP-resident 12 Figure 2: Pictures of a Janus II SP module; the picture at left has a small heat radiator, providing a complete view of all components; the picture at right shows the large heat radiator needed to allow high frequency operation of the machine. program. This programming style is similar to the one usually adopted in processing systems that include some form of co-processor or accelerator: a perhaps familiar example is GPU programming, where the host processor sets up all required data-structures, initializes data values and controls the outer loops of the program; the computationally heavy kernels run on the GPU. The main difference is of course that, while GPUs execute a program written in an appropriate programming language (e.g. CUDA or openCL), SPs in Janus II run the hardwired sequence of operations implied by the configured FPGA. Several development environments are available to assist in configuring FPGAs; we use VHDL, a relatively low-level language that re- quires a detailed description of the structures that store data, the operations that are performed on data and of instruction control: our experience shows however that only this low level, largely handcrafted approach guarantees the high performance that we look for. From the perspective outlined in the previous paragraph, Janus II might be seen as a (possibly exotic) general purpose computer; however the main driving force behind the project is of course that one expects outstanding performance when the SPs are configured for spin glass Monte Carlo simula- tions. Still, the fact that Janus II processing elements can be configured in arbitrary ways keeps the door open for other uses of this machine. The simplest operation mode for Janus II will be the one already adopted for JANUS: each SP performs a full Monte Carlo simulation of one SG sys- tem, while different replicas of the system or physical systems at different temperatures are assigned to several SPs. The update engine for one lattice site has a very simple structure. We consider again for definiteness the Ising spin glass in 3D; one maps the spins 13 Figure 3: Picture of a Janus II box; there are 16 SP modules (plugged vertically on the printed circuit, while the CP module is at the center of the structure; at left one sees the cooling fans and the power supplies. and coupling into bit-valued ({0,1}) variables: Sk → σk = (1 + Sk)/2 Jkm → jkm = (1 + Jkm)/2 . (7) Once this is done, the evaluation of ∆E = 2 ∑ (JijSiSj), only implies 6 logic bit-wise xor functions (replacing the products JijSj) followed by an arithmetic sum of just six bit-valued operands. The result can be seen as the pointer to a small look-up table where the corresponding pre-computed val- ues of eβ∆E are stored. At this point, one arithmetically compares the value of the selected table entry with a freshly generated random number: accord- ing to the outcome of the comparison the previous value of the spin is left unchanged, or the flipped value is written to memory. The required sequence of operation is similar for more complex spin glass models or different Monte Carlo algorithms: different and (possibly) more complex logic manipulations may be needed; in most cases the generation of pseudo-random numbers re- mains the most complex operation. On JANUS we were able to implement ∼ 1000 such basic engines in each FPGA, using the Parisi-Rapuano [21] gen- erator. With Janus II we plan to double this number and to increase the operating frequency by a factor 4. Under these conditions, the estimated power consumption of each SP – based on data made available by Xilinx – is between ∼ 25 and 30 Watts. 14 Figure 4: Close-up view of the Janus II machine installed at BIFI (Zaragoza). The installation has 16 Janus II boxes (12 are visible in the picture). The cables supporting the data-links in the z direction are mounted in loop-back mode for test purposes. One should notice that processing each spin implies reading 13 bits and writing one bit result (the new value of Si) and reading a few 32-bit num- bers (3 for the Parisi-Rapuano generator) to compute the next element in the sequence of random numbers. One quickly evaluates the overall memory traffic for 2000 spin-processing elements running at 200 MHz in excess of 4 Tbyte/second, orders of magnitude beyond the bandwidth available with the large memory banks outside the FPGA. The needed bandwidth is on the other hand available using the large number of memory blocks embedded inside our FPGAs; a rather complex memory allocation scheme that matches our requirements and can be efficiently implemented within the FPGA was devised for JANUS [10] and can be carried over directly to Janus II. This requires however that all data items required by the program fit inside the available on-chip memory. In our case the size of the FPGA embedded mem- ory is ∼ 32 Mbit so we are able to handle 3D lattices with L < 200, taking into account that each lattice site needs 4 bits of data. Alternatively, one can squeeze 30 copies of a lattice of size 643 inside each SP, making it possible to run a large parallel tempering protocol on one or two SPs. In this case, the CP would collect the energies of the lattices at all temperatures {Ta} after nPT Monte Carlo steps, re-assign temperatures according to Eq. (5) and start a new iteration. 15 If one wants to simulate larger lattices, all SPs can be used concurrently: under the same assumptions as above, all 16 SPs in one Janus II box are able to handle a 3D lattice with L ≈ 500 and even larger lattices fit the complete array of 16 boxes; in this case, the lattice is partitioned on all SPs in 1D or 2D slices and data associated to abutting faces of the sub-lattice are moved across SPs on the appropriate data links. A combination of the strategies discussed produces extremely high com- puting performance on Janus II. As discussed above, we can partition the lattice on several SPs, slicing along one dimension. The average time to process one spin on each processor is Tspin = 1 npf , (8) where np (np ∼ 2000) is the number of update cores available on each SP and f is the SP operating frequency, expected in the range from 125 to 250 MHz. If we partition our lattice on P processors (e.g., P = 16) the aggregated mean spin update time is Tglobal = 1 npfP , (9) corresponding to a Tglobal from 0.125 to 0.250 ps, in our frequency range. In order to sustain these processing rates, the node-to-node communica- tion harness must provide a matching communication bandwidth: during the time in which one SP updates all spins of its sublattice, we must move data associated to the spin configuration of one face of the lattice from one SP to its neighbor. Each SP sweeps all spins of its sublattice in a time Tlat = 1 npf L3 P . (10) The communication harness must move data belonging to one 2D face of the lattice in the same amount of time (this is just one bit per site on the surface); assuming the network has nl lanes each with a communication bandwidth of fc bit/second, we have Tdat = L2 nlfc = L2 nl(fc/f) 1 f . (11) 16 Figure 5: Estimates of the computing time (Tlat(L), red) and the SP-to-SP communication time (Tcom(L), blue) as a function of the lattice size L, assuming that the full lattice is split in 16 strips, each assigned to one SP within a Janus II box. One clearly sees that communication overheads are small for lattices of size L ∼ 150 or larger and become fully negligible as soon as L ≥ 250. Communication is not a bottleneck as long as 1 np L3 P ≥ L2 nl (fc/f) . (12) Figure 5 shows the behavior of the two sides of Eq. (12) as a function of the lattice size L, with the already stated values of the parameters and fc/f = 15 (we expect that fc/f will be somewhere in the 12 to 20 range): we see that the communication infrastructure is powerful enough to handle lattices with L ∼ 250 or larger. Let us consider a very large lattice for the current state-of-the-art (e.g., L = 500); from either Eq. (9) or Eq. (10) one finds that the processing time for one sweep of the whole sublattice is of the order of Tproc from 15 to 30 µs; in this simulation campaign, each Janus II box would run an independent replica of the system, so in one year of operation one can hope to follow for several 1011 Monte Carlo steps of ∼ 10 replicas of this very large system with 3 or 4 values of the temperature. 17 5. Janus II impact on spin-glass simulations To a large extent, Janus II is a follow up of JANUS, which has been a major player in the field of spin glasses during the last five years [22, 23, 24, 25, 26, 27, 28, 29, 30]. Hence, it is natural to ask which are the important physics questions accessible to Janus II that were not within reach for JANUS? In the previous sections we have estimated that the computing power available from one SP in Janus II is roughly 10× larger than available with JANUS. The (on board) available memory is also 10× larger and, last but not least, SP-to-SP communications make it possible to efficiently simulate SG samples on just one or on a collection of SPs, allowing flexible ways to trade the simulation speed of one sample with the concurrent simulation of several samples. Having these figures in mind, a rather blunt comparison with JANUS would be as follows. The total number of spin updates in a simulation cam- paign is Nspin−flips = NT ×Nspins ×NMCS ×Nsamples , (13) where NT is the number of temperatures at which we simulate, Nspins is the number of spins in the simulated lattice (i.e., in D spatial dimensions, for a lattice of size L, Nspins = LD), NMCS is the number of full-lattice updates performed for a single sample and Nsamples is the number of independent samples in the simulation. As we said above, for a given wall-clock time, on Janus II the l.h.s. of Eq. (13) will be roughly ten times larger than on JANUS. In fact, depending on the setup and the goals of the simulation campaign, with Janus II we can select which of the factors in (13) we want to increase by 10× or we can decide to spread the total gain on two or more such factors. In addition, thanks to the improved communications, it is possible to spread the simulation of a single sample over several FPGAs, thus increasing further Nspins or NMCS at the cost of reducing Nsamples. It turns out that increasing by one order of magnitude either Nspins or NMCS or Nsamples opens new opportunity windows. Roughly speaking, typical SG simulations come in two flavors: non- equilibrium and equilibrium. Surprisingly enough, the two turn out to be complementary [26]. In non-equilibrium simulations one tries to analyze the relaxation pro- cesses that take place in experimental spin glasses such as CuMn. Below their 18 glass temperature, such materials never reach thermal equilibrium. Hence, one should perform simulations at a single temperature (i.e. NT = 1), with a dynamic rule such as Metropolis or heat-bath that try to mimic the real spin dynamics. These simulations should be as long as possible (i.e. NMCS should be large), and the system size (i.e. Nspins) should be large enough to ensure that thermal equilibrium is never approached. The only good news is that the number of samples can be moderate, Nsamples ∼ 100, because most of the quantities that one computes are self-averaging (i.e., their sample-to-sample fluctuations tends to zero as 1/Na spins, with a ≈ 1/2). On the other hand we have equilibrium simulations. Here, we need to approach the equilibrium distribution, Eq. (3). We are not tied to any phys- ical dynamics: any trick that one may invent is acceptable, provided that it verifies the balance condition [18]. In particular, we may employ the parallel tempering algorithm explained in Sec. 2, which requires NT ∼ 40. As one may easily guess, the larger the system size the more valuable the physical information obtained from the simulation. Unfortunately, the efficiency of parallel tempering is rather moderate: JANUS established a world record by equilibrating lattices with L = 32 in three dimensions [25]. Another big issue is that the interesting physical quantities are not self-averaging at equilib- rium: sample-to-sample fluctuations are huge, which makes it desirable to simulate a large number of samples. At this point we are ready to appreciate the benefits of increasing by a factor of 10 each of the individual factors in the r.h.s of Eq. (13). • Increasing system sizes will mostly benefit non-equilibrium simulations. Indeed, the coherence length ξ(t), the typical size of the glassy domains, grows with the simulation time as ξ(t) ∼ t1/z(T ), with z(T ) ≈ 6.86Tc T [22, 23] (we measure the time t in lattice sweeps; Tc = 1.109(10) is the critical temperature [31]). In experimental samples ξ(t) is negligibly small as compared with the system size. Typical figures are L = 108 and ξ(t) ∼ 100 lattice spacings [32, 33]. In fact, we know that in order to stay in the non-equilibrium regime one should have L ≥ 7ξ(t) [22]. In other words, for any L there is a maximum safe simulation time t∗. This t∗ was amply surpassed in some of the simulations performed with JANUS. Indeed, in a month of continued operation one of the JANUS FPGAs simulated an L = 80 lattice up to t = 1011 (this is the equivalent of one tenth of a second in physical time). However, in particular close to the critical temperature, L = 80 is not large enough. 19 Finite-size effects were felt at t∗ = 109. Fortunately, in the same month of continued operation Janus II will be able to reach t = 1011 for lattice sizes L = 180 (single FPGA), L ' 500 (16 FPGAs in a single board working in parallel) or L ' 700 (full machine). It is highly unlikely that, for t = 1011 and L ' 500 finite-size effects will be relevant. • Increasing the number of samples. JANUS previous campaigns were remarkable for the sizes of the simulated samples, and the low tempera- tures reached. However, the number of simulated samples was typically in the range 1000 − 10000. Some important physical effects, however, can be traced only through rare events. Hence, an adequate investi- gation requires a significant boost in the number of samples (at least by a factor ∼ 10). There are at least two major problems where the sample-number issue is crucial. One is the so called temperature chaos problem [34]. The other is the survival (or lack of) of the spin glass phase in the presence of an external magnetic field [35, 36, 37, 28, 29]. • Increasing the simulation time. Both equilibrium and non-equilibrium simulations may benefit by increasing NMCS. Non-equilibrium simu- lations for temperatures T = 0.6 and below reached a quite modest coherence length ξ(t) at t = 1011 [22, 23]. Thus, extending the dura- tion of these L = 80 simulations to t = 1012 will be informative while not endangering the non-equilibrium condition L ≥ 7ξ(t). Another off- equilibrium example is the dynamical study of the possible transition in the physics of the spin glass in a field in D = 3. With JANUS we were able to identify a dynamical transition, but our precision was not enough to decide definitively between several possible scenarios [29]. Extending the time window where we follow the evolution of the sys- tem could be crucial to improve our understanding of this system. In equilibrium simulations one could either try to lower the reached temperature while keeping the system size fixed to L = 32, or to in- crease the system size to L = 48 while holding fixed the lowest tem- perature Tmin = 0.7026 [25]. By decreasing the lowest temperature, we could probe deeply in the spin-glass phase to study its many intriguing features (ultrametricity, statistics of overlap distributions, temperature chaos, etc.). On the other hand, increasing the system size at fixed temperature should allow us to assess finite-size effects, and to make rare-events less rare (the probability for a sample not to display a rare 20 event is expected to go as exp[−NspinsΩ] with Ω small but positive [34]). Finally, let us mention another frontier to be explored, namely studying more sophisticated spin glasses. Indeed, JANUS’ limited memory implied that, in practice, one was forced to consider only spin glass with Ising spins. However, there are important problems [38, 39] that cannot be treated within this framework. Janus II should be able to simulate, at the very least, XY spins [n = 2 in Eq. (2)], maybe with some discretization. In fact, a Migdal- Kadanoff renormalization study of the discretization of the XY model by means of a clock model has recently appeared [40]. The discretization issue seems to be rather subtle and worth of investigation on itself. In short, the enhanced power of Janus II will allow us to improve our understanding of key topics in spin-glass physics that have already been in- vestigated with JANUS (temperature chaos, ultrametricity, non-coarsening isothermal dynamics, presence of a phase transition in a field) but also to delve into new problems (more sophisticated spin-glass models, non-isothermal dynamics, etc.). 6. Performance comparison with commodity computers When undertaking a major development project, like Janus II, one should ensure that the performance gain over commodity computers is large enough to justify the effort and that this gap can be reasonably forecast to stay for a long enough time window. In this section we compare the expected performance of Janus II with that available from several commodity systems, measured over the last few years and try to derive reasonable forecasts for the near future. We start reminding that the discussion of the previous sections shows that our computational problem would optimally suit a super-slim processor that handles bit-valued variables. Since commodity processors have wide data words (and the current trend for recent processor is for wider and wider vector words), efficient use of the computing resources mandates that spins and couplings of different sites of the same lattice are grouped together on the same (scalar or vector) data word and operated upon by bit-wise logic operations; this approach – that also naturally supports SIMD vectorization – is known in the literature as multi-spin coding [41, 42]. One then maps V spins of a given sample on the same computer word and processes these spins in parallel. In principle V can be as large as the machine word size S, but 21 one independent random value is needed for each spin, so, as V increases the incremental performance gain quickly fades away. As a further optimization step, one can then process in parallel spins belonging to W independent samples (e.g., W = S/V ) since just one random value can be used to process W spins belonging to independent samples, introducing a tolerable amount of sample-to-sample correlation; in the following we will say that we have a sample parallelism of degree V and a global parallelism of degree W . The optimal trade-off for most commercial architectures is that V is significantly smaller than S, implying that a large number W of samples is simulated concurrently. This is useful from the point of view of accumulating statistics over samples, but – we stress it once again – in no way helps solve the key problem of speeding up the Monte Carlo dynamics of each sample; this is precisely where an application-driven architecture, for which V is O(103), produces its biggest dividends. We will use performance metrics directly relevant for physics; we define the Sample spin update Time (SUT) as the average time needed to update one spin of one lattice sample. For each SP in Janus II we have estimated in the previous section a SUT of 2 ps; for one full Janus II box working on one lattice, SUT goes down to 0.125 ps. We also define the Global spin Update Time (GUT), appropriate when one simulation job handles several samples of the lattice at the same time; GUT is simply defined as SUT/W . For Janus II, GUT equals SUT for each SP (and can be defined as SUT divided by the number of SPs working on different samples). When the JANUS project started, early 2006, state-of-the-art commodity systems had dual-core CPUs; on those processors carefully optimized codes had a SUT of ∼ 1000 ps and GUT of ∼ 400 ps. In the following years, processors have changed significantly with the introduction of many-core CPUs and of general purpose GPUs; these are better SG machines than traditional CPUs as one maps the available parallelism on more cores (or on more threads, for GPUs). Over the years, we have compared [43, 44] JANUS with several multi-core systems. In Table 1 we report the best SUT measured on several processors for a simulation of a lattice of 643 sites. We clearly see that over the years the large performance gap of Janus over commodity processor (e.g. Core 2 Duo) has been significantly reduced; an interesting first example was the extremely efficient IBM-Cell CPU, for which we have measured a SUT of 150 ps. As of today, the best figure is offered by a 16-cores Sandy Bridge processor, for which SUT is ≈ 60 ps. Processor like the Xeon- Phi performs better on large lattices, for example we have measured a SUT 22 System Core 2 Duo CBE (16 cores) JANUS C1060 NH (8 cores) C2050 SB (16 cores) K20X Xeon-Phi Janus II Year 2007 2007 2008 2009 2009 2010 2012 2012 2013 2013 Power (W) 150 220 35 200 220 300 300 300 300 25 SUT (ps/flip) 1000 150 16 720 200 430 60 230 52 2 Energy/flip (nJ/flip) 150 33 0.56 144 244 129 18 69 15.6 0.05 Table 1: Spin-update-time (SUT) of EA simulation codes on a 643 lattice on several architectures. CBE is a system based on the IBM Cell processor; Tesla C1060, C2050 and K20X are NVIDIA GP-GPUs; NH (SB) are dual-socket systems based respectively on the 4-core Nehalem Xeon-5560 (8-core Sandybridge Xeon-E5-2680) processors, and Xeon-Phi is the recent launched MIC architecture of Intel. The table also shows rough estimates of the energy needed to perform all the computing steps associated to one spin flip. of 30 ps on a lattice of 1283 which improves the performance of Sandy Bridge by a factor 2. Equally significant is the energy efficiency of the Janus II system; data is shown again in Table 1, in which we display the approximate energy cost associated to the Monte Carlo update of one spin. All in all, a Janus II box will be able to simulate in parallel a large spin glass lattice more than 200 times faster than the best currently available com- modity option, and using ∼ 300 times less energy. The next obvious question that one has to face when developing a custom system is how long it will keep its performance edge over commercial systems. Looking at Figure 6, plot- ting data of Table 1, we see that performances of spin-glass applications on commodity systems have increased over the time following a regular trend. Conversely application-specific projects evolve in steps, as there is no per- formance increase till a new generation is developed. The plot clearly shows three lines of evolution of commodity systems: they all scale according to Moore’s law, with different pre-factors corresponding to different broadly- defined families of architecture. Looking at SUT figures for the Intel Nehalem and Sandy Bridge micro- architectures with respect to those of the Core 2 Duo processor we clearly see an abrupt jump in the scaling behavior associated to Moore’s law; we interpret this fact as the consequence of a performance gap that happened when multi-core processors were introduced, followed by a regular Moore’s behavior (compare the two Moore’s lines in the picture). Looking at the performance plots of the JANUS-class machines, we see that JANUS will remain competitive until the end of 2014, and Janus II comes into operation at the end-of-life of its predecessor; from this analysis we can reasonably look into our crystal-ball and expect that Janus II should remain competitive through the year 2017. Our analysis also shows the outstanding performance of the IBM-Cell processor, whose production has however been discontinued 23 Figure 6: Performance trends (measured in spin-flips/picosecond) for the simulation of the EA spin glass model with optimized programs for several commodity architectures and for JANUS and Janus II. The lines scale according to Moore’s law. See the text for a complete discussion. and the poor performance of GPU-based accelerator which suffer as they are more strongly optimized for floating-point arithmetics and lack cache-systems that are crucial for this class of applications. Concerning the very recent Xeon-Phi processors, in spite of a very careful optimization, performances are not better than a dual Sandybridge system for small lattices; on the other hand, large on-chip caches in this processor keep its performance constant on larger lattices [45]. 7. Conclusions In this paper we have described the architecture and implementation of the Janus II application-driven machine, emphasizing its potential for performance in the simulation of spin glass systems. As described in detail in the previous section, the new machine will make it possible to carry out Monte Carlo simulation campaigns that would take centuries if performed on currently available computer systems. The possibility to obtain such a large performance gap stems mainly from the fact that the number crunching requirements associated to this class of simulations are very different from those for which state-of-the-art computers 24 are optimized. At the same time, FPGAs offer an enabling technology that allows to implement real machines with a reasonable engineering effort and at costs affordable to a small scientific collaboration. Janus II builds and improves on the experience of its predecessor – JANUS – that has been running physics simulations for the last 6 years, and replaces the older machine at a point in time when the JANUS perfor- mance edge over commercial systems is significantly reduced. JANUS and Janus II have been designed with the main aim of speeding up the Monte Carlo simulation of (a wide class of) spin glass models. At the basic hardware level, both machines are not specialized for these classes of simulations, so their use for other computational tasks is in principle possible and efficient. In practice, attempts at using JANUS for other applications have been hit by the serious bottleneck of the small size of available memory. Janus II addresses explicitly this problem, since each SP node has 2 large banks of fast memory; we are now starting to work on the assessment of the poten- tial efficiency of our machine for other applications, including such areas as cryptography, graph optimization and simulation of VLSI circuits. Acknowledgments We warmly acknowledge the excellent work done by the Janus II team at Link Engineering. In particular we thank Pietro Lazzeri, Pamela Pedrini, Roberto Preatoni, Luigi Trombetta and Alessandro Zambardi for their pro- fessional and enthusiastic work. The Janus II project was supported by the European Regional Development Fund (ERDF/2007-2013, FEDER project UNZA08-4E-020); by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013, ERC grant agree- ment no. 247328) ; by the MICINN (Spain) (contracts FIS2012-35719-C02, FIS2010-16587); by Junta de Extremadura (contract GR101583); by the Ital- ian Ministry of Education and Research (PRIN Grant 2010HXAW77 007). References [1] C. Angell, Science 267, 1924 (1995). [2] P. Debenedetti, Metastable liquids, Princeton University Press, Prince- ton (1997). [3] R. Tripiccione, Comp. Phys. Comm. 169, 442 (2005). 25 [4] J. Makino et al., The Astrophysical Journal 480, 432 (1997). [5] D. E. Shaw et al., Communications of the ACM 51, 91 (2008). [6] R. Pearson, J. Richardson, D. Toussaint and A Special, Purpose Ma- chine for Monte Carlo Simulations, Tech. Report NSF-ITP-81-139, Inst. Theoretical Physics, Univ. California, Santa Barbara, 1981. [7] J.H. Condon and A.T. Ogielski, Rev. Sci. Instruments 56 1691 (1985); A.T. Ogielski, Phys. Rev. B 32. 7384 (1985). [8] J. Pech et al., Comp. Phys. Comm. 106, 10 (1997); A. Cruz et al., Comp. Phys. Comm. 133, 165 (2001). [9] F. Belletti et al., Computing in Science & Engineering 8 41 (2006). [10] F. Belletti et al., Computer Physics Communications 178, 208 (2008). [11] F. Belletti et al., Computing in Science & Engineering 11 48 (2009). [12] A. P. Young (editor), Spin Glasses and Random Fields (World Scientific, Singapore, 1998). [13] J. A. Mydosh, Spin Glasses: an Experimental Introduction (Taylor and Francis, London, 1993). [14] F. Barahona, J. Phys. A 15, 3241 (1982). [15] S. F. Edwards and P. W. Anderson, J. Phys. F: Metal Phys. 5, 965 (1975); ibid. 6, 1927 (1976). [16] K. Binder and D. W. Heerman, Monte Carlo Simulation in Statistical Physics, (Springer, Berlin, 2010). [17] M. Creutz, Quantum Fields on the Computer, World Scientific, 1992. [18] A. D. Sokal in Functional Integration: Basics and Applications (1996 Cargèse School), C. DeWitt-Morette, P. Cartier and A. Folacci eds. (Plenum, New York, 1997). [19] E. Marinari and G. Parisi, Europhys. Lett. 19. 451 (1992); K. Hukushima, K. Nemoto, J. Phys. Soc. Jpn. 65, 1604 (1996); M.C. Tesi, et al., J. Stat. Phys. 82, 155 (1996). 26 [20] http://www.linkengineering.it [21] G. Parisi and F. Rapuano, Phys. Lett. B 157, 301 (1985). [22] Janus Collaboration: F. Belletti et al., Phys. Rev. Lett. 101, 157201 (2008). [23] Janus Collaboration: F. Belletti et al., J. Stat. Phys. 135, 1121-1158 (2009). [24] Janus Collaboration: A. Cruz, et al, Phys. Rev. B 79, 184408 (2009). [25] Janus Collaboration: R. A. Banos, et al., J. Stat. Mech. P06026 (2010) . [26] Janus Collaboration: R. Alvarez Banos, et al., Phys. Rev. Lett. 105, 177202 (2010). [27] Janus Collaboration: R. A. Banos, et al., Phys. Rev. B 84, 174209 (2011). [28] Janus Collaboration: R. A. Baños et al., Proc. Natl. Acad. Sci. USA 109, 6452 (2012). [29] Janus collaboration: M. Baity-Jesi et al., arXiv:1307.4998. [30] M. Baity-Jesi et al., The European Physical Journal: Special Topics 210, 33-51 (2012). [31] M. Hasenbusch, A. Pelissetto and E. Vicari, Phys. Rev. B 78, 214205 (2008). [32] Y. G. Joh et al., Phys. Rev. Lett. 82, 438 (1999). [33] F. Bert et al., Phys. Rev. Lett. 92, 167203 (2004). [34] L.A. Fernandez, V. Martin-Mayor, G. Parisi and B. Seoane, arXiv:1307.2361. [35] A.J. Bray and M.A. Moore, Phys. Rev. B 83, 224408 (2011). [36] A.P. Young and H.G. Katzgraber, Phys. Rev. Lett. 93, 207203 (2004). 27 http://www.linkengineering.it http://arxiv.org/abs/1307.4998 http://arxiv.org/abs/1307.2361 [37] T. Jörg, H. Katzgraber and F. Krzakala, Phys. Rev. Lett. 100, 197202, (2008) [38] A.P. Young and A. Sharma, Phys. Rev. B 83, 214405 (2011). [39] V. Martin-Mayor and S. Perez-Gaviro, Phys. Rev. B 84, 024419 (2011). [40] E. Ilker and A. Nihat Berker, Phys. Rev. E 87, 032124 (2013). [41] C. Michael, Phys. Rev. B, 33, 7861-7862 (1986). [42] G. Bhanot, D. Duke and R. Salvador, Phys. Rev. B, 33, 7841-7844 (1986). [43] M. Guidetti et al., Spin Glass Monte Carlo Simulations on the Cell Broadband Engine in Proc. of PPAM09, LNCS 6067, 467-476 (Springer, Heidelberg 2010). [44] M. Guidetti et al., Monte Carlo Simulations of Spin Systems on Multi- core Processors K. Jonasson ed.), LNCS 7133, 220-230 (Springer, Hei- delberg 2010) . [45] A. Gabbana, M. Pivanti, S. F. Schifano, R. Tripiccione, Benchmarking MIC architectures with Monte Carlo simulations of spin glass systems, in Proceedings of the High Performance Computing Conference, 2013, Bangalore (India), in press. 28 1 Overview 2 Spin Glass models 3 Janus II architecture 4 Structuring and programming a spin glass simulation on Janus II 5 Janus II impact on spin-glass simulations 6 Performance comparison with commodity computers 7 Conclusions