This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2024.3377890, IEEE Transactions on Computers This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 1 Big-PERCIVAL: Exploring the Native Use of 64-Bit Posit Arithmetic in Scientific Computing David Mallasén, Alberto A. Del Barrio, Senior Member, IEEE and Manuel Prieto-Matias Abstract—The accuracy requirements in many scientific computing workloads result in the use of double-precision floating-point arithmetic in the execution kernels. Nevertheless, emerging real-number representations, such as posit arithmetic, show promise in delivering even higher accuracy in such computations. In this work, we explore the native use of 64-bit posits in a series of numerical benchmarks and compare their timing performance, accuracy and hardware cost to IEEE 754 doubles. In addition, we also study the conjugate gradient method for numerically solving systems of linear equations in real-world applications. For this, we extend the PERCIVAL RISC-V core and the Xposit custom RISC-V extension with posit64 and quire operations. Results show that posit64 can obtain up to 4 orders of magnitude lower mean square error than doubles. This leads to a reduction in the number of iterations required for convergence in some iterative solvers. However, leveraging the quire accumulator register can limit the order of some operations such as matrix multiplications. Furthermore, detailed FPGA and ASIC synthesis results highlight the significant hardware cost of 64-bit posit arithmetic and quire. Despite this, the large accuracy improvements achieved with the same memory bandwidth suggest that posit arithmetic may provide a potential alternative representation for scientific computing. Index Terms—Arithmetic, Posit, IEEE-754, Floating point, Scientific computing, RISC-V, CPU, Matrix multiplication, PolyBench. ✦ 1 INTRODUCTION R EAL-NUMBER arithmetic is at the core of many scientific workloads. Physical constants, data from sensors, and in general most inputs to experimental applications have to be represented accurately in a computer. Moreover, this accuracy has to be maintained throughout the execution of the algorithms that are at the root of scientific computing. Even minor errors can result in significant consequences, potentially leading to incorrect predictions of the behavior of a system or inaccurate solutions to differential equations and optimization problems. The most widely used representation of real numbers in a computer is the IEEE 754 standard for floating-point arithmetic [1]. Although this standard is considered to be a robust and reliable method for representing and operating with real numbers on a computer, it is not perfect. For instance, its results can be inconsistent across platforms, it does not ensure the associative property of additions and multiplications, it has signed zeros, and there is an excess of Not a Number (NaN) representations. In the past years, other alternatives to this floating- point format have emerged. Some of these new arithmetic representations have been implemented by large techno- logical companies, especially in the machine learning do- main. Examples of this are Google’s bfloat16 [2] or Nvidia’s TensorFloat [3]. In scientific computing, the solution to the accuracy requirements is to use wider floating-point representations such as double-precision floats. However, another solution is to explore emerging floating-point repre- sentations that provide more accuracy bits. One of the most promising alternatives for this purpose is posit numbers, which we study in this work. All authors are with the Facultad de Informática, Universidad Complutense de Madrid, 28040 Madrid, Spain. E-mails: {dmallase, abarriog, mpmatias}@ucm.es Manuscript received -; revised -. Targeting this goal, we have extended the PERCIVAL posit RISC-V core [4] to support 64-bit posits, as well as including a more diverse and flexible design. This has allowed us to explore the native use of this arithmetic with a larger bit-width both at the hardware-cost level and at the accuracy and performance levels. For the first part, we have performed Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) synthesis of different configurations of Big-PERCIVAL, and given a detailed analysis of the results. For the accuracy and performance comparison of the arithmetics, we have added posit support for the PolyBench benchmark suite [5]. In addition, we have also studied the accuracy of iterative linear solvers (conjugate gradient and biconjugate gradient) as well-known examples of widely used applications from science and engineering that can benefit from higher accu- racy [6]. Results were measured on Big-PERCIVAL running on the Genesys II FPGA board. Our main contributions can be summarized in the fol- lowing: • We present Big-PERCIVAL, an extension of the PER- CIVAL1 [4] posit RISC-V core which adds posit64 operations and increased flexibility. In particular, we support standard posit addition, subtraction, mul- tiplication, division and square root, conversions to and from integer numbers, comparison oper- ations and register move instructions. Optionally we also support quire operations and logarithmic- approximate multiplication, division and square root units. • Detailed FPGA and ASIC synthesis results of the Posit Arithmetic Unit (PAU) in Big-PERCIVAL show- case the area impact of posit arithmetic and quire 1. https://github.com/artecs-group/PERCIVAL 2 in different configurations of the core. These results are compared with the FPNew IEEE 754 Floating- Point Unit (FPU) [7]. An analysis of the individual units in the PAU gives insight into how the hard- ware resources are distributed among the different operations. • Compiler support for posit64 numbers in the Xposit custom RISC-V extension in LLVM2. This allows for easily embedding posit and quire instructions, including loads and stores, into C code. • PolyBench benchmark results provide insight into how posit32 and posit64 numbers compare to IEEE 754 floats and doubles in terms of timing perfor- mance and accuracy. In particular, the impact of the quire accumulator register is also studied. Results show that 64-bit posits can provide up to 4 orders of magnitude lower Mean Squared Error (MSE) and up to 3 orders of magnitude lower Maximum Absolute Error (MaxAbsE) than 64-bit doubles. This provides improved accuracy with the same memory band- width as doubles. • Iterative linear equation solvers, namely the conju- gate gradient and biconjugate gradient algorithms, showcase how posit64 can reduce the number of iterations needed to reach a certain tolerance margin when solving large ill-conditioned systems which are frequent in scientific-computing and engineering problems. • Detailed analysis on how leveraging the quire affects the order in which some operations are executed. For example, the execution of the General Matrix Multiplication (GEMM) kernel using a dot-product or memory-aware method impacts the final timing performance and accuracy. The rest of the paper is organized as follows: Section 2 introduces posit arithmetic. Related works on RISC-V cores, the use of posits in High-Performance Computing (HPC) and theoretical studies on posit64 are presented in Section 3. In Section 4 we describe the novelties on Big-PERCIVAL and the Xposit custom RISC-V extension. The synthesis results of the core and the individual posit64 units are analyzed in Section 5. Benchmark results using PolyBench targeting accuracy and timing performance are shown in Section 6, followed by the conjugate gradient use-case in Section 7, and a more exhaustive analysis of the GEMM kernel in Section 8. Finally, Section 9 concludes this work. 2 POSIT ARITHMETIC The posit number standard [8], defines a posit configuration from its total bit-width n. This allows for any posit sizes, but in the literature, the most common ones are the byte-aligned posit8, posit16, posit32, and posit64 configurations. One of the main benefits of posit arithmetic is that it does not have a variety of special cases that have to be checked. Posits have only two special cases. The value zero is represented as 0 · · ·0, and the Not-a-Real (NaR) is represented as 10 · · ·0. The rest of the bit patterns are composed of the four fields shown in Figure 1. 2. https://github.com/artecs-group/llvm-xposit Fig. 1. Posit format with sign, regime, exponent, and fraction fields. These four bit-fields are: • The sign bit S, the value of which is s = 0 if the value is positive or s = 1 if the value is negative. • The variable-length regime field R, which consists of a series of k bits equal to R0 and terminated either by 1−R0 or the end of the posit. This field represents a long-range scaling factor r given by: r = { −k if R0 = 0 k − 1 if R0 = 1 • The exponent field E, consisting of at most 2 bits. This field encodes an integer unbiased value e. Since the regime field is variable-length, one or both of the exponent bits may be located after the least significant bit of the posit. In this case, those bits will have the value 0. • The variable-length fraction field F, which is formed by the m remaining bits. Its value f will be given by dividing the unsigned integer F by 2m and therefore 0 ≤ f < 1. From these fields, we can calculate the real value p of a generic posit as: p = ((1− 3s) + f)× 2(1−2s)×(4r+e+s). (1) This is the most efficient decoding of posits, as shown by [9], [10]. The most notable differences in this value representation between posit arithmetic and the IEEE 754 floating-point standard are the existence of the variable- length regime, the use of an unbiased exponent, and the value of the hidden bits [9]. In floating-point arithmetic, the hidden bit is fixed to 1, except for the subnormal numbers, when it is fixed to 0. However, in posit arithmetic, it is kept as 1 if the number is positive, or changed to −2 if the number is negative. Fig. 2. Decoding example of a posit16. As an example, let 1111101010010110 be the binary encoding of a Posit16 (Figure 2). The first bit s = 1 indicates a negative number. The regime field 11110 gives k = 4 and therefore r = 3. The next two bits 10 represent the exponent e = 2. Finally, the remaining m = 8 bits, 10010110, encode a fraction value of f = 150/28 = 0.5859375. Hence, from Equation (1) we conclude that 1111101010010110 ≡ (−2 + 0.5859375)× 2−(4·3+2+1) = −0.000043154. The variable-length regime field acts as a long-range dynamic exponent, as can be seen in Equation (1), where it is multiplied by 4 or, equivalently, shifted left by the two 3 exponent bits. Since the regime and the fraction are dynamic fields, they allow for more flexibility in the trade-off be- tween accuracy and dynamic range that can be achieved by a posit. If the regime field occupies more bits, it represents larger numbers at the cost of lower accuracy. On the other hand, when the regime field consists of fewer bits, posits have higher accuracy in the neighborhoods of ±1. In posit arithmetic, NaR has a unique representation that maps to the most negative 2’s complement signed integer. Consequently, if used in comparison operations, it results in less than all other posits and equal to itself. Moreover, the rest of the posit values follow the same ordering as their cor- responding bit representations. These characteristics allow posit numbers to be compared as if they were 2’s comple- ment signed integers, eliminating additional hardware for posit comparison operations. Posit arithmetic also includes fused operations using the quire, a 16n-bit fixed-point 2’s complement register. This special accumulation register allows for the execution of up to 231 − 1 Multiply-Accumulate (MAC) operations without intermediate rounding or accuracy loss. These operations are very common when computing dot products, matrix multiplications, or other more complex algorithms. The additional accuracy that can be achieved using the quire can allow the execution of these algorithms with narrower posit configurations [11], [12], [13], thus avoiding the limits that can occur in memory bandwidth. Currently, one of the main drawbacks of posit arith- metic is its higher area cost [4]. For an accurate compari- son between posits and floats, the FPU must be IEEE 754 compliant instead of being limited to normal floats only. Authors in [14] state that posit hardware is slightly more expensive than floating-point hardware that does not take into account subnormal numbers. Moreover, adding a wide quire accumulator register further increases the area cost of implementing these fused operations. 3 RELATED WORK Common open-source application-class RISC-V cores in- clude support for the F and D RISC-V extensions for IEEE- 754 single- and double-precision floating-point numbers (which are part of the G compilation). Some of the most notable ones include the Rocket [15], the CVA6 (ex. Ari- ane) [16], the Berkeley Out-of-Order Machine (BOOM) [17] or the SHAKTI C-Class processor [18]. There have been several previous proposals of including some levels of posit arithmetic capabilities into RISC-V cores. PERC [19] and PERI [20] included PAUs into the Rocket core and the SHAKTI C-class core, respectively. These proposals were constrained by the F and D extensions and thus did not include quire support. CLARINET [21] added fused MAC, and fused divide and accumulate oper- ators using the quire to an RV32IMAFC RISC-V core. In [22] authors store IEEE floats in memory by first converting them to the posit representation. This allows storage of the real-number values with a lower bit-width while performing the computations using the IEEE 754 FPU. More recently a posit dot-product unit was presented in [23]. This open-source implementation allows to perform high- throughput dot-products for deep learning applications. In the HPC field, [12] explores the use of 16-bit posits in a shallow water model as an example of a medium- complexity climate application. In [24] authors adapt the NAS Parallel Benchmark (NPB) to 32-bit posits, concluding that they obtain between 0.6 and 1.4 additional decimal digits of accuracy. However, the software emulation resulted in a 4× to 19× performance overhead. An evaluation of the numerical stability of posits for solving linear systems is presented in [25]. Here, the authors find that there is no big difference between posits and floats in the native range of the matrices that they test. However, when re-scaling the matrices to optimize the use of the posit representation, they obtained 4 extra bits of precision with 32-bit numbers and 2 extra bits with 16-bit numbers. The use of 32-bit posits for the conjugate gradient method is also studied in [26]. A PAU called POSAR is presented in [27]. This work analyzes 32-bit posits on one NPB scientific application and a Convolutional Neural Network (CNN) inference model. The use of 64-bit posits is studied theoretically in [28] and in the initial presentation of this arithmetic [29]. At the hardware level, synthesis results of some posit64 operators are given in [10], [30], [31], and a GEMM accelerator which also supports 64-bit posits is presented in [32]. A more complex architecture with a reconfigurable posit tensor unit supporting posit64 is introduced in [33]. 4 BIG-PERCIVAL CORE The PERCIVAL [4] posit RISC-V core is based on the application-level CVA6 core [16], to which it adds tightly coupled 32-bit posit and quire capabilities. This allows for both posits and IEEE 754 floats to coexist natively in hard- ware, which is essential for the evaluation of their strengths and weaknesses. The Big-PERCIVAL core we present in this paper is a modified version of PERCIVAL that supports either posit64 numbers or posit32 (see Figure 3), although in this work we have focused on the 64-bit version. This includes a param- eterized 16n-bit quire, which results in 1024 bits for posit64 numbers. We also provide a 32-entry posit register file for either posit32 or posit64. The decision between single- or double-precision posits is taken at synthesis time. In addition to adding 64-bit versions of all the com- putational and conversion units, we include some addi- tional customizations to the PAU (see Figure 4). These are set with parameters in the Register-Transfer Level (RTL) description of the core, so they take effect at synthesis time. Furthermore, as typically the multiplicative arithmetic units are among the most power-hungry modules [34], [35], [36], we permit either the use of the logarithm-approximate units presented in [37] or exact units for the multiplication, division and square root operations [38]. The exact units for these last two operations use a non-restoring division algorithm [39]. Also, all of the quire operations are optional and the quire can be enabled or disabled completely. These options allow a fine-grained study of the individual impact of the quire accumulator register and approximate versus exact computational units. The input operands can be either 32 or 64 bits. However, the output will always be 64 bits as the core follows the RV64 architecture (XLEN=64). 4 ID Issue Instruction Queue Re-aligner Compressed Decoder Decoder Scoreboard Regfile Read Is su e EX Commit Sc or eb oa rd PTW DTLB LSU ALU CSR Buffer Multiplier Branch Unit FPU PAU Data Cache Regfile Write CSR Write Commit Fig. 3. Block diagram of Big-PERCIVAL. Extended modules are highlighted in green. The PAU with the new posit64 units is shown in orange. Posit Arithmetic Unit (PAU) Quire operations (Optional) MAC Q2P Quire 512/1024 bits Conversion to/from integers P2I P2U P2L P2LU I2P U2P L2P LU2P Operand A Operand B Operator Result Posit computational operations ADD MUL (Exact/Approx.) DIV (Exact/Approx.) SQRT (Exact/Approx.) 64 32/64 Fig. 4. Internal structure of the Posit Arithmetic Unit (PAU). The modifiable blocks with SystemVerilog parameters are shown in green. The new 64-bit operations were validated with an exten- sive test suite. These tests cover the full range of instructions that Big-PERCIVAL can execute. To obtain the expected outputs of each series of operations we used the Universal Numbers software library [40]. To compile the new 64-bit instructions we updated the modified LLVM compiler with support for the Xposit RISC-V custom extension [4] so that it can generate double- precision loads and stores. The rest of the instructions are maintained as in the original proposal since we only support either posit32 or posit64 at any one time and the decoding can be reused. The updated list of the full Xposit instruction set can be found in Table 1. 5 SYNTHESIS RESULTS In this section, we present FPGA and ASIC synthesis results of different configurations of Big-PERCIVAL as well as detailed area costs of the individual posit units. This highlights the hardware cost of posit numbers and the quire, which is the main drawback we have seen. 5.1 FPGA For the FPGA synthesis results we used Vivado v.2021.2 tar- geting a Genesys II (Xilinx Kintex-7 XC7K325T-2FFG900C) board. In every case, the target frequency of 50 MHz was met. This parameter is defined by the base CVA6 core. With this setup, we synthesized different configurations of the PAU with or without quire and with or without 5 TABLE 1 Updated instruction set of the XPosit custom RISC-V extension. 31 20 19 15 14 12 11 7 6 0 PLW imm[11:0] rs1 0x1 rd 0x0B PLD imm[11:0] rs1 0x5 rd 0x0B 31 24 20 19 15 14 12 11 7 6 0 PSW imm[11:5] rs2 rs1 0x3 imm[4:0] 0x0B PSD imm[11:5] rs2 rs1 0x6 imm[4:0] 0x0B 31 27 26 25 24 20 19 15 14 12 11 7 6 0 PADD.S 0x00 0x2 rs2 rs1 0x0 rd 0x0B PSUB.S 0x01 0x2 rs2 rs1 0x0 rd 0x0B PMUL.S 0x02 0x2 rs2 rs1 0x0 rd 0x0B PDIV.S 0x03 0x2 rs2 rs1 0x0 rd 0x0B PMIN.S 0x04 0x2 rs2 rs1 0x0 rd 0x0B PMAX.S 0x05 0x2 rs2 rs1 0x0 rd 0x0B PSQRT.S 0x06 0x2 0x0 rs1 0x0 rd 0x0B QMADD.S 0x07 0x2 rs2 rs1 0x0 0x0 0x0B QMSUB.S 0x08 0x2 rs2 rs1 0x0 0x0 0x0B QCLR.S 0x09 0x2 0x0 0x0 0x0 0x0 0x0B QNEG.S 0x0A 0x2 0x0 0x0 0x0 0x0 0x0B QROUND.S 0x0B 0x2 0x0 0x0 0x0 rd 0x0B PCVT.W.S 0x0C 0x2 0x0 rs1 0x0 rd 0x0B PCVT.WU.S 0x0D 0x2 0x0 rs1 0x0 rd 0x0B PCVT.L.S 0x0E 0x2 0x0 rs1 0x0 rd 0x0B PCVT.LU.S 0x0F 0x2 0x0 rs1 0x0 rd 0x0B PCVT.S.W 0x10 0x2 0x0 rs1 0x0 rd 0x0B PCVT.S.WU 0x11 0x2 0x0 rs1 0x0 rd 0x0B PCVT.S.L 0x12 0x2 0x0 rs1 0x0 rd 0x0B PCVT.S.LU 0x13 0x2 0x0 rs1 0x0 rd 0x0B PSGNJ.S 0x14 0x2 rs2 rs1 0x0 rd 0x0B PSGNJN.S 0x15 0x2 rs2 rs1 0x0 rd 0x0B PSGNJX.S 0x16 0x2 rs2 rs1 0x0 rd 0x0B PMV.X.W 0x17 0x2 0x0 rs1 0x0 rd 0x0B PMV.W.X 0x18 0x2 0x0 rs1 0x0 rd 0x0B PEQ.S 0x19 0x2 rs2 rs1 0x0 rd 0x0B PLT.S 0x1A 0x2 rs2 rs1 0x0 rd 0x0B PLE.S 0x1B 0x2 rs2 rs1 0x0 rd 0x0B TABLE 2 Area results of the PAU for different configurations of Big-PERCIVAL. PAU DivSqrt LUTs FFs DSPs SRLs 32-bit No quire Approx. 5666 689 2 0 Exact 5923 1453 2 36 Quire Approx. 11 605 2923 4 0 Exact 12 908 3640 4 35 64-bit No quire Approx. 8561 1075 12 0 Exact 15 959 4233 12 71 Quire Exact 29 781 7274 24 584 approximate division and square-root units. These results are shown in Table 2. When comparing with the FPU available in the CVA6 (FPNew [7]), we observe that 32-bit posits and especially the 64-bit variant require significantly more resources than the same length floats. The 32-bit FPU occupies 4045 Lookup Tables (LUTs), 1008 Flip-flops (FFs) and 2 Digital Signal Processing (DSP) blocks. The 64-bit only variant of the FPU (no 32-bit support) requires 6243 LUTs, 1893 FFs and 9 DSP blocks. The exact PAUs without quire need almost 50% more LUTs and FFs in the 32-bit case and more than 2.5× as many resources in the 64-bit case. This shows that the growth in terms of hardware usage is significantly higher in the case of posit numbers. Adding quire capabilities results in an even higher hardware cost, but as will be shown in Section 6 this allows for obtaining more accurate results with the same memory bandwidth as double-precision floats. If the target application can tolerate some errors in TABLE 3 Area results of the 64-bit components of the PAU. Name LUTs FFs DSPs SRLs Quire MAC 9834 1909 12 515 Posit Div 5710 1903 0 61 Posit Sqrt 3447 1442 0 10 Quire to Posit 3290 236 0 0 Posit Add 1544 77 0 0 Posit Mult 1503 114 12 0 Posit Div Approx. 870 106 0 0 Posit Sqrt Approx. 618 68 0 0 Posit to ULong 595 0 0 0 Posit to Long 580 0 0 0 ULong to Posit 500 0 0 0 Long to Posit 429 0 0 0 UInt to Posit 356 0 0 0 Posit to UInt 342 0 0 0 Posit to Int 311 0 0 0 Int to Posit 202 0 0 0 the division and square root outputs, using the logarithm- approximate units [37] allows for a significant reduction in LUT and FF usage. This is especially true in the 64-bit case, where the total PAU LUTs can be cut in half and the FFs reduced to only 25%. Detailed area results for the individual units in the 64-bit PAU are shown in Table 3. As a reference, the whole Big- PERCIVAL core requires 77888 LUTs, 33437 FFs, and 51 DSP blocks. Unsurprisingly, the largest unit corresponds to the quire MAC operation which, together with the quire to posit and additional quire logic in the PAU, corresponds to 50% of the hardware resources of the PAU. This goes in line with the results for the 32-bit PAU with quire. The exact division and square root units are the next largest components of the PAU. The cost of these exact units is substantially reduced in the case of using the corresponding approximate units. As the works in [36], [37], [41], [42] highlight, error-resilient applications may benefit from those. Such is the case of Deep Neural Networks (DNNs), filters, and other Machine Learning kernels. 5.2 ASIC Regarding ASIC synthesis we targeted TSMC’s 28nm standard-cell library to obtain more insight into the area and power cost of the 64-bit PAU and FPU. The synthesis was performed using Synopsys DC with a 5ns timing constraint and a toggle rate of 0.1. The 64-bit FPU (without 32-bit support enabled to match the PAU) requires an area of 21853 µm2 and consumes 0.738 mW of power. On the other hand, the exact 64-bit PAU with quire occupies 114695 µm2 and consumes 3.516 mW of power. In the case of the 64-bit PAU without quire, this results in significantly less resource use. In particular it spans 71090 µm2 and consumes 1.958 mW. These ASIC area and power results go in line with the FPGA results we obtained previously, highlighting the main drawback of current hardware implementations of posit arithmetic. 6 POLYBENCH In this work, we used the PolyBench suite [5] to benchmark Big-PERCIVAL. This benchmark suite contains a series of 6 numerical computations with static control flow from vari- ous domains such as linear algebra computations or physics simulation. From these, we selected some representative al- gorithms to study how posit32 and posit64 compare to IEEE 754 floats and doubles in scientific computing calculations. PolyBench implements each benchmark in a single file, with some header parameters and a series of compile-time directives. In order to compile them we have employed the modified version of LLVM with Xposit. Furthermore, the compile-time directives allow both the study of the accuracy of the results and the accurate measuring of the performance using cache flushing and multiple executions. We have chosen to port to posit arithmetic the following representative benchmarks: • Covariance: Computes the covariance of N data points, each with M attributes. • GEMM: Generalized matrix multiply from BLAS [43] Cout = αAB + βC. • 3mm: Linear algebra kernel that consists of three matrix multiplications G = (AB)(CD). • Cholesky: Cholesky decomposition of a positive- definite matrix A into a lower triangular matrix L such that A = LLT . • Durbin: This is an algorithm for solving Yule-Walker equations, which are a special case of Toeplitz sys- tems. • ludcmp: LU decomposition followed by forward and backward substitutions to solve a system of linear equations. • fdtd-2d: Simplified Finite-Difference Time-Domain method for 2D data. This models electric and mag- netic fields based on Maxwell’s equations. • seidel-2d: Gauss-Seidel style computation over 2D data with a 9-point stencil pattern. Porting these benchmarks entailed translating the arith- metic kernels from using IEEE 754 floats or doubles to inline assembly with posit32 or posit64 instructions. The code structure was kept unchanged, including the initialization phase which populated the input data to the algorithms. All tests were compiled with the -O3 flag of gcc and clang to optimize the execution of the kernels and minimize the dif- ference between the original code and the straightforward translation to posit assembly. For each configuration, we provide six executions: single- and double-precision IEEE 754 arithmetic with fused MAC operations as optimized by the compiler, posit32 and posit64 arithmetic with quire fused MAC operations, and posit32 and posit64 arithmetic without quire, that is, replac- ing these MAC operations with individual multiplication and addition operations. The fdtd-2d and seidel-2d bench- marks do not benefit from fused MAC operations, hence they have a single posit32 and posit64 execution. Each of these benchmarks was executed with four prob- lem sizes. These range from the MINI datasets, which require less than 16KB of memory each, to the LARGE datasets, which occupy around 25MB of memory. The cache hierarchy of PERCIVAL comprises a 32KB 8-way set- associative L1 data cache with 16-byte lines. Therefore, these problem sizes stretch the whole memory range. The performance results as shown by the timing script of PolyBench are shown in Table 4. As can be seen, float is faster than posit32 in practically all scenarios. However, when comparing doubles and posit64, there is no clear winner except for the GEMM benchmark, which will be studied in detail in Section 8. In the rest of the algorithms, some perform better using doubles and others using posit64. These variations are due in part to the translation of the execution kernels from doubles in the native C datatype to the posit64 inline assembly of the Xposit RISC-V custom extension. Furthermore, the execution of 64-bit posits with- out quire is slower than the other options, as an individual multiplication plus addition has more latency than a single fused MAC operation both using doubles or posit64 with quire. Again, this does not hold in the GEMM benchmark, which will be detailed in Section 8. Therefore we can con- clude that, when integrating 64-bit posits with quire into the CVA6, they do not suffer any noticeable performance degradation in comparison with double-precision floats. This system inherits the limitations given by the underlying core, and since the critical path is not in the PAU or the FPU, but in the load/store unit, further optimizations are out of the scope of this work. Regarding accuracy, we have obtained two metrics: Mean Squared Error (MSE) and Maximum Absolute Error (MaxAbsE). To obtain both of these metrics, we compared the results of all the arithmetics under study to the same al- gorithm computed using the GNU MPFR multiple-precision library with 128 fraction bits, which we use as our golden solution. We chose the MSE as a general accuracy metric and the MaxAbsE to also take into account the maximum error, which can be a critical value in certain applications [44], [45]. The MSE results are shown in Figure 5. Note the loga- rithmic scale on the Y-axis. The trend in every benchmark is a significantly lower error when using posit64 or posit32 numbers in comparison to floats or doubles, respectively. This is up to 4 orders of magnitude lower MSE depending on the benchmark. In these plots we can also observe the difference in magnitude of the accuracy errors when using 32- or 64-bit numbers. The accuracy improvements are maintained across the whole range of problem sizes. When using posits without the quire accumulator we also observe significant accuracy improvements. This happens even though in the posit case we have an extra rounding between the multiplication and addition operations that is not present when using fused MAC with IEEE representations. The MaxAbsE metric follows the same pattern as the MSE. These results are shown in Figure 6, where the Y-axis also follows a logarithmic scale. Posits obtain up to 3 orders of magnitude lower error than their same bit-width floats across all benchmarks and dataset sizes. There are also large accuracy improvements when executing with posit without quire over doubles. All in all, from this set of benchmarks we can conclude that 64-bit posits present significant accuracy improvements over IEEE 754 doubles, while maintaining the same memory bandwidth. This is also the case with 32-bit numbers. This better accuracy is maintained both in a general sense, as shown by the MSE, as well as in each particular value since 7 MINI SMALL MEDIUM LARGE 10 31 10 27 10 23 10 19 10 15 10 11 10 7 10 3 M SE 4. 20 e- 16 5. 07 e- 15 7. 48 e- 14 4. 85 e- 12 4. 34 e- 34 6. 60 e- 34 4. 46 e- 33 2. 67 e- 31 2. 28 e- 15 1. 37 e- 13 6. 04 e- 12 2. 68 e- 09 5. 12 e- 34 8. 25 e- 33 3. 20 e- 31 1. 44 e- 28 2. 02 e- 30 3. 75 e- 29 1. 36 e- 27 1. 71 e- 25 5. 73 e- 13 1. 12 e- 11 4. 02 e- 10 4. 86 e- 08 gemm MINI SMALL MEDIUM LARGE 10 34 10 29 10 24 10 19 10 14 10 9 10 4 101 106 M SE 1. 42 e- 18 4. 61 e- 15 1. 90 e- 11 5. 07 e- 06 1. 59 e- 37 3. 69 e- 34 1. 20 e- 30 2. 75 e- 25 7. 11 e- 18 4. 75 e- 14 1. 44 e- 09 1. 79 e- 03 2. 92 e- 37 2. 78 e- 33 7. 70 e- 29 9. 71 e- 23 4. 89 e- 33 1. 51 e- 29 9. 33 e- 26 7. 64 e- 212. 04 e- 15 4. 54 e- 12 2. 44 e- 08 2. 26 e- 03 3mm MINI SMALL MEDIUM LARGE 10 32 10 28 10 24 10 20 10 16 10 12 10 8 10 4 M SE 2. 64 e- 16 1. 21 e- 14 1. 30 e- 13 1. 58 e- 11 2. 49 e- 35 7. 07 e- 34 6. 97 e- 33 8. 25 e- 31 4. 17 e- 16 1. 79 e- 14 1. 88 e- 13 2. 02 e- 11 2. 36 e- 35 9. 58 e- 34 9. 98 e- 33 1. 05 e- 30 5. 34 e- 31 4. 97 e- 30 5. 17 e- 29 1. 44 e- 27 1. 26 e- 13 1. 22 e- 12 1. 85 e- 11 4. 52 e- 10 cholesky MINI SMALL MEDIUM LARGE 10 29 10 24 10 19 10 14 10 9 10 4 101 106 M SE 5. 98 e- 11 3. 84 e- 09 8. 45 e- 03 6. 11 e- 05 5. 99 e- 32 1. 01 e- 27 5. 47 e- 23 2. 85 e- 23 1. 13 e- 11 1. 07 e- 08 2. 87 e- 03 1. 88 e- 04 8. 38 e- 33 1. 28 e- 27 1. 62 e- 21 1. 00 e- 22 5. 37 e- 27 4. 83 e- 24 3. 43 e- 20 3. 71 e- 19 5. 23 e- 10 2. 53 e- 07 7. 33 e- 02 7. 16 e- 03 durbin MINI SMALL MEDIUM LARGE 10 32 10 28 10 24 10 20 10 16 10 12 10 8 M SE 2. 95 e- 16 3. 46 e- 16 2. 51 e- 15 6. 84 e- 14 1. 63 e- 35 2. 57 e- 35 1. 52 e- 34 4. 23 e- 33 2. 96 e- 16 4. 59 e- 16 2. 67 e- 15 6. 82 e- 14 6. 06 e- 35 3. 48 e- 35 1. 51 e- 34 4. 24 e- 33 3. 30 e- 31 4. 25 e- 31 4. 75 e- 30 8. 76 e- 29 6. 90 e- 14 1. 43 e- 13 5. 94 e- 13 3. 23 e- 12 ludcmp MINI SMALL MEDIUM LARGE 10 30 10 25 10 20 10 15 10 10 10 5 100 105 M SE 4. 37 e- 14 1. 23 e- 11 4. 98 e- 09 3. 27 e- 05 2. 50 e- 33 9. 05 e- 32 2. 38 e- 28 1. 80 e- 24 2. 94 e- 13 4. 07 e- 10 8. 34 e- 07 1. 82 e- 02 1. 43 e- 32 3. 63 e- 30 2. 06 e- 26 9. 14 e- 22 1. 35 e- 29 1. 02 e- 27 9. 78 e- 25 3. 62 e- 21 4. 24 e- 12 4. 71 e- 09 7. 47 e- 07 1. 17 e- 03 covariance MINI SMALL MEDIUM LARGE 10 29 10 25 10 21 10 17 10 13 10 9 10 5 10 1 M SE 3. 93 e- 14 4. 64 e- 13 1. 66 e- 11 7. 18 e- 09 1. 23 e- 31 4. 69 e- 31 4. 12 e- 30 5. 29 e- 28 1. 15 e- 29 1. 38 e- 28 2. 80 e- 27 3. 92 e- 25 4. 09 e- 12 4. 18 e- 11 8. 68 e- 10 1. 18 e- 07 fdtd-2d MINI SMALL MEDIUM LARGE 10 31 10 27 10 23 10 19 10 15 10 11 10 7 10 3 M SE 1. 36 e- 14 2. 28 e- 13 3. 46 e- 12 2. 84 e- 10 7. 36 e- 34 1. 24 e- 32 1. 94 e- 31 1. 41 e- 29 4. 46 e- 30 2. 91 e- 29 3. 13 e- 28 8. 59 e- 27 1. 08 e- 12 7. 25 e- 12 8. 72 e- 11 2. 69 e- 09 seidel-2d Posit64 Posit64 No Quire Double Posit32 Posit32 No Quire Float Fig. 5. PolyBench benchmarks mean square error of the different arithmetics studied with respect to the results obtained with GNU MPFR. 8 MINI SMALL MEDIUM LARGE 10 15 10 12 10 9 10 6 10 3 100 M ax Ab sE 5. 72 e- 08 2. 43 e- 07 6. 79 e- 07 7. 58 e- 06 4. 42 e- 17 8. 67 e- 17 1. 66 e- 16 1. 89 e- 15 1. 48 e- 07 1. 79 e- 06 1. 38 e- 05 5. 88 e- 04 7. 37 e- 17 4. 15 e- 16 3. 86 e- 15 1. 68 e- 13 4. 55 e- 15 2. 37 e- 14 2. 20 e- 13 4. 32 e- 12 2. 63 e- 06 1. 25 e- 05 1. 06 e- 04 3. 34 e- 03 gemm MINI SMALL MEDIUM LARGE 10 16 10 13 10 10 10 7 10 4 10 1 102 M ax Ab sE 2. 95 e- 09 1. 78 e- 07 8. 82 e- 06 7. 68 e- 03 9. 76 e- 19 4. 51 e- 17 2. 33 e- 15 1. 70 e- 129. 48 e- 09 1. 07 e- 06 1. 69 e- 04 2. 15 e- 01 1. 90 e- 18 1. 89 e- 16 4. 12 e- 14 4. 65 e- 11 2. 55 e- 16 1. 54 e- 14 1. 30 e- 12 4. 50 e- 101. 51 e- 07 8. 74 e- 06 6. 96 e- 04 2. 38 e- 01 3mm MINI SMALL MEDIUM LARGE 10 16 10 14 10 12 10 10 10 8 10 6 10 4 10 2 M ax Ab sE 4. 02 e- 08 3. 35 e- 07 1. 13 e- 06 1. 53 e- 05 1. 27 e- 17 9. 01 e- 17 3. 44 e- 16 3. 32 e- 15 5. 51 e- 08 4. 40 e- 07 1. 52 e- 06 2. 27 e- 05 1. 36 e- 17 1. 04 e- 16 3. 62 e- 16 5. 33 e- 15 2. 32 e- 15 9. 42 e- 15 4. 31 e- 14 2. 76 e- 13 1. 04 e- 06 3. 54 e- 06 2. 07 e- 05 1. 16 e- 04 cholesky MINI SMALL MEDIUM LARGE 10 14 10 11 10 8 10 5 10 2 101 M ax Ab sE 1. 20 e- 05 9. 01 e- 05 1. 30 e- 01 1. 15 e- 02 3. 91 e- 16 4. 60 e- 14 1. 05 e- 11 7. 74 e- 12 5. 09 e- 06 1. 74 e- 04 7. 68 e- 02 2. 41 e- 02 2. 44 e- 16 5. 52 e- 14 5. 72 e- 11 2. 68 e- 11 1. 18 e- 13 3. 29 e- 12 2. 72 e- 10 9. 10 e- 10 3. 52 e- 05 7. 74 e- 04 3. 87 e- 01 1. 46 e- 01 durbin MINI SMALL MEDIUM LARGE 10 16 10 14 10 12 10 10 10 8 10 6 10 4 10 2 M ax Ab sE 2. 96 e- 08 6. 12 e- 08 3. 12 e- 07 2. 44 e- 06 8. 24 e- 18 1. 98 e- 17 7. 77 e- 17 5. 71 e- 16 2. 96 e- 08 7. 61 e- 08 3. 17 e- 07 2. 44 e- 06 4. 29 e- 17 2. 58 e- 17 7. 57 e- 17 5. 67 e- 16 2. 36 e- 15 1. 97 e- 15 1. 22 e- 14 7. 64 e- 14 4. 75 e- 07 2. 77 e- 06 4. 43 e- 06 2. 03 e- 05 ludcmp MINI SMALL MEDIUM LARGE 10 14 10 11 10 8 10 5 10 2 101 104 M ax Ab sE 1. 26 e- 06 4. 58 e- 05 6. 02 e- 04 3. 04 e- 02 2. 91 e- 16 1. 17 e- 15 1. 12 e- 13 6. 59 e- 12 2. 92 e- 06 6. 71 e- 04 1. 54 e- 02 2. 41 e+ 00 5. 83 e- 16 2. 02 e- 14 2. 90 e- 12 9. 12 e- 10 2. 35 e- 14 2. 27 e- 13 1. 45 e- 11 1. 81 e- 09 1. 54 e- 05 1. 34 e- 03 1. 51 e- 02 1. 13 e+ 00 covariance MINI SMALL MEDIUM LARGE 10 14 10 12 10 10 10 8 10 6 10 4 10 2 100 102 M ax Ab sE 2. 00 e- 06 1. 67 e- 05 1. 38 e- 04 6. 60 e- 03 2. 25 e- 15 3. 89 e- 15 4. 24 e- 14 1. 32 e- 12 2. 98 e- 14 2. 99 e- 13 1. 83 e- 12 6. 27 e- 11 3. 05 e- 05 1. 12 e- 04 1. 11 e- 03 3. 72 e- 02 fdtd-2d MINI SMALL MEDIUM LARGE 10 15 10 13 10 11 10 9 10 7 10 5 10 3 10 1 M ax Ab sE 9. 54 e- 07 3. 18 e- 06 1. 59 e- 05 1. 83 e- 04 2. 22 e- 16 7. 43 e- 16 3. 64 e- 15 3. 89 e- 14 1. 14 e- 14 4. 55 e- 14 1. 61 e- 13 1. 20 e- 12 5. 34 e- 06 1. 93 e- 05 9. 16 e- 05 5. 29 e- 04 seidel-2d Posit64 Posit64 No Quire Double Posit32 Posit32 No Quire Float Fig. 6. PolyBench benchmarks maximum absolute error of the different arithmetics studied with respect to the results obtained with GNU MPFR. 9 TABLE 4 PolyBench timing comparisons of floats and doubles with fused MAC, posit32 and posit64 with quire, and posit32 and posit64 without quire (posit32– and posit64–) measured in seconds of FPGA runtime. MINI SMALL MEDIUM LARGE GEMM float 0.004991 0.1004 4.579 591.8 posit32 0.005175 0.1151 5.225 1251.3 posit32– 0.007326 0.1638 6.382 846.5 double 0.006025 0.1846 6.988 1004.9 posit64 0.005845 0.1747 7.500 1973.3 posit64– 0.007628 0.2175 7.981 1109.5 3mm float 0.005591 0.1177 9.533 2208.0 posit32 0.003778 0.1250 9.678 2265.2 posit32– 0.009895 0.2230 14.01 2757.0 double 0.005321 0.2307 16.09 3931.2 posit64 0.004626 0.1903 14.14 3674.6 posit64– 0.010466 0.3044 18.46 4208.8 Cholesky float 0.002504 0.0617 3.670 484.0 posit32 0.003005 0.0707 3.811 498.5 posit32– 0.004960 0.1179 5.699 734.8 double 0.003952 0.1190 6.323 870.0 posit64 0.003453 0.0987 5.529 767.3 posit64– 0.005044 0.1503 7.680 1038.1 Durbin float 0.000526 0.00388 0.0394 1.011 posit32 0.001091 0.00660 0.0676 1.681 posit32– 0.000814 0.00724 0.0708 1.766 double 0.000772 0.00703 0.0698 2.127 posit64 0.000898 0.00732 0.0814 2.445 posit64– 0.000928 0.00816 0.0851 2.539 LU dcmp solver float 0.004377 0.1079 9.162 2896.6 posit32 0.006173 0.1459 10.37 2986.3 posit32– 0.007162 0.1704 11.58 3147.2 double 0.005918 0.2447 19.48 3309.6 posit64 0.006388 0.2536 19.96 3311.2 posit64– 0.007347 0.2859 20.47 3468.3 Covariance float 0.004203 0.0939 4.381 1779.2 posit32 0.004447 0.0903 4.149 1727.9 posit32– 0.005377 0.1062 4.552 1793.1 double 0.004478 0.1811 7.843 1949.5 posit64 0.004817 0.1670 7.311 1883.0 posit64– 0.005678 0.1875 7.818 1948.8 FDTD 2D float 0.014354 0.3824 11.14 1457.3 posit32 0.020327 0.5168 14.49 1893.7 double 0.016546 0.6750 17.37 2469.6 posit64 0.020668 0.7709 19.83 2711.4 Gauss-Seidel 2D float 0.030509 0.5979 17.35 2346.2 posit32 0.036817 0.7519 22.74 3164.3 double 0.041236 0.8985 26.98 4071.4 posit64 0.037041 0.8412 26.55 4097.4 the MaxAbsE is also lower in every case. The use of the quire is beneficial both in terms of performance and accuracy, so it can compensate for its significant hardware cost shown in Section 5. 7 CONJUGATE GRADIENT In addition to the Polybench algorithms shown in Sec- tion 6, we have studied the use of posit64 in iterative linear equation solvers. In particular, when using the conjugate gradient (CG) and the biconjugate gradient (BiCG) meth- ods. These serve as larger real-world applications in which posit64 could be used. The conjugate gradient algorithm serves to numerically solve systems of linear equations Ax = b for vector x, where the real matrix A is symmetric and positive-definite. We have executed this algorithm on Big-PERCIVAL with a tolerance margin of 10−12 on four matrices extracted from the Matrix Market repository3. Concretely, on a subset of the BCS Structural Engineering Matrices from the Harwell- Boeing Collection. This provided a real use-case in which we can analyze the use of posit64 and compare it with IEEE doubles. Figure 7 shows the results of these executions. The bcsstk01, bcsstk04, bcsstk07, and bcsstk08 matrices have a size of 48 × 48, 132 × 132, 420 × 420, and 1074 × 1074, respectively. In the smallest case, both posit64 and double converge in the same number of iterations (136), but in the rest of the problems, posit64 converges in fewer iterations. This amounts to a reduction of 2-10% in the number of iterations of the algorithm, depending on the input matrix. The biconjugate gradient method is a generalized ver- sion of the CG that serves to solve systems of linear equa- tions in which matrix A does not have the restriction of being symmetric and positive-definite. On the other hand, its computational cost is around double the one of CG. Following the same setup, we tested BiCG on a set of four increasingly larger unsymmetric matrices from the Harwell-Boeing Collection. These were the impcol b, imp- col c, west0381, and gre 1107, which have a size of 59× 59, 137 × 137, 381 × 381, and 1107 × 1107, respectively. The tolerance margin was again 10−12. The results can be seen in Figure 8. In every case, using posit64 results in fewer iterations before the target tolerance is met. This reduction reaches almost 20% in the west0381 matrix. 8 GEMM The GEMM kernel in PolyBench is optimized from the memory perspective by performing loop interchange (see loops k and j in Figure 9). This version is the one whose results are shown in Section 6. When executing the GEMM kernel in hardware for sufficiently large matrices it can be observed that there is a considerable time penalty for the posit64 case (see Table 4). This is due to the fact that exploiting the use of the quire accumulator register limits the order in which matrix multiplication is computed (see Figure 10), which results in a higher number of cache misses because of the long dot-product computations. This is not the case for the execution without the quire, where the performance results are comparable. The GEMM operation is typically optimized to reduce the number of memory accesses. In this section, we describe the impact on both timing performance and accuracy of executing the GEMM kernel using posit64 and quire with block tiling optimizations. The matrices are kept the same as in the PolyBench GEMM benchmark, but we modified the algorithm to use a standard 6-loop tiling approach (see Figure 11). The tile size was varied at compile time to better study the impact of this method, allowing the compiler to perform optimizations. We tested all tile sizes between 5 and 25 and also larger tiles of 30 to 40 in steps of 2 to check the observed trends. Figure 12 shows the timing results of executing the tiled version of GEMM when varying the tile size. These results are for the LARGE dataset, which set the values of ni, 3. https://math.nist.gov/MatrixMarket/ 10 0 20 40 60 80 100 120 140 Number of iterations 10 11 10 8 10 5 10 2 101 104 Re sid ua l Conjugate Gradient Residual: bcsstk01 Matrix 0 100 200 300 400 500 600 Number of iterations 10 11 10 8 10 5 10 2 101 104 107 Re sid ua l Conjugate Gradient Residual: bcsstk04 Matrix 0 1000 2000 3000 4000 Number of iterations 10 10 10 7 10 4 10 1 102 105 Re sid ua l Conjugate Gradient Residual: bcsstk07 Matrix 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of iterations 10 11 10 8 10 5 10 2 101 104 107 Re sid ua l Conjugate Gradient Residual: bcsstk08 Matrix Double Posit64 Fig. 7. Conjugate gradient iterative residual results on four increasingly larger Matrix Market problems. 0 10 20 30 40 50 60 70 Number of iterations 10 12 10 9 10 6 10 3 100 103 106 Re sid ua l Biconjugate Gradient Residual: impcol_b Matrix 0 50 100 150 200 250 300 Number of iterations 10 11 10 8 10 5 10 2 101 104 107 Re sid ua l Biconjugate Gradient Residual: impcol_c Matrix 0 2000 4000 6000 8000 10000 12000 14000 16000 Number of iterations 10 11 10 8 10 5 10 2 101 104 107 1010 Re sid ua l Biconjugate Gradient Residual: west0381 Matrix 0 500 1000 1500 2000 2500 Number of iterations 10 11 10 8 10 5 10 2 101 104 Re sid ua l Biconjugate Gradient Residual: gre_1107 Matrix Double Posit64 Fig. 8. Biconjugate gradient iterative residual results on four increasingly larger Matrix Market problems. 11 Input: Double matrices A (ni×nk), B (nk×nj) and C (ni×nj). Scalar values a and b. Output: Double matrix C = aAB + bC. for i = 0 to ni-1 do for j = 0 to nj-1 do C[i][j] *= b end for for k = 0 to nk-1 do for j = 0 to nj-1 do C[i][j] += a * A[i][k] * B[k][j] end for end for end for Fig. 9. PolyBench Double GEMM pseudocode with loop interchange. Input: Posit64 matrices A (ni×nk), B (nk×nj) and C (ni×nj). Scalar values a and b. Output: Posit64 matrix C = aAB + bC. for i = 0 to ni-1 do for j = 0 to nj-1 do C[i][j] *= b end for for j = 0 to nj-1 do quire = C[i][j] for k = 0 to nk-1 do quire += a * A[i][k] * B[k][j] end for C[i][j] = round(quire) end for end for Fig. 10. Posit GEMM pseudocode using the quire accumulator. Input: Posit64 matrices A (ni×nk), B (nk×nj) and C (ni×nj). Scalar values a and b. Tile size nt. Output: Posit64 matrix C = aAB + bC. for ii = 0 to ni-1 in steps of nt do for jj = 0 to nj-1 in steps of nt do for i = ii to min(ii+nt, ni)-1 do for j = jj to min(jj+nt, nj)-1 do C[i][j] *= b end for end for end for for kk = 0 to nk-1 in steps of nt do for i = ii to min(ii+nt, ni)-1 do for j = jj to min(jj+nt, nj)-1 do quire = C[i][j] for k = kk to min(kk+nt, nk)-1 do quire += a * A[i][k] * B[k][j] end for C[i][j] = round(quire) end for end for end for end for Fig. 11. Posit GEMM tiled pseudocode using the quire accumulator. 5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40 Tile size 0 100 200 300 400 500 600 700 800 Ti m e (s ) Tiled GEMM LARGE Timing Comparison Double Posit64 Fig. 12. Tiled GEMM timing results. nj and nk to 1000, 1100 and 1200 respectively. For very small tile sizes between 5 and 10, there is a relatively large variation in the performance, but this stabilizes for larger tile sizes. In the case of posit64 numbers, there is a big jump in execution time between using a tile size of 8 or a tile size of 9. This is due to compiler optimizations. In the size 8 case, the compiler can loop-unroll the main computational loop and this does not happen with the size 9 case. In the case of doubles, this happens for sizes 9 and 10. When using a tile size of 10 the compiler decides to loop-unroll the main computational loop, and this is not the case for a tile size of 9. The extra execution time required by the posit64 kernel is due to the extra instructions needed to initialize the quire and round it back to a posit value after each series of accumulations inside a tile. For large dot-product com- putations, these extra instructions are amortized over the long accumulations, but for smaller batches this overhead is noticeable. All in all, the performance comparison of posit64 and doubles in the GEMM tiled benchmark is closer and should scale better for even larger matrix sizes, as the memory pressure is reduced. Even though the posit64 execution of this kernel is slower, there are significant benefits regarding the accu- racy of the computations. Figure 13 shows the MSE and MaxAbsE results of the same execution of the GEMM tiled benchmark. As can be seen from the logarithmic scale on the Y-axis, posit64 obtains between 4 and 5 orders of mag- nitude lower MSE and around 2 orders of magnitude lower MaxAbsE than doubles. The accuracy improves with larger tile sizes, and is comprised between the posit64 with and without quire values shown in Figures 5 and 6. This is to be expected, as the execution in tiles adds extra rounding steps in the computation of each value of the output matrix. With larger tiles, the number of intermediate roundings will be lower and thus the final value is more accurate. However, note that even for small tile sizes, the accuracy improvements obtained by posit arithmetic are about 4 orders of magnitude. 12 5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40 Tile size 10 29 10 28 10 27 10 26 10 25 M SE Tiled GEMM LARGE MSE Comparison Double Posit64 (a) 5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40 Tile size 10 14 10 13 10 12 M ax Ab sE Tiled GEMM LARGE MaxAbsE Comparison Double Posit64 (b) Fig. 13. Tiled GEMM error comparison of posit64 and double with respect to the results obtained with GNU MPFR. 9 CONCLUSIONS In this work, we presented Big-PERCIVAL, an extension of the PERCIVAL posit RISC-V core that adds support for 64- bit posits and provides increased flexibility. We studied the hardware cost, accuracy, and performance of 64-bit posit arithmetic compared to double-precision IEEE 754 floating- point arithmetic using the PolyBench benchmark suite. Synthesis results of the 64-bit PAU in Big-PERCIVAL have shown that it requires 2.5× as many resources as the double-precision FPNew FPU. Moreover, we studied the impact of the corresponding 1024-bit quire accumulator register, which increased the total hardware cost to a third of the area of the core. Detailed area results illustrated how the hardware resources are distributed among the different operations. In particular, the most resource-hungry elements are the quire-related units and the posit division and square root units. The PolyBench numerical benchmarks executed on Big- PERCIVAL running on the Genesys II board provided in- sight into the native use of 64-bit posits. Furthermore, the conjugate gradient and biconjugate gradient linear solvers demonstrated the use of posit64 in real-world problems. Additionally, the use of the quire accumulator requires some extra thought into the order in which the operations will be executed in some instances. Regarding accuracy, which is one of the main requirements in scientific computing, we have seen that 64-bit posits obtain up to 4 orders of magnitude lower MSE and up to 3 orders of magnitude lower MaxAbsE than 64-bit doubles. This provides a high- accuracy solution that can reduce the number of steps in iterative solvers without additional impact on the memory bandwidth. Overall, our contributions show the potential of posit arithmetic as an alternative to IEEE 754 floating-point arith- metic in scientific computing, and Big-PERCIVAL provides a flexible platform for exploring this alternative. We believe that this work provides a starting point for future research on 64-bit hardware and software implementations of posit arithmetic and contributes to the development of more accurate and efficient scientific computing systems. ACKNOWLEDGMENTS This work was supported by grants PID2021-123041OB- I00 and PID2021-126576NB-I00 funded by MCIN/AEI/ 10.13039/501100011033 and by “ERDF A way of making Europe”, and by the CM under grant S2018/TCS-4423. REFERENCES [1] IEEE Computer Society, “IEEE Standard for Floating-Point Arith- metic,” IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84, Jul. 2019. [2] “BFloat16: The secret to high performance on Cloud TPUs,” https://cloud.google.com/blog/products/ai-machine- learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus. [3] P. Kharya, “NVIDIA Blogs: TensorFloat- 32 Accelerates AI Training HPC upto 20x,” https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32- precision-format/, May 2020. [4] D. Mallasén, R. Murillo, A. A. D. Barrio, G. Botella, L. Piñuel, and M. Prieto-Matias, “PERCIVAL: Open-Source Posit RISC-V Core With Quire Capability,” IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 3, pp. 1241–1252, 2022. [5] L.-N. Pouchet and T. Yuki, “PolyBench/C 4.2,” https://sourceforge.net/projects/polybench/, May 2016. [6] Y. Durand, E. Guthmuller, C. Fuguet, J. Fereyre, A. Bocco, and R. Alidori, “Accelerating Variants of the Conjugate Gradient with the Variable Precision Processor,” in 2022 IEEE 29th Symposium on Computer Arithmetic (ARITH), Sep. 2022, pp. 51–57. [7] S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “FPnew: An Open- Source Multiformat Floating-Point Unit Architecture for Energy- Proportional Transprecision Computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 4, pp. 774– 787, Apr. 2021. [8] Posit Working Group, “Standard for Posit Arithmetic (2022),” Feb. 2022. [Online]. Available: {https://posithub.org/docs/posit standard-2.pdf} [9] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Compar- ing Different Decodings for Posit Arithmetic,” in Next Generation Arithmetic, J. Gustafson and V. Dimitrov, Eds. Cham: Springer International Publishing, 2022, vol. 13253, pp. 84–99. [10] Y. Uguen, L. Forget, and F. de Dinechin, “Evaluating the Hard- ware Cost of the Posit Number System,” in 2019 29th Interna- tional Conference on Field Programmable Logic and Applications (FPL). Barcelona, Spain: IEEE, Sep. 2019, pp. 106–113. [11] R. Chaurasiya, J. Gustafson, R. Shrestha, J. Neudorfer, S. Nambiar, K. Niyogi, F. Merchant, and R. Leupers, “Parameterized Posit Arithmetic Hardware Generator,” in 2018 IEEE 36th International Conference on Computer Design (ICCD), Oct. 2018, pp. 334–341. 13 [12] M. Klöwer, P. D. Düben, and T. N. Palmer, “Posits as an alternative to floats for weather and climate models,” in Proceedings of the Conference for Next Generation Arithmetic 2019. Singapore Singapore: ACM, Mar. 2019, pp. 1–8. [Online]. Available: https://dl.acm.org/doi/10.1145/3316279.3316281 [13] N. Neves, P. Tomás, and N. Roma, “Dynamic Fused Multiply- Accumulate Posit Unit with Variable Exponent Size for Low- Precision DSP Applications,” in 2020 IEEE Workshop on Signal Processing Systems (SiPS), Oct. 2020, pp. 1–6. [14] A. Guntoro, C. De La Parra, F. Merchant, F. De Dinechin, J. L. Gustafson, M. Langhammer, R. Leupers, and S. Nambiar, “Next Generation Arithmetic for Edge Computing,” in 2020 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). Greno- ble, France: IEEE, Mar. 2020, pp. 1357–1365. [15] K. Asanović, R. Avizienis, J. Bachrach, S. Beamer, D. Bian- colin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love, M. Maas, A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson, B. Richards, C. Schmidt, S. Twigg, H. Vo, and A. Waterman, “The rocket chip generator,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17, Apr. 2016. [16] F. Zaruba and L. Benini, “The Cost of Application-Class Process- ing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology,” IEEE Transac- tions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, Nov. 2019. [17] C. Celio, D. A. Patterson, and K. Asanović, “The berkeley out-of- order machine (BOOM): An industry-competitive, synthesizable, parameterized RISC-V processor,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-167, Jun. 2015. [18] N. Gala, A. Menon, R. Bodduna, G. S. Madhusudan, and V. Ka- makoti, “SHAKTI Processors: An Open-Source Hardware Initia- tive,” in 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID), Jan. 2016, pp. 7–8. [19] M. V. Arunkumar, S. G. Bhairathi, and H. G. Hayatnagarkar, “PERC: Posit Enhanced Rocket Chip,” in 4th Workshop on Computer Architecture Research with RISC-V (CARRV’20), 2020, p. 8. [20] S. Tiwari, N. Gala, C. Rebeiro, and V. Kamakoti, “PERI: A Config- urable Posit Enabled RISC-V Core,” ACM Transactions on Architec- ture and Code Optimization, vol. 18, no. 3, pp. 1–26, Jun. 2021. [21] N. N. Sharma, R. Jain, M. M. Pokkuluri, S. B. Patkar, R. Leupers, R. S. Nikhil, and F. Merchant, “CLARINET: A quire-enabled RISC- V-based framework for posit arithmetic empiricism,” Journal of Systems Architecture, p. 102801, Dec. 2022. [22] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, “A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications,” IEEE Transactions on Emerging Topics in Computing, no. 01, pp. 1–1, Oct. 2021. [23] Q. Li, C. Fang, and Z. Wang, “PDPU: An Open-Source Posit Dot- Product Unit for Deep Learning Applications,” Feb. 2023. [24] S. W. D. Chien, I. B. Peng, and S. Markidis, “Posit NPB: Assess- ing the Precision Improvement in HPC Scientific Applications,” in Parallel Processing and Applied Mathematics, R. Wyrzykowski, E. Deelman, J. Dongarra, and K. Karczewski, Eds. Cham: Springer International Publishing, 2020, vol. 12043, pp. 301–310. [25] N. Buoncristiani, S. Shah, D. Donofrio, and J. Shalf, “Evaluating the Numerical Stability of Posit Arithmetic,” in 2020 IEEE Interna- tional Parallel and Distributed Processing Symposium (IPDPS), May 2020, pp. 612–621. [26] D. Mallasén Quintana, “Leveraging Posits for the Conjugate Gra- dient Linear Solver on an Application-Level RISC-V Core,” KTH Royal Institute of Technology, Tech. Rep., 2022. [27] S. D. Ciocirlan, D. Loghin, L. Ramapantulu, N. Tapus, and Y. M. Teo, “The Accuracy and Efficiency of Posit Arithmetic,” arXiv:2109.08225 [cs], Sep. 2021. [28] F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: The good, the bad and the ugly,” in Proceedings of the Conference for next Generation Arithmetic 2019, ser. CoNGA’19. New York, NY, USA: Association for Computing Machinery, 2019. [29] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its own game: Posit arithmetic,” Supercomputing Frontiers and Innova- tions, vol. 4, no. 2, pp. 71–86, Apr. 2017. [30] L. Forget, Y. Uguen, and F. de Dinechin, “Comparing posit and IEEE-754 hardware cost,” Apr. 2021. [31] S. Jean, A. Raveendran, A. D. Selvakumar, G. Kaur, S. G. Dharani, S. G. Pattanshetty, and V. Desalphine, “P-FMA: A Novel Param- eterized Posit Fused Multiply-Accumulate Arithmetic Processor,” in 2021 34th International Conference on VLSI Design and 2021 20th International Conference on Embedded Systems (VLSID), Feb. 2021, pp. 282–287. [32] L. Ledoux and M. Casas, “A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs,” in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). New York City, NY, USA: IEEE, May 2022, pp. 1–10. [33] N. Neves, P. Tomás, and N. Roma, “A Reconfigurable Posit Ten- sor Unit with Variable-Precision Arithmetic and Automatic Data Streaming,” Journal of Signal Processing Systems, vol. 93, no. 12, pp. 1365–1385, Dec. 2021. [34] W. Liu and A. Nannarelli, “Power efficient division and square root unit,” IEEE Transactions on Computers, vol. 61, no. 8, pp. 1059– 1070, 2012. [35] A. A. D. Barrio, R. Hermida, and S. O. Memik, “A partial carry-save on-the-fly correction multispeculative multiplier,” IEEE Transactions on Computers, vol. 65, no. 11, pp. 3251–3264, 2016. [36] M. S. Kim, A. A. Del Barrio, L. T. Oliveira, R. Hermida, and N. Bagherzadeh, “Efficient Mitchell’s Approximate Log Multi- pliers for Convolutional Neural Networks,” IEEE Transactions on Computers, vol. 68, no. 5, pp. 660–675, 2019. [37] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Plaus: Posit logarithmic approximate units to implement low-cost oper- ations with real numbers,” in Proceedings of the Conference for Next Generation Arithmetic 2023, ser. CoNGA’23, 2023. [38] R. Murillo, A. A. Del Barrio, and G. Botella, “A Suite of Division Algorithms for Posit Arithmetic,” in 2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Proces- sors (ASAP). Porto, Portugal: IEEE, Jul. 2023, pp. 41–44. [39] K. Jun and E. E. Swartzlander, “Modified non-restoring division algorithm with improved delay profile and error correction,” in 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 2012, pp. 1460–1464. [40] E. T. L. Omtzigt, P. Gottschling, M. Seligman, and W. Zorn, “Universal Numbers Library: Design and implementation of a high-performance reproducible number systems library,” arXiv:2012.11011, 2020. [41] M. S. Ansari, B. F. Cockburn, and J. Han, “An improved loga- rithmic multiplier for energy-efficient neural computing,” IEEE Transactions on Computers, vol. 70, no. 4, pp. 614–625, 2021. [42] R. Murillo, A. A. Del Barrio Garcia, G. Botella, M. S. Kim, H. Kim, and N. Bagherzadeh, “PLAM: A Posit Logarithm-Approximate Multiplier,” IEEE Transactions on Emerging Topics in Computing, pp. 1–1, 2021. [43] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Transactions on Mathematical Software, vol. 16, no. 1, pp. 1–17, Mar. 1990. [44] X. Gao, S. Bayliss, and G. A. Constantinides, “Soap: Structural optimization of arithmetic expressions for high-level synthesis,” in 2013 International Conference on Field-Programmable Technology (FPT), 2013, pp. 112–119. [45] J. Villalba-Moreno, J. Hormigo, and S. González-Navarro, “Unbi- ased rounding for hub floating-point addition,” IEEE Transactions on Computers, vol. 67, no. 9, pp. 1359–1365, 2018. David Mallasén David Mallasén Quintana re- ceived a BSc Degree in Computer Science and a BSc Degree in Mathematics in 2020 from the Complutense University of Madrid (UCM). In 2022 he obtained an MSc Degree in Embedded Systems at KTH Royal Institute of Technology, specializing in embedded platforms. Currently, he is pursuing a Ph.D. in Computer Engineering at UCM. He has carried out a Ph.D. research stay at the Embedded Systems Laboratory at EPFL (Switzerland). His main research areas in- clude computer arithmetic, computer architecture, embedded systems, and high-performance computing. 14 Alberto A. Del Barrio (SM’19) Alberto A. Del Barrio received the Ph.D. degree in Computer Science from the Complutense University of Madrid (UCM), Madrid, Spain, in 2011. He has performed stays at Northwestern University, Uni- versity of California at Irvine and University of California at Los Angeles. Since 2021, he is an Associate Professor (tenure-track, civil-servant) of Computer Science with the Department of Computer Architecture and System Engineering, UCM. His main research interests include De- sign Automation, Next Generation Arithmetic and Quantum Computing. Dr. del Barrio has been the PI of the PARNASO project, funded by the Leonardo Grants program by Fundación BBVA, and currently, he is the PI of the ASIMOV project, funded by the Spanish MICINN, which includes a work package to research on the deployment of posits on RISC-V cores. Since 2019 he is an IEEE Senior Member and since December 2020, he is an ACM Senior Member, too. Manuel Prieto-Matias Manuel Prieto Matias ob- tained a Ph.D. degree from Complutense Uni- versity of Madrid (UCM) in 2000. Since 2002, he has been a Professor at the Department of Computer Architecture at UCM, being a Full Professor since 2019. His research interests in- clude high-performance computing, non-volatile memory technologies, accelerators, and code generation and optimization. His current focus is on effectively managing resources on emerg- ing computing platforms, emphasizing the inter- action between the system software and the underlying architecture. Manuel has co-authored over 100 scientific publications in journals and conferences in parallel computing and computer architecture. He is a member of the ACM.