This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2022.3187199, IEEE Transactions on Emerging Topics in Computing This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 1 PERCIVAL: Open-Source Posit RISC-V Core with Quire Capability David Mallasén, Raul Murillo, Alberto A. Del Barrio, Senior Member, IEEE, Guillermo Botella, Senior Member, IEEE, Luis Piñuel and Manuel Prieto-Matias Abstract—The posit representation for real numbers is an alternative to the ubiquitous IEEE 754 floating-point standard. In this work, we present PERCIVAL, an application-level posit RISC-V core based on CVA6 that can execute all posit instructions, including the quire fused operations. This solves the obstacle encountered by previous works, which only included partial posit support or which had to emulate posits in software. In addition, Xposit, a RISC-V extension for posit instructions is incorporated into LLVM. Therefore, PERCIVAL is the first work that integrates the complete posit instruction set in hardware. These elements allow for the native execution of posit instructions as well as the standard floating-point ones, further permitting the comparison of these representations. FPGA and ASIC synthesis show the hardware cost of implementing 32-bit posits and highlight the significant overhead of including a quire accumulator. However, results show that the quire enables a more accurate execution of dot products. In general matrix multiplications, the accuracy error is reduced up to 4 orders of magnitude. Furthermore, performance comparisons show that these accuracy improvements do not hinder their execution, as posits run as fast as single-precision floats and exhibit better timing than double-precision floats, thus potentially providing an alternative representation. Index Terms—Arithmetic, Posit, IEEE-754, Floating point, RISC-V, CPU, CVA6, LLVM, Matrix multiplication. F 1 INTRODUCTION R EPRESENTING real numbers and executing arithmetic operations on them in a microprocessor presents unique challenges. When comparing with the simpler set of integers, working with reals introduces notions such as their precision. The representation of real numbers in virtually all computers for decades has followed the IEEE 754 standard for floating-point arithmetic [1]. However, this standard has some flaws such as rounding and reproducibility issues, signed zero, or excess of Not a Number (NaN) represen- tations. To face these challenges, alternative real number repre- sentations are proposed in the literature. Posits [2] are a promising substitute proposed in 2017 that provide com- pelling benefits. They deliver a good trade-off between dy- namic range and accuracy, encounter fewer exceptions when operating, and have tapered precision. This means that numbers near ±1 have more precision, while very big and very small numbers have less. The posit standard includes fused operations, which can be used to compute a series of multiplications and accumulations without intermediate rounding. Furthermore, posits are consistent between im- plementations, as they use a single rounding scheme and include only two special cases: single 0 and ±∞. Therefore, they potentially simplify the hardware implementation [3]. Nonetheless, posits are still under development, and it is still not clear whether they could completely replace IEEE floats [4]. Including Posit Arithmetic Units (PAUs) into cores in hardware is a crucial step to study the efficiency of this representation further. When designing such a core and All authors are with the Department of Computer Architecture and Automa- tion, Complutense University of Madrid, 28040 Madrid, Spain. E-mails: {dmallase, ramuri01, abarriog, gbotella, lpinuel, mpmatias}@ucm.es Manuscript received -; revised -. its arithmetic operations, an important decision is which Instruction Set Architecture (ISA) to implement. RISC-V [5] is a promising open-source ISA that is getting significant attraction both in academia and in industry. Thanks to its openness and flexibility, multiple RISC-V cores have been developed targeting diverse purposes in recent years. In the case of studying the performance of posits, a core that can run application-level software is needed. Some works have studied the use of posits by emu- lating their execution in software [6], [7], [8]. However, this approach has the significant drawback of requiring excessive execution times, thus limiting the scalability of the applications. To overcome these limitations, we propose to include native posit and quire support in hardware by leveraging a high-performance RISC-V core. A comparison of four of the leading open-source application-class RISC-V cores is studied in [9], CVA6 among them. In this work, we have extended the datapath of the CVA6 [10] RISC-V core with a 32-bit PAU with quire and a posit register file. Together with the Xposit compiler extension, this core allows the native hardware execution of high-level applications that leverage the posit number system. Therefore, the main contributions of this paper are the following: • We present PERCIVAL, an oPEn-souRCe1 posIt risc- V core with quire cApabiLity based on the CVA6 that can execute all 32-bit posit instructions, including the quire fused operations. • Compiler support for the Xposit RISC-V extension in LLVM. This allows to easily embed posit instruc- tions into a C program that can be run natively on 1. https://github.com/artecs-group/PERCIVAL ar X iv :2 11 1. 15 28 6v 3 [ cs .A R ] 7 J ul 2 02 2 2 PERCIVAL or any other core that implements these opcodes. • To the best of our knowledge, the PERCIVAL core together with the Xposit extension is the first work that integrates in hardware standard posit addition, subtraction, and multiplication together with quire fused operations. It also includes posit logarithmic- approximate hardware for division and square root operations. Furthermore, all comparison operations and conversions to and from integer numbers are also included in PERCIVAL. • Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) synthesis results showcasing the resource-usage of posit arithmetic and quire capabilities on a RISC-V CPU. These results are compared with the native IEEE 754 Floating-Point Unit (FPU) available in the CVA6 and with previous works. • Accuracy and timing performance of posit numbers and IEEE 754 floats are compared on PERCIVAL using General Matrix Multiplication (GEMM) and max-pooling benchmarks. Results show that 32-bit posits can be up to 4 orders of magnitude more accurate than 32-bit floats thanks to the quire register. Furthermore, this improvement does not imply a trade-off in execution time, as they can perform as fast as 32-bit floats, and thus execute faster than 64- bit floats. The rest of the paper is organized as follows: Section 2 introduces the necessary background about the posit for- mat, the RISC-V ISA and the CVA6 RISC-V core. Related works from the literature are surveyed in Section 3, both as standalone PAUs and at the core level. In Section 4 the PERCIVAL posit core is described and in Section 5 the necessary compiler support for the Xposit RISC-V extension is introduced. The FPGA and ASIC synthesis results of the core are presented, as well as compared with other implementations, in Section 6. Subsequently, in Section 7 posits and IEEE 754 floats are compared regarding accuracy and timing performance. Finally, Section 8 concludes this work. 2 BACKGROUND 2.1 Posit Format Posit numbers [2] were introduced in 2017 as an alternative to the predominant IEEE 754 floating-point standard to represent and operate with real numbers. Posits provide reproducible results across platforms and few special cases. Furthermore, they do not support overflow or underflow, which reduces the complexity of exception handling. A posit number configuration is defined using two pa- rameters as Posit〈n, es〉, where n is the total bit-width, and es is the maximum bit-width of the exponent. Although in literature [4], [6], [11] the most widespread posit formats have been Posit〈8, 0〉, Posit〈16, 1〉 and Posit〈32, 2〉, in the latest Posit Standard 4.12 Draft [12], the value of es is fixed to 2. This has the advantage of simplifying the hardware design and facilitates the conversion between different posit sizes. Fig. 1. Posit format with sign, regime, exponent and fraction fields. Posits only distinguish two special cases: zero and Not- a-Real (NaR), which are represented as 0 · · ·0 and 10 · · ·0 respectively. The rest of the representations are composed of four fields as shown in Figure 1: • The sign bit S; • The variable-length regime field R, consisting of k bits equal to R0 followed by R0 or the end of the posit. This field encodes a scaling factor r given by Equation (1); • The exponent E, consisting of at most es bits, which encodes an integer unbiased value e. If any of its bits are located after the least significant bit of the posit, that bit will have value 0; • The variable-length fraction field F, formed by the remaining m bits. Its value 0 ≤ f < 1 is given by dividing the unsigned integer F by 2m. r = { −k if R0 = 0 k − 1 if R0 = 1 (1) The real value p of a generic posit is given by Equa- tion (2). The main differences with the IEEE 754 floating- point format are the existence of the regime field, the use of an unbiased exponent, and the value of the fraction hidden bit. Usually, in floating-point arithmetic, the hidden bit is considered to be 1. However, in the latest representation of posits, it is considered to be 1 if the number is positive, or −2 if the number is negative. This simplifies the decoding stage of the posit representation [3], [13]. p = ((1− 3s) + f)× 2(1−2s)×(4r+e+s) (2) In posit arithmetic, NaR has a unique representation that maps to the most negative 2’s complement signed integer. Consequently, if used in comparison operations, it results in less than all other posits and equal to itself. Moreover, the rest of the posit values follow the same ordering as their cor- responding bit representations. These characteristics allow posit numbers to be compared as if they were 2’s comple- ment signed integers, eliminating additional hardware for posit comparison operations. The variable-length regime field acts as a long-range dynamic exponent, as can be seen in Equation (2), where it is multiplied by 4 or, equivalently, shifted left by the two exponent bits. Since it is a dynamic field, it can occupy more bits to represent larger numbers or leave more bits to the fraction field when looking for accuracy in the neighbor- hoods of ±1. However, detecting these variable-sized fields adds some hardware overhead. As an example, let 11101010 be the binary encoding of a Posit8, i.e. a Posit〈8, 2〉 according to the latest Posit Stan- dard 4.12 Draft [12]. The first bit s = 1 indicates a negative number. The regime field 110 gives k = 2 and therefore r = 1. The next two bits 10 represent the exponent e = 2. 3 Finally, the remaining m = 2 bits, 10, encode a fraction value of f = 2/22 = 0.5. Hence, from (2) we conclude that 11101010 ≡ (−2 + 0.5)× 2−(4+2+1) = −0.01171875. In addition to the standard representation, posits include fused operations using the quire, a 16n-bit fixed-point 2’s complement register, where n is the posit bit-width. This allows to execute up to 231−1 Multiply-Accumulate (MAC) operations without intermediate rounding or accuracy loss. The quire can represent either NaR, similarly to regular posits, or the value given by 216−8n times the 2’s comple- ment signed integer represented by the 16n concatenated bits. 2.2 RISC-V ISA The open-source RISC-V ISA [5] emanates from the ideas of Reduced Instruction Set Computers (RISCs). It is structured as a base integer ISA plus a set of optional standard and non- standard extensions to customize and specialize the final set of instructions. There are two main base integer ISAs, RV32I and RV64I, that establish the user address spaces as 32-bit or 64-bit respectively. The RISC-V general standard extensions include, among others, functionality for integer multiply/divide (M), atomic memory operations (A) and single- (F) and double-precision (D) floating-point arithmetic following the IEEE 754 stan- dard. This set of general-purpose standard extensions IMAFD, together with the instruction-fetch fence (Zifencei), and the control and status register (Zicsr), form the general- purpose G abbreviation. In general, following the RISC prin- ciples, all extensions have fixed-length 32-bit instructions. However, there is also a compressed instruction extension (C) that provides 16-bit instructions. Expanding the RISC-V ISA with specialized extensions is supported by the standard to allow for customized accelera- tors. Non-standard extensions can be added to the encoding space leveraging the four major opcodes reserved for cus- tom extensions. A proposal of the changes that should be made to the F standard extension in order to have a 32-bit posit RISC-V extension is described in [14]. 2.3 CVA6 The CVA6 [10] (formerly known as Ariane) is a 6-stage, in-order, single-issue CPU which implements the RV64GC RISC-V standard. The core implements three privilege levels and can run a Linux operating system. The primary goal of its micro-architecture is to reduce the critical path length. It was developed initially as part of the PULP ecosystem, but it is currently maintained by the OpenHW Group, which is developing a complete, industrial-grade pre-silicon verifica- tion. CVA6 is written in SystemVerilog and is licensed under an open-source Solderpad Hardware License. As execution units in the datapath it includes an integer ALU, a multiply/divide unit and an IEEE 754 FPU [15]. This FPU claims to be IEEE 754-2008 compliant, except for some issues in the division and square root operations. For the sake of comparison, it is important that the FPU is IEEE 754 compliant instead of being limited to normal floats only, since in theory, posit hardware is slightly more expensive than floating-point hardware that does not take into account subnormal numbers [3]. 3 RELATED WORK There has been a great deal of interest in hardware im- plementations of posit arithmetic since its first appearance. Standalone PAUs with different degrees of capabilities or basic posit functional units have been described in the liter- ature [11], [16], [17], [18]. These units provide the building blocks to execute posit arithmetic. However, they do not allow themselves to execute whole posit algorithms. Recently, some works adding partial posit support to RISC-V cores have been presented. CLARINET [19] incor- porates the quire into a RV64GC 5-stage in-order core. How- ever, not all posit capabilities are included in this work. Most operations are performed in IEEE floating-point format, and the values are converted to posit when using the quire. The only posit functionalities added to the core are fused MAC with quire, fused divide and accumulate with quire and conversion instructions. PERC [20] integrates a PAU into the Rocket Chip gen- erator, replacing the 32 and 64-bit FPU. However, this work does not include quire support, as it is constrained by the F and D RISC-V extensions for IEEE-754 floating- point numbers. More recently, PERI [21] added a tightly coupled PAU into the SHAKTI C-class core, a 5-stage in- order RV32IMAFC core. This proposal also does not include quire support as it reuses the F extension instructions. Nonetheless, it allows dynamic switching between es=2 and es=3 posits. In [22] authors include a PAU named POSAR into a RISC-V Rocket Chip core. Again, this proposal does not include quire support and replaces the FPU present in Rocket Chip to reuse the floating-point instructions. A different approach is taken in [23], where authors use the posit representation as a way to store IEEE floats in memory with a lower bit-width while performing the computations using the IEEE FPU. For this purpose they include a light posit processing unit into the CVA6 core that converts between 8 or 16-bit posits and 32-bit IEEE floats. They also develop an extension of the RISC-V ISA to include these conversion instructions. 4 PERCIVAL POSIT CORE In this work, we have integrated full posit capabilities, in- cluding quire and fused operations, into an application-level RISC-V core. In addition to the design of the functional units that execute the posit and quire operations, the novelty of our design is that it is fully compatible both at the software and hardware level with the F and D RISC-V extensions. Therefore, both posit and IEEE floating-point numbers can be used simultaneously on the same core. This is the first work that integrates practically all of the posit and quire operations specified in the posit standard into a core, to the best of our knowledge. 4.1 PAU Design The Posit Arithmetic Unit (PAU) is in charge of executing most posit operations and also contains the quire register, as shown in Figure 2. Posit comparisons are executed in the integer Arithmetic Logic Unit (ALU). As mentioned above, this is one of the benefits of the posit representation. When designing the micro-architecture of the PAU, our objective was to achieve a similar latency and throughput 4 Fig. 2. Internal structure of the proposed Posit Arithmetic Unit (PAU). as the FPU operations, to obtain fair comparisons. The throughput is limited, as there is no pipeline in the FPU nor the PAU. Nevertheless, all of the operations are multi-cycle. The latency of the PAU units is the following: • PADD, PSUB, QMADD, and QMSUB: 2 cycles. • PMUL, PDIV, PSQRT, and QROUND: 1 cycle. All other operations have no latency, i.e. they output their result at the next clock cycle after receiving the inputs. As a comparison, the 32-bit FADD, FSUB, FMADD, FMSUB, and FMUL instructions in the FPU have a latency of 2 clock cycles, but the 64-bit analogous instructions have a latency of 3 cycles. It is noteworthy that the comparisons in the FPU have a latency of 1, while the posit comparisons that reuse the integer hardware have no latency. Conversions to and from integer values also take an extra clock cycle in the FPU. Depending on the operation, the input operands are directed to the corresponding posit unit and the result is forwarded as an output of the PAU. There are three main blocks: computational operations (COMP), conversion operations (CONV), and operations that make use of the quire register (FUSED) (Figure 2). Regarding COMP, the ADD unit is used both for addi- tion and subtraction, calculating the two’s complement of the second operand when subtracting. In this group, all the modules use both operands except the square root, which uses only operand A. In addition, the operands and the result correspond to the posit register file. It must be noted that the posit division and square root units are approximate, as this type of arithmetic simplifies the designs and thus reduces the hardware cost of the system. They are logarithm-approximate units based on Mitchell’s Approximate Log Multipliers and our previous work [11]. These units have been demonstrated to have a maximum relative error of 11.11%, and have less impact on area/performance than the exact hardware operators. On the other hand, exact division and square root algorithms could be implemented in software leveraging the MAC unit, thus eliminating the need for dedicated hardware. However, this is out of the scope of this work. In the CONV group, only operand A is used for conver- sions. Depending on the operation, the input data and the result belong to the posit or the integer register file. The quire register is the most singular addition to this number format. According to the posit standard, it must be an architectural register accessible by the programmer that is also allowed to be dumped into memory. However, being so wide, the cost of including this functionality into the core’s datapath could be too high for the benefits it would add. In the vast majority of cases, the quire is used as an accumulator to avoid overflows in the MAC operations, and this does not require quire load and store operations. Instead, we can initialize the quire to zero (QCLR.S), negate it if needed (QNEG.S), accumulate the partial products in it without rounding or storing in memory (QMADD.S and QMSUB.S), and, when the whole operation is finished, round and output the result (QROUND.S). The necessary support for all of these operations related to the quire is included in our proposal (see Table 2 below). The hardware cost of including the quire as an internal register in the PAU is studied in Section 6. 4.2 Core Integration The proposed PAU has been integrated into the CVA6 RV64GC core while maintaining the compatibility with all existing extensions, including single- and double-precision floating point. Moreover, since we work with Posit32 num- bers, i.e. Posit〈32, 2〉, the core adds a 32-bit posit register file in addition to the integer and floating-point registers. The instruction decoder has been extended to support posit instructions. The inner workings of the decoder are described in Figure 3. As part of the decoding process, each posit instruction selects from which register file it must obtain its operands and to which register file it must forward its result. 5 Require: Instruction to decode instr. Ensure: Scoreboard entry sc_instr which contains the operation op and the destination functional unit fu. switch (instr.opcode) . . . case POSIT: switch (instr.func3) case 000: {Computational posit instruction} switch (instr.func5) case 00000: {PAU instruction} sc_instr.fu = PAU sc_instr.op = PADD . . . case 00100: {ALU instruction} sc_instr.fu = ALU sc_instr.op = PMIN . . . end switch case 001: {Posit load instruction} sc_instr.fu = LOAD sc_instr.op = PLW case 011: {Posit store instruction} sc_instr.fu = STORE sc_instr.op = PSW end switch . . . default: {Instruction not decoded in any switch/case} illegal_instr = true end switch Fig. 3. Pseudocode describing the decoding of posit instructions. The CVA6 core uses scoreboarding for dynamically scheduled instructions and allows out-of-order write-back of each functional unit. The scoreboard tracks which in- structions are issued, their functional unit and in which register they will write back to. Our design has enlarged the scoreboard to include posit registers and instructions. In this manner, we can discern whether the input data of posit operations are retrieved from a register or forwarded directly as a result of a previous operation. As mentioned in Section 2.1, posit numbers have the benefit of being able to reuse the comparison hardware of 2’s complement signed integers. Therefore, the integer ALU has also been extended to accept posit operands and to be able to forward the result of these instructions with minimal hardware overhead. Furthermore, the PAU has been integrated into the execution phase of the processor in parallel to the ALU and the FPU, connecting the issue module with the aforementioned scoreboard. Finally, the complete datapath has been adapted to include the posit signals and all necessary additional interconnections. 5 COMPILER SUPPORT: XPOSIT EXTENSION The assembly output of a RISC-V compiler when process- ing programs that use floating-point arithmetic includes instructions from the corresponding F and D extensions. To produce a similar output but targeting posit numbers, a new extension must be introduced that translates posit instructions and posit operators to binary code. Therefore, in this section, the Xposit RISC-V extension targeting posit arithmetic is presented. As part of this work, Xposit has been integrated into LLVM 12 backend [24] to allow the compilation of high-level applications. This modified version of LLVM can compile C code. However, posit instructions must be written from the assem- bly level, as there is currently no support for writing posit or quire operations directly in C. Therefore, previous codes can be reused in PERCIVAL, and only the computational kernels have to be manually written in assembly. An example of this is shown in Section 7. The posit instruction set follows the structure of the F RISC-V standard extension for single-precision floating point [25]. This Xposit extension mostly follows the adap- tation to the posit format proposed in [14]. The differences with this proposal are the following: • We include 32 posit registers p0-31 as in the F standard extension. • Similarly to the integer operations in CVA6, there is no flag signaling division by zero. • We do not include the possibility of loading and storing the quire in memory. The Xposit extension uses the 0001011 opcode (cus- tom-0), occupying the space indicated in Table 1 as POSIT. If more operations were needed in the future, especially posit load and store instructions of other word lengths, the 0101011, 1011011, and 1111011 opcodes (custom-1,2,3) could be leveraged. In this way, a similar approach as the F and D RISC-V extensions could be followed, which utilize the OP-FP, LOAD-FP and STORE-FP opcodes. The format and fields of the Xposit instructions are described in Figure 4. Posit load and store use the same base+offset addressing as the corresponding floating-point instructions, with the base address in register rs1 and a signed 12-bit byte offset. Thus, the PLW instruction loads a posit value from memory to the rd posit register and the PSW instruction stores a posit value from the rs2 posit register to memory. The rest of the Xposit operations keep the POSIT opcode and differ from the previous instructions by the funct3 field. Finally, it must be noted that the fmt field is fixed to 01 indicating that the instructions are for single- precision (32-bit) posits. The complete instruction set of the proposed Xposit RISC-V extension is detailed in Table 2. An important addition of the Xposit extension are the quire instructions. Since the quire is a single internal register of the PAU, the instructions that operate with it do not have to specify a quire register number. For example, the quire clear instruction does not have any parameters. It is decoded and then executed internally by the PAU, which simply sets the quire register to 0. The quire fused operations only have to specify the posit registers of the two values that will be multiplied. Then, the accumulation is performed implicitly on the quire. 6 SYNTHESIS RESULTS In this section, we present the FPGA and ASIC synthesis re- sults of PERCIVAL. The details of its PAU and the IEEE 754 FPU using 32 and 64-bit formats are also included. In this manner, the hardware cost of posit numbers and the quire are highlighted and compared with other implementations. 6 TABLE 1 RISC-V base opcode map + POSIT extension; inst[1:0]=11 inst[4:2] 000 001 010 011 100 101 110 111 inst[6:5] (> 32b) 00 LOAD LOAD-FP POSIT MISC-MEM OP-IMM AUIPC OP-IMM-32 48b 01 STORE STORE-FP custom-1 AMO OP LUI OP-32 64b 10 MADD MSUB NMSUB NMADD OP-FP reserved custom-2/rv128 48b 11 BRANCH JALR reserved JAL SYSTEM reserved custom-3/rv128 ≥ 80b Fig. 4. Internal structure and fields of Xposit instructions. 6.1 FPGA Synthesis The FPGA synthesis was performed using Vivado v.2020.2 targeting a Genesys II (Xilinx Kintex-7 XC7K325T- 2FFG900C) FPGA. Different configurations of FPU and PAU were tested, the results of which are shown in Table 3. Since the critical path does not traverse the arithmetic units of the core, in all of the cases the timing constraint of 20ns was met and the timing slack was +0.177ns. The bare CVA6 without a FPU or PAU requires 28950 Lookup Tables (LUTs) and 19579 Flip-flops (FFs). Includ- ing support for 32-bit floating-point numbers increases the number of LUTs and FFs by 6452 and 2039 respectively. This difference grows to 12310 LUTs and 4366 FFs when using also the double-precision D extension. Note that these values are larger than simply the FPU area, since they also include other elements such as the floating-point register file, instruction decoding and interconnections. These other non-FPU elements require 2406 LUTs and 1066 FFs in the 32-bit case and 4147 LUTs and 2122 FFs in the 64-bit case. Comparing the overall cost of including posit support with the cost of including IEEE floating-point support, a significant difference can be seen. Adding 32-bit posit oper- ations and quire support to the CVA6 requires 15743 LUTs and 4057 FFs, which is comparable to the FD floating-point configuration. Out of this area, 3864 LUTs and 1072 FFs are occupied by the non-PAU elements mentioned in the previous floating-point analysis. The synthesis results reveal that the PAU requires signif- icantly more resources than the FPU available in the CVA6. In particular, the 32-bit PAU with quire occupies 2.94 times as many LUTs and 3.07 times as many FFs as the 32-bit FPU. To better understand these results, in Table 4 the area requirements of the different modules inside the PAU are presented. The most interesting value shown in this table is the area occupied by the posit MAC unit, which corresponds to almost half of the total area of the PAU. When compared with the floating-point units, which do not include an accumulation register, the area requirements of the quire could be separated. Thus, the posit MAC and the quire rounding to posit can be subtracted from the total PAU area to obtain a value of 5326 LUTs and 1312 FFs. This outcome is now much closer to the synthesis results of the FPU, as the PAU without quire occupies 1.32 times as many LUTs and 1.35 times as many FFs. These results match previous works [22], where authors also report an increase of around 30% in FPGA resources when comparing their 32-bit PAU without quire with a 32-bit FPU. In our case, the actual value of not including a quire would be even smaller, as the cost of allocating the 512-bit quire in the PAU and computing its 2’s complement, which are included in the PAU top, should also be subtracted. However, the synthesis tool does not include these details. 6.2 ASIC Synthesis The 32-bit PAU with quire and the 32-bit FPU configuration present in PERCIVAL were synthesized targeting TSMC’s 45nm standard-cell library to further study their hardware cost in ASICs. The synthesis was performed using Synopsys Design Compiler with a timing constraint of 5ns, which was met in both cases, and a toggle rate of 0.1. The 32-bit FPU within CVA6 requires an area of 30691 µm2 and consumes 27.26 mW of power. On the other hand, the 32-bit PAU with quire requires an area of 76970 µm2 and consumes 67.73 mW of power. This follows the same trend shown in the FPGA synthesis, as the PAU with quire is significantly larger, 2.51x, and consumes more power, 2.48x, than the FPU. In addition, to better assess these values in comparison with other proposals, the PAU available in CLARINET [19] was also synthesized with the same parameters. We have chosen to evaluate this work because it integrates, to the best of our knowledge, the only other PAU that contains a quire. In this case, the 32-bit PAU with quire requires an area of 69920 µm2 and consumes 68.31 mW of power. This is a decrease of around 10% in area and a slight increase in power compared to our proposal, although ours implements a much larger set of posit functionality. 7 TABLE 2 Instruction set of the proposed XPosit RISC-V extension. 31 27 26 25 24 20 19 15 14 12 11 7 6 0 imm[11:0] rs1 001 rd 00001011 PLW imm[11:5] rs2 rs1 011 imm[4:0] 00001011 PSW 00000 10 rs2 rs1 000 rd 00001011 PADD.S 00001 10 rs2 rs1 000 rd 00001011 PSUB.S 00010 10 rs2 rs1 000 rd 00001011 PMUL.S 00011 10 rs2 rs1 000 rd 00001011 PDIV.S 00100 10 rs2 rs1 000 rd 00001011 PMIN.S 00101 10 rs2 rs1 000 rd 00001011 PMAX.S 00110 10 00000 rs1 000 rd 00001011 PSQRT.S 00111 10 rs2 rs1 000 00000 00001011 QMADD.S 01000 10 rs2 rs1 000 00000 00001011 QMSUB.S 01001 10 00000 00000 000 00000 00001011 QCLR.S 01010 10 00000 00000 000 00000 00001011 QNEG.S 01011 10 00000 00000 000 rd 00001011 QROUND.S 01100 10 00000 rs1 000 rd 00001011 PCVT.W.S 01101 10 00000 rs1 000 rd 00001011 PCVT.WU.S 01110 10 00000 rs1 000 rd 00001011 PCVT.L.S 01111 10 00000 rs1 000 rd 00001011 PCVT.LU.S 10000 10 00000 rs1 000 rd 00001011 PCVT.S.W 10001 10 00000 rs1 000 rd 00001011 PCVT.S.WU 10010 10 00000 rs1 000 rd 00001011 PCVT.S.L 10011 10 00000 rs1 000 rd 00001011 PCVT.S.LU 10100 10 rs2 rs1 000 rd 00001011 PSGNJ.S 10101 10 rs2 rs1 000 rd 00001011 PSGNJN.S 10110 10 rs2 rs1 000 rd 00001011 PSGNJX.S 10111 10 00000 rs1 000 rd 00001011 PMV.X.W 11000 10 00000 rs1 000 rd 00001011 PMV.W.X 11001 10 rs2 rs1 000 rd 00001011 PEQ.S 11010 10 rs2 rs1 000 rd 00001011 PLT.S 11011 10 rs2 rs1 000 rd 00001011 PLE.S TABLE 3 Comparison of FPGA synthesis results with different configurations of FPU, marked as F and D for 32 and 64-bit numbers respectively, and 32-bit PAU with quire. PAU No PAU F D FD - F D FD - Total core (LUT, FF) (50318, 25727) (55900, 27652) (57129, 27996) (44693, 23636) (35402, 21618) (40740, 23599) (41260, 23945) (28950, 19579) FPU area (LUT, FF) (3726, 1008) (6352, 1905) (7612, 2245) - (4046, 973) (6626, 1905) (8163, 2244) - PAU area (LUT, FF) (11796, 2979) (11810, 2979) (11803, 2979) (11879, 2985) - - - - TABLE 4 FPGA synthesis area results of the PAU desegregated into its individual components. Name LUTs FFs PAU top 593 1063 Posit Add 784 106 Posit Mult 736 73 Posit ADiv 413 43 Posit ASqrt 426 33 Posit MAC 5644 1541 Quire to Posit 889 126 Int to Posit 176 0 Long to Posit 331 0 ULong to Posit 425 0 Posit to Int 499 0 Posit to Long 379 0 Posit to UInt 228 0 Posit to ULong 358 0 PAU total 11 879 2985 PAU w/o quire 5346 1318 Similarly as in Section 6.1, the area and power results of the different elements inside the PAU are presented in Table 5. As can be seen, when subtracting the cost of the quire in the PAU, the outcome is still higher than the 32-bit FPU, but it becomes much closer. The 32-bit PAU occupies 1.32 times as much area and consumes 1.38 times as much power as the 32-bit IEEE FPU FPNew [15]. However, it is noteworthy that some aspects of posit arithmetic are not yet fully studied. For example, most of the works presenting posit units have tackled the decoding and encoding phases using sign-magnitude. Nonetheless, more recent studies show that a 2’s complement approach is more efficient [13]. 7 POSIT VS IEEE-754 BENCHMARKS One of the benefits of PERCIVAL is that an accurate and fair comparison can be made between posit and IEEE floating point. The main advantage of having support for native posit and IEEE floating point simultaneously on 8 TABLE 5 ASIC synthesis area and power results of the 32-bit PAU with quire desegregated into its individual components. Name Area (µm2) Power (mW) PAU top 13 462.15 12.69 Posit Add 4075.31 3.59 Posit Mult 8635.37 9.98 Posit ADiv 2540.87 2.41 Posit ASqrt 1722.84 1.61 Posit MAC 30 419.12 26.07 Quire to Posit 6026.76 4.04 Int to Posit 905.99 0.68 Long to Posit 1423.43 0.96 UInt to Posit 869.77 0.66 ULong to Posit 1353.11 0.94 Posit to Int 966.67 0.71 Posit to Long 1810.33 1.38 Posit to UInt 958.44 0.68 Posit to ULong 1800.22 1.33 PAU total 76 970.38 67.73 PAU w/o quire 40 524.62 37.62 CLARINET PAU 69 920.02 68.31 the same core is that identical benchmarks can be run on both number representations to compare them. In this work, we have chosen to benchmark the General Matrix Multiplication (GEMM) and the max-pooling layer, used to down-sample the representation of neural networks. These examples showcase the use of the quire and posits both in the PAU and in the ALU, loading and storing from memory and leveraging the posit register file. The GEMM and max-pooling codes for posits and IEEE floats have been written in C, including inline assembly for the required posit and float instructions. The floating-point code has also been written in inline assembly to provide exactly the same sequence of instructions to the core. The GEMM code for floats is shown in Figure 5 and the analo- gous version for posits using the quire is shown in Figure 6. These codes have been compiled using the modified version of LLVM with the Xposit RISC-V extension as specified in Section 5, and serve as an example of how this extension can be leveraged. Therefore, the final target architecture is RV64GCXposit. The -O2 optimization flag has been used to obtain an optimized code in every case. 7.1 Accuracy The accuracy differences between posits and floats are stud- ied for the GEMM benchmark. Furthermore, each arithmetic is executed with and without using fused MAC operations, which in posit arithmetic include the quire. In the cases without quire or FMADD, each fused operation is substituted by a multiplication and an addition. The results obtained using the 64-bit IEEE 754 format are considered the golden solution and used to compute the Mean Squared Error (MSE) of the 32-bit posit and the 32-bit IEEE 754 floating point. In all cases, the inputs are square matrices with the same random values. These input values are gener- ated from a uniform distribution in intervals of the form [−10i, 10i], i ∈ {−1, 0, 1, 2, 3}. This results in 5 different sets of inputs. These intervals allow for a study of the impact of the input data range on the GEMM. These random Require: Float matrices a and b of size n×n. Ensure: Float matrix c = ab. for i = 0 to n-1 do for j = 0 to n-1 do asm(”fmv.w.x ft0,zero”:::); {Set ft0 to 0} for k = 0 to n-1 do asm ”flw ft1,0(%0)” {Load float a and b} ”flw ft2,0(%1)” ”fmadd.s ft0,ft1,ft2,ft0” {Accumulate on ft0} :: ”r” (&a[i * n + k]), ”r” (&b[k * n + j]): end asm end for asm ”fsw ft0,0(%1)” {Store the result in c} : ”=rm” (c[i * n + j]) : ”r” (&c[i * n + j]): end asm end for end for Fig. 5. 32-bit floating-point GEMM using the F RISC-V extension. Require: Posit matrices a and b of size n×n. Ensure: Posit matrix c = ab. for i = 0 to n-1 do for j = 0 to n-1 do asm(”qclr.s”:::); {Clear the quire} for k = 0 to n-1 do asm ”plw pt0,0(%0)” {Load posit a and b} ”plw pt1,0(%1)” ”qmadd.s pt0,pt1” {Accumulate on the quire} :: ”r” (&a[i * n + k]), ”r” (&b[k * n + j]): end asm end for asm ”qround.s pt2” {Round the quire to a posit} ”psw pt2,0(%1)” {Store the result in c} : ”=rm” (c[i * n + j]) : ”r” (&c[i * n + j]) : end asm end for end for Fig. 6. Posit GEMM using the Xposit RISC-V extension with the quire accumulator. values are generated as 64-bit IEEE 754 numbers and then converted to the two other formats with the aid of the SoftPosit [26] library. The MSE results are shown in Table 6 for different matrix sizes and input ranges. Additionally, Figure 7 shows the MSE in the [−1, 1] case. We decided to give slightly more at- tention to this case since many applications normalize their values. As can be seen, for 256×256 matrices, the difference between MSEs is around four orders of magnitude when using fused operations. This is reduced to two orders of magnitude if the quire is not used. Note that when using floats, the accuracy difference between employing fused FMADD operations or not is minimal. If we compare how the MSE scales when increasing the matrix size, it can be seen that posit numbers present a 9 better behavior thanks to the quire register. This is true in all ranges of input values. Overall, the impact of the quire is significant among all test cases, and its extra cost is justified by the results. These results go in line with our previous work [27], where a similar benchmark was performed using hardware simulations with an input interval of [−2, 2]. The MSE results on 32-bit floats and posits follow the same trends given in Table 6. When removing the quire, posits still have a lower MSE than floats except in the [−1000, 1000] case. This can be explained by posit’s tapered precision. When the numbers’ exponents are closer to 0, they end up in the so-called “golden zone” of posits [4]. This is the area where posits have more accuracy bits than floats thanks to their variable- length fields. However, when the accumulated values are large or very small, IEEE floats gain an advantage over posits without quire. Particularly, this “golden zone” comprises values roughly in the interval [10−6, 106]. In the test with input values in [−1000, 1000], the absolute value of the final outputs averages 1.2×106 in the 16×16 matrix and 4.3×106 in the 256×256 case. As a comparison, even in the 256×256 multiplication, the [−100, 100] input range only averages 4.3× 104. 7.2 Performance Besides the synthesis data presented in Section 6, the exe- cution time is a critical metric to study the hardware per- formance of posits and floats. The test has been performed executing the same GEMM and max-pooling described pre- viously on PERCIVAL, avoiding cold misses and averaging over 10 executions to obtain more accurate measurements. The range of the input values does not affect perfor- mance. Thus, the values shown in Table 7 for GEMM are an average of the timings obtained in the 5 cases described previously. This gives a total of 50 executions in the GEMM operation. In this case, when using fused MAC operations and the quire, the execution time of 32-bit posits is prac- tically the same as that of single-precision floats for the larger matrix sizes, where the overhead execution of the extra qround.s instruction becomes negligible (see Figure 6). This instruction is executed in the order of O(n2) times, compared with the O(n3) running time of the algorithm. This cost is noticeable for smaller values of n, when 32- bit posits are slightly slower than 32-bit and 64-bit floats. However, for larger matrix sizes, which are common in scientific applications and Deep Neural Networks (DNNs), 32-bit posits perform equally as 32-bit floats and outperform 64-bit floats, since these instructions require more clock cycles to compute. Furthermore, as seen in the previous accuracy benchmark, 32-bit posits are orders of magnitude more accurate than 32-bit floats when performing this cal- culation. Therefore, they provide an alternative solution for the execution of kernels that make use of the dot product. The quire and fused MAC operations have a positive impact on timing performance. This is true in all test cases. Again, this performance increase stems from the extra clock cycles needed for a multiplication + an addition in compar- ison to only one fused operation. Additionally, for the sake of completeness, we have per- formed the same GEMM timing test on a commercial core with support for posit arithmetic. RacEr is a GPGPU FPGA provided by VividSparks that supports computation with Posit32 but does not include quire support, so its accuracy results are the same as the Posit32 no quire case. It has 512 CPUs running at 300MHz with 32GB of DDR4 RAM. Table 7 also includes the results of the GEMM benchmark on this platform. As can be seen, our proposal provides significantly faster results than this commercial accelerator. Regarding the max-pooling layers, three different con- figurations have been tested following common DNNs. In LeNet-5, the input of this layer is 28x28x6, the pooling kernel is 2x2 and is applied with a stride of 2, creating a 14x14x6 output representation. In AlexNet, the input size is 54x54x96, the kernel size is 3x3 and is applied with a stride of 2, generating an output of size 26x26x96. Finally, ResNet- 50 is the largest configuration we have tested, as its input is 112x112x64, the pooling kernel is 3x3 and again is applied with a stride of 2, creating a 55x55x64 output representation. The results of executing these layers on PERCIVAL using the 32 and 64-bit IEEE floating-point and Posit32 represen- tations are shown in Table 8. Results show that 32-bit posits perform as fast as 32-bit floats but without the need for extra hardware, as the posit maximum operation is carried out reusing the integer ALU. Double-precision floats are slower than 32-bit posits and floats by a factor of 1.4-1.7× due to the latency difference in the units as seen in the GEMM benchmark. 8 CONCLUSIONS This paper has presented PERCIVAL, an extension of the application-level CVA6 RISC-V core, including all 32-bit posit instructions as well as the quire fused operations. These capabilities, integrated into a Posit Arithmetic Unit together with a posit register file, are natively incorporated while preserving IEEE 754 single- and double-precision floats. Furthermore, the RISC-V ISA has been extended with Xposit, which includes support for all posit and quire in- structions. This allows the compilation and execution on PERCIVAL of application-level programs that make use of posits and floats simultaneously. To the best of our knowl- edge, this is the first work that enables complete posit and quire capabilities in hardware. Synthesis results show that half the area dedicated to the PAU is occupied by the quire and its operations. When comparing with the only previous work which includes quire capabilities [19], our proposal consumes slightly less power and only 10% more area, while also providing full posit operations support. When focusing on the 32-bit PAU without the quire, our proposal requires 32% more area and 38% more power than the 32-bit FPU. This goes in line with the results of recent works which reuse the F RISC-V extension [22], where authors obtain a 30% increase in FPGA resources when comparing their PAU to the FPU. The Posit vs IEEE-754 comparison benchmark results show that 32-bit posits are up to 4 orders of magnitude more accurate than 32-bit floats when calculating the GEMM 10 TABLE 6 GEMM MSE comparison between IEEE 754 floating-point and posit numbers. Input values Matrix size 16× 16 32× 32 64× 64 128× 128 256× 256 [-0.1, 0.1] IEEE 754 1.385× 10−18 4.429× 10−18 1.523× 10−17 6.347× 10−17 2.407× 10−16 Posit32 3.157× 10−21 6.110× 10−21 1.158× 10−20 2.014× 10−20 3.497× 10−20 IEEE 754 no FMADD 1.515× 10−18 4.752× 10−18 1.566× 10−17 6.524× 10−17 2.432× 10−16 Posit32 no quire 2.146× 10−20 6.726× 10−20 2.371× 10−19 7.805× 10−19 2.203× 10−18 [-1, 1] IEEE 754 1.490× 10−14 4.251× 10−14 1.602× 10−13 6.019× 10−13 2.361× 10−12 Posit32 1.138× 10−17 2.355× 10−17 4.729× 10−17 9.430× 10−17 1.937× 10−16 IEEE 754 no FMADD 1.324× 10−14 4.637× 10−14 1.686× 10−13 6.246× 10−13 2.416× 10−12 Posit32 no quire 5.028× 10−17 1.727× 10−16 6.457× 10−16 2.447× 10−15 9.870× 10−15 [-10, 10] IEEE 754 1.371× 10−10 3.998× 10−10 1.581× 10−9 5.922× 10−9 2.378× 10−8 Posit32 8.549× 10−13 1.475× 10−12 3.055× 10−12 6.355× 10−12 1.295× 10−11 IEEE 754 no FMADD 1.300× 10−10 4.304× 10−10 1.708× 10−9 6.026× 10−9 2.447× 10−8 Posit32 no quire 3.878× 10−12 1.341× 10−11 7.500× 10−11 3.282× 10−10 1.41× 10−9 [-100, 100] IEEE 754 1.412× 10−6 4.206× 10−6 1.544× 10−5 6.402× 10−5 2.405× 10−4 Posit32 4.819× 10−8 8.266× 10−8 1.760× 10−7 6.150× 10−7 1.506× 10−6 IEEE 754 no FMADD 1.293× 10−6 5.052× 10−6 1.595× 10−5 6.503× 10−5 2.440× 10−4 Posit32 no quire 3.077× 10−7 1.230× 10−6 4.295× 10−6 2.804× 10−5 1.569× 10−4 [-1000, 1000] IEEE 754 1.503× 10−2 3.936× 10−2 1.509× 10−1 6.069× 10−1 2.391 Posit32 5.293× 10−3 8.573× 10−3 1.900× 10−2 3.746× 10−2 8.265× 10−2 IEEE 754 no FMADD 1.675× 10−2 4.815× 10−2 1.644× 10−1 6.323× 10−1 2.433 Posit32 no quire 4.168× 10−2 1.570× 10−1 5.669× 10−1 2.365 9.586 16x16 32x32 64x64 128x128 256x256 Matrix size 10−17 10−16 10−15 10−14 10−13 10−12 10−11 M SE 1. 14 e- 17 2. 35 e- 17 4. 73 e- 17 9. 43 e- 17 1. 94 e- 16 1. 49 e- 14 4. 25 e- 14 1. 60 e- 13 6. 02 e- 13 2. 36 e- 12 5. 03 e- 17 1. 73 e- 16 6. 46 e- 16 2. 45 e- 15 9. 87 e- 15 1. 32 e- 14 4. 64 e- 14 1. 69 e- 13 6. 25 e- 13 2. 42 e- 12 GEMM MSE comparison w.r.t. double, [-1, 1] input Posit32 Float Posit32 no quire Float no FMADD Fig. 7. MSE results of posits and floats with respect to doubles in the GEMM test with input values in [−1, 1]. Note the logarithmic Y-axis. Blue (green) bars show the results with (without) fused MAC and quire operations. TABLE 7 GEMM timing comparison between IEEE 754 floating-point and posit numbers. Matrix size 16× 16 32× 32 64× 64 128× 128 256× 256 32-bit float 0.978 ms 6.58 ms 52.1 ms 1.48 s 13.9 s 64-bit float 0.920 ms 6.64 ms 69.4 ms 1.74 s 15.0 s Posit32 0.949 ms 7.30 ms 57.7 ms 1.48 s 13.9 s 32-bit float no FMADD 1.16 ms 8.69 ms 68.6 ms 1.61 s 15.0 s 64-bit float no FMADD 1.26 ms 9.36 ms 92.6 ms 1.92 s 16.7 s Posit32 no quire 1.27 ms 9.63 ms 69.1 ms 1.61 s 15.0 s VividSparks Posit32 no quire 7.95 ms 48.9 ms 345 ms 2.63 s 21.1 s 11 TABLE 8 Max-pooling timing comparison between IEEE 754 floating-point and posit numbers. Max-pooling layer 32-bit float 64-bit float Posit32 LeNet-5 (28x28x6) 0.715ms 1.211ms 0.688ms AlexNet (54x54x96) 0.115ms 0.160ms 0.116ms ResNet-50 (112x112x64) 0.337ms 0.470ms 0.340ms due to the quire. Moreover, they do not show a perfor- mance degradation compared with floats, thus providing a potential alternative when operating with real numbers. In addition, our proposal performs significantly better than available commercial solutions, obtaining up to 8× speedup when multiplying small matrices. Some known limitations occur in the use of the quire. As it is a single internal register in the PAU, PERCIVAL cannot support parallel accumulation into different independent accumulators. This also prevents safe automatic context switches, as the value of the quire cannot be loaded or stored in memory. Therefore, when developing programs for PER- CIVAL this must be taken into account to not overwrite the value of the quire. As future work, we plan to implement and evaluate on PERCIVAL large-scale scientific applications which make use of dot products, leveraging the accuracy gains of fused operations. ACKNOWLEDGMENTS This work was supported by a 2020 Leonardo Grant for Researchers and Cultural Creators, from BBVA Foundation, whose id is PR2003 20/01, by the EU(FEDER) and the Spanish MINECO under grant RTI2018-093684-B-I00, and by the CM under grant S2018/TCS-4423. REFERENCES [1] IEEE Computer Society, “IEEE Standard for Floating-Point Arith- metic,” IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84, Jul. 2019. [2] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its own game: Posit arithmetic,” Supercomputing Frontiers and Innova- tions, vol. 4, no. 2, pp. 71–86, Apr. 2017. [3] A. Guntoro, C. De La Parra, F. Merchant, F. De Dinechin, J. L. Gustafson, M. Langhammer, R. Leupers, and S. Nambiar, “Next Generation Arithmetic for Edge Computing,” in 2020 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). Greno- ble, France: IEEE, Mar. 2020, pp. 1357–1365. [4] F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: The good, the bad and the ugly,” in Proceedings of the Conference for next Generation Arithmetic 2019, ser. CoNGA’19. New York, NY, USA: Association for Computing Machinery, 2019. [5] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanović, “The RISC- V instruction set manual, volume I: User-level ISA, version 2.0,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014. [6] R. Murillo, A. A. Del Barrio, and G. Botella, “Deep PeNSieve: A deep learning framework based on the posit number system,” Digital Signal Processing, vol. 102, p. 102762, Jul. 2020. [7] G. Raposo, P. Tomás, and N. Roma, “Positnn: Training Deep Neural Networks with Mixed Low-Precision Posit,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 7908–7912. [8] H. F. Langroudi, V. Karia, Z. Carmichael, A. Zyarah, T. Pandit, J. L. Gustafson, and D. Kudithipudi, “Alps: Adaptive Quantization of Deep Neural Networks with GeneraLized PositS,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Nashville, TN, USA: IEEE, Jun. 2021, pp. 3094–3103. [9] A. Dörflinger, M. Albers, B. Kleinbeck, Y. Guan, H. Michalik, R. Klink, C. Blochwitz, A. Nechi, and M. Berekovic, “A compar- ative survey of open-source application-class RISC-V processor implementations,” in Proceedings of the 18th ACM International Conference on Computing Frontiers, ser. CF ’21. New York, NY, USA: Association for Computing Machinery, 2021, pp. 12–20. [10] F. Zaruba and L. Benini, “The Cost of Application-Class Process- ing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology,” IEEE Transac- tions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, Nov. 2019. [11] R. Murillo, A. A. Del Barrio Garcia, G. Botella, M. S. Kim, H. Kim, and N. Bagherzadeh, “PLAM: A Posit Logarithm-Approximate Multiplier,” IEEE Transactions on Emerging Topics in Computing, pp. 1–1, 2021. [12] Posit Working Group, “Posit Standard Documentation Release 4.12-draft,” Jul. 2021. [Online]. Available: https://posithub.org/ posit standard4.12.pdf [13] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Com- paring Different Decodings for Posit Arithmetic,” in Conference on Next Generation Arithmetic (CoNGA), 2022. [14] J. L. Gustafson, “RISC-V Proposed Extension for 32-bit Posits,” https://posithub.org/docs/RISC-V/RISC-V.htm, Jun. 2018. [15] S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “FPnew: An Open- Source Multiformat Floating-Point Unit Architecture for Energy- Proportional Transprecision Computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 4, pp. 774– 787, Apr. 2021. [16] R. Chaurasiya, J. Gustafson, R. Shrestha, J. Neudorfer, S. Nambiar, K. Niyogi, F. Merchant, and R. Leupers, “Parameterized Posit Arithmetic Hardware Generator,” in 2018 IEEE 36th International Conference on Computer Design (ICCD), Oct. 2018, pp. 334–341. [17] M. K. Jaiswal and H. K.-H. So, “PACoGen: A Hardware Posit Arithmetic Core Generator,” IEEE Access, vol. 7, pp. 74 586–74 601, 2019. [18] R. Murillo, A. A. Del Barrio, and G. Botella, “Customized Posit Adders and Multipliers using the FloPoCo Core Generator,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Oct. 2020, pp. 1–5. [19] N. Sharma, R. Jain, M. Mohan, S. Patkar, R. Leupers, N. Rishiyur, and F. Merchant, “CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism,” arXiv:2006.00364 [cs], Oct. 2021. [20] M. V. Arunkumar, S. G. Bhairathi, and H. G. Hayatnagarkar, “PERC: Posit Enhanced Rocket Chip,” in 4th Workshop on Computer Architecture Research with RISC-V (CARRV’20), 2020, p. 8. [21] S. Tiwari, N. Gala, C. Rebeiro, and V. Kamakoti, “PERI: A Config- urable Posit Enabled RISC-V Core,” ACM Transactions on Architec- ture and Code Optimization, vol. 18, no. 3, pp. 1–26, Jun. 2021. [22] S. D. Ciocirlan, D. Loghin, L. Ramapantulu, N. Tapus, and Y. M. Teo, “The Accuracy and Efficiency of Posit Arithmetic,” arXiv:2109.08225 [cs], Sep. 2021. [23] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, “A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications,” IEEE Transactions on Emerging Topics in Computing, no. 01, pp. 1–1, Oct. 2021. [24] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis amp; transformation,” in International Symposium on Code Generation and Optimization, 2004. CGO 2004., Mar. 2004, pp. 75–86. [25] “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213,” Dec. 2019. [Online]. Available: https://riscv.org/technical/specifications/ [26] S. H. Leong, “SoftPosit,” Mar. 2020. [Online]. Available: https://gitlab.com/cerlane/SoftPosit [27] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Energy- Efficient MAC Units for Fused Posit Arithmetic,” in 2021 IEEE 39th International Conference on Computer Design (ICCD), Oct. 2021, pp. 138–145. 12 David Mallasén David Mallasén Quintana re- ceived a BSc Degree in Computer Science and a BSc Degree in Mathematics in 2020 from the Complutense University of Madrid (UCM). From 2020 to 2022 he obtained a MSc Degree in Embedded Systems at KTH Royal Institute of Technology, specializing in embedded platforms. Currently he is pursuing a Ph.D. in Computer Engineering at UCM. His main research areas include computer arithmetic, computer architec- ture, embedded systems, and high-performance computing. Raul Murillo Raul Murillo studied Mathematics and Computer Science at Complutense Univer- sity of Madrid (UCM), Spain, where he also received a MSc Degree in Computer Science in 2021. His main research interests include Approximate Computing, new Computer Arith- metic, and Deep Neural Networks (DNNs). He is currently pursuing a Ph.D. at UCM related to the previously mentioned areas. Alberto A. Del Barrio Alberto A. Del Barrio received the Ph.D. degree in Computer Sci- ence from the Complutense University of Madrid (UCM), Madrid, Spain, in 2011. He has per- formed stays at Northwestern University, Uni- versity of California at Irvine and University of California at Los Angeles. Since 2021, he is an Associate Professor (tenure-track, civil-servant) of Computer Science with the Department of Computer Architecture and System Engineering, UCM. His main research interests include De- sign Automation, Arithmetic and their application to the field of Artifi- cial Intelligence. He is leading the PARNASO project, funded by the Leonardo Grants program by Fundación BBVA. The main objective is to natively integrate the posit format in a hardware/software platform. Since 2019 he is an IEEE Senior Member and since December 2020 he is an ACM Senior Member, too. Guillermo Botella Guillermo Botella received the M.A.Sc. degree in Physics (Fundamental) in 1998, the M.A.Sc. degree in Electronic En- gineering in 2001 and the Ph.D. degree (Com- puter Engineering) in 2007, all from the Uni- versity of Granada, Spain. He was a research fellow funded by EU working at University of Granada, Spain and the Vision Research Lab- oratory at University College London, UK. After that, he joined as Assistant Professor at the De- partment of Computer Architecture and Automa- tion of Complutense University of Madrid, Spain where he is currently Associate Professor. He has performed research stays acting also as visiting professor from 2008 to 2012 at the Department of Electrical and Computer Engineering, Florida State University, Tallahassee, USA. His current research interests include Image and Video Processing for VLSI, FPGAs, GPGPUs, Embedded Systems, and novel computing paradigms such as analog and quantum computing. Since 2019 he has become an IEEE Senior Member. Luis Piñuel Luis Piñuel is an Associate Profes- sor of the Department of Computer Architecture and Systems Engineering at the Universidad Complutense de Madrid, Spain. He received his M. Sc. and Ph.D. degrees in Computer Science from the Universidad Complutense de Madrid (UCM) in 1996 and 2003, respectively. His re- search interests include computer architecture, high-performance computing, embedded sys- tems, and resource management for emerging computing systems. In these fields, he is co- author of more than 70 publications in prestigious journals and inter- national conferences, several book chapters and he has advised or co- advised 5 PhD dissertations. Worried about improving knowledge trans- fer between research institutions and industry, he has directed more than 12 research contracts with different companies (Texas Instruments, Imagination Technologies, Indra, ...). He has also served as evaluator for several national agencies and has also been member of the Board of Directors of the Spanish Computer Architecture Society (SARTECO). Manuel Prieto-Matias Manuel Prieto Matias ob- tained a Ph.D. degree from Complutense Uni- versity of Madrid (UCM) in 2000. Since 2002, he has been a Professor at the Department of Computer Architecture at UCM, being a Full Professor since 2019. His research interests in- clude high-performance computing, non-volatile memory technologies, accelerators, and code generation and optimization. His current focus is on effectively managing resources on emerg- ing computing platforms, emphasizing the inter- action between the system software and the underlying architecture. Manuel has co-authored over 100 scientific publications in journals and conferences in parallel computing and computer architecture. He is a member of the ACM.