This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Citation information: DOI 10.1109/TC.2024.3377890, IEEE Transactions on Computers

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

1

Big-PERCIVAL: Exploring the Native Use of
64-Bit Posit Arithmetic in Scientific Computing

David Mallasén, Alberto A. Del Barrio, Senior Member, IEEE and Manuel Prieto-Matias

Abstract—The accuracy requirements in many scientific computing workloads result in the use of double-precision floating-point
arithmetic in the execution kernels. Nevertheless, emerging real-number representations, such as posit arithmetic, show promise in
delivering even higher accuracy in such computations. In this work, we explore the native use of 64-bit posits in a series of numerical
benchmarks and compare their timing performance, accuracy and hardware cost to IEEE 754 doubles. In addition, we also study the
conjugate gradient method for numerically solving systems of linear equations in real-world applications. For this, we extend the
PERCIVAL RISC-V core and the Xposit custom RISC-V extension with posit64 and quire operations. Results show that posit64 can
obtain up to 4 orders of magnitude lower mean square error than doubles. This leads to a reduction in the number of iterations
required for convergence in some iterative solvers. However, leveraging the quire accumulator register can limit the order of some
operations such as matrix multiplications. Furthermore, detailed FPGA and ASIC synthesis results highlight the significant hardware
cost of 64-bit posit arithmetic and quire. Despite this, the large accuracy improvements achieved with the same memory bandwidth
suggest that posit arithmetic may provide a potential alternative representation for scientific computing.

Index Terms—Arithmetic, Posit, IEEE-754, Floating point, Scientific computing, RISC-V, CPU, Matrix multiplication, PolyBench.

✦

1 INTRODUCTION

R EAL-NUMBER arithmetic is at the core of many scientific
workloads. Physical constants, data from sensors, and

in general most inputs to experimental applications have
to be represented accurately in a computer. Moreover, this
accuracy has to be maintained throughout the execution of
the algorithms that are at the root of scientific computing.
Even minor errors can result in significant consequences,
potentially leading to incorrect predictions of the behavior
of a system or inaccurate solutions to differential equations
and optimization problems.

The most widely used representation of real numbers
in a computer is the IEEE 754 standard for floating-point
arithmetic [1]. Although this standard is considered to be a
robust and reliable method for representing and operating
with real numbers on a computer, it is not perfect. For
instance, its results can be inconsistent across platforms, it
does not ensure the associative property of additions and
multiplications, it has signed zeros, and there is an excess of
Not a Number (NaN) representations.

In the past years, other alternatives to this floating-
point format have emerged. Some of these new arithmetic
representations have been implemented by large techno-
logical companies, especially in the machine learning do-
main. Examples of this are Google’s bfloat16 [2] or Nvidia’s
TensorFloat [3]. In scientific computing, the solution to
the accuracy requirements is to use wider floating-point
representations such as double-precision floats. However,
another solution is to explore emerging floating-point repre-
sentations that provide more accuracy bits. One of the most
promising alternatives for this purpose is posit numbers,
which we study in this work.

All authors are with the Facultad de Informática, Universidad Complutense
de Madrid, 28040 Madrid, Spain.
E-mails: {dmallase, abarriog, mpmatias}@ucm.es
Manuscript received -; revised -.

Targeting this goal, we have extended the PERCIVAL
posit RISC-V core [4] to support 64-bit posits, as well as
including a more diverse and flexible design. This has
allowed us to explore the native use of this arithmetic
with a larger bit-width both at the hardware-cost level
and at the accuracy and performance levels. For the first
part, we have performed Field-Programmable Gate Array
(FPGA) and Application-Specific Integrated Circuit (ASIC)
synthesis of different configurations of Big-PERCIVAL, and
given a detailed analysis of the results. For the accuracy
and performance comparison of the arithmetics, we have
added posit support for the PolyBench benchmark suite [5].
In addition, we have also studied the accuracy of iterative
linear solvers (conjugate gradient and biconjugate gradient)
as well-known examples of widely used applications from
science and engineering that can benefit from higher accu-
racy [6]. Results were measured on Big-PERCIVAL running
on the Genesys II FPGA board.

Our main contributions can be summarized in the fol-
lowing:

• We present Big-PERCIVAL, an extension of the PER-
CIVAL1 [4] posit RISC-V core which adds posit64
operations and increased flexibility. In particular, we
support standard posit addition, subtraction, mul-
tiplication, division and square root, conversions
to and from integer numbers, comparison oper-
ations and register move instructions. Optionally
we also support quire operations and logarithmic-
approximate multiplication, division and square root
units.

• Detailed FPGA and ASIC synthesis results of the
Posit Arithmetic Unit (PAU) in Big-PERCIVAL show-
case the area impact of posit arithmetic and quire

1. https://github.com/artecs-group/PERCIVAL


2

in different configurations of the core. These results
are compared with the FPNew IEEE 754 Floating-
Point Unit (FPU) [7]. An analysis of the individual
units in the PAU gives insight into how the hard-
ware resources are distributed among the different
operations.

• Compiler support for posit64 numbers in the Xposit
custom RISC-V extension in LLVM2. This allows
for easily embedding posit and quire instructions,
including loads and stores, into C code.

• PolyBench benchmark results provide insight into
how posit32 and posit64 numbers compare to IEEE
754 floats and doubles in terms of timing perfor-
mance and accuracy. In particular, the impact of the
quire accumulator register is also studied. Results
show that 64-bit posits can provide up to 4 orders of
magnitude lower Mean Squared Error (MSE) and up
to 3 orders of magnitude lower Maximum Absolute
Error (MaxAbsE) than 64-bit doubles. This provides
improved accuracy with the same memory band-
width as doubles.

• Iterative linear equation solvers, namely the conju-
gate gradient and biconjugate gradient algorithms,
showcase how posit64 can reduce the number of
iterations needed to reach a certain tolerance margin
when solving large ill-conditioned systems which
are frequent in scientific-computing and engineering
problems.

• Detailed analysis on how leveraging the quire affects
the order in which some operations are executed.
For example, the execution of the General Matrix
Multiplication (GEMM) kernel using a dot-product
or memory-aware method impacts the final timing
performance and accuracy.

The rest of the paper is organized as follows: Section 2
introduces posit arithmetic. Related works on RISC-V cores,
the use of posits in High-Performance Computing (HPC)
and theoretical studies on posit64 are presented in Section 3.
In Section 4 we describe the novelties on Big-PERCIVAL and
the Xposit custom RISC-V extension. The synthesis results
of the core and the individual posit64 units are analyzed
in Section 5. Benchmark results using PolyBench targeting
accuracy and timing performance are shown in Section 6,
followed by the conjugate gradient use-case in Section 7, and
a more exhaustive analysis of the GEMM kernel in Section 8.
Finally, Section 9 concludes this work.

2 POSIT ARITHMETIC

The posit number standard [8], defines a posit configuration
from its total bit-width n. This allows for any posit sizes, but
in the literature, the most common ones are the byte-aligned
posit8, posit16, posit32, and posit64 configurations.

One of the main benefits of posit arithmetic is that it
does not have a variety of special cases that have to be
checked. Posits have only two special cases. The value
zero is represented as 0 · · ·0, and the Not-a-Real (NaR)
is represented as 10 · · ·0. The rest of the bit patterns are
composed of the four fields shown in Figure 1.

2. https://github.com/artecs-group/llvm-xposit

Fig. 1. Posit format with sign, regime, exponent, and fraction fields.

These four bit-fields are:

• The sign bit S, the value of which is s = 0 if the value
is positive or s = 1 if the value is negative.

• The variable-length regime field R, which consists of
a series of k bits equal to R0 and terminated either
by 1−R0 or the end of the posit. This field represents
a long-range scaling factor r given by:

r =

{
−k if R0 = 0
k − 1 if R0 = 1

• The exponent field E, consisting of at most 2 bits.
This field encodes an integer unbiased value e. Since
the regime field is variable-length, one or both of
the exponent bits may be located after the least
significant bit of the posit. In this case, those bits will
have the value 0.

• The variable-length fraction field F, which is formed
by the m remaining bits. Its value f will be given by
dividing the unsigned integer F by 2m and therefore
0 ≤ f < 1.

From these fields, we can calculate the real value p of a
generic posit as:

p = ((1− 3s) + f)× 2(1−2s)×(4r+e+s). (1)

This is the most efficient decoding of posits, as shown
by [9], [10]. The most notable differences in this value
representation between posit arithmetic and the IEEE 754
floating-point standard are the existence of the variable-
length regime, the use of an unbiased exponent, and the
value of the hidden bits [9]. In floating-point arithmetic, the
hidden bit is fixed to 1, except for the subnormal numbers,
when it is fixed to 0. However, in posit arithmetic, it is
kept as 1 if the number is positive, or changed to −2 if the
number is negative.

Fig. 2. Decoding example of a posit16.

As an example, let 1111101010010110 be the binary
encoding of a Posit16 (Figure 2). The first bit s = 1 indicates
a negative number. The regime field 11110 gives k = 4 and
therefore r = 3. The next two bits 10 represent the exponent
e = 2. Finally, the remaining m = 8 bits, 10010110,
encode a fraction value of f = 150/28 = 0.5859375. Hence,
from Equation (1) we conclude that 1111101010010110 ≡
(−2 + 0.5859375)× 2−(4·3+2+1) = −0.000043154.

The variable-length regime field acts as a long-range
dynamic exponent, as can be seen in Equation (1), where
it is multiplied by 4 or, equivalently, shifted left by the two


3

exponent bits. Since the regime and the fraction are dynamic
fields, they allow for more flexibility in the trade-off be-
tween accuracy and dynamic range that can be achieved by
a posit. If the regime field occupies more bits, it represents
larger numbers at the cost of lower accuracy. On the other
hand, when the regime field consists of fewer bits, posits
have higher accuracy in the neighborhoods of ±1.

In posit arithmetic, NaR has a unique representation that
maps to the most negative 2’s complement signed integer.
Consequently, if used in comparison operations, it results in
less than all other posits and equal to itself. Moreover, the
rest of the posit values follow the same ordering as their cor-
responding bit representations. These characteristics allow
posit numbers to be compared as if they were 2’s comple-
ment signed integers, eliminating additional hardware for
posit comparison operations.

Posit arithmetic also includes fused operations using the
quire, a 16n-bit fixed-point 2’s complement register. This
special accumulation register allows for the execution of up
to 231 − 1 Multiply-Accumulate (MAC) operations without
intermediate rounding or accuracy loss. These operations
are very common when computing dot products, matrix
multiplications, or other more complex algorithms. The
additional accuracy that can be achieved using the quire
can allow the execution of these algorithms with narrower
posit configurations [11], [12], [13], thus avoiding the limits
that can occur in memory bandwidth.

Currently, one of the main drawbacks of posit arith-
metic is its higher area cost [4]. For an accurate compari-
son between posits and floats, the FPU must be IEEE 754
compliant instead of being limited to normal floats only.
Authors in [14] state that posit hardware is slightly more
expensive than floating-point hardware that does not take
into account subnormal numbers. Moreover, adding a wide
quire accumulator register further increases the area cost of
implementing these fused operations.

3 RELATED WORK

Common open-source application-class RISC-V cores in-
clude support for the F and D RISC-V extensions for IEEE-
754 single- and double-precision floating-point numbers
(which are part of the G compilation). Some of the most
notable ones include the Rocket [15], the CVA6 (ex. Ari-
ane) [16], the Berkeley Out-of-Order Machine (BOOM) [17]
or the SHAKTI C-Class processor [18].

There have been several previous proposals of including
some levels of posit arithmetic capabilities into RISC-V
cores. PERC [19] and PERI [20] included PAUs into the
Rocket core and the SHAKTI C-class core, respectively.
These proposals were constrained by the F and D extensions
and thus did not include quire support. CLARINET [21]
added fused MAC, and fused divide and accumulate oper-
ators using the quire to an RV32IMAFC RISC-V core.

In [22] authors store IEEE floats in memory by first
converting them to the posit representation. This allows
storage of the real-number values with a lower bit-width
while performing the computations using the IEEE 754 FPU.
More recently a posit dot-product unit was presented in [23].
This open-source implementation allows to perform high-
throughput dot-products for deep learning applications.

In the HPC field, [12] explores the use of 16-bit posits
in a shallow water model as an example of a medium-
complexity climate application. In [24] authors adapt the
NAS Parallel Benchmark (NPB) to 32-bit posits, concluding
that they obtain between 0.6 and 1.4 additional decimal
digits of accuracy. However, the software emulation resulted
in a 4× to 19× performance overhead. An evaluation of the
numerical stability of posits for solving linear systems is
presented in [25]. Here, the authors find that there is no
big difference between posits and floats in the native range
of the matrices that they test. However, when re-scaling the
matrices to optimize the use of the posit representation, they
obtained 4 extra bits of precision with 32-bit numbers and 2
extra bits with 16-bit numbers. The use of 32-bit posits for
the conjugate gradient method is also studied in [26]. A PAU
called POSAR is presented in [27]. This work analyzes 32-bit
posits on one NPB scientific application and a Convolutional
Neural Network (CNN) inference model.

The use of 64-bit posits is studied theoretically in [28]
and in the initial presentation of this arithmetic [29]. At the
hardware level, synthesis results of some posit64 operators
are given in [10], [30], [31], and a GEMM accelerator which
also supports 64-bit posits is presented in [32]. A more
complex architecture with a reconfigurable posit tensor unit
supporting posit64 is introduced in [33].

4 BIG-PERCIVAL CORE

The PERCIVAL [4] posit RISC-V core is based on the
application-level CVA6 core [16], to which it adds tightly
coupled 32-bit posit and quire capabilities. This allows for
both posits and IEEE 754 floats to coexist natively in hard-
ware, which is essential for the evaluation of their strengths
and weaknesses.

The Big-PERCIVAL core we present in this paper is a
modified version of PERCIVAL that supports either posit64
numbers or posit32 (see Figure 3), although in this work we
have focused on the 64-bit version. This includes a param-
eterized 16n-bit quire, which results in 1024 bits for posit64
numbers. We also provide a 32-entry posit register file for
either posit32 or posit64. The decision between single- or
double-precision posits is taken at synthesis time.

In addition to adding 64-bit versions of all the com-
putational and conversion units, we include some addi-
tional customizations to the PAU (see Figure 4). These are
set with parameters in the Register-Transfer Level (RTL)
description of the core, so they take effect at synthesis
time. Furthermore, as typically the multiplicative arithmetic
units are among the most power-hungry modules [34], [35],
[36], we permit either the use of the logarithm-approximate
units presented in [37] or exact units for the multiplication,
division and square root operations [38]. The exact units
for these last two operations use a non-restoring division
algorithm [39]. Also, all of the quire operations are optional
and the quire can be enabled or disabled completely. These
options allow a fine-grained study of the individual impact
of the quire accumulator register and approximate versus
exact computational units. The input operands can be either
32 or 64 bits. However, the output will always be 64 bits as
the core follows the RV64 architecture (XLEN=64).


4

ID Issue

Instruction
Queue

Re-aligner

Compressed
Decoder

Decoder

Scoreboard

Regfile Read

Is
su

e

EX Commit

Sc
or

eb
oa

rd

PTW

DTLB
LSU

ALU

CSR Buffer

Multiplier

Branch Unit

FPU

PAU

Data Cache

Regfile Write

CSR Write

Commit

Fig. 3. Block diagram of Big-PERCIVAL. Extended modules are highlighted in green. The PAU with the new posit64 units is shown in orange.

Posit Arithmetic Unit (PAU)

Quire operations (Optional)

MAC Q2P

Quire 512/1024 bits

Conversion to/from integers

P2I P2U P2L P2LU

I2P U2P L2P LU2P

Operand A
Operand B

Operator

Result

Posit computational operations

ADD MUL
(Exact/Approx.)

DIV
(Exact/Approx.)

SQRT
(Exact/Approx.)

64

32/64

Fig. 4. Internal structure of the Posit Arithmetic Unit (PAU). The modifiable blocks with SystemVerilog parameters are shown in green.

The new 64-bit operations were validated with an exten-
sive test suite. These tests cover the full range of instructions
that Big-PERCIVAL can execute. To obtain the expected
outputs of each series of operations we used the Universal
Numbers software library [40].

To compile the new 64-bit instructions we updated
the modified LLVM compiler with support for the Xposit
RISC-V custom extension [4] so that it can generate double-
precision loads and stores. The rest of the instructions are
maintained as in the original proposal since we only support
either posit32 or posit64 at any one time and the decoding
can be reused. The updated list of the full Xposit instruction
set can be found in Table 1.

5 SYNTHESIS RESULTS

In this section, we present FPGA and ASIC synthesis
results of different configurations of Big-PERCIVAL as well
as detailed area costs of the individual posit units. This
highlights the hardware cost of posit numbers and the quire,
which is the main drawback we have seen.

5.1 FPGA
For the FPGA synthesis results we used Vivado v.2021.2 tar-
geting a Genesys II (Xilinx Kintex-7 XC7K325T-2FFG900C)
board. In every case, the target frequency of 50 MHz was
met. This parameter is defined by the base CVA6 core.
With this setup, we synthesized different configurations
of the PAU with or without quire and with or without


5

TABLE 1
Updated instruction set of the XPosit custom RISC-V extension.

31 20 19 15 14 12 11 7 6 0

PLW imm[11:0] rs1 0x1 rd 0x0B
PLD imm[11:0] rs1 0x5 rd 0x0B

31 24 20 19 15 14 12 11 7 6 0

PSW imm[11:5] rs2 rs1 0x3 imm[4:0] 0x0B
PSD imm[11:5] rs2 rs1 0x6 imm[4:0] 0x0B

31 27 26 25 24 20 19 15 14 12 11 7 6 0

PADD.S 0x00 0x2 rs2 rs1 0x0 rd 0x0B
PSUB.S 0x01 0x2 rs2 rs1 0x0 rd 0x0B
PMUL.S 0x02 0x2 rs2 rs1 0x0 rd 0x0B
PDIV.S 0x03 0x2 rs2 rs1 0x0 rd 0x0B
PMIN.S 0x04 0x2 rs2 rs1 0x0 rd 0x0B
PMAX.S 0x05 0x2 rs2 rs1 0x0 rd 0x0B
PSQRT.S 0x06 0x2 0x0 rs1 0x0 rd 0x0B
QMADD.S 0x07 0x2 rs2 rs1 0x0 0x0 0x0B
QMSUB.S 0x08 0x2 rs2 rs1 0x0 0x0 0x0B
QCLR.S 0x09 0x2 0x0 0x0 0x0 0x0 0x0B
QNEG.S 0x0A 0x2 0x0 0x0 0x0 0x0 0x0B
QROUND.S 0x0B 0x2 0x0 0x0 0x0 rd 0x0B
PCVT.W.S 0x0C 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.WU.S 0x0D 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.L.S 0x0E 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.LU.S 0x0F 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.S.W 0x10 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.S.WU 0x11 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.S.L 0x12 0x2 0x0 rs1 0x0 rd 0x0B
PCVT.S.LU 0x13 0x2 0x0 rs1 0x0 rd 0x0B
PSGNJ.S 0x14 0x2 rs2 rs1 0x0 rd 0x0B
PSGNJN.S 0x15 0x2 rs2 rs1 0x0 rd 0x0B
PSGNJX.S 0x16 0x2 rs2 rs1 0x0 rd 0x0B
PMV.X.W 0x17 0x2 0x0 rs1 0x0 rd 0x0B
PMV.W.X 0x18 0x2 0x0 rs1 0x0 rd 0x0B
PEQ.S 0x19 0x2 rs2 rs1 0x0 rd 0x0B
PLT.S 0x1A 0x2 rs2 rs1 0x0 rd 0x0B
PLE.S 0x1B 0x2 rs2 rs1 0x0 rd 0x0B

TABLE 2
Area results of the PAU for different configurations of Big-PERCIVAL.

PAU DivSqrt LUTs FFs DSPs SRLs

32-bit
No quire Approx. 5666 689 2 0

Exact 5923 1453 2 36

Quire Approx. 11 605 2923 4 0
Exact 12 908 3640 4 35

64-bit No quire Approx. 8561 1075 12 0
Exact 15 959 4233 12 71

Quire Exact 29 781 7274 24 584

approximate division and square-root units. These results
are shown in Table 2.

When comparing with the FPU available in the CVA6
(FPNew [7]), we observe that 32-bit posits and especially the
64-bit variant require significantly more resources than the
same length floats. The 32-bit FPU occupies 4045 Lookup
Tables (LUTs), 1008 Flip-flops (FFs) and 2 Digital Signal
Processing (DSP) blocks. The 64-bit only variant of the FPU
(no 32-bit support) requires 6243 LUTs, 1893 FFs and 9 DSP
blocks.

The exact PAUs without quire need almost 50% more
LUTs and FFs in the 32-bit case and more than 2.5× as many
resources in the 64-bit case. This shows that the growth in
terms of hardware usage is significantly higher in the case of
posit numbers. Adding quire capabilities results in an even
higher hardware cost, but as will be shown in Section 6 this
allows for obtaining more accurate results with the same
memory bandwidth as double-precision floats.

If the target application can tolerate some errors in

TABLE 3
Area results of the 64-bit components of the PAU.

Name LUTs FFs DSPs SRLs

Quire MAC 9834 1909 12 515
Posit Div 5710 1903 0 61
Posit Sqrt 3447 1442 0 10
Quire to Posit 3290 236 0 0
Posit Add 1544 77 0 0
Posit Mult 1503 114 12 0
Posit Div Approx. 870 106 0 0
Posit Sqrt Approx. 618 68 0 0
Posit to ULong 595 0 0 0
Posit to Long 580 0 0 0
ULong to Posit 500 0 0 0
Long to Posit 429 0 0 0
UInt to Posit 356 0 0 0
Posit to UInt 342 0 0 0
Posit to Int 311 0 0 0
Int to Posit 202 0 0 0

the division and square root outputs, using the logarithm-
approximate units [37] allows for a significant reduction in
LUT and FF usage. This is especially true in the 64-bit case,
where the total PAU LUTs can be cut in half and the FFs
reduced to only 25%.

Detailed area results for the individual units in the 64-bit
PAU are shown in Table 3. As a reference, the whole Big-
PERCIVAL core requires 77888 LUTs, 33437 FFs, and 51 DSP
blocks. Unsurprisingly, the largest unit corresponds to the
quire MAC operation which, together with the quire to posit
and additional quire logic in the PAU, corresponds to 50%
of the hardware resources of the PAU. This goes in line with
the results for the 32-bit PAU with quire. The exact division
and square root units are the next largest components of the
PAU. The cost of these exact units is substantially reduced
in the case of using the corresponding approximate units.
As the works in [36], [37], [41], [42] highlight, error-resilient
applications may benefit from those. Such is the case of
Deep Neural Networks (DNNs), filters, and other Machine
Learning kernels.

5.2 ASIC
Regarding ASIC synthesis we targeted TSMC’s 28nm
standard-cell library to obtain more insight into the area and
power cost of the 64-bit PAU and FPU. The synthesis was
performed using Synopsys DC with a 5ns timing constraint
and a toggle rate of 0.1.

The 64-bit FPU (without 32-bit support enabled to match
the PAU) requires an area of 21853 µm2 and consumes
0.738 mW of power. On the other hand, the exact 64-bit PAU
with quire occupies 114695 µm2 and consumes 3.516 mW
of power. In the case of the 64-bit PAU without quire,
this results in significantly less resource use. In particular
it spans 71090 µm2 and consumes 1.958 mW. These ASIC
area and power results go in line with the FPGA results
we obtained previously, highlighting the main drawback of
current hardware implementations of posit arithmetic.

6 POLYBENCH

In this work, we used the PolyBench suite [5] to benchmark
Big-PERCIVAL. This benchmark suite contains a series of


6

numerical computations with static control flow from vari-
ous domains such as linear algebra computations or physics
simulation. From these, we selected some representative al-
gorithms to study how posit32 and posit64 compare to IEEE
754 floats and doubles in scientific computing calculations.

PolyBench implements each benchmark in a single file,
with some header parameters and a series of compile-time
directives. In order to compile them we have employed the
modified version of LLVM with Xposit. Furthermore, the
compile-time directives allow both the study of the accuracy
of the results and the accurate measuring of the performance
using cache flushing and multiple executions.

We have chosen to port to posit arithmetic the following
representative benchmarks:

• Covariance: Computes the covariance of N data
points, each with M attributes.

• GEMM: Generalized matrix multiply from BLAS [43]
Cout = αAB + βC.

• 3mm: Linear algebra kernel that consists of three
matrix multiplications G = (AB)(CD).

• Cholesky: Cholesky decomposition of a positive-
definite matrix A into a lower triangular matrix L
such that A = LLT .

• Durbin: This is an algorithm for solving Yule-Walker
equations, which are a special case of Toeplitz sys-
tems.

• ludcmp: LU decomposition followed by forward and
backward substitutions to solve a system of linear
equations.

• fdtd-2d: Simplified Finite-Difference Time-Domain
method for 2D data. This models electric and mag-
netic fields based on Maxwell’s equations.

• seidel-2d: Gauss-Seidel style computation over 2D
data with a 9-point stencil pattern.

Porting these benchmarks entailed translating the arith-
metic kernels from using IEEE 754 floats or doubles to inline
assembly with posit32 or posit64 instructions. The code
structure was kept unchanged, including the initialization
phase which populated the input data to the algorithms. All
tests were compiled with the -O3 flag of gcc and clang to
optimize the execution of the kernels and minimize the dif-
ference between the original code and the straightforward
translation to posit assembly.

For each configuration, we provide six executions:
single- and double-precision IEEE 754 arithmetic with fused
MAC operations as optimized by the compiler, posit32 and
posit64 arithmetic with quire fused MAC operations, and
posit32 and posit64 arithmetic without quire, that is, replac-
ing these MAC operations with individual multiplication
and addition operations. The fdtd-2d and seidel-2d bench-
marks do not benefit from fused MAC operations, hence
they have a single posit32 and posit64 execution.

Each of these benchmarks was executed with four prob-
lem sizes. These range from the MINI datasets, which
require less than 16KB of memory each, to the LARGE
datasets, which occupy around 25MB of memory. The
cache hierarchy of PERCIVAL comprises a 32KB 8-way set-
associative L1 data cache with 16-byte lines. Therefore, these
problem sizes stretch the whole memory range.

The performance results as shown by the timing script
of PolyBench are shown in Table 4. As can be seen, float
is faster than posit32 in practically all scenarios. However,
when comparing doubles and posit64, there is no clear
winner except for the GEMM benchmark, which will be
studied in detail in Section 8. In the rest of the algorithms,
some perform better using doubles and others using posit64.
These variations are due in part to the translation of the
execution kernels from doubles in the native C datatype to
the posit64 inline assembly of the Xposit RISC-V custom
extension. Furthermore, the execution of 64-bit posits with-
out quire is slower than the other options, as an individual
multiplication plus addition has more latency than a single
fused MAC operation both using doubles or posit64 with
quire. Again, this does not hold in the GEMM benchmark,
which will be detailed in Section 8. Therefore we can con-
clude that, when integrating 64-bit posits with quire into
the CVA6, they do not suffer any noticeable performance
degradation in comparison with double-precision floats.
This system inherits the limitations given by the underlying
core, and since the critical path is not in the PAU or the FPU,
but in the load/store unit, further optimizations are out of
the scope of this work.

Regarding accuracy, we have obtained two metrics:
Mean Squared Error (MSE) and Maximum Absolute Error
(MaxAbsE). To obtain both of these metrics, we compared
the results of all the arithmetics under study to the same al-
gorithm computed using the GNU MPFR multiple-precision
library with 128 fraction bits, which we use as our golden
solution. We chose the MSE as a general accuracy metric
and the MaxAbsE to also take into account the maximum
error, which can be a critical value in certain applications
[44], [45].

The MSE results are shown in Figure 5. Note the loga-
rithmic scale on the Y-axis. The trend in every benchmark
is a significantly lower error when using posit64 or posit32
numbers in comparison to floats or doubles, respectively.
This is up to 4 orders of magnitude lower MSE depending
on the benchmark. In these plots we can also observe the
difference in magnitude of the accuracy errors when using
32- or 64-bit numbers.

The accuracy improvements are maintained across the
whole range of problem sizes. When using posits without
the quire accumulator we also observe significant accuracy
improvements. This happens even though in the posit case
we have an extra rounding between the multiplication and
addition operations that is not present when using fused
MAC with IEEE representations.

The MaxAbsE metric follows the same pattern as the
MSE. These results are shown in Figure 6, where the Y-axis
also follows a logarithmic scale. Posits obtain up to 3 orders
of magnitude lower error than their same bit-width floats
across all benchmarks and dataset sizes. There are also large
accuracy improvements when executing with posit without
quire over doubles.

All in all, from this set of benchmarks we can conclude
that 64-bit posits present significant accuracy improvements
over IEEE 754 doubles, while maintaining the same memory
bandwidth. This is also the case with 32-bit numbers. This
better accuracy is maintained both in a general sense, as
shown by the MSE, as well as in each particular value since


7

MINI SMALL MEDIUM LARGE

10 31

10 27

10 23

10 19

10 15

10 11

10 7

10 3

M
SE

4.
20

e-
16

5.
07

e-
15

7.
48

e-
14

4.
85

e-
12

4.
34

e-
34

6.
60

e-
34

4.
46

e-
33

2.
67

e-
31

2.
28

e-
15

1.
37

e-
13

6.
04

e-
12

2.
68

e-
09

5.
12

e-
34

8.
25

e-
33

3.
20

e-
31

1.
44

e-
28

2.
02

e-
30

3.
75

e-
29

1.
36

e-
27

1.
71

e-
25

5.
73

e-
13

1.
12

e-
11

4.
02

e-
10

4.
86

e-
08

gemm

MINI SMALL MEDIUM LARGE

10 34

10 29

10 24

10 19

10 14

10 9

10 4

101

106

M
SE 1.
42

e-
18

4.
61

e-
15

1.
90

e-
11 5.

07
e-

06

1.
59

e-
37

3.
69

e-
34

1.
20

e-
30 2.

75
e-

25

7.
11

e-
18

4.
75

e-
14 1.
44

e-
09 1.

79
e-

03

2.
92

e-
37

2.
78

e-
33 7.
70

e-
29 9.

71
e-

23

4.
89

e-
33

1.
51

e-
29

9.
33

e-
26 7.
64

e-
212.

04
e-

15

4.
54

e-
12

2.
44

e-
08 2.
26

e-
03

3mm

MINI SMALL MEDIUM LARGE

10 32

10 28

10 24

10 20

10 16

10 12

10 8

10 4

M
SE

2.
64

e-
16

1.
21

e-
14

1.
30

e-
13

1.
58

e-
11

2.
49

e-
35

7.
07

e-
34

6.
97

e-
33

8.
25

e-
31

4.
17

e-
16

1.
79

e-
14

1.
88

e-
13

2.
02

e-
11

2.
36

e-
35

9.
58

e-
34

9.
98

e-
33

1.
05

e-
30

5.
34

e-
31

4.
97

e-
30

5.
17

e-
29

1.
44

e-
27

1.
26

e-
13

1.
22

e-
12

1.
85

e-
11

4.
52

e-
10

cholesky

MINI SMALL MEDIUM LARGE

10 29

10 24

10 19

10 14

10 9

10 4

101

106

M
SE

5.
98

e-
11

3.
84

e-
09 8.

45
e-

03

6.
11

e-
05

5.
99

e-
32 1.
01

e-
27 5.
47

e-
23

2.
85

e-
23

1.
13

e-
11

1.
07

e-
08 2.

87
e-

03

1.
88

e-
04

8.
38

e-
33 1.

28
e-

27 1.
62

e-
21

1.
00

e-
22

5.
37

e-
27

4.
83

e-
24

3.
43

e-
20

3.
71

e-
19

5.
23

e-
10

2.
53

e-
07 7.

33
e-

02

7.
16

e-
03

durbin

MINI SMALL MEDIUM LARGE

10 32

10 28

10 24

10 20

10 16

10 12

10 8

M
SE

2.
95

e-
16

3.
46

e-
16

2.
51

e-
15

6.
84

e-
14

1.
63

e-
35

2.
57

e-
35

1.
52

e-
34

4.
23

e-
33

2.
96

e-
16

4.
59

e-
16

2.
67

e-
15

6.
82

e-
14

6.
06

e-
35

3.
48

e-
35

1.
51

e-
34

4.
24

e-
33

3.
30

e-
31

4.
25

e-
31

4.
75

e-
30

8.
76

e-
29

6.
90

e-
14

1.
43

e-
13

5.
94

e-
13

3.
23

e-
12

ludcmp

MINI SMALL MEDIUM LARGE

10 30

10 25

10 20

10 15

10 10

10 5

100

105

M
SE 4.

37
e-

14

1.
23

e-
11

4.
98

e-
09

3.
27

e-
05

2.
50

e-
33

9.
05

e-
32

2.
38

e-
28 1.
80

e-
24

2.
94

e-
13

4.
07

e-
10

8.
34

e-
07 1.
82

e-
02

1.
43

e-
32

3.
63

e-
30

2.
06

e-
26 9.
14

e-
22

1.
35

e-
29

1.
02

e-
27

9.
78

e-
25

3.
62

e-
21

4.
24

e-
12

4.
71

e-
09

7.
47

e-
07

1.
17

e-
03

covariance

MINI SMALL MEDIUM LARGE

10 29

10 25

10 21

10 17

10 13

10 9

10 5

10 1

M
SE

3.
93

e-
14

4.
64

e-
13

1.
66

e-
11

7.
18

e-
09

1.
23

e-
31

4.
69

e-
31

4.
12

e-
30

5.
29

e-
28

1.
15

e-
29

1.
38

e-
28

2.
80

e-
27

3.
92

e-
25

4.
09

e-
12

4.
18

e-
11

8.
68

e-
10

1.
18

e-
07

fdtd-2d

MINI SMALL MEDIUM LARGE

10 31

10 27

10 23

10 19

10 15

10 11

10 7

10 3

M
SE

1.
36

e-
14

2.
28

e-
13

3.
46

e-
12

2.
84

e-
10

7.
36

e-
34

1.
24

e-
32

1.
94

e-
31

1.
41

e-
29

4.
46

e-
30

2.
91

e-
29

3.
13

e-
28

8.
59

e-
27

1.
08

e-
12

7.
25

e-
12

8.
72

e-
11

2.
69

e-
09

seidel-2d

Posit64 Posit64 No Quire Double Posit32 Posit32 No Quire Float

Fig. 5. PolyBench benchmarks mean square error of the different arithmetics studied with respect to the results obtained with GNU MPFR.


8

MINI SMALL MEDIUM LARGE

10 15

10 12

10 9

10 6

10 3

100

M
ax

Ab
sE 5.
72

e-
08

2.
43

e-
07

6.
79

e-
07

7.
58

e-
06

4.
42

e-
17

8.
67

e-
17

1.
66

e-
16

1.
89

e-
15

1.
48

e-
07

1.
79

e-
06

1.
38

e-
05

5.
88

e-
04

7.
37

e-
17

4.
15

e-
16

3.
86

e-
15

1.
68

e-
13

4.
55

e-
15

2.
37

e-
14

2.
20

e-
13

4.
32

e-
12

2.
63

e-
06

1.
25

e-
05

1.
06

e-
04

3.
34

e-
03

gemm

MINI SMALL MEDIUM LARGE

10 16

10 13

10 10

10 7

10 4

10 1

102

M
ax

Ab
sE

2.
95

e-
09

1.
78

e-
07

8.
82

e-
06 7.

68
e-

03

9.
76

e-
19

4.
51

e-
17

2.
33

e-
15 1.

70
e-

129.
48

e-
09

1.
07

e-
06 1.
69

e-
04 2.

15
e-

01

1.
90

e-
18

1.
89

e-
16 4.
12

e-
14 4.

65
e-

11

2.
55

e-
16

1.
54

e-
14

1.
30

e-
12 4.
50

e-
101.
51

e-
07

8.
74

e-
06

6.
96

e-
04 2.
38

e-
01

3mm

MINI SMALL MEDIUM LARGE

10 16

10 14

10 12

10 10

10 8

10 6

10 4

10 2

M
ax

Ab
sE 4.

02
e-

08

3.
35

e-
07

1.
13

e-
06

1.
53

e-
05

1.
27

e-
17

9.
01

e-
17

3.
44

e-
16

3.
32

e-
15

5.
51

e-
08

4.
40

e-
07

1.
52

e-
06

2.
27

e-
05

1.
36

e-
17

1.
04

e-
16

3.
62

e-
16

5.
33

e-
15

2.
32

e-
15

9.
42

e-
15

4.
31

e-
14

2.
76

e-
13

1.
04

e-
06

3.
54

e-
06

2.
07

e-
05

1.
16

e-
04

cholesky

MINI SMALL MEDIUM LARGE

10 14

10 11

10 8

10 5

10 2

101

M
ax

Ab
sE 1.

20
e-

05

9.
01

e-
05 1.

30
e-

01

1.
15

e-
02

3.
91

e-
16 4.
60

e-
14 1.
05

e-
11

7.
74

e-
12

5.
09

e-
06

1.
74

e-
04 7.

68
e-

02

2.
41

e-
02

2.
44

e-
16 5.
52

e-
14 5.

72
e-

11

2.
68

e-
11

1.
18

e-
13

3.
29

e-
12

2.
72

e-
10

9.
10

e-
10

3.
52

e-
05

7.
74

e-
04 3.

87
e-

01

1.
46

e-
01

durbin

MINI SMALL MEDIUM LARGE

10 16

10 14

10 12

10 10

10 8

10 6

10 4

10 2

M
ax

Ab
sE 2.

96
e-

08

6.
12

e-
08

3.
12

e-
07

2.
44

e-
06

8.
24

e-
18

1.
98

e-
17

7.
77

e-
17

5.
71

e-
16

2.
96

e-
08

7.
61

e-
08

3.
17

e-
07

2.
44

e-
06

4.
29

e-
17

2.
58

e-
17

7.
57

e-
17

5.
67

e-
16

2.
36

e-
15

1.
97

e-
15

1.
22

e-
14

7.
64

e-
14

4.
75

e-
07

2.
77

e-
06

4.
43

e-
06

2.
03

e-
05

ludcmp

MINI SMALL MEDIUM LARGE

10 14

10 11

10 8

10 5

10 2

101

104

M
ax

Ab
sE

1.
26

e-
06

4.
58

e-
05

6.
02

e-
04

3.
04

e-
02

2.
91

e-
16

1.
17

e-
15

1.
12

e-
13

6.
59

e-
12

2.
92

e-
06 6.
71

e-
04

1.
54

e-
02 2.
41

e+
00

5.
83

e-
16

2.
02

e-
14 2.
90

e-
12 9.
12

e-
10

2.
35

e-
14

2.
27

e-
13

1.
45

e-
11

1.
81

e-
09

1.
54

e-
05

1.
34

e-
03

1.
51

e-
02

1.
13

e+
00

covariance

MINI SMALL MEDIUM LARGE

10 14

10 12

10 10

10 8

10 6

10 4

10 2

100

102

M
ax

Ab
sE 2.
00

e-
06

1.
67

e-
05

1.
38

e-
04 6.
60

e-
03

2.
25

e-
15

3.
89

e-
15

4.
24

e-
14

1.
32

e-
12

2.
98

e-
14

2.
99

e-
13

1.
83

e-
12

6.
27

e-
11

3.
05

e-
05

1.
12

e-
04

1.
11

e-
03

3.
72

e-
02

fdtd-2d

MINI SMALL MEDIUM LARGE

10 15

10 13

10 11

10 9

10 7

10 5

10 3

10 1

M
ax

Ab
sE

9.
54

e-
07

3.
18

e-
06

1.
59

e-
05

1.
83

e-
04

2.
22

e-
16

7.
43

e-
16

3.
64

e-
15

3.
89

e-
14

1.
14

e-
14

4.
55

e-
14

1.
61

e-
13

1.
20

e-
12

5.
34

e-
06

1.
93

e-
05

9.
16

e-
05

5.
29

e-
04

seidel-2d

Posit64 Posit64 No Quire Double Posit32 Posit32 No Quire Float

Fig. 6. PolyBench benchmarks maximum absolute error of the different arithmetics studied with respect to the results obtained with GNU MPFR.


9

TABLE 4
PolyBench timing comparisons of floats and doubles with fused MAC,
posit32 and posit64 with quire, and posit32 and posit64 without quire

(posit32– and posit64–) measured in seconds of FPGA runtime.

MINI SMALL MEDIUM LARGE

GEMM
float 0.004991 0.1004 4.579 591.8
posit32 0.005175 0.1151 5.225 1251.3
posit32– 0.007326 0.1638 6.382 846.5
double 0.006025 0.1846 6.988 1004.9
posit64 0.005845 0.1747 7.500 1973.3
posit64– 0.007628 0.2175 7.981 1109.5

3mm
float 0.005591 0.1177 9.533 2208.0
posit32 0.003778 0.1250 9.678 2265.2
posit32– 0.009895 0.2230 14.01 2757.0
double 0.005321 0.2307 16.09 3931.2
posit64 0.004626 0.1903 14.14 3674.6
posit64– 0.010466 0.3044 18.46 4208.8

Cholesky
float 0.002504 0.0617 3.670 484.0
posit32 0.003005 0.0707 3.811 498.5
posit32– 0.004960 0.1179 5.699 734.8
double 0.003952 0.1190 6.323 870.0
posit64 0.003453 0.0987 5.529 767.3
posit64– 0.005044 0.1503 7.680 1038.1

Durbin
float 0.000526 0.00388 0.0394 1.011
posit32 0.001091 0.00660 0.0676 1.681
posit32– 0.000814 0.00724 0.0708 1.766
double 0.000772 0.00703 0.0698 2.127
posit64 0.000898 0.00732 0.0814 2.445
posit64– 0.000928 0.00816 0.0851 2.539

LU dcmp
solver

float 0.004377 0.1079 9.162 2896.6
posit32 0.006173 0.1459 10.37 2986.3
posit32– 0.007162 0.1704 11.58 3147.2
double 0.005918 0.2447 19.48 3309.6
posit64 0.006388 0.2536 19.96 3311.2
posit64– 0.007347 0.2859 20.47 3468.3

Covariance float 0.004203 0.0939 4.381 1779.2
posit32 0.004447 0.0903 4.149 1727.9
posit32– 0.005377 0.1062 4.552 1793.1
double 0.004478 0.1811 7.843 1949.5
posit64 0.004817 0.1670 7.311 1883.0
posit64– 0.005678 0.1875 7.818 1948.8

FDTD 2D float 0.014354 0.3824 11.14 1457.3
posit32 0.020327 0.5168 14.49 1893.7
double 0.016546 0.6750 17.37 2469.6
posit64 0.020668 0.7709 19.83 2711.4

Gauss-Seidel
2D

float 0.030509 0.5979 17.35 2346.2
posit32 0.036817 0.7519 22.74 3164.3
double 0.041236 0.8985 26.98 4071.4
posit64 0.037041 0.8412 26.55 4097.4

the MaxAbsE is also lower in every case. The use of the quire
is beneficial both in terms of performance and accuracy, so
it can compensate for its significant hardware cost shown in
Section 5.

7 CONJUGATE GRADIENT

In addition to the Polybench algorithms shown in Sec-
tion 6, we have studied the use of posit64 in iterative linear
equation solvers. In particular, when using the conjugate
gradient (CG) and the biconjugate gradient (BiCG) meth-
ods. These serve as larger real-world applications in which
posit64 could be used.

The conjugate gradient algorithm serves to numerically
solve systems of linear equations Ax = b for vector x,
where the real matrix A is symmetric and positive-definite.
We have executed this algorithm on Big-PERCIVAL with a

tolerance margin of 10−12 on four matrices extracted from
the Matrix Market repository3. Concretely, on a subset of
the BCS Structural Engineering Matrices from the Harwell-
Boeing Collection. This provided a real use-case in which
we can analyze the use of posit64 and compare it with IEEE
doubles.

Figure 7 shows the results of these executions. The
bcsstk01, bcsstk04, bcsstk07, and bcsstk08 matrices have a
size of 48 × 48, 132 × 132, 420 × 420, and 1074 × 1074,
respectively. In the smallest case, both posit64 and double
converge in the same number of iterations (136), but in the
rest of the problems, posit64 converges in fewer iterations.
This amounts to a reduction of 2-10% in the number of
iterations of the algorithm, depending on the input matrix.

The biconjugate gradient method is a generalized ver-
sion of the CG that serves to solve systems of linear equa-
tions in which matrix A does not have the restriction of
being symmetric and positive-definite. On the other hand,
its computational cost is around double the one of CG.

Following the same setup, we tested BiCG on a set
of four increasingly larger unsymmetric matrices from the
Harwell-Boeing Collection. These were the impcol b, imp-
col c, west0381, and gre 1107, which have a size of 59× 59,
137 × 137, 381 × 381, and 1107 × 1107, respectively. The
tolerance margin was again 10−12. The results can be seen
in Figure 8. In every case, using posit64 results in fewer
iterations before the target tolerance is met. This reduction
reaches almost 20% in the west0381 matrix.

8 GEMM
The GEMM kernel in PolyBench is optimized from the
memory perspective by performing loop interchange (see
loops k and j in Figure 9). This version is the one whose
results are shown in Section 6. When executing the GEMM
kernel in hardware for sufficiently large matrices it can
be observed that there is a considerable time penalty for
the posit64 case (see Table 4). This is due to the fact that
exploiting the use of the quire accumulator register limits
the order in which matrix multiplication is computed (see
Figure 10), which results in a higher number of cache misses
because of the long dot-product computations. This is not
the case for the execution without the quire, where the
performance results are comparable.

The GEMM operation is typically optimized to reduce
the number of memory accesses. In this section, we describe
the impact on both timing performance and accuracy of
executing the GEMM kernel using posit64 and quire with
block tiling optimizations. The matrices are kept the same
as in the PolyBench GEMM benchmark, but we modified
the algorithm to use a standard 6-loop tiling approach (see
Figure 11). The tile size was varied at compile time to better
study the impact of this method, allowing the compiler to
perform optimizations. We tested all tile sizes between 5 and
25 and also larger tiles of 30 to 40 in steps of 2 to check the
observed trends.

Figure 12 shows the timing results of executing the tiled
version of GEMM when varying the tile size. These results
are for the LARGE dataset, which set the values of ni,

3. https://math.nist.gov/MatrixMarket/


10

0 20 40 60 80 100 120 140
Number of iterations

10 11

10 8

10 5

10 2

101

104
Re

sid
ua

l

Conjugate Gradient Residual: bcsstk01 Matrix

0 100 200 300 400 500 600
Number of iterations

10 11

10 8

10 5

10 2

101

104

107

Re
sid

ua
l

Conjugate Gradient Residual: bcsstk04 Matrix

0 1000 2000 3000 4000
Number of iterations

10 10

10 7

10 4

10 1

102

105

Re
sid

ua
l

Conjugate Gradient Residual: bcsstk07 Matrix

0 1000 2000 3000 4000 5000 6000 7000 8000
Number of iterations

10 11

10 8

10 5

10 2

101

104

107

Re
sid

ua
l

Conjugate Gradient Residual: bcsstk08 Matrix

Double Posit64

Fig. 7. Conjugate gradient iterative residual results on four increasingly larger Matrix Market problems.

0 10 20 30 40 50 60 70
Number of iterations

10 12

10 9

10 6

10 3

100

103

106

Re
sid

ua
l

Biconjugate Gradient Residual: impcol_b Matrix

0 50 100 150 200 250 300
Number of iterations

10 11

10 8

10 5

10 2

101

104

107

Re
sid

ua
l

Biconjugate Gradient Residual: impcol_c Matrix

0 2000 4000 6000 8000 10000 12000 14000 16000
Number of iterations

10 11

10 8

10 5

10 2

101

104

107

1010

Re
sid

ua
l

Biconjugate Gradient Residual: west0381 Matrix

0 500 1000 1500 2000 2500
Number of iterations

10 11

10 8

10 5

10 2

101

104

Re
sid

ua
l

Biconjugate Gradient Residual: gre_1107 Matrix

Double Posit64

Fig. 8. Biconjugate gradient iterative residual results on four increasingly larger Matrix Market problems.


11

Input: Double matrices A (ni×nk), B (nk×nj) and C
(ni×nj). Scalar values a and b.

Output: Double matrix C = aAB + bC.
for i = 0 to ni-1 do

for j = 0 to nj-1 do
C[i][j] *= b

end for
for k = 0 to nk-1 do

for j = 0 to nj-1 do
C[i][j] += a * A[i][k] * B[k][j]

end for
end for

end for

Fig. 9. PolyBench Double GEMM pseudocode with loop interchange.

Input: Posit64 matrices A (ni×nk), B (nk×nj) and C
(ni×nj). Scalar values a and b.

Output: Posit64 matrix C = aAB + bC.
for i = 0 to ni-1 do

for j = 0 to nj-1 do
C[i][j] *= b

end for
for j = 0 to nj-1 do
quire = C[i][j]
for k = 0 to nk-1 do
quire += a * A[i][k] * B[k][j]

end for
C[i][j] = round(quire)

end for
end for

Fig. 10. Posit GEMM pseudocode using the quire accumulator.

Input: Posit64 matrices A (ni×nk), B (nk×nj) and C
(ni×nj). Scalar values a and b. Tile size nt.

Output: Posit64 matrix C = aAB + bC.
for ii = 0 to ni-1 in steps of nt do

for jj = 0 to nj-1 in steps of nt do
for i = ii to min(ii+nt, ni)-1 do

for j = jj to min(jj+nt, nj)-1 do
C[i][j] *= b

end for
end for

end for
for kk = 0 to nk-1 in steps of nt do

for i = ii to min(ii+nt, ni)-1 do
for j = jj to min(jj+nt, nj)-1 do
quire = C[i][j]
for k = kk to min(kk+nt, nk)-1 do
quire += a * A[i][k] * B[k][j]

end for
C[i][j] = round(quire)

end for
end for

end for
end for

Fig. 11. Posit GEMM tiled pseudocode using the quire accumulator.

5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40
Tile size

0

100

200

300

400

500

600

700

800

Ti
m

e 
(s

)

Tiled GEMM LARGE Timing Comparison

Double
Posit64

Fig. 12. Tiled GEMM timing results.

nj and nk to 1000, 1100 and 1200 respectively. For very
small tile sizes between 5 and 10, there is a relatively large
variation in the performance, but this stabilizes for larger
tile sizes. In the case of posit64 numbers, there is a big jump
in execution time between using a tile size of 8 or a tile size
of 9. This is due to compiler optimizations. In the size 8 case,
the compiler can loop-unroll the main computational loop
and this does not happen with the size 9 case. In the case
of doubles, this happens for sizes 9 and 10. When using a
tile size of 10 the compiler decides to loop-unroll the main
computational loop, and this is not the case for a tile size of
9.

The extra execution time required by the posit64 kernel
is due to the extra instructions needed to initialize the
quire and round it back to a posit value after each series
of accumulations inside a tile. For large dot-product com-
putations, these extra instructions are amortized over the
long accumulations, but for smaller batches this overhead
is noticeable. All in all, the performance comparison of
posit64 and doubles in the GEMM tiled benchmark is closer
and should scale better for even larger matrix sizes, as the
memory pressure is reduced.

Even though the posit64 execution of this kernel is
slower, there are significant benefits regarding the accu-
racy of the computations. Figure 13 shows the MSE and
MaxAbsE results of the same execution of the GEMM tiled
benchmark. As can be seen from the logarithmic scale on
the Y-axis, posit64 obtains between 4 and 5 orders of mag-
nitude lower MSE and around 2 orders of magnitude lower
MaxAbsE than doubles. The accuracy improves with larger
tile sizes, and is comprised between the posit64 with and
without quire values shown in Figures 5 and 6. This is to
be expected, as the execution in tiles adds extra rounding
steps in the computation of each value of the output matrix.
With larger tiles, the number of intermediate roundings
will be lower and thus the final value is more accurate.
However, note that even for small tile sizes, the accuracy
improvements obtained by posit arithmetic are about 4
orders of magnitude.


12

5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40
Tile size

10 29

10 28

10 27

10 26

10 25

M
SE

Tiled GEMM LARGE MSE Comparison

Double
Posit64

(a)

5 6 7 8 910111213141516171819202122232425 30 32 34 36 38 40
Tile size

10 14

10 13

10 12

M
ax

Ab
sE

Tiled GEMM LARGE MaxAbsE Comparison

Double
Posit64

(b)

Fig. 13. Tiled GEMM error comparison of posit64 and double with respect to the results obtained with GNU MPFR.

9 CONCLUSIONS

In this work, we presented Big-PERCIVAL, an extension of
the PERCIVAL posit RISC-V core that adds support for 64-
bit posits and provides increased flexibility. We studied the
hardware cost, accuracy, and performance of 64-bit posit
arithmetic compared to double-precision IEEE 754 floating-
point arithmetic using the PolyBench benchmark suite.

Synthesis results of the 64-bit PAU in Big-PERCIVAL
have shown that it requires 2.5× as many resources as
the double-precision FPNew FPU. Moreover, we studied
the impact of the corresponding 1024-bit quire accumulator
register, which increased the total hardware cost to a third
of the area of the core. Detailed area results illustrated how
the hardware resources are distributed among the different
operations. In particular, the most resource-hungry elements
are the quire-related units and the posit division and square
root units.

The PolyBench numerical benchmarks executed on Big-
PERCIVAL running on the Genesys II board provided in-
sight into the native use of 64-bit posits. Furthermore, the
conjugate gradient and biconjugate gradient linear solvers
demonstrated the use of posit64 in real-world problems.
Additionally, the use of the quire accumulator requires some
extra thought into the order in which the operations will
be executed in some instances. Regarding accuracy, which
is one of the main requirements in scientific computing,
we have seen that 64-bit posits obtain up to 4 orders of
magnitude lower MSE and up to 3 orders of magnitude
lower MaxAbsE than 64-bit doubles. This provides a high-
accuracy solution that can reduce the number of steps in
iterative solvers without additional impact on the memory
bandwidth.

Overall, our contributions show the potential of posit
arithmetic as an alternative to IEEE 754 floating-point arith-
metic in scientific computing, and Big-PERCIVAL provides
a flexible platform for exploring this alternative. We believe
that this work provides a starting point for future research
on 64-bit hardware and software implementations of posit
arithmetic and contributes to the development of more
accurate and efficient scientific computing systems.

ACKNOWLEDGMENTS

This work was supported by grants PID2021-123041OB-
I00 and PID2021-126576NB-I00 funded by MCIN/AEI/
10.13039/501100011033 and by “ERDF A way of making
Europe”, and by the CM under grant S2018/TCS-4423.

REFERENCES

[1] IEEE Computer Society, “IEEE Standard for Floating-Point Arith-
metic,” IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84, Jul.
2019.

[2] “BFloat16: The secret to high performance on Cloud
TPUs,” https://cloud.google.com/blog/products/ai-machine-
learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.

[3] P. Kharya, “NVIDIA Blogs: TensorFloat-
32 Accelerates AI Training HPC upto 20x,”
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-
precision-format/, May 2020.

[4] D. Mallasén, R. Murillo, A. A. D. Barrio, G. Botella, L. Piñuel, and
M. Prieto-Matias, “PERCIVAL: Open-Source Posit RISC-V Core
With Quire Capability,” IEEE Transactions on Emerging Topics in
Computing, vol. 10, no. 3, pp. 1241–1252, 2022.

[5] L.-N. Pouchet and T. Yuki, “PolyBench/C 4.2,”
https://sourceforge.net/projects/polybench/, May 2016.

[6] Y. Durand, E. Guthmuller, C. Fuguet, J. Fereyre, A. Bocco, and
R. Alidori, “Accelerating Variants of the Conjugate Gradient with
the Variable Precision Processor,” in 2022 IEEE 29th Symposium on
Computer Arithmetic (ARITH), Sep. 2022, pp. 51–57.

[7] S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “FPnew: An Open-
Source Multiformat Floating-Point Unit Architecture for Energy-
Proportional Transprecision Computing,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 29, no. 4, pp. 774–
787, Apr. 2021.

[8] Posit Working Group, “Standard for Posit Arithmetic (2022),”
Feb. 2022. [Online]. Available: {https://posithub.org/docs/posit
standard-2.pdf}

[9] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Compar-
ing Different Decodings for Posit Arithmetic,” in Next Generation
Arithmetic, J. Gustafson and V. Dimitrov, Eds. Cham: Springer
International Publishing, 2022, vol. 13253, pp. 84–99.

[10] Y. Uguen, L. Forget, and F. de Dinechin, “Evaluating the Hard-
ware Cost of the Posit Number System,” in 2019 29th Interna-
tional Conference on Field Programmable Logic and Applications (FPL).
Barcelona, Spain: IEEE, Sep. 2019, pp. 106–113.

[11] R. Chaurasiya, J. Gustafson, R. Shrestha, J. Neudorfer, S. Nambiar,
K. Niyogi, F. Merchant, and R. Leupers, “Parameterized Posit
Arithmetic Hardware Generator,” in 2018 IEEE 36th International
Conference on Computer Design (ICCD), Oct. 2018, pp. 334–341.


13

[12] M. Klöwer, P. D. Düben, and T. N. Palmer, “Posits as an alternative
to floats for weather and climate models,” in Proceedings of
the Conference for Next Generation Arithmetic 2019. Singapore
Singapore: ACM, Mar. 2019, pp. 1–8. [Online]. Available:
https://dl.acm.org/doi/10.1145/3316279.3316281

[13] N. Neves, P. Tomás, and N. Roma, “Dynamic Fused Multiply-
Accumulate Posit Unit with Variable Exponent Size for Low-
Precision DSP Applications,” in 2020 IEEE Workshop on Signal
Processing Systems (SiPS), Oct. 2020, pp. 1–6.

[14] A. Guntoro, C. De La Parra, F. Merchant, F. De Dinechin, J. L.
Gustafson, M. Langhammer, R. Leupers, and S. Nambiar, “Next
Generation Arithmetic for Edge Computing,” in 2020 Design, Au-
tomation & Test in Europe Conference & Exhibition (DATE). Greno-
ble, France: IEEE, Mar. 2020, pp. 1357–1365.

[15] K. Asanović, R. Avizienis, J. Bachrach, S. Beamer, D. Bian-
colin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz,
S. Karandikar, B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love,
M. Maas, A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson,
B. Richards, C. Schmidt, S. Twigg, H. Vo, and A. Waterman,
“The rocket chip generator,” EECS Department, University of
California, Berkeley, Tech. Rep. UCB/EECS-2016-17, Apr. 2016.

[16] F. Zaruba and L. Benini, “The Cost of Application-Class Process-
ing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz
64-Bit RISC-V Core in 22-nm FDSOI Technology,” IEEE Transac-
tions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11,
pp. 2629–2640, Nov. 2019.

[17] C. Celio, D. A. Patterson, and K. Asanović, “The berkeley out-of-
order machine (BOOM): An industry-competitive, synthesizable,
parameterized RISC-V processor,” EECS Department, University
of California, Berkeley, Tech. Rep. UCB/EECS-2015-167, Jun. 2015.

[18] N. Gala, A. Menon, R. Bodduna, G. S. Madhusudan, and V. Ka-
makoti, “SHAKTI Processors: An Open-Source Hardware Initia-
tive,” in 2016 29th International Conference on VLSI Design and 2016
15th International Conference on Embedded Systems (VLSID), Jan.
2016, pp. 7–8.

[19] M. V. Arunkumar, S. G. Bhairathi, and H. G. Hayatnagarkar,
“PERC: Posit Enhanced Rocket Chip,” in 4th Workshop on Computer
Architecture Research with RISC-V (CARRV’20), 2020, p. 8.

[20] S. Tiwari, N. Gala, C. Rebeiro, and V. Kamakoti, “PERI: A Config-
urable Posit Enabled RISC-V Core,” ACM Transactions on Architec-
ture and Code Optimization, vol. 18, no. 3, pp. 1–26, Jun. 2021.

[21] N. N. Sharma, R. Jain, M. M. Pokkuluri, S. B. Patkar, R. Leupers,
R. S. Nikhil, and F. Merchant, “CLARINET: A quire-enabled RISC-
V-based framework for posit arithmetic empiricism,” Journal of
Systems Architecture, p. 102801, Dec. 2022.

[22] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, “A
Lightweight Posit Processing Unit for RISC-V Processors in Deep
Neural Network Applications,” IEEE Transactions on Emerging
Topics in Computing, no. 01, pp. 1–1, Oct. 2021.

[23] Q. Li, C. Fang, and Z. Wang, “PDPU: An Open-Source Posit Dot-
Product Unit for Deep Learning Applications,” Feb. 2023.

[24] S. W. D. Chien, I. B. Peng, and S. Markidis, “Posit NPB: Assess-
ing the Precision Improvement in HPC Scientific Applications,”
in Parallel Processing and Applied Mathematics, R. Wyrzykowski,
E. Deelman, J. Dongarra, and K. Karczewski, Eds. Cham: Springer
International Publishing, 2020, vol. 12043, pp. 301–310.

[25] N. Buoncristiani, S. Shah, D. Donofrio, and J. Shalf, “Evaluating
the Numerical Stability of Posit Arithmetic,” in 2020 IEEE Interna-
tional Parallel and Distributed Processing Symposium (IPDPS), May
2020, pp. 612–621.

[26] D. Mallasén Quintana, “Leveraging Posits for the Conjugate Gra-
dient Linear Solver on an Application-Level RISC-V Core,” KTH
Royal Institute of Technology, Tech. Rep., 2022.

[27] S. D. Ciocirlan, D. Loghin, L. Ramapantulu, N. Tapus, and
Y. M. Teo, “The Accuracy and Efficiency of Posit Arithmetic,”
arXiv:2109.08225 [cs], Sep. 2021.

[28] F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: The
good, the bad and the ugly,” in Proceedings of the Conference for next
Generation Arithmetic 2019, ser. CoNGA’19. New York, NY, USA:
Association for Computing Machinery, 2019.

[29] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its
own game: Posit arithmetic,” Supercomputing Frontiers and Innova-
tions, vol. 4, no. 2, pp. 71–86, Apr. 2017.

[30] L. Forget, Y. Uguen, and F. de Dinechin, “Comparing posit and
IEEE-754 hardware cost,” Apr. 2021.

[31] S. Jean, A. Raveendran, A. D. Selvakumar, G. Kaur, S. G. Dharani,
S. G. Pattanshetty, and V. Desalphine, “P-FMA: A Novel Param-

eterized Posit Fused Multiply-Accumulate Arithmetic Processor,”
in 2021 34th International Conference on VLSI Design and 2021 20th
International Conference on Embedded Systems (VLSID), Feb. 2021,
pp. 282–287.

[32] L. Ledoux and M. Casas, “A Generator of Numerically-Tailored
and High-Throughput Accelerators for Batched GEMMs,” in 2022
IEEE 30th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM). New York City, NY, USA:
IEEE, May 2022, pp. 1–10.

[33] N. Neves, P. Tomás, and N. Roma, “A Reconfigurable Posit Ten-
sor Unit with Variable-Precision Arithmetic and Automatic Data
Streaming,” Journal of Signal Processing Systems, vol. 93, no. 12, pp.
1365–1385, Dec. 2021.

[34] W. Liu and A. Nannarelli, “Power efficient division and square
root unit,” IEEE Transactions on Computers, vol. 61, no. 8, pp. 1059–
1070, 2012.

[35] A. A. D. Barrio, R. Hermida, and S. O. Memik, “A partial
carry-save on-the-fly correction multispeculative multiplier,” IEEE
Transactions on Computers, vol. 65, no. 11, pp. 3251–3264, 2016.

[36] M. S. Kim, A. A. Del Barrio, L. T. Oliveira, R. Hermida, and
N. Bagherzadeh, “Efficient Mitchell’s Approximate Log Multi-
pliers for Convolutional Neural Networks,” IEEE Transactions on
Computers, vol. 68, no. 5, pp. 660–675, 2019.

[37] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Plaus:
Posit logarithmic approximate units to implement low-cost oper-
ations with real numbers,” in Proceedings of the Conference for Next
Generation Arithmetic 2023, ser. CoNGA’23, 2023.

[38] R. Murillo, A. A. Del Barrio, and G. Botella, “A Suite of Division
Algorithms for Posit Arithmetic,” in 2023 IEEE 34th International
Conference on Application-specific Systems, Architectures and Proces-
sors (ASAP). Porto, Portugal: IEEE, Jul. 2023, pp. 41–44.

[39] K. Jun and E. E. Swartzlander, “Modified non-restoring division
algorithm with improved delay profile and error correction,” in
2012 Conference Record of the Forty Sixth Asilomar Conference on
Signals, Systems and Computers (ASILOMAR), 2012, pp. 1460–1464.

[40] E. T. L. Omtzigt, P. Gottschling, M. Seligman, and W. Zorn,
“Universal Numbers Library: Design and implementation of
a high-performance reproducible number systems library,”
arXiv:2012.11011, 2020.

[41] M. S. Ansari, B. F. Cockburn, and J. Han, “An improved loga-
rithmic multiplier for energy-efficient neural computing,” IEEE
Transactions on Computers, vol. 70, no. 4, pp. 614–625, 2021.

[42] R. Murillo, A. A. Del Barrio Garcia, G. Botella, M. S. Kim, H. Kim,
and N. Bagherzadeh, “PLAM: A Posit Logarithm-Approximate
Multiplier,” IEEE Transactions on Emerging Topics in Computing, pp.
1–1, 2021.

[43] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, “A set
of level 3 basic linear algebra subprograms,” ACM Transactions on
Mathematical Software, vol. 16, no. 1, pp. 1–17, Mar. 1990.

[44] X. Gao, S. Bayliss, and G. A. Constantinides, “Soap: Structural
optimization of arithmetic expressions for high-level synthesis,”
in 2013 International Conference on Field-Programmable Technology
(FPT), 2013, pp. 112–119.

[45] J. Villalba-Moreno, J. Hormigo, and S. González-Navarro, “Unbi-
ased rounding for hub floating-point addition,” IEEE Transactions
on Computers, vol. 67, no. 9, pp. 1359–1365, 2018.

David Mallasén David Mallasén Quintana re-
ceived a BSc Degree in Computer Science and
a BSc Degree in Mathematics in 2020 from
the Complutense University of Madrid (UCM). In
2022 he obtained an MSc Degree in Embedded
Systems at KTH Royal Institute of Technology,
specializing in embedded platforms. Currently,
he is pursuing a Ph.D. in Computer Engineering
at UCM. He has carried out a Ph.D. research
stay at the Embedded Systems Laboratory at
EPFL (Switzerland). His main research areas in-

clude computer arithmetic, computer architecture, embedded systems,
and high-performance computing.


14

Alberto A. Del Barrio (SM’19) Alberto A. Del
Barrio received the Ph.D. degree in Computer
Science from the Complutense University of
Madrid (UCM), Madrid, Spain, in 2011. He has
performed stays at Northwestern University, Uni-
versity of California at Irvine and University of
California at Los Angeles. Since 2021, he is an
Associate Professor (tenure-track, civil-servant)
of Computer Science with the Department of
Computer Architecture and System Engineering,
UCM. His main research interests include De-

sign Automation, Next Generation Arithmetic and Quantum Computing.
Dr. del Barrio has been the PI of the PARNASO project, funded by
the Leonardo Grants program by Fundación BBVA, and currently, he
is the PI of the ASIMOV project, funded by the Spanish MICINN, which
includes a work package to research on the deployment of posits on
RISC-V cores. Since 2019 he is an IEEE Senior Member and since
December 2020, he is an ACM Senior Member, too.

Manuel Prieto-Matias Manuel Prieto Matias ob-
tained a Ph.D. degree from Complutense Uni-
versity of Madrid (UCM) in 2000. Since 2002,
he has been a Professor at the Department
of Computer Architecture at UCM, being a Full
Professor since 2019. His research interests in-
clude high-performance computing, non-volatile
memory technologies, accelerators, and code
generation and optimization. His current focus
is on effectively managing resources on emerg-
ing computing platforms, emphasizing the inter-

action between the system software and the underlying architecture.
Manuel has co-authored over 100 scientific publications in journals and
conferences in parallel computing and computer architecture. He is a
member of the ACM.