1549-7747 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2021.3071188, IEEE Transactions on Circuits and Systems II: Express Briefs

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS 1

Low-Delay FPGA-Based Implementation of Finite
Field Multipliers

José L. Imaña

Abstract— Arithmetic operations over binary extension fields
GF (2m) have many important applications in domains such
as cryptography, code theory and digital signal processing.
These applications must be fast, so low-delay implementations
of arithmetic circuits are required. Among GF (2m) arithmetic
operations, field multiplication is considered the most important
one. For hardware implementation of multiplication over binary
finite fields, irreducible trinomials and pentanomials are normally
used. In this brief, low-delay FPGA-based implementations of
bit-parallel GF (2m) polynomial basis multipliers are presented,
where a new multiplier based on irreducible trinomials is given.
Several post-place and route implementation results in Xilinx
Artix-7 FPGA for different GF (2m) finite fields are reported.
Experimental results show that the proposed multiplier exhibits
the best delay, with a delay improvement of up to 4.7%, and
the second best Area× T ime complexities when compared with
similar multipliers found in the literature.

I. INTRODUCTION

High-speed hardware implementations of arithmetic operations
over binary extension fields GF (2m) are greatly desirable due
to their use in cryptography, digital signal processing and code
theory [1]. Multiplication is the most complex and important arith-
metic operation, because exponentiation, division and inversion
can be realized by consecutive use of finite field multiplication
[2], [3], [4], [5]. Among the different basis for representation of
GF (2m) elements, polynomial basis (PB) is normally used [6],
[7]. In addition to the representation basis, the complexity of the
finite field multiplier also depends on the irreducible polynomial
f(y) selected for the generation of the binary extension field.
In order to perform hardware implementations, low Hamming
weight polynomials (trinomials and pentanomials) are used.
GF (2m) polynomial basis multiplication requires a multipli-

cation of polynomials followed by a modular reduction using an
irreducible polynomial [8]. Several bit-parallel PB multipliers in
which a product matrix combines the above two steps together
have been proposed in the literature [9], [10]. In [11], a PB
multiplication method based on the decomposition of a product
matrix was given. The approach in [11] was applied to five types
of irreducible trinomials, and two functions, Si and Ti, were
introduced in such a way that the addition of these functions
were used for the computation of the product of two GF (2m)

elements. Functions Si and Ti are given as addition of products
of the coordinates of the two operands. In [12], the additions
of products of coordinates included in the Si and Ti functions
obtained for type II irreducible pentanomials were split into sums
of 2k product terms that can be implemented with k-depth binary

J.L. Imaña is with the Department of Computer Architecture and Automa-
tion, Faculty of Physics, Complutense University, 28040 Madrid, Spain. E-
mail: jluimana@ucm.es.

This work has been supported by the Spanish MINECO and CM under
grants S2018/TCS-4423 and RTI2018-093684-B-I00.

trees of XOR gates. It was shown in [12] that the pairwise addition
of binary trees with the same depth, given by parentheses used in
the expressions of the product coordinates, reduces the theoretical
delay of the GF (2m) multiplier. However, this splitting method
enforces strong parenthesized constraints that could hinder a
synthesis tool the mapping of Si and Ti terms into FPGA’s logic
blocks. A new approach was introduced in [13] for Si and Ti

functions obtained for type II irreducible pentanomials where the
splitting is carried out, but the constraint imposed by the pairwise
parenthesized addition of binary trees is deleted.

In this brief, low-delay FPGA-based Xilinx implementations of
GF (2m) bit-parallel PB multipliers using irreducible trinomials
are given. The novelty of the new multiplier here presented
is that the splitting method with the removal of parenthesized
constraints has been applied to irreducible trinomials, in such a
way that the synthesis tool is free to optimize the implementation.
Based on the work in [11], a new general algorithm for the
computation of the product coordinates for irreducible trinomials
using the new splitting approach without parenthesized constraints
is also given. Several finite field multipliers, including SECG
[14] recommended fields, have been described in VHDL and
the results obtained from their implementations in Xilinx Artix-7
FPGAs have been reported. Experimental post-place and route
results show that the new approach for multiplication applied
to irreducible trinomials exhibits the best delay, with a delay
improvement of up to 4.7%, and the second best Area × T ime
complexities when compared with similar bit-parallel polynomial
basis multipliers found in the literature.

II. BACKGROUND

Let f(y) =
∑m

i=0 fiy
i be an irreducible polynomial of de-

gree m over the binary field GF (2) and let α be a root of
f(y). Any element X of the binary extension field GF (2m) =

GF (2)[y]/(f(y)) can be represented in the polynomial basis
{1, α, . . . , αm−1} as X =

∑m−1
i=0 xiα

i, with xi ∈ GF (2). Two-
step classic PB multiplication in GF (2m) requires a multipli-
cation of polynomials followed by a reduction modulo f(y),
although several efficient multiplication methods combining these
two steps together by means of a product matrix have been
proposed [9], [10]. A new GF (2m) PB multiplication method
was given in [11] for the computation of the product P = X · Z
mod f(y). This method defined functions Si and Ti given by the
addition (XOR) of terms vk = (xkzk) and wj

i = (xizj + xjzi),
with xi, zi ∈ GF (2), that can be constructed as binary trees of
XOR gates with a bottom level of AND gates (that corresponds
with the xizj products). Functions Si (1 ≤ i ≤ m) and Ti

(0 ≤ i ≤ m− 2) are given by [12]

Si = xp +

p−1∑
h=0

zi−h−1
h , Ti = xq +

r−(i+1)∑
j=1

zm−j
i+j (1)

Authorized licensed use limited to: Imperial College London. Downloaded on May 17,2021 at 13:40:42 UTC from IEEE Xplore.  Restrictions apply. 


IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS 2

where p = bi/2c and q = (dm/2e+ bi/2c). In these expressions,
vp = xpzp only appears for i odd and vq only appears for m
and i even or for m and i odd, where r = q. Otherwise, the term
vq does not occur and r = (dm/2e+ di/2e). As an example, for
GF (27) the terms Si and Ti are the following:

S1 = v0 = x0z0,
S2 = w1

0 = (x0z1 + x1z0),
S3 = v1 + w2

0 = x1z1 + (x0z2 + x2z0),
S4 = w3

0 + w2
1 = (x0z3 + x3z0) + (x1z2 + x2z1),

S5 = v2 +w4
0 +w3

1 = x2z2 + (x0z4 + x4z0) + (x1z3 + x3z1),
S6 = w5

0+w
4
1+w

3
2 = (x0z5+x5z0)+(x1z4+x4z1)+(x2z3+

x3z2),
S7 = v3 + w6

0 + w5
1 + w4

2 = x3z3 + (x0z6 + x6z0) + (x1z5 +

x5z1) + (x2z4 + x4z2),
T0 = w6

1+w
5
2+w

4
3 = (x1z6+x6z1)+(x2z5+x5z2)+(x3z4+

x4z3),
T1 = v4 +w6

2 +w5
3 = x4z4 + (x2z6 + x6z2) + (x3z5 + x5z3),

T2 = w6
3 + w5

4 = (x3z6 + x6z3) + (x4z5 + x5z4),
T3 = v5 + w6

4 = x5z5 + (x4z6 + x6z4),
T4 = w6

5 = (x5z6 + x6z5), and
T5 = v6 = x6z6.
The coordinates of the product P = X·Z, with X,Z ∈ GF (27),

can be determined by the XOR of these terms.
Irreducible polynomials with low Hamming weight, such as

trinomials and pentanomials, are normally used for hardware
implementations of GF (2m) multipliers. Irreducible trinomials
[15] f(y) = ym+yn+1 are important because they are abundant
and, for a given m, an irreducible trinomial can always be found
when irreducible pentanomials do not exist. Polynomial basis
multiplication using irreducible trinomials was considered in [11],
where expressions for the computation of the coefficients of the
GF (2m) product were set in terms of Si and Ti functions. As an
example, for the binary field GF (27) generated by the trinomial
f(y) = y7 + y3 + 1, the coefficients ci of the product computed
using [11] are given in the second column of Table III. In these
expressions, the parentheses point out that the corresponding
terms must be shared for different coefficients of the product with
the purpose of reducing the area requirements of the multiplier.
For example, the addition (T0 +T4) in Table III appears in c0
and c3, so it can be shared among the two coordinates.

III. DELAY REDUCTION

If the objective is the reduction of the delay, the approach intro-
duced in [11] presents the problem of the unbending construction
of Si and Ti terms. For example, for the field GF (27) the addition
S1+T3 = x0z0+(x5z5+(x4z6+x6z4)), where the parenthesized
terms must be XORed previously to the XOR with the other
terms, will result in a binary tree with three levels of XOR gates.
However, the addition S1+T3 requires the sum of the four terms
x0z0, x5z5, x4z6 and x6z4. If the term x0z0 is first added with
x5z5 and then the sum with (x4z6 + x6z4) is carried out in the
form S1 +T3 = (x0z0 + x5z5) + (x4z6 + x6z4), then a 2-level
binary tree of XOR gates can be used for the implementation.
This concept was used in [12] in order to describe GF (2m) PB
multipliers based on type II irreducible pentanomials. In [12],
functions Si and Ti were split in the form Si = sikS

k
i +. . .+s

i
0S

0
i

and Ti = tikT
k
i + . . . + ti0T

0
i , where sij , t

i
j ∈ GF (2) and

k = blog2mc. Terms Sj
i and Tj

i represent the sum of 2j products
xkzl, so they can be implemented using binary trees with j

levels of XOR gates. Pairwise addition (XOR) of Sj
i or Tj

i terms

TABLE I
FUNCTIONS Si AND Ti FOR GF (27).

22 21 20 binary

S1 0 0 S0
1 = v0 001

T5 0 0 T0
5 = v6

S2 0 S1
2 = w1

0 0 010
T4 0 T1

4 = w6
5 0

S3 0 S1
3 = w2

0 S0
3 = v1 011

T3 0 T1
3 = w6

4 T0
3 = v5

S4 S2
4 = (w3

0 + w2
1) 0 0 100

T2 T2
2 = (w6

3 + w5
4) 0 0

S5 S2
5 = (w4

0 + w3
1) 0 S0

5 = v2 101
T1 T2

1 = (w6
2 + w5

3) 0 T0
1 = v4

S6 S2
6 = (w4

1 + w3
2) S1

6 = w5
0 0 110

T0 T2
0 = (w5

2 + w4
3) T1

0 = w6
1 0

S7 S2
7 = (w5

1 + w4
2) S1

7 = w6
0 S0

7 = v3 111

 
S42 

0 1 2 3 4 

T01 

p3 
	

T02 

T30 

T31 

T41 

Fig. 1. Implementation of coefficient p3 for f(y) = y7 + y3 + 1.

therefore results in a new (j+1)-level binary tree of XOR gates.
It can be concluded that if the addition of Si and Ti functions is
done by the pairwise sums of Sj

i and Tj
i terms with the same depth

j, then the delay of the GF (2m) multiplier is reduced by means of
the diminishing of the XOR levels needed for the implementation.

The above splitting representation introduced in [12] satisfies
that the coefficients (sik, . . . , s

i
1, s

i
0) and (tik, . . . , t

i
1, t

i
0) are given

by the binary representations of the subindex i for Si and of the
value m−1−i for Ti, respectively. In this way, the sik coefficients
are ’1’ if and only if the binary representation of i has a ’1’
in the position with weight 2k, and the tik coefficients take the
value ’1’ if and only if the binary representation of m − 1 − i
has a ’1’ in the position with weight 2k. For example, the term
S6 for GF (27) is represented by S6 = s62S

2
6 + s61S

1
6 + s60S

0
6

= 1 · S2
6 + 1 · S1

6 + 0 · S0
6 = S2

6 + S1
6, where the binary vector

(s62, s
6
1, s

6
0)2 = (1, 1, 0)2 = 610. Similarly, the term T1 for GF (27)

is given by T1 = t12T
2
1 + t11T

1
1 + t10T

0
1 = 1 ·T2

1 +0 ·T1
1 +1 ·T0

1

= T2
1 + T0

1, where the binary vector (t12, t
1
1, t

1
0)2 = (1, 0, 1)2 is

the binary representation of the value (7− 1− 1)10 = 510.
Following [12], functions Si and Ti for GF (27) are given in

Table I, where each function is given as the XOR of the Sj
i and

Tj
i terms, respectively, existent in their rows. In Table I, terms Si

and Tm−1−i with the same binary representation are included
in a row. Furthermore, the expressions for the corresponding
Sj
i and Tj

i terms are also given in Table I. For GF (27), the
coefficients of the product generated by the irreducible trinomial
f(y) = y7 + y3 + 1, with (m,n) = (7,3), using the algorithms

Authorized licensed use limited to: Imperial College London. Downloaded on May 17,2021 at 13:40:42 UTC from IEEE Xplore.  Restrictions apply. 


1549-7747 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2021.3071188, IEEE Transactions on Circuits and Systems II: Express Briefs

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS 3

TABLE II
ALGORITHM FOR MULTIPLICATION FOR f(y) = ym + yn + 1, WITH

1 ≤ n ≤ (m+ 1)/2.

if n > 1 then
for i = 0 to n− 2 do
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + tihTh
i + ti+l

h Th
i+l)

end for
end if
p(n− 1) =

∑k
h=0(s

n
hSh

n + tn−1
h Th

n−1)
for i = n to 2n− 2 do

if m even and n = m/2 then
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + ti−n
h Th

i−n)
else

if i = m− 1 then
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + ti−n
h Th

i−n + ti+l−n
h Th

i+l−n)
else
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + tihTh
i + ti−n

h Th
i−n + ti+l−n

h Th
i+l−n)

end if
end if

end for
for i = 2n− 1 to m− 1 do

j = i
if j ≤ m− 2 then
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + tjhTh
j + ti−n

h Th
i−n)

else
p(i) =

∑k
h=0(s

i+1
h Sh

i+1 + ti−n
h Th

i−n)
end if

end for

given in [11] and the splitting method given in [12] are given as
follows, where parenthesized terms point out that they have to be
XORed previously to the XOR with the other terms to reduce the
delay of the multiplier:
c0 = (S0

1 +T1
0) + (T2

0 +T1
4),

c1 = (S1
2 + (T0

1 +T0
5)) +T2

1,
c2 = (S0

3 + S1
3) +T2

2),
c3 = (S2

4 +T2
0) + ((T1

0 +T1
4) + (T0

3 +T1
3)),

c4 = (S0
5 + S2

5) + (T2
1 + ((T0

1 +T0
5) +T1

4),
c5 = S2

6 + (T2
2 + (S1

6 +T0
5)), and

c6 = (S1
7 + S2

7) + ((S0
7 +T0

3) +T1
3).

In order to exemplify the approach, the implementation of the
coefficient p3 for the field GF (27) is given in Figure 1. This
coefficient is the most complex one and can be used to determine
the maximum delay of the multiplier for GF (27) with (m,n) =
(7,3). In Figure 1, the levels of XOR trees are represented by
vertical dashed lines and terms Sj

i and Tj
i are represented by

black circles, in such a way that S4 = S2
4, T0 = T2

0 + T1
0,

T3 = T1
3 + T0

3 and T4 = T1
4. For example, S2

4 represents the
2-level binary tree S2

4 = (w3
0 + w2

1) = (x0z3 + x3z0) + (x1z2 +

x2z1). Furthermore, terms T0 and T3 are represented with circles
enclosed within ellipses. It can be observed in Figure 1 that the
pairwise addition of terms starts with the 0-level term T0

3 and the
1-level term T1

3. In this example, p3 can be implemented by a
binary tree of XOR gates with depth 4, so the delay complexity
of the multiplier is TA + 4TX, with TA and TX standing for the
delay of 2-input AND and XOR gates, respectively.

IV. LOW-DELAY FPGA-BASED POLYNOMIAL BASIS

MULTIPLIER

As previously given, splitting approach applied to GF (2m)

PB multipliers enforces strong parenthesized constraints on the
sum of Sj

i and Tj
i terms in order to reduce the number of

XOR levels. However these constraints could hinder a synthesis

tool the mapping of these terms into FPGA’s configurable logic
blocks. If parenthesized constraints are removed, the synthesizer
could have more freedom to efficiently implement the GF (2m)

multiplier into the FPGA’s logic blocks. This approach was used
in [13] for type II irreducible pentanomials where optimized
FPGA implementations of bit-parallel multipliers were given. The
removal of parenthesized constraints has been applied in this
work to irreducible trinomials, obtaining a different set of product
equations from those given in [13] due to the different number of
existing not-null terms in both polynomials and therefore to the
use of different algorithms for the product computation.

In [11], expressions for the coefficients of the GF (2m) product
were given for specific trinomials f(y) = ym + yn + 1 with
n = m−1,m/2, (m+1)/2, (m−1)/2 and n = 1. However, more
general expressions can be given for irreducible trinomials with
1 ≤ n ≤ (m+1)/2. Table II shows a new algorithm in which the
expressions for the computation of the coefficients of the product
given in [11] are generalized for any 1 ≤ n ≤ (m + 1)/2. In
Table II, the new approach without parenthesized constraints is
used for the computation of the expressions. This is given by
the summation from h = 0 to k = blog2mc of sihSh

i and tihTh
i

terms, in such a way that the sih coefficients are ’1’ if and only
if the binary representation of i has a ’1’ in the position with
weight 2h, and the tih coefficients are ’1’ if and only if the binary
representation of m− 1− i has a ’1’ in the position with weight
2h. In Table II, the term l = m− n has been used.

Using the new algorithm given in Table II, the expressions
for the coefficients of the GF (27) PB multiplier based on the
irreducible trinomial with (m,n) = (7,3) are given in the third
column of Table III, where the constraints enforced by the
parenthesized additions of Sj

i and Tj
i terms have been deleted.

In this case, the synthesis tool can be free to better optimize the
implementation of the multiplier.

TABLE III
COEFFICIENTS OF THE PRODUCT FOR TRINOMIAL GF (27) WITH n = 3.

p0 S1 + (T0 +T4) S0
1 +T2

0 +T1
0 +T1

4
p1 S2 + (T1 +T5) S1

2 +T2
1 +T0

1 +T0
5

p2 S3 +T2 S1
3 + S0

3 +T2
2

p3 S4 + (T0 +T4) +T3 S2
4 +T2

0 +T1
0 +T1

3 +T0
3 +T1

4
p4 S5 + (T1 +T5) +T4 S2

5 + S0
5 +T2

1 +T0
1 +T1

4 +T0
5

p5 S6 +T2 +T5 S2
6 + S1

6 +T2
2 +T0

5
p6 S7 +T3 S2

7 + S1
7 + S0

7 +T1
3 +T0

3

The architecture of the new proposed multiplier for irreducible
trinomials is shown in Figure 2, where k = blog2mc. The first
(left) block receives the two input operands and generates the
vi = (xizi) functions as the product (AND) of the coordinates of
the operands and the wj

i = (xizj+xjzi) functions as the addition
(XOR) of coordinate products. The vi and wj

i terms implemented
in the first block are the inputs to the second block that generates
the individual Sj

i and Tj
i small terms as given in Section III. The

last block in Figure 2 receives the Sj
i and Tj

i terms and computes
the product coordinates as the XOR of these terms using the new
multiplication algorithm given in Table II. The main characteristic
of this approach in comparison with other multiplication methods
is that the coordinates of the product are computed using the
individual small terms Sj

i and Tj
i rather than using more complex

(parenthesized) expressions, in such a way that a synthesis tool
has more freedom to map these terms into FPGA’s configurable
blocks and therefore optimize the implementation.

Authorized licensed use limited to: Imperial College London. Downloaded on May 17,2021 at 13:40:42 UTC from IEEE Xplore.  Restrictions apply. 


IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS 4

(x0,x1,…,xm-1)

m

(z0,z1,…,zm-1)

m

(p0,p1,…,pm-1)

m

v0

vm-1

w

w

1
0

m-1
m-2

S 0
1 S k

1

k+1

S 0
m S k

m

k+1

T 0
0 T k

0

k+1

T 0
m-2 T k

m-2

k+1

Fig. 2. Block diagram of the proposed multiplier.

TABLE IV
RESULTS OF P&R IMPLEMENTATION FOR SECG TRINOMIALS.

LUT SLC T SLC × T LUT × T DI(%)
(m,n) = (113, 9)

[9] 5429 2847 22.96 65367.12 124649.84 100%
[10] 5394 2714 21.82 59219.48 117697.08 95.0%
[11] 5820 2521 20.21 50949.41 117622.20 88.0%
[6] 6995 3321 21.45 71218.85 150007.78 93.4%
[7] 4453 1845 21.09 38892.60 93869.24 91.9%
[12] 5439 2305 19.79 45615.95 107637.81 86.2%

Fig.2 5443 2303 19.52 44954.56 106247.36 85.0%
(m,n) = (113, 15)

[9] 5430 2866 22.16 63510.56 120328.80 100%
[10] 5397 2706 21.03 56907.18 113498.91 94.9%
[11] 5767 2409 20.37 49071.33 117473.79 91.9%
[6] 6707 3141 20.53 64487.87 137701.42 92.6%
[7] 4460 1874 20.64 38669.99 92032.10 93.1%
[12] 5443 2441 20.22 49357.02 110057.46 91.2%

Fig.2 5431 2226 19.65 43740.90 106719.15 88.7%
(m,n) = (113, 30)

[9] 5424 2793 22.33 62367.69 121117.92 98.0%
[10] 5396 2738 21.30 58319.40 114934.80 93.5%
[11] 5656 2602 20.52 53393.04 116061.12 90.1%
[6] 6088 2792 22.78 63596.18 138672.46 100%
[7] 4432 1860 20.66 38418.30 91542.96 90.7%
[12] 5418 2377 20.53 48799.81 111231.54 90.1%

Fig.2 5429 2361 19.96 47125.56 108362.84 87.6%

V. FPGA IMPLEMENTATION RESULTS

The new proposed polynomial basis multiplier given in Figure
2 that implements the new algorithm given in Table II for irre-
ducible trinomials has been described in VHDL and implemented
with Xilinx ISE 14.7 using XST synthesizer in the Artix-7
XC7A200T-FFG1156. In order to compare the new approach with
similar PB multipliers found in the literature, the multiplication
method given in [10], the bit-parallel version of the multiplier
presented in [9] and the multiplier given in [11] that introduced
the Si and Ti functions have been described in VHDL and
implemented in Artix-7. Furthermore, recent methods using PB
based on the Chinese Remainder Theorem [6] and a combination
of Montgomery Multiplication with a Karatsuba-based approach
[7] have also been implemented. Finally, the splitting method
with strong parenthesized constraints given in [12] applied to
irreducible trinomials has also been implemented in order to com-
pare their results with those obtained by the new non-restricted
approach. Several finite fields have been implemented in all cases

TABLE V
RESULTS OF P&R IMPLEMENTATION OF GF (2124) MULTIPLIERS.

LUT SLC T SLC × T LUT × T DI(%)
(m,n) = (124, 19)

[9] 6522 3122 24.25 75708.50 158158.50 100%
[10] 6494 2205 20.20 44541.00 131178.80 83.3%
[11] 6925 2404 21.18 50916.72 146671.50 87.3%
[6] 8014 2905 20.18 58622.90 161722.52 83.2%
[7] 5614 2357 21.69 51123.33 121767.66 89.4%
[12] 6543 2353 20.79 48918.87 136028.97 85.7%

Fig.2 6529 2052 19.94 40916.88 130188.26 82.2%
(m,n) = (124, 37)

[9] 6509 3115 23.86 74323.90 155304.74 100%.
[10] 6497 2194 20.31 44560.14 131954.07 85.1%
[11] 6708 2216 21.30 47200.80 142880.40 89.3%
[6] 7187 2565 20.96 53800.88 150747.33 87.8%
[7] 5507 2330 21.31 49649.97 117348.66 89.3%
[12] 6560 2008 21.62 43412.96 141827.20 90.6%

Fig.2 6501 1971 19.85 39124.35 129044.85 83.2%
(m,n) = (124, 45)

[9] 6512 3126 22.88 71522.88 148994.56 100%
[10] 6497 2182 21.66 47262.12 140725.02 94.7%
[11] 6730 2287 21.70 49627.90 146041.00 94.8%
[6] 6811 2518 22.04 55501.76 150128.06 96.3%
[7] 5495 2450 22.63 55431.25 124324.38 98.9%
[12] 6570 2131 20.31 43280.61 133436.70 88.8%

Fig.2 6526 2016 20.35 41025.60 132804.10 88.9%
(m,n) = (124, 55)

[9] 6510 3132 24.95 78143.40 162424.50 100%
[10] 6495 2046 20.21 41349.66 131263.95 81.0%
[11] 6656 2187 20.45 44724.15 136115.20 82.0%
[6] 6488 2369 21.63 51234.36 140315.98 86.7%
[7] 5486 2381 21.66 51579.60 118843.22 86.8%
[12] 6590 2055 21.15 43463.25 139378.50 84.8%

Fig.2 6513 2086 19.88 41469.68 129478.44 79.7%

using speed high optimizations and same pin assignments.
Table IV shows experimental post-place and route results for

the binary fields GF (2113) recommended by SECG (Standards
for Efficient Cryptography Group) [14] with n = 9, 15 and 30.
Table V includes results for GF (2m) multipliers using irreducible
trinomials with values (m,n) = (124,19), (124,37), (124,45),
(124,55), and Table VI shows results for (162,27), (162,63) and
(162,81). Area complexity is given by the number of LUTs

and Slices (SLC) used, and time results T (in nanoseconds)
correspond with the critical path delay of the multipliers. The
Area × T ime metrics are given by the products SLC × T

and LUT × T in order to compare the area and delay (less is
better). Furthermore, DI(%) represents the delay improvement
with respect to the worst time delay (100%).

From the experimental results, it can be observed that the new
multiplier exhibits the lowest delay in all cases except for the
multiplier (m,n) = (124,45), where the delay of the proposed
multiplier is only 0.2% higher than the approach given in [12]
using splitting method with hard parenthesized constraints. In
the remaining implementations, it must be noted that the delay
improvement of the new multiplier with respect to the second best
time delays ranges from 1.3%, for the multiplier (124,19) in [10],
to 4.7%, for (162,27) in [12]. Furthermore, the delay improvement
with respect to the worst time delay in [9] ranges from 11.2%
for (124,45) to 20.3% for (124,55). With respect to the area
complexity, the multiplier given in [7] presents the lowest number
of LUTs in all cases, and the lowest number of slices in five
out of the ten implemented multipliers (for GF (2113), (162,27)
and (162,81)). In four of the remaining cases, the multiplier here
presented exhibits the lowest number of slices. With regard to

Authorized licensed use limited to: Imperial College London. Downloaded on May 17,2021 at 13:40:42 UTC from IEEE Xplore.  Restrictions apply. 


1549-7747 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2021.3071188, IEEE Transactions on Circuits and Systems II: Express Briefs

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS 5

TABLE VI
RESULTS OF P&R IMPLEMENTATION OF GF (2162) MULTIPLIERS.

LUT SLC T SLC × T LUT × T DI(%)
(m,n) = (162, 27)

[9] 11089 4491 24.27 108996.57 269130.03 100%
[10] 11082 3683 23.81 87692.23 263862.42 98.1%
[11] 11657 3869 21.89 84692.41 255171.73 90.2%
[6] 13498 4856 22.22 107880.90 299871.57 91.6%
[7] 8893 3027 23.15 70059.92 205828.49 95.4%
[12] 11101 3925 21.66 85015.50 240447.66 89.2%

Fig.2 11085 3468 20.64 71579.52 228794.40 85.0%
(m,n) = (162, 63)

[9] 11061 4050 23.91 96835.50 264468.51 100%
[10] 11101 3666 21.65 79368.90 240336.65 90.5%
[11] 11281 3704 21.50 79636.00 242541.50 89.9%
[6] 11344 3987 22.28 88818.40 252710.29 93.2%
[7] 9125 3420 22.16 75800.88 202246.50 92.7%
[12] 11137 3741 21.67 81067.47 241338.79 90.6%

Fig.2 11056 3397 20.99 71303.03 232065.44 87.8%
(m,n) = (162, 81)

[9] 11070 3657 23.02 84184.14 254831.40 97.1%
[10] 11108 3694 21.30 78682.20 236600.40 89.9%
[11] 11800 3922 22.65 88833.30 267270.00 95.6%
[6] 10777 3831 23.01 88139.82 247946.44 97.1%
[7] 8897 3428 23.70 81226.46 210814.42 100%
[12] 11065 3501 21.12 73941.12 233692.80 89.1%

Fig.2 11058 3623 20.74 75141.02 229342.92 87.5%

the Area × T ime metrics, the multipliers given in [7] exhibits
the lowest LUT×T values and the lowest SLC×T values in
five out of the ten multipliers. For all the GF (2124) multipliers
and for (162,63), the multiplier here proposed presents the lowest
SLC×T values, with an improvement with respect to the second
best SLC×T values ranging from 5.2% in (124,45) to 9.9% for
(124,37). It is important to note that the work in [7] performs the
Montgomery GF (2m) polynomial basis multiplication given by
P = X ·Z · yh mod f(y), with 1 ≤ h ≤ m, that is different from
the multiplication P = X · Z mod f(y) here presented. If we
do not consider the Montgomery multiplier given in [7], then the
proposed multiplier presents the lowest SLC×T values except
for (124,55) and (162,81), with improvements with respect to the
second best values ranging from 1.4% for (113,9) to 15.5% in
(162,27). With regard to LUT×T values, the proposed multiplier
presents the best values in all cases, with improvements ranging
from 0.5% for (124,45) to 4.8% in (162,27).

It can also be observed that experimental results show that
the multiplier here presented that applies the new non-restricted
splitting approach to irreducible trinomials exhibits the lowest
delay and Area×T ime values in comparison with the splitting
method with hard parenthesized constraints given in [12]. This
is because in the parenthesized implementation the synthesizer
can not optimize the mapping into the FPGA’s configurable logic
blocks due to the constraints imposed by the parenthesis. This
optimization can be performed in the non-parenthesized version
here presented that gives the synthesizer more freedom to find
efficient implementations of GF (2m) bit-parallel PB multipliers.

VI. CONCLUSION

In this brief, low-delay FPGA-based implementations of
GF (2m) bit-parallel polynomial basis multipliers using irre-
ducible trinomials have been given. The novelty of the new
multiplier here presented is that the splitting method without
parenthesized constraints has been applied to irreducible trino-
mials, in such a way that the synthesis tool is free to optimize

the implementation. The architecture of the new proposed mul-
tiplier and a new general algorithm for the computation of the
product for irreducible trinomials f(y) = ym + yn + 1 for any
1 ≤ n ≤ (m + 1)/2 using the new splitting approach without
parenthesized constraints have also been given. Several GF (2m)

polynomial basis multiplication methods have been described in
VHDL and implemented for several field sizes, including SECG
recommended fields. Post-place and route implementation results
in Xilinx Artix-7 have also been reported. Experimental results
have shown that the multiplier here presented exhibits the best
delay, with a delay improvement of up to 4.7%, and the second
best Area × T ime complexities when compared with similar
multipliers found in the literature. It has also been observed
that the new proposed multiplier that applies the non-restricted
splitting approach to irreducible trinomials exhibits the lowest
delay and Area×T ime values in comparison with the splitting
method with hard parenthesized constraints. This is because in
the parenthesized implementation the synthesizer can not optimize
the mapping into the FPGA’s configurable logic blocks due to the
constraints imposed by the parenthesis. This optimization can be
performed in the new non-parenthesized version here presented
that gives the synthesizer more freedom to find efficient imple-
mentations of GF (2m) bit-parallel polynomial basis multipliers.

REFERENCES

[1] A.A.H. Abd-Elkader, M. Rashdan, E-S.A.M. Hasaneen and H.F.A.
Hamed, ‘FPGA-Based Optimized Design of Montgomery Modular Mul-
tiplier’, IEEE Trans. Circuits and Systems II-Express Briefs, early access,
2021.

[2] P. H. Namin, R. Muscedere and M. Ahmadi, ‘A Fully Serial-In Parallel-
Out Digit-level Finite Field Multiplier in F2m Using Redundant Repre-
sentation’, IEEE Trans. Circuits and Systems II-Express Briefs, vol. 64,
no. 11, pp. 1337-1341, November 2017.

[3] Y. Li, Y. Zhang, X. Guo and C. Qi, “N-Term Karatsuba Algorithm and Its
Application to Multiplier Designs for Special Trinomials”, IEEE Access,
vol.6, pp. 43056-43069, Aug. 2018.

[4] S.-M. Park, K.-Y. Chang, D. Hong and C. Seo, “Low Space Complexity
GF (2m) Multiplier for Trinomials Using n-Term Karatsuba Algorithm”,
IEEE Access, vol.7, pp. 27047-27064, March 2019.

[5] S.-M. Park, K.-Y. Chang, D. Hong and C. Seo “Efficient Bit-Parallel
Multiplier for All Trinomials Based on n-Term Karatsuba Algorithm”,
IEEE Access, vol.8, pp. 173491-173507, Oct. 2020.

[6] H. Fan, “A Chinese remainder Theorem Approach to Bit-Parallel
GF (2m) Polynomial Basis Multipliers for Irreducible Trinomials”, IEEE
Trans. Comput., vol.65, no.2, pp. 343-352, Feb. 2016.

[7] Y. Li and Y. Chen, “New bit-parallel Montgomery multiplier for trinomi-
als using squaring operation”, INTEGRATION, the VLSI journal, vol.52,
pp. 142-155, Jan. 2016.

[8] C. Paar, “Efficient VLSI Architectures for Bit Parallel Computation in
Galois Fields”, PhD Thesis, Universität GH Essen, 1994.

[9] B. Rashidi, R.R. Farashahi and S.M. Sayedi, “Efficient Implementation
of Low Time Complexity and Pipelined Bit-Parallel Polynomial Basis
Multiplier over Binary Finite Fields”, Int. Journal of Information Security,
vol. 7, no. 2, pp. 101-114, July 2015.

[10] A. Reyhani-Masoleh and M.A. Hasan, “Low Complexity Bit Parallel
Architectures for Polynomial Basis Multiplication over GF (2m)”, IEEE
Trans. Computers, vol. 53, no. 8, pp. 945-959, August 2004.

[11] J.L. Imaña, J.M. Sánchez and F. Tirado, “Bit-Parallel Finite Field
Multipliers for Irreducible Trinomials”, IEEE Trans. Computers, vol. 55,
no. 5, pp. 520-533, May 2006.

[12] J.L. Imaña, ‘High-Speed Polynomial Basis Multipliers over GF (2m)
for Special Pentanomials’, IEEE Trans. Circuits and Systems I-Regular
Papers, vol. 63, no. 1, pp. 58-69, January 2016.

[13] J.L. Imaña, ‘Reconfigurable implementation of GF (2m) bit-parallel
multipliers’, Design, Automation & Test in Europe Conference & Ex-
hibition DATE 2018, pp. 899-902, March 2018.

[14] SEC 2. Standards for Efficient Cryptography Group, “Recommended
Elliptic Curve Domain Parameters”. Version 1.0, 2000.

[15] H. Fan and Y. Dai, “Fast Bit Parallel GF (2m) Multiplier for All
Trinomials”, IEEE Trans. Comput., vol.54, no.4, pp. 485-490, Apr. 2005.

Authorized licensed use limited to: Imperial College London. Downloaded on May 17,2021 at 13:40:42 UTC from IEEE Xplore.  Restrictions apply.