RESEARCH Open Access

Acceleration of block-matching algorithms using
a custom instruction-based paradigm on a Nios II
microprocessor
Diego González*, Guillermo Botella, Carlos García, Manuel Prieto and Francisco Tirado

Abstract

This contribution focuses on the optimization of matching-based motion estimation algorithms widely used for
video coding standards using an Altera custom instruction-based paradigm and a combination of synchronous
dynamic random access memory (SDRAM) with on-chip memory in Nios II processors. A complete profile of the
algorithms is achieved before the optimization, which locates code leaks, and afterward, creates a custom
instruction set, which is then added to the specific design, enhancing the original system. As well, every possible
memory combination between on-chip memory and SDRAM has been tested to achieve the best performance.
The final throughput of the complete designs are shown. This manuscript outlines a low-cost system, mapped
using very large scale integration technology, which accelerates software algorithms by converting them into
custom hardware logic blocks and showing the best combination between on-chip memory and SDRAM for the
Nios II processor.

Keyword: Computer vision, Optical flow, MPEG compression, Block-matching algorithm, Nios II, FPGA, Custom
instructions, Embedded systems

1. Introduction
Real-time motion estimation is an important task to be
computed using machine vision technology and a multi-
media scope. For example, one of the most time-
consuming issues when computing standards with video
coding and transmission has to do with the ubiquitous
portable consumer electronic devices, all with multimedia
capabilities, that require the efficient implementation of
video coding algorithms, creating a trade-off between ac-
curacy, efficiency, and power consumption. There is a pro-
fusion of motion estimation algorithms and systems;
many of them are frequently used in multimedia tasks and
video coding standards, such as motion compensation and
coding [1,2].
When considering motion estimation for multimedia

purposes, the main point is to avoid the use of temporal
redundancy of video data for storage and transmission
[2,3]. Motion estimation for multimedia coding is
achieved mostly through block-matching techniques

[4-8] that analyze the macro blocks (blocks of pixels,
commonly called MBs) of the reference frame in order
to estimate the closest block to the one in the current
frame. Accordingly, the motion vector is defined as an
offset from the current frame of MB coordinates to the
MB coordinates in the reference frame. An overview of
the process is shown in Figure 1.
This method of coding the processed frame with motion

estimation using video is also known as inter-frame.
There are several previous works regarding motion
estimation hardware acceleration [9-11] and, specific-
ally, block-matching algorithms [12], though none of
them explore the custom instruction paradigm. Looking
into block-matching techniques, three frequently used
techniques can be classified: the full-search technique
(FST) [4], the three-step-search technique (TSST) [13],
and the two-dimensional logarithmic-search technique
(2DLOG) [14].
The FST [4] matches all possible blocks within a

search window in the reference frame to determine the
closest block to the one fixed in the current frame
(Figure 2). The closest is the one with the minimum

* Correspondence: dgonzalez@grupobme.es
Department of Computer Architecture and Automation, Complutense
University, Ciudad Universitaria s/nMadrid 28040, Spain

© 2013 González et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction
in any medium, provided the original work is properly cited.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118
http://asp.eurasipjournals.com/content/2013/1/118

mailto:dgonzalez@grupobme.es
http://creativecommons.org/licenses/by/2.0


summation of absolute differences (SAD), which is
defined as:

SAD x; y;u; vð Þ ¼
X31

x¼0

X31

y¼0
It x; yð Þ−It−1 xþ u; yþ vð Þj j;

ð1Þ

where It(x,y) represents the pixel value at the coordinate
(x,y) in the frame t, and the (u,v) is the displacement of
the candidate MB. For example, for a 32 × 32 block, the
FST algorithm requires 1,024 subtractions and 1,023
additions to calculate a SAD. The required number of
checking blocks is (1 + 2d)2, while the search window is
limited within ± d pixels, and currently, a power of two
is used for it.
The TSST [13,15,16] selects nine candidate points, in-

cluding the center point and eight checkpoints on the
boundary of the center movement ratio search, fast for-
wards to the matching point with the minimum SAD,
and reduces the step size by half in each of its three
steps (Figure 3). The final step stops the search process
with the optimal MV so the minimum SAD can be
obtained.
The 2DLOG [14] uses a pattern cross search (+) for

each step until the step size is one pixel, with the initial

step size being d/4. The step size is reduced by half only
when the minimum point of the previous step is at the
center or the current minimum point reaches the search
window boundary (Figure 4). If none of these two condi-
tions is accomplished, the step size remains the same.
The organization of the paper is as follows: in Section 2,

a Nios II processor overview is presented. Section 2.1 out-
lines the custom instruction types. Section 2.2 discusses
the different memory architectures. Section 3 shows the
methodology used in this work. Section 4 shows and dis-
cusses the results from the different designs. Conclusions
are presented in Section 5.

2. Nios II processor
Nios II [17,18] is a 32-bit general soft-core embedded
processor, which allows the acceleration of time-critical
software algorithms by adding custom instructions to its
instruction set. It is part of a three-member family, fast,
economy, or standard, each one optimized for a specific
price and performance range.
The Nios II/f Fast central processing unit (CPU) is op-

timized for maximum performance [19,20]; it delivers up
to 220 DMIPs of performance in the Stratix II family of
field programmable gate arrays (FPGAs), placing it
squarely in the advanced RISC machine (ARM) 9 [21]
class of processors. This performance can be fitted to

Figure 1 Motivation part of the MPEG-4 scheme.

Figure 2 Full-search technique.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 2 of 20
http://asp.eurasipjournals.com/content/2013/1/118


meet constraints using custom instructions, high-
bandwidth switch fabric, and hardware accelerators. It
supports fixed and variable cycle operations. The Nios
II/e Economy CPU is optimized for the lowest cost,
resulting in a smaller FPGA footprint. The Nios II/s
Standard CPU delivers over 120 DMIPs while consum-
ing only 930 LEs (Stratix II [22]), creating a balance be-
tween processing performance and logic element usage.

2.1 Nios II custom instructions
A better design can be achieved with the Nios II custom
instruction-based paradigm [23] by designing custom
logic blocks adjacent to the arithmetic logic unit (ALU)
in the processor's datapath (Figure 5) thus allowing the
designer to reduce a complex sequence of standard
instructions to a single instruction implemented in
hardware. The Nios II processor uses GNU compiler
collection (GCC) built-in functions to map custom in-
structions [24]; therefore, it is feasible to use macro dir-
ectly in the C or C++ application code, avoiding having
to write assembly code to access the custom instruc-
tions. The Nios II processor supports different types of
custom instructions. Figure 6 lists the additional ports

that accommodate the different custom instruction
types, where only the ports used for the specific custom
instruction implementation are required.
There are four available types of custom instructions

that can be used to meet each application's constraint
and requirements.
The combinational type of custom instruction consists

of a logic block that completes its logic function in a sin-
gle clock cycle. The multi-cycle (or sequential) type of
custom instruction consists of a logic block that requires
two or more clock cycles to complete an operation. An
extended type of custom instruction allows a single cus-
tom logic block to implement several different opera-
tions using an index to specify which operation the logic
block will have to perform. The internal register file cus-
tom instructions allow access to its own internal register
file, providing the flexibility to specify whether the cus-
tom instruction reads its operands from the Nios II
processor's register file or from the custom instruction's
own internal register file.

2.2 Memory system design for machine vision
implementation
Initially, the design was implemented without using a
custom instruction-based paradigm; instead, the memory
types were managed in order to reach a better combin-
ation to achieve a faster design. According to the Nios II
specifications [17], there are four types of memory that
could be used in the Nios II processor-based design: on-
chip memory, external static random access memory
(SRAM), flash memory, and synchronous dynamic ran-
dom access memory (SDRAM). Here is a brief analysis
of the advantages and disadvantages of each:

On-chip memory
On-chip memory is connected to the circuit board with-
out using any external connection, because it is embed-
ded inside the FPGA. As an advantage, this is the fastest
type of memory that can be used in an FPGA-based em-
bedded system. It allows for the pipelining of transactions,
and it does not require additional circuit-board wiring,
which translates to a very low cost. A disadvantage is that

Figure 3 Three-step-search technique: (A) first step, (B) second step, (C) third step.

Figure 4 The search path of the 2DLOG search algorithm.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 3 of 20
http://asp.eurasipjournals.com/content/2013/1/118


it raises the volatility, which makes it lose its contents
when power is disconnected, and it has limited capacity,
because designed memory capacity depends only on the
specific FPGA device. Considering the advantages and dis-
advantages, on-chip memories are mainly utilized for stor-
ing boot code or look-up tables (LUT).

External RAM
SRAM is implemented outside of FPGA and connected
to it by a shared and simple bidirectional bus. As an ad-
vantage, the throughput remains high, though still lower

than on-chip memories, but the storage capacity is lar-
ger. However, they are more expensive per MByte than
other high-capacity memory types, such as SDRAM, and
they consume more board space per MByte than both
SDRAM and FPGA on-chip memory, which almost
consumes none.

Flash memory
Flash memory is a non-volatile memory type external to
the FPGA, since FPGAs do not contain it. It has several
advantages: it retains the data after the power is off; and

Figure 5 Nios II embedded processor. Source: Altera [13].

Figure 6 Custom instructions types. Source: Altera [15].

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 4 of 20
http://asp.eurasipjournals.com/content/2013/1/118


Figure 7 FST flow chart.

Figure 8 TSST flow chart.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 5 of 20
http://asp.eurasipjournals.com/content/2013/1/118


it is low cost, erasable, and durable. One disadvantage is
that it has a very high writing latency, due to the writing
process requesting specific commands and bus transac-
tions. Moreover, before flash memory can be written, it
must be erased, and since individual words cannot be
erased, entire sections of the flash must be erased as a
unit. Because of their advantages and disadvantages,
flash memories are mainly used to hold microprocessor
boot code as well as any data, which need to be pre-
served in the case of a power failure.

SDRAM
SDRAM is another type of volatile memory. It organizes
the memory space in columns, rows, and banks. It is
similar to SRAM, but it must be refreshed periodically

to keep its data. It requires one specific hardware con-
troller, which drives the timing, address multiplexing,
and refreshes every cycle. Its advantages are that it is
low cost and it has a large capacity. Moreover, its power
consumption is lower than SRAM. As a drawback, its
required SDRAM controller occupies a major part of the
interface. SDRAM latency is always greater than that of
regular external SRAM or FPGA on-chip memory, al-
though some types of SDRAM can achieve higher clock
frequencies than SRAM. Due to its advantages and
drawbacks, the devices that work with SDRAM are usu-
ally low cost and high capacity.
From these four types of memory, two of them were

used here. Flash memory was discarded as an option
due to its high latency and low capacity; additionally,

Figure 9 2DLOG flow chart.

For each byte in source address

Copy byte from source address to destination address

Figure 10 CopyBlock flow chart.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 6 of 20
http://asp.eurasipjournals.com/content/2013/1/118


non-volatile memory was not needed for this design.
SRAM was discarded because of its cost per MByte,
which is more expensive than SDRAM, and designing a
low-cost system was of prime importance. On-chip
memory was utilized, because it is the fastest memory
type available on the FPGA, and it is low cost. SDRAM
was also used in the design, because of its low cost and
very high capacity; moreover, we needed another mem-
ory, apart from the on-chip, for allocating the entire
project.

3. Methodology
The methodology follows in two sections: the first sec-
tion presents the parameters for the achieved designs
through the use of Nios II custom instructions, and the
second section presents the improvement in terms of
throughput achieved through every valid and possible
combinations of the on-chip and SDRAM memory into
a design using the Nios II processor.

3.1 Nios II custom instructions
Once the different custom instruction types and the ad-
vantages and disadvantages of each one were deter-
mined, a profile of the three presented algorithms was
made using the well-known code blocks tool [25]. This
allowed facing up directly to the time leak point, where
better performance improvement could be achieved to
replace source code functions for custom instructions.
For a better comprehension of the profiling, flow charts
are provided in Figures 7, 8, and 9 for FST [4], TSST
[13,15,16], and 2DLOG [14], respectively.
Figures 10, 11, and 12 show CopyBlock, GetBlock, and

GetCost functions, respectively.
Tables 1, 2, and 3 show the profiling results accom-

plished for complete executions of the motion estima-
tion process, highlighting FST [4], TSST [13,15,16], and
2DLOG [14], accordingly.
By examining the profiling results, some conclusions

can be made about the most appropriate part of the

For each row into block size

Call Do DMA with block row and destination address

Figure 11 GetBlock flow chart.

Figure 12 GetCost flow chart.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 7 of 20
http://asp.eurasipjournals.com/content/2013/1/118


source code to be replaced by a specific custom
instruction.
The GetCost function was the first focus; moreover, as

was shown in the previous tables, almost all the execu-
tion time was taken. Regarding the custom instruction
types, their weaknesses, and the GetCost function source
code structure, the best approach was clearly a combina-
torial custom instruction, due to its speed (only one
clock cycle). Although, at first glance, it might seem a
multi-cycle custom instruction should be utilized; this
was impossible due to the necessary accumulated result
in calculating the accumulated SAD between each pair
of pixels from the selected blocks. The extended custom
instruction was discarded because only one kind of
operation was needed between each pair of pixels (calcu-
late SAD). The internal register file custom instruction
extends from the multi-cycle one, and that one was
discarded for reasons previously discussed. The best can-
didate was clearly monocycle custom instruction, be-
cause of its low latency (only one cycle) and its easiness.
After applying the specific custom instruction used,

the performance enhancement for every video-coding
motion estimation algorithm in two different video se-
quences is presented. Additionally, the size of each win-
dow (8, 16, 32 pixels), previously explained in Figure 2,
was analyzed using every possible combination with a
MB size of 16, 32, 64 pixels, respectively. As well, every
Nios II processor type was used in our experiments. All
results for the Foreman test sequence [26] are shown in
Figure 13, and all the results for the Carphone test se-
quence [26] are shown in Figure 14. In Tables 4 and 5,
respectively, the time spent executing time reduction for
the presented results is shown for the two video se-
quences for the improvement achieved for the Foreman
and Carphone test benches [26].
Improvements were not dependent on the window

search size; results only depended on the size of the macro
block processor and algorithm; lower improvement

correlated with a higher size. For example, for the FST al-
gorithm, performances moved from 10% to 55%, while for
the other algorithms that were different than the pure ex-
haustive full search (TSST and 2DLOG), the performance
moved from 30% to 0%.
Regarding the FST algorithm, a noteworthy saving of

execution time was shown; however, looking inside each
processor, the greatest improvement was seen with the
Nios II/e processor, around 40% of saved execution
time, independent of the window search using macro
block sizes of 16 and 32 pixels, although using a macro
block size of 64 pixels was around 30%. Using the Nios
II/s processor, the percentage of saved execution time
was reduced around 20%, more or less in half; but
using the Nios II/f processor, the maximum saved exe-
cution time was obtained with FST, around 45% off,
which was reduced when using a macro block size of
64 pixels to 10%.
Regarding 2DLOG, a noteworthy improvement in exe-

cution time was shown, around 35%, using the Nios II/e
processor using macro block sizes of 16 and 32 pixels,
although using a macro block size of 64 pixels was
around 20%. By using the Nios II/s processor, the saving
was more or less 10%, which is not much compared with
the saving achieved with the FST algorithm. Compared
with the Nios II/f processor, the improvement was still
less, around 5%.
Finally, regarding the TSST algorithm, the table shows

great improvement using the Nios II/e processor, saving
around 35%, as with the 2DLOG algorithm, but using
the Nios II/s processor, an improvement was achieved of
around 10% of the spent execution time. Finally, the
Nios II/f processor still showed less improvement, which
was only around 5%.
From the results, it can be determined that the

obtained saving does not depend on the window size; it
only depends on the macro block size and the Nios II
processor type. The best improvements are achieved
with the FST algorithm and with the Nios II/f processor.

3.2 Memory system design
As previously seen, the best advantages of each kind of
memory can be utilized to achieve a better design. The
design testing here was improved using every possible
combination between the selected memories (on-chip
and SDRAM) with two different video sequences. The

Table 1 FST [4] algorithm profiling

Function (FST) % Time Calls

CopyBlock 8.57 4844640

GetBlock 0.48 302790

GetCost 90.95 302691

FST approximately 0 99

Table 2 TSST [13,15,16] algorithm profiling

Function (TSST) % Time Calls

CopyBlock approximately 0 38832

GetBlock approximately 0 2427

GetCost approximately 100 2328

TSST approximately 0 99

Table 3 2DLOG [14] algorithm profiling

Function (2DLOG) % Time Calls

CopyBlock approximately 0 43280

GetBlock approximately 0 2705

GetCost approximately 100 2606

TSST approximately 0 99

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 8 of 20
http://asp.eurasipjournals.com/content/2013/1/118


on-chip memory was chosen because it is the fastest
available memory on the FPGA and SDRAM due to its
large capacity and its lower cost with a good perform-
ance. The window size was fixed to 32 pixels and a
macro block size of 16 pixels to obtain a viable com-
parison of the achieved results in each one of the tested
algorithms (FST, 2DLOG, and TSST). Presented in
Figures 15, 16, 17 are the combinations of memory
which obtained a valid design and produced a correct
program output in our testing platform, (Altera DE2
board) [27], which incorporates a chip Cyclone II
EP2C35F672C6 [28].
Table 6 shows the exact memory system design

achieved for each performance. Each one of the possible
memories that could be chosen by the designer is
shown, as well as which kind of memory (on-chip vs
SDRAM) of the chosen types was used.
Configuration types 2 and 3 (in italics) release the bet-

ter performance since program memory is allocated
using on-chip memory.
The second-best configuration groups are designs 6 to

8, where the Stack is configured to be on-chip memory,
and designs 13 to 16, where the main characteristic in

Stack is configured to on-chip memory. The baseline
case (number 1) was considered using SDRAM in every
single parameter of the microprocessor design.

4. Final results: custom instructions and memory
choice
According to the two previous approaches, the embed-
ded system was built by putting them together, in order
to enhance the performance results. All possible mem-
ory configurations were tested between on-chip and
SDRAM memories in order to present the entire scope
of possibilities running the three presented algorithms,
including (or not) the use of the designed custom in-
struction, instead of the source GetCost function in
every available Nios II processor (E, S, and F). In Table 7,
an overview of the FPGA used resources is presented for
each one of the possible FPGA configurations, which
provide all the possibilities for implementing all of the
tested designs.
Regarding Table 6, the FPGA configurations are or-

dered as follows: the first four rows correspond to the
Nios II processor ‘e,’ the following four rows correspond
to the Nios processor ‘s,’ and the last four rows correspond

Figure 13 Throughput obtained for each algorithm, for each processor, and for each macro block size. With/without using custom
instruction using the GetCost function with a Foreman test bench [26].

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 9 of 20
http://asp.eurasipjournals.com/content/2013/1/118


Figure 14 Throughput obtained for each algorithm, for each processor, and for each Macro Block size. With/without using custom
instruction using the GetCost function with a Carphone test bench [26].

Table 4 Achieved improvements for the Foreman test bench [26]

Technique Microprocessor

Nios II/e Nios II/s Nios II/f

MB 16 MB 32 MB 64 MB 16 MB 32 MB 64 MB 16 MB 32 MB 64

FST

8 pixels 41.72% 41.88% 32.23% 20.02 % 16.73% 15.84% 39.74% 36.22% 9.13%

16 pixels 42.58% 42.90% 33.29% 21.08% 18.86% 17.91% 50.22% 44.96% 11.15%

32 pixels 44.23% 43.21% 33.62% 21.61% 18.99% 18.42% 54.00% 53.85% 11.49%

2DLOG

8 pixels 32.35% 27.63% 20.63% 8.78% 7.25% 4.20% 6.36% 5.15% 4.13%

16 pixels 34.05% 30.05% 23.31% 11.63% 6.54% 3.94% 4.63% 8.00% 4.65%

32 pixels 35.45% 35.46% 25.61% 10.11% 10.33% 11.24% 5.50% 14.91% 5.39%

TSST

8 pixels 34.97% 33.46% 26.18% 11.24% 10.56% 8.98% 0.93% 11.21% 4.71%

16 pixels 34.67% 33.14% 26.50% 11.11% 9.44% 9.52% 4.42% 11.21% 4.12%

32 pixels 34.37% 33.46% 25.79% 9.55% 10.22% 8.77% ~0.00% 11.21% 7.39%

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 10 of 20
http://asp.eurasipjournals.com/content/2013/1/118


to the Nios ‘f ’ processor. Inside each processor, the group
of four rows were divided into two groups of two rows.
The first two rows corresponded to the FPGA configur-
ation without custom instruction, and the last two rows
corresponded to the FPGA configuration using the cus-
tom instruction. Looking at every pair of rows, it can be
seen that the first one contains the chip vectors (the reset
vector and the exception vector), which were allocated
into the on-chip memory, and the second one allocated
them into the SDRAM memory.
In Figures 18, 19, and 20, at-a-glance results were

obtained running FST, 2DLOG, and TSST, using a
macro block size of 16 pixels.
It can be seen from the charts that, as with the previ-

ous eight designs, a totally descriptive set can be
achieved of the memory system designs shown. For this
reason, results are presented for macro block sizes of 32
and 64 pixels using only the eight previous memory
system designs. In Figures 21, 22, and 23, the results
obtained running FST, 2DLOG, and TSST using a macro
block size of 32 pixels can be seen.

Figures 24, 25, and 26 show results obtained running
FST, 2DLOG, and TSST fixing the macro block size to
64 pixels.
By examining the above charts some conclusions can

be obtained regarding the tests used without custom
instructions.
Regarding the Nios II processor ‘e,’ it can be seen that

designs are divided into three families, regardless of the
macro block size. The first family is formed by designs
1, 4, 5, 9, 10, 11, and 12, which have the Text program
and the Stack allocated into the SDRAM memory. The
second family is formed by designs 2 and 3, which are
different because of the storage of the Text program in
the on-chip memory, though the Stack is stored in the
SDRAM. The third family is formed by designs 6, 7, 8,
13, 14, 15, and 16, which store the Text program in
SDRAM but the Stack into the on-chip memory. As is
evident, the second family is the fastest, followed by the
third family, and finally, by the first one.
Regarding the Nios II processor ‘s,’ it can be seen that

designs are divided into two families, regardless of the

Table 5 Achieved Improvements for the Carphone test-bench [26]

Technique Microprocessor

Nios II/e Nios II/s Nios II/f

M B16 MB 32 MB 64 MB 16 MB 32 MB 64 MB 16 MB 32 MB 64

FST

8 pixels 42.43% 41.76% 31.72% 19.12% 17.00% 15.67% 40.39% 35.83% 9.87%

16 pixels 43.58% 42.87% 33.24% 21.02% 18.57% 17.74% 51.73% 49.03% 10.98%

32 pixels 43.91% 43.17% 33.69% 21.44% 18.99% 18.48% 56.01% 53.73% 11.53%

2DLOG

8 pixels 28.76% 28.53% 17.92% 6.79% 7.91% ~0.00% 0.92% 5.26% 1.75%

16 pixels 31.65% 30.63% 20.22% 10.44% 7.84% 5.56% 0.88% 6.06% 3.17%

32 pixels 31.05% 34.76% 26.27% 10.15% 9.36% 7.55% 5.13% 10.81% 4.94%

TSST

8 pixels 30.51% 33.65% 27.02% 9.84% 8.94% 8.54% 8.47% 10.38% 5.39%

16 pixels 30.99% 33.85% 25.61% 10.77% 10.11% 7.32% 6.72% 11.82% 5.39%

32 pixels 30.92% 33.52% 25.43% 10.42% 10.22% 7.78% 9.24% 10.19% 5.88%

0

200000

400000

600000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Memory System Design

Time Vs Memory System Design. 

FST, Window Size 32 pixels, Macro Block 16 pixels.

Design

Figure 15 Throughput obtained for each memory system design using FST.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 11 of 20
http://asp.eurasipjournals.com/content/2013/1/118


macro block size. The first one is formed by designs 1,
2, 3, 4, 5, 9, 10, 11, and 12, which are different from the
second family, where the Stack is stored in the SDRAM.
The second family, and the fastest, is formed by designs
6, 7, 8, 13, 14, 15, and 16, where the Stack is put into the
on-chip memory.
Regarding the Nios II processor ‘f,’ it can be seen that

all the designs produce a similar throughput using
macro block with sizes of 16 and 32 pixels, but by using
a macro block with a size of 64 pixels, designs are di-
vided into two families that correspond with the two
families described when using the Nios II processor ‘s.’
What follows are the uses of custom instruction and

descriptions of the improvements, depending on the
algorithm.
Focusing on the FST algorithm, it can be deduced that

the use of custom instructions with the Nios II processor
‘e’ reduce by nearly 50%, on average, the time spent by
the algorithm in every family, particularly regarding the
design and the macro block size, except when using a
macro block size of 64 pixels, where a slightly lower re-
duction was achieved, which is around 45% on average.
Using the Nios II processor ‘s,’ the use of the custom in-
struction only depends on the design family. For this
reason, the first family allows a profit between 30% and
35%, but for the second family, an improvement of
nearly 50% was obtained. Finally, with the use of custom
instructions in the Nios II processor ‘f,’ an improvement
of around 55% was achieved in every design when using
macro block sizes of 16 and 32 pixels; but, when using a

macro block size of 64 pixels, an improvement around
20% was achieved.
Focusing on the 2DLOG algorithm, it can be con-

cluded that the use of custom instruction in the Nios II
processor ‘e’ has an improvement rate of around 35% for
the first design family, an improvement of 15% in the
execution time for the second family, and for the third, a
profit between 30% and 35% was obtained. Looking at
the Nios II processor ‘s,’ a profit between 10% and 15%
was achieved using custom instructions, regardless of
the particular design, although using custom instruction
with the Nios II processor ‘f ’ netted an improvement of
around 10% in every design.
Focusing on the TSST algorithm, a profit between 25%

and 35% was obtained using custom instruction with the
Nios II processor ‘e’ for the first design family, an im-
provement of 10% for the second family, and a reduction
between 30% and 35% for the third family regarding the
execution time. Using the Nios II processor ‘s’ and cus-
tom instructions, an improvement between 10% and
15% was achieved, regardless of any particular design or
macro block size. Finally, using the Nios II processor ‘f ’
and custom instructions, a profit between 0% and 10%
was obtained in every design, regardless of the fixed
macro block size.
Gathering all the previous results from the experi-

ments, the following conclusions were made: regarding
the Nios II processor ‘e,’ the fastest family is the second
family, which netted a profit, on average, of nearly 75%
(for FST MB 16, 32, and 64); around 60% (for 2DLOG

0
1000
2000
3000
4000
5000
6000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Memory System Design

Time Vs Memory System Design.

2DLOG, Window Size 32 pixels, Macro Block 16 pixels.

Design

Figure 16 Throughput obtained for each memory system design using 2DLOG.

0
1000
2000
3000
4000
5000
6000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Memory System Design

Time Vs Memory System Design.

TSST, Window Size 32 pixels, Macro Block 16 pixels.

Design

Figure 17 Throughput obtained for each memory system design using TSST.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 12 of 20
http://asp.eurasipjournals.com/content/2013/1/118


MB 16, 32, and 64); and nearly 60% (for TSST MB 16,
32, and 64), depending on the algorithm. Using the Nios
II processor ‘s,’ the fastest family was shown to be the
second family, too, reducing the execution time, on
average, nearly 75% (for FST MB 16, 32, and 64); around
20% (for 2DLOG MB 16, 32, and 64); and nearly 40%
(for TSST MB 16, 32, and 64) in every algorithm. Finally,
using the Nios II processor ‘f ’ for the FST, an

improvement was achieved, on average, of around 55%
(for FST MB 16, 32, and 64); around 40% (for 2DLOG
MB 16, 32, and 64); and nearly 20% (TSST MB 16, 32,
and 64) on each algorithm.
To contrast the experiments, the significant set of the

memory system designs previously described (1 to 8)
was tested using the Carphone test bench [26], also pre-
viously described.

Table 6 Memory system design configuration

Design Memories

Processor reset
vector

Processor exception
vector

Stack Heap Read/write data
(.rwdata)

Read only data
(.rodata)

Program (.text)

1 SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM

2 SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM On-Chip

3 SDRAM SDRAM SDRAM SDRAM SDRAM On-Chip On-Chip

4 SDRAM SDRAM SDRAM SDRAM On-Chip SDRAM SDRAM

5 SDRAM SDRAM SDRAM SDRAM On-Chip On-Chip SDRAM

6 SDRAM SDRAM On-Chip SDRAM SDRAM SDRAM SDRAM

7 SDRAM SDRAM On-Chip SDRAM On-Chip SDRAM SDRAM

8 SDRAM SDRAM On-Chip SDRAM On-Chip On-Chip SDRAM

9 On-chip On-chip SDRAM SDRAM SDRAM SDRAM SDRAM

10 On-chip On-chip SDRAM SDRAM SDRAM On-chip SDRAM

11 On-chip On-chip SDRAM SDRAM On-chip SDRAM SDRAM

12 On-chip On-chip SDRAM SDRAM On-chip On-Chip SDRAM

13 On-chip On-chip On-chip SDRAM SDRAM SDRAM SDRAM

14 On-chip On-chip On-chip SDRAM SDRAM On-chip SDRAM

15 On-chip On-chip On-chip SDRAM On-chip SDRAM SDRAM

16 On-chip On-chip On-chip SDRAM On-chip On-chip SDRAM

Table 7 FPGA used resources

FPGA resources

Logic
cells

Dedicated logic
registers

I/O
registers

Memory
bits

M4Ks DSP
elements

DSP
9 × 9

DSP
18 × 18

Pins Virtual
pins

LUT-only
LCs

Register-only
LCs

LUT/register
LCs

2202(1) 1050(0) 52(52) 306,176 78 0 0 0 56 0 1,152(1) 148(0) 902(0)

2198(1) 1050(0) 52(52) 306,176 78 0 0 0 56 0 1,148(1) 148(0) 902(0)

2352(1) 1051(0) 52(52) 306,176 78 0 0 0 56 0 1,301(1) 148(0) 903(0)

2348(1) 1051(0) 52(52) 306,176 78 0 0 0 56 0 1,297(1) 146(0) 905(0)

3152(1) 1755(0) 52(52) 341,632 87 4 0 2 56 0 1,397(1) 228(0) 1,527(0)

3150(1) 1755(0) 52(52) 341,632 87 4 0 2 56 0 1,395(1) 225(0) 1,530(0)

3284(1) 1756(0) 52(52) 341,632 87 4 0 2 56 0 1,528(1) 223(0) 1,533(0)

3285(1) 1756(0) 52(52) 341,632 87 4 0 2 56 0 1,529(1) 223(0) 1,533(0)

3868(1) 2175(0) 52(52) 377,088 98 4 0 2 56 0 1,693(1) 363(0) 1,812(0)

3858(1) 2175(0) 52(52) 377,088 98 4 0 2 56 0 1,683(1) 360(0) 1,815(0)

3973(1) 2178(0) 52(52) 377,088 98 4 0 2 56 0 1,795(1) 360(0) 1,818(0)

3968(1) 2178(0) 52(52) 377,088 98 4 0 2 56 0 1,790(1) 360(0) 1,818(0)

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 13 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

100000

200000

300000

400000

500000

600000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 18 Performance of final throughput (Custom instruction + Memory optimization) for FST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0
1000
2000
3000
4000
5000
6000
7000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Design

Time Vs Design.

2DLOG, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 19 Performance of final throughput (Custom instruction + Memory Optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 20 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 14 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

100000

200000

300000

400000

500000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 21 Performance of final throughput (Custom instruction + Memory optimization) for FST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

2000

4000

6000

8000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
2DLOG, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 22 Performance of the final throughput (Custom instruction + Memory optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 23 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 15 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

50000

100000

150000

200000

250000

300000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 24 Performance of final throughput (Custom instruction + Memory optimization) for FST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
2DLOG, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 25 Performance of final throughput (Custom instruction + Memory optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 26 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 16 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

100000

200000

300000

400000

500000

600000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 27 Performance of final throughput (Custom instruction + Memory optimization) for FST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0
1000
2000
3000
4000
5000
6000
7000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
2DLOG, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 28 Performance of final throughput (Custom instruction + Memory optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0
1000
2000
3000
4000
5000
6000
7000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 16 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 29 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 17 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

100000

200000

300000

400000

500000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 30 Performance of final throughput (Custom instruction + Memory optimization) for full-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0
1000
2000
3000
4000
5000
6000
7000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
2DLOG, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 31 Performance of final throughput (Custom instruction + Memory optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 32 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 32 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 18 of 20
http://asp.eurasipjournals.com/content/2013/1/118


0

50000

100000

150000

200000

250000

300000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
FST, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 33 Performance of final throughput (Custom instruction + Memory optimization) for full-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
2DLOG, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 34 Performance of final throughput (Custom instruction + Memory optimization) for 2DLOG-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

0

1000

2000

3000

4000

5000

1 2 3 4 5 6 7 8

T
im

e 
(m

se
c)

Design

Time Vs Design.
TSST, Window Size 32 pixels, Macro Block 64 pixels.

Processor E CI Off

Processor E CI On

Processor S CI Off

Processor S CI On

Processor F CI Off

Processor F CI On

Figure 35 Performance of final throughput (Custom instruction + Memory optimization) for TSST-algorithm Nios II processor.
(‘economic’, ‘standard’, and ‘fast’).

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 19 of 20
http://asp.eurasipjournals.com/content/2013/1/118


Figures 27, 28, and 29 show the results obtained run-
ning FST, 2DLOG, and TSST fixing the macro block size
to 16 pixels.
Figures 30, 31, and 32 show the obtained results run-

ning FST, 2DLOG, and TSST fixing the macro block size
to 32 pixels.
Finally, Figures 33, 34, and 35 show the obtained re-

sults running FST, 2DLOG, and TSST by fixing the
macro block size to 64 pixels.

5. Conclusions
In this paper, the problem of the acceleration motion es-
timation algorithm, widely used for video coding MPEG
(H.264/6), was addressed by building an FPGA-based
low-cost embedded system, the Altera DE2 platform,
and customizing the well-known soft-core microproces-
sor Nios II. The technique developed here has been sep-
arately evaluated using a custom instruction paradigm
through a combinational instruction and the efficient
combination of on-chip memory and SDRAM regarding
the reset vector, exception vector, stack, heap, read/write
data (.rwdata), read only data (.rodata), and program text
(.text) in the design. A combination of two methods was
then developed to build the final embedded system.
With the use of custom instructions, an improvement
was reached of 23%, on average, but close to a 55% im-
provement, in the best case, which supposes a great
amount of savings in time spent for execution. With a
better use of the memory types available in the design,
an improvement of 61% was achieved in the execution
time. With the combination of both techniques, an im-
provement of 75% was achieved against the base case.

Competing interests
The authors declare that they have no competing interests.

Acknowledgments
The authors would like to thank the Altera Company for the provided
hardware and software under the University programs. The authors would
like to thank Professor Uwe Meyer-Bäse from Florida State University for
his help and support regarding digital signal processing with FPGAs.
This work has been partially supported by Spanish Projects TIN 2008/508 and
TIN 2012/32180.

Received: 15 February 2013 Accepted: 29 May 2013
Published: 16 June 2013

References
1. D Marpe, T Wiegand, GJ Sullivan, The H.264/MPEG4 advanced video coding

standard and its applications. IEEE Commun Mag 44, 134–143 (2006)
2. ITU-T Recommendation H.264 (draft), International standard for advanced

video coding (ITU-T, Geneva, 2003)
3. ITU-T Recommendation H.264 & ISO/IEC 14496-10 (MPEG-4) AVC, Advance

Video Coding for Generic Audiovisual Services (ITU-T, Geneva, 2005)
4. J Konrad, Estimating motion in image sequences. IEEE Signal Process Mag

16, 70–91 (1999)
5. S Kappagantula, K-R Rao, Motion compensated interframes image

prediction. IEEE Trans Commun 33, 1011–1015 (1985)
6. C-J Kuo, C-H Yeh, S-F Odeh, Polynomial search algorithms for motion

estimation, in Proceedings of the 1999 IEEE International Symposium on
Circuits and System (ISCAS’99), vol. 4 (Orlando, 1999), pp. 215–218

7. S Zhu, K-K Ma, A new diamond search algorithm for fast block-matching
motion estimation. IEEE Trans Image Process 9, 287–290 (2000)

8. S Zhu, Fast motion estimation algorithms for video coding (Nanyang
Technology University, Singapore, M.S. thesis, 1998)

9. F Ayuso, G Botella, C García, M Prieto, F Tirado, GPU-based acceleration of
bio-inspired motion estimation model. Concurrency and Computation:
Practice and Experience 25, 1037–1056 (2013). doi:10.1002/cpe.2946

10. G Botella, A García, M Rodriguez-Alvarez, E Ros, U Meyer-Bäse, MC Molina,
Robust bioinspired architecture for optical-flow computation. IEEE Trans.
VLSI Syst. 18(4), 616–629 (2010)

11. C Garcia, G Botella, F Ayuso, M Prieto, F Tirado, Multi-GPU based on
multicriteria optimization for motion estimation system. EURASIP JOURNAL
on Advances in Signal Processing 2013, 23 (2013)

12. D González, G Botella, U Meyer-Baese, C García, C Sanz, M Prieto-Matías, F
Tirado, A Low, Cost Matching Motion Estimation Sensor Based on the NIOS
II Microprocessor. Sensors 12, 13126–13149 (2012)

13. T Koga, K Iinuma, A Hirano, Y Iijima, T Ishiguro, Motion compensated
interframe coding for video conferencing, in Proc. of the Nat.
Telecommunications Conference (New Orleans, LA, 1981), pp. G5.3.1–G5.3.5

14. J-R Jain, A-K Jain, Displacement measurement and its application in
interframes image coding. IEEE Trans Commun 29, 1799–1808 (1981)

15. B Liu, A Zaccarin, New fast algorithms for estimation of block motion
vectors. IEEE Trans. Circuit. Syst. Video Technol. 3, 148–157 (1993)

16. R Li, B Zeng, M-L Liou, A new three-step search algorithm for block motion
estimation. IEEE Trans. Circuit. Syst. Video Technol. 4, 438–422 (1994)

17. Altera, Nios II processor: the world's most versatile embedded processor,
(2013). http://www.altera.com/devices/processor/nios2/ni2-index.html.
Accessed 10 June 2013

18. P Chu, Embedded SoPC Design with NIOS II Processor and Examples (Wiley,
Hoboken, 2012)

19. Altera, Nios II performance benchmarks, (2013). http://www.altera.com/
literature/ds/ds_nios2_perf.pdf. Accessed 10 June 2013

20. Altera, Documentation: Nios processor, (2013). http://www.altera.com/
literature/lit-nio.jsp. Accessed 10 June 2013

21. Arm, ARM: the architecture for the digital world, (2013). http://www.arm.
com/products/processors/classic/arm9/. Accessed 10 June 2013

22. Altera, Stratix II FPGA: high performance with great signal integrity, (2013).
http://www.altera.com/devices/fpga/stratix-fpgas/stratix-ii/stratix-ii/st2-index.
jsp. Accessed 10 June 2013

23. Altera, Hardware acceleration, (2013). http://www.altera.com/devices/
processor/nios2/benefits/performance/ni2-acceleration.html. Accessed 10
June 2013

24. Altera, Nios II custom instruction user guide, (2013). http://www.altera.com/
literature/ug/ug_nios2_custom_instruction.pdf. Accessed 10 June 2013

25. Y Mandravellos, Code::Blocks IDE, (2006). https://launchpad.net/codeblocks.
Accessed 10 June 2013

26. C Yushin, CIPR sequences, (2013). http://www.cipr.rpi.edu/resource/
sequences/. Accessed 10 June 2013

27. Altera, DE2 development and education board, (2013). http://www.altera.
com/education/univ/materials/boards/de2/unv-de2-board.html.
Accessed 10 Feb 2013

28. Altera, Cyclone II FPGAs at cost that rivals ASICs, (2012). http://www.altera.
com/devices/fpga/cyclone2/cy2-index.jsp. Accessed 10 June 2013

doi:10.1186/1687-6180-2013-118
Cite this article as: González et al.: Acceleration of block-matching
algorithms using a custom instruction-based paradigm on a Nios II
microprocessor. EURASIP Journal on Advances in Signal Processing 2013
2013:118.

González et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:118 Page 20 of 20
http://asp.eurasipjournals.com/content/2013/1/118

http://dx.doi.org/10.1002/cpe.2946
http://www.altera.com/devices/processor/nios2/ni2-index.html
http://www.altera.com/literature/ds/ds_nios2_perf.pdf
http://www.altera.com/literature/ds/ds_nios2_perf.pdf
http://www.altera.com/literature/lit-nio.jsp
http://www.altera.com/literature/lit-nio.jsp
http://www.arm.com/products/processors/classic/arm9/
http://www.arm.com/products/processors/classic/arm9/
http://www.altera.com/devices/fpga/stratix-fpgas/stratix-ii/stratix-ii/st2-index.jsp
http://www.altera.com/devices/fpga/stratix-fpgas/stratix-ii/stratix-ii/st2-index.jsp
http://www.altera.com/devices/processor/nios2/benefits/performance/ni2-acceleration.html
http://www.altera.com/devices/processor/nios2/benefits/performance/ni2-acceleration.html
http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf
http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf
https://launchpad.net/codeblocks
http://www.cipr.rpi.edu/resource/sequences/
http://www.cipr.rpi.edu/resource/sequences/
http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html
http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html
http://www.altera.com/devices/fpga/cyclone2/cy2-index.jsp
http://www.altera.com/devices/fpga/cyclone2/cy2-index.jsp

	Abstract
	1. Introduction
	2. Nios II processor
	2.1 Nios II custom instructions
	2.2 Memory system design for machine vision implementation
	On-chip memory
	External RAM
	Flash memory
	SDRAM


	3. Methodology
	3.1 Nios II custom instructions
	3.2 Memory system design

	4. Final results: custom instructions and memory choice
	5. Conclusions
	Competing interests
	Acknowledgments
	References