TVLSI-00330-2009 1 

  
Abstract—New generation embedded systems demand high 

performance, efficiency and flexibility. Reconfigurable hardware 

can provide all these features. However the costly reconfiguration 

process and the lack of management support have prevented a 

broader use of these resources. To solve these issues we have 

developed a scheduler that deals with task-graphs at run-time, 

steering its execution in the reconfigurable resources while 

carrying out both prefetch and replacement techniques that 

cooperate to hide most of the reconfiguration delays. In our 

scheduling environment task-graphs are analyzed at design-time 

to extract useful information. This information is used at run-time 

to obtain near-optimal schedules, escaping from local-optimum 

decisions, while only carrying out simple computations. 

Moreover, we have developed a hardware implementation of the 

scheduler that applies all the optimization techniques while 

introducing a delay of only a few clock cycles. In the experiments 

our scheduler clearly outperforms conventional run-time 

schedulers based on As-Soon-As-Possible techniques. In addition, 

our replacement policy, specially designed for reconfigurable 

systems, achieves almost optimal results both regarding reuse and 

performance. 

 
Index Terms— Field Programmable Gate Arrays, 

Reconfigurable Architectures, Task scheduling.  

 
I. INTRODUCTION 

n the last few years embedded devices have become more 
and more complex, including functionality initially 

developed for general purpose platforms such as multimedia 
support (sound processing, texture rendering, image and video 
displaying…). In fact the new generation of portable devices 
has inherited the area and energy constraints of embedded 
systems, and at the same time they must achieve the 
performance required by multimedia applications. The best 
way to meet these constraints is to include some HW support 
that can speed up the execution, and even reduce the energy 

 
Manuscript received August 25, 2009. This work was supported by the 

Spanish Department of Science and Innovation under grants TIN2009-09806 
and AYA2009-13300.  

J. A. Clemente is with Computer Architecture Department, Universidad 
Complutense de Madrid, Madrid, Spain (phone: +34 620390713; fax: +34 
913947510; e-mail: ja.clemente@fdi.ucm.es).  

J. Resano was with Computer Architecture Department, Universidad de 
Complutense de Madrid, Madrid, Spain. He is now with the Computer Eng. 
Department, Universidad de Zaragoza, Zaragoza, Spain (e-mail: 
jresano@unizar.es). 

C. González and D. Mozos are with Universidad Complutense de Madrid, 
Madrid, Spain (e-mails: carlosgonzalez@fdi.ucm.es; mozos@fis.ucm.es).  

consumption. Traditionally this migration was carried out 
developing application-specific integrated circuits (ASICs), 
which are silicon circuits customized for a particular use. 
However, although using ASICs is a very efficient option three 
important drawbacks prevent their use as a general solution. 
Firstly, the HW area in embedded/portable devices is very 
constrained. Hence only very critical functionality can be 
migrated to HW. Secondly, developing a new ASIC involves 
an increase in time-to-market, which is frequently a key factor 
for the success of a platform. Finally, their functionality is 
fixed, and cannot be updated in order to fix some detected 
bugs, or improve the efficiency of the system.   

One interesting option to overcome these three limitations is 
to include reconfigurable HW resources: run-time 
reconfiguration allows reusing the same HW for different 
functionalities in order to meet the area constraints; the time-
to-market is considerably shorter for reconfigurable HW than 
for ASICs, because the physical platform has been already 
tested, and the new functionality can be tested in the target 
board since the beginning of the design-cycle; finally, it offers 
an interesting trade-off between both performance and 
flexibility. Thus, this technology is especially suitable for 
applications that have dynamic and/or unpredictable behavior. 
In fact Sony™ has developed its own reconfigurable 
architecture, and has included it in some portable devices [1].  

In embedded systems, applications are often represented as 
one or several Direct Acyclic Graphs (DAGs), where the 
nodes specify computational tasks and the edges represent 
precedence constraints. Managing efficiently the execution of 
these graphs is critical for embedded systems. Therefore, it is 
essential to optimize it. When dealing with reconfigurable 
systems several issues must be taken into account in order to 
deal with DAGs efficiently:  

• The system must manage the task-graph information 
and must guarantee that the execution meets the 
precedence constraints. 

• It must schedule the task execution attempting to 
achieve the required performance.  

• It must also efficiently schedule the run-time 
reconfigurations. Most of current reconfigurable 
devices only include one reconfiguration circuitry and 
the reconfiguration latencies are frequently very 
significant (of de order of milliseconds), hence if many 
reconfigurations are demanded in a short period of 
time, the performance of the system can be seriously 
affected. When this happens the reconfiguration of the 

A Hardware Implementation of a Run-Time 
Scheduler for Reconfigurable Systems 

Juan Antonio Clemente, Javier Resano, Carlos González and Daniel Mozos 

I


TVLSI-00330-2009 2 

most critical tasks must be scheduled in the first place 
in order to minimize the reconfiguration overhead.   

• Finally, it must attempt to optimize the use of the 
reconfigurable resources. On the one hand, the system 
must assign to each task a suitable resource. On the 
other hand, since the system usually deals with 
recurring tasks the scheduler must promote the reuse of 
the most important tasks, avoiding their delay due to 
the reconfiguration latency. This can be done applying 
the proper replacement policy. 

 
Moreover, all these problems must be addressed in such a 

way that the run-time delay generated by all the 
management/scheduling techniques is as low as possible, in 
order to prevent performance degradations. 

In this regard, we have developed a complete scheduling 
flow targeting DAGs in a HW multi-tasking system, as well as 
an efficient implementation of a scheduler that includes all 
these optimization techniques. We have also tested our 
implementation in a Virtex-II PRO FPGA.  

The rest of the paper is organised as follows: next section 
describes the contributions of this article. Section III shows a 
motivational example and Section IV overviews the related 
work. Section V describes in detail the proposed scheduling 
flow. Section VI presents the implementation details of the 
proposed scheduler and Section VII presents the experimental 
results. Finally, Section VIII explains our conclusions and 
indicates some lines for future work. 

 
II. CONTRIBUTIONS OF THIS WORK 

Our target system comprises a fixed number of 
reconfigurable units (RUs) with similar area (as it is shown in 
Fig. 1), where tasks can be reconfigured and executed. These 
RUs are connected among them by means of an 
interconnection network as it was initially proposed in [2].  
Our scheduler receives as input a task graph (represented as a 
DAG) and steers its execution in our HW multi-tasking system 
taking into account its internal data dependencies. A task is the 
basic scheduling unit (i.e. a node of a task graph). Other 
processing elements may be present in the platform, as 
processors, DSPs, GPUs… We assume that the OS or 
middleware assigns the task graphs to the processing elements. 

The proposed algorithm is a mixed design-time/run-time 
approach, since we are looking for a good-quality schedule but 
without carrying out too many computations at run-time. At 
design-time, graphs are analyzed in order to extract some 
useful information that will be used at run-time. Basically we 
characterize each task with three parameters: weight, 
criticality, and mobility. At run-time the scheduler uses the 
weight parameter to decide the reconfiguration order. The idea 
is to reconfigure first those tasks that have a greater impact in 
the critical path of the graph. The criticality identifies the 
delays that the reconfiguration of each task may generate, and 
it is used to assign greater priority to the tasks that generate 
greater delays. Finally, the mobility is used to escape from 

local-optimum scheduling decisions delaying at run-time some 
reconfigurations.  

The scheduler steers the execution of the task graphs 
sequentially in the RUs taking into account both the 
precedence constraints and the available resources. In addition 
it applies a prefetch approach in order to carry out the 
reconfigurations in advance. The configuration that is fetched 
in advance is selected according to the weight of the 
candidates. Since several important scheduling decisions must 
be taken at run-time, we initially selected an As Soon As 

Possible (ASAP) scheduling strategy since it provides a good 
trade-off between the run-time complexity and the quality of 
the schedules. However, it is well-known that greedy ASAP 
strategies often fall into locally optimum decisions that can 
reduce the performance of the system. To attempt to escape 
from these local-optimum solutions, while increasing the run-
time complexity as little as possible, we have developed an 
extended ASAP approach that sometimes delays the 
reconfiguration of a task taking into account its mobility and 
the state of the system.  

In addition the scheduler also applies a replacement policy 
designed not only to maximize the task reuse when dealing 
with recurring applications, but also to improve the overall 
system performance collaborating with the prefetch technique. 
In fact, the replacement policy is the key factor to decide 
whether to delay the reconfiguration of a task or not.  

We have evaluated our scheduler by executing task graphs 
extracted from actual multimedia applications. We have 
compared the results obtained by our scheduler with several 
reference systems that apply an ASAP scheduling approach 
combined with well-known replacement policies, such as LRU 
(Last Recently Used) or LFD (Longest Forward Distance). 
The latter case is especially interesting since LFD is the 
replacement policy that guarantees the optimal reuse 
percentage as it was demonstrated in [3]. However, LFD 

Interconnection 
network

DSPProc. 2

RU1 RU2

RU3 RU4

Rec. HW Proc. 1

OS/Middleware: 
Assigns tasks to the 
different resources 

DAG 

Scheduler

Running Applications

t=2 t=8t=5 t=12

…

Target platform I/O

SRAM SDRAM
 

Fig. 1.  Target architecture and execution scheme. 
 

TVLSI-00330-2009 3 

cannot be applied in dynamic scenarios, because it can only be 
used in static systems where the future events are known. 
Nevertheless, LFD can be used as a reference to obtain an 
upper-bound of the reuse percentage for a given experiment. 
The experiments demonstrate that our replacement heuristic 
achieves almost as good results regarding reuse as LFD, and 
that our scheduler always obtains better results regarding 
performance than an ASAP approach, even when it applies the 
LFD replacement heuristic.  

We have also evaluated the run-time overheads generated 
by our approach. For this purpose, we have developed two 
different versions of the system. The first one is a SW module 
that is executed in an embedded processor. However, for some 
benchmarks this version generates significant overheads, 
mostly due to the management of the complex data-structures 
and the HW/SW communications. To reduce these overheads, 
we have also developed an efficient implementation of the run-
time scheduler using some of the reconfigurable resources. 
This implementation delays the execution only a few clock 
cycles. These two versions offer different trade-offs between 
the run-time management overhead and the cost needed to 
implement the scheduler.  

There are many other areas of interest regarding HW multi-
tasking systems. One of them is task placement. This may be 
sometimes a major problem, since a sub-optimal placement 
can lead to infeasible schedules that do not meet the system 
constraints, as it is proved in [4]. Frequently the reason is that 
the communication topology is not compatible with the 
selected mapping. In our target architecture we assume that the 
communication infrastructure is contention-aware and provides 
enough bandwidth so that even the worst-case communication 
is performed correctly and efficiently. This can be achieved by 
means of a bus with enough bandwidth or a contention-aware 
Network on Chip that implements a wormhole routing 
algorithm or any similar technique that guarantees that the 
latency of a transmitted message between 2 whatever nodes is 
almost constant, no matter the distance between them. Hence, 
under these assumptions we can safely assume that any task 
can be safely implemented on any of the RUs.   

Other areas of interest regarding HW multi-tasking systems 
are inter-task communications, HW/SW partitioning, area 
fragmentation, low power concerns… However, these issues 
are orthogonal to the problem that we are targeting. Hence, we 
will also assume that the system OS or the middleware will 
take care of these decisions and we will just focus on what 
happens once a task graph has been assigned to the 
reconfigurable resources. To test our scheduler we have 
developed a simplified simulation environment and 
implemented it in a FPGA. This environment simulates the 
reconfiguration and the execution of task-graphs using 
programmable timers to simulate the behaviour of the RUs, 
and provides clock-cycle precision to measure the overheads 
generated by the scheduler. 

 
III. MOTIVATIONAL EXAMPLE 

Fig. 2 shows a motivational example that illustrates how our 
scheduler can escape from local-optimum solutions delaying 
some reconfigurations. In this example the task graph is 
executed twice in a system with 3 RUs, applying prefetch with 
an ASAP approach (a) and delaying some reconfigurations (b).  

In both approaches it can be seen that the prefetch technique 
is very powerful when dealing with DAGs since it hides the 
latency of most reconfigurations. The simplest way to apply 
prefetch is using an ASAP approach, since it is very simple to 
implement and it generally will lead to good schedules. When 
more than one task can be reconfigured we select the one with 
greater weight (as we will describe later). In this example, both 
tasks 2 and 3 are ready for reconfiguration simultaneously, and 
task 3 is selected since it is part of the critical path. With this 
approach the latency of three of the four tasks is hidden. Task 
1 is the only one that generates delays due to its 
reconfigurations. In our scheduling environment we will say 
that task 1 is critical whereas the others are non-critical. 

However, although our replacement policy assigns greater 
priority to Task 1, in Fig. 2 (a) this task is replaced by Task 4, 
because in that instant it was the only available candidate. For 
this reason in the second iteration the reconfiguration of Task 
1 introduces again a delay in the execution.  

This delay will disappear if we delay the reconfiguration of 
task 4 (Fig. 2 (b)). In this case, our scheduler knows that the 
reconfiguration of Task 4 can be delayed without any 
performance degradation (in Section V we will explain how 
this information is obtained). Hence, when Task 1 is selected 
as the replacement victim, the scheduler decides to delay this 
reconfiguration waiting for the following event (i.e. the end of 

Reconfiguration

Reconfigured but idle

N Execution of task N

RU1
RU2
RU3

1
2

3

4

1

2 3

4

1
2

3

4

(a) ASAP execution

time t1

1st execution 2nd execution

TASK GRAPH

RU1
RU2
RU3

1
2

3
4

1
2

3
4

(b) Optimal execution

time

1st execution 2nd execution

t2 t1
 

Fig. 2.  Execution of a task graph in a platform with three RUs applying an 
ASAP approach (a) and using our scheduling technique that achieves the 
optimal solution delaying one of the reconfigurations (b). 
 

TVLSI-00330-2009 4 

the execution of Task 2). Thus, when Task 2 finishes its 
execution the replacement policy can select between two 
replacement victims (tasks 1 and 2), and it will select Task 2 
since it is non critical. As a consequence, when the task graph 
is executed again, Task 1 will be directly reused and no 
reconfiguration will generate any delay in the execution. 

 
IV. RELATED WORK 

During recent years many research groups have developed 
techniques that attempt to reduce the reconfiguration overhead. 
And interesting survey of most of them can be found in [5]. 

Many authors, such as [2, 6, 7, 8, 9] have proposed to build 
a HW multi-tasking platform by dividing the entire 
reconfigurable area into smaller RUs, The first implementation 
of such a system in a commercial FPGA was presented by 
Marescaux et al. in [2].  In this work, the authors propose to 
divide the entire reconfigurable area into identical tiles, 
connected among them by an interconnection network (ICN). 
Recently Nollet et al. [10] have proposed to extend this 
approach applying the idea of configuration hierarchy to build 
a reconfigurable Multi-Processor System-on-a-Chip (MP-
SoC). Basically, instead of executing dedicated HW tasks in 
the reconfigurable resources, they use these resources to 
implement programmable softcores that will execute SW tasks. 
They have developed task mapping heuristics to assign tasks to 
the different processors (both hard- and soft-processors), 
taking into account the communication topology. We believe 
that the ideas of this work are compatible with our approach, 
although we are not focusing on soft-cores.  

In [6], Walder et al. present a run-time environment to 
execute HW tasks by partially reconfiguring a Xilinx™ Virtex 
II PRO. This work does not deal with task-graphs, but with 
independent tasks, and the main objective is to develop a 
partition technique to decide the optimal size of the RUs. In 
our work we assume that this has already been decided, hence 
this work is again orthogonal to our work.  

In [7], Qu et al. propose adding more reconfiguration 
controllers to carry out several reconfigurations in parallel. 
This could improve the efficiency of a reconfigurable multi-
tasking system, but currently the commercial platforms only 
include one reconfiguration controller.  

In [11] and [12] the authors present two interesting 
approaches for HW multi-tasking. In this case the objective is 
to take full advantage of the data-parallelism for a given 
application by replicating the same task several times. This is 
again compatible with our work. Several versions of the same 
task with different level of data parallelism can be identified at 
design-time, and at run-time the OS or the middleware could 
select the appropriated one and then send it to our DAG 
scheduler. In fact this is what we have done in order to 
generate one of the benchmarks that we will use in the 
experimental section (Parallel-JPEG).    

Other interesting contributions for HW multi-tasking 
systems are the research efforts to develop OS support to 

simplify the use of the reconfigurable HW. Some relevant 
examples are the works developed by Kosciuszkiewicz et al. 
[13] and H. Kwok-Hay So et al. [14]. They propose to extend 
an Embedded Linux OS to support HW tasks. The idea is to 
provide support to let the user transparently deal with HW 
tasks at run-time as they were regular threads. A related topic 
is how to decide, again transparently to the user, which tasks 
must be executed in the reconfigurable resources. HW/SW 
partitioning has been a very active research field during the 
last decade: in [15], Fu and Compton deal with a HW/SW 
multitasking system, which includes reconfigurable resources 
and a UltraSparc processor. In order to improve the 
performance some tasks are assigned to the reconfigurable 
HW. They have developed several algorithms to select these 
tasks taking into account the needs of each application, the 
available resources, and the reconfiguration overheads.    

Previously some research groups have proposed to include 
HW scheduling support for reconfigurable systems. In [8], the 
authors propose a HW micro-architecture to deal with task 
management at run-time efficiently applying a list-scheduling 
heuristic. However, the authors did not implement their design, 
but they only included it in their specific simulation 
environment. In [9] the authors have extended [8] in order to 
support control dependencies in the task graph. However, 
again they do not implement their approach. Moreover, they 
assume that all the reconfigurations can be carried out in 
parallel; hence they do not need to schedule the 
reconfigurations. This simplifies considerably their scheduler. 
However, as it was explained before, currently this is not a 
realistic assumption. In addition, they do not provide 
replacement support, and they follow a greedy list-scheduling 
approach that cannot escape from local-optimum decisions. In 
[16] the authors propose a scheduling methodology for a real-
time system based on reconfigurable HW and implement it in a 
FPGA. They use an earliest-deadline first approach but 
grouping several tasks in order to apply efficiently full 
reconfigurations. Since they do not support partial 
reconfiguration, they do not have to apply any prefetch 
approach or a replacement policy because they only have one 
RU. In addition they deal with independent periodic tasks; 
hence the scheduling support that they have developed is 
completely different from our approach. Hence our work 
clearly provides important novel contributions. First, we not 
only provide a prefetch approach, which was initially proposed 
in [17], but also a replacement technique designed to 
collaborate with our scheduler and improve the overall results. 
Second, we have found a simple way to escape from local 
optimum decisions. In this regard, many researchers have 
proposed off-line scheduling techniques that can escape from 
local optimum solutions. However, when dealing with on-line 
systems this is much more complex and only simple (and 
mostly greedy) algorithms are used as list-scheduling or 
earliest-deadline first. Third, we have implemented our 
scheduler and tested it in a FPGA using graphs extracted from 
actual multimedia applications. Fourth, we have implemented 


TVLSI-00330-2009 5 

an equivalent SW version and analyzed the design trade-offs.    
 Our research group has been already active in this field for 

many years. We already presented a scheduling environment 
for reconfigurable systems in [18] and [19]. In these works we 
proposed several techniques that also deal with task-graphs at 
run-time. However, in a recent work [20] we identified that 
even our simple techniques generate important overheads 
when dealing at run-time with complex data structures, such as 
task-graphs, especially if they were executed in an embedded 
processor. Hence we decided to develop a generic HW task-
graph execution manager for reconfigurable systems [21]. This 
module receives as inputs a task-graph and its schedule and 
steers the execution of that task-graph following the 
instructions given in the schedule. The main achievement 
presented here was a HW implementation of that manager, 
which significantly speeds up this management process with 
respect to an equivalent SW version.  

The next step has been to add a scheduling layer to this 
management module, thereby developing a task-graph 
scheduler for reconfigurable systems. Of course, we have tried 
to keep those ideas that proved to be useful in our previous 
work, such as a replacement policy that collaborates with the 
scheduler, and trying to extract useful information at design-
time. However, we have developed new techniques that can be 
efficiently implemented in HW, and we have extracted some 
extra information at design-time. A preliminary version of this 
work has been published in [22]. At that time our scheduler 
applied a simple ASAP approach, and our HW implementation 
was very area-hungry. Since then, we have developed a new 
scheduler that delays some reconfigurations in order to escape 
from local-optimum decisions, we have drastically improved 
the area scalability of our design, we have improved the 
replacement strategy, and we have carried out new 
experiments in order to better test our approach.  

 
V. THE SCHEDULING FLOW 

In this section we will explain the proposed scheduling flow.  
In the following, we will refer to a node of a task graph as a 
task.  

Our scheduler flow includes two phases: design-time and 
run-time. At design-time, the task graphs are analyzed in order 
to extract some information that the scheduler will use to 
optimize its decisions. At run-time the scheduler steers the 
execution of the graph in the set of RUs) of the proposed 
architecture, applying a prefetch technique and a replacement 
policy and taking into account the mobility of the tasks.  

The next sub-sections describe the design-time and the run-

time phases, respectively.  

A. Design-time phase 

This phase is basically a compilation phase, which is 
necessary in order to extract useful information that will be 
used at run-time. Each task is characterized with three 
parameters: weight, criticality and mobility. 

 
1) Weight calculation 

Initially the scheduler assigns a weight to each task 
according to this simple algorithm: the weight of a leaf task 
(those tasks that have no successors) is just its execution time. 
For the rest of the tasks, their weight is the addition of their 
execution time and the maximum of the weights of all their 
successors. An example of this process is shown in Fig. 3 (a). 
Thus, according to the figure, W(task4)=6; W(task2)=8+6=14; 
W(task3)=16+6=22 and W(task1)= 12+max(14, 22)=34. 

We assume that we have reliable estimations of the 
execution time of each task. Nevertheless, if the execution time 
of the tasks was variable (for instance, depending on the input 
data), the scheduler could adopt a solution similar to the one 
proposed in [23], where the authors suggest creating several 
graphs for the same task (called scenarios) and then choosing 
the one that best fits to the current conditions. These scenarios 
are identified at design-time; hence if it was necessary to apply 
this technique, the scheduling flow that we propose would 
continue being valid.  

The weights represent the impact of the tasks in the critical 
path of the task graph. Thus, if Weight(A) > Weight(B), it 
means that A must be loaded before B. Hence, once this 
process is performed, all the tasks are sorted decreasingly 
according to their weights to determine the sequence of 

reconfigurations that the scheduler will follow at run-time. 
Thus, in Fig. 3 (a), this sequence is 1-3-2-4.  

 
2) Critical tasks identification 

 
The second step identifies which tasks are especially critical 

for the system. The goal of the algorithm is to obtain the 
minimum set of tasks that fulfill the following condition: if 
they are reused (and therefore they do not generate any delay 
due to the reconfiguration latency), the scheduler will be able 
to hide the reconfiguration overhead of the remaining (i.e. non-
critical) tasks. It is important to point out that at design-time 

 
1

e.t.: Execution time

n: Non critical task
n: Critical task

Reconfiguration that
generates overhead
Reconfiguration that does
not generate overhead

(m): Criticality

RU1
RU2
RU3

1
2

3

4

e.t. 12
W 34

e.t. 8
W 14

e.t. 16
W 22

e.t. 6
W 6

time

1st iteration 2nd iteration

TASK GRAPH

1(4)
2

3

4

2

4

3

(a) Sequence of reconfigurations: 1-3-2-4

W: Weight

(b) Critical task identification process:

 
Fig. 3.  Example of the algorithms for the weight calculation (a) and the 
critical -task identification (b). The reconfiguration latency is always 4 ms. 
 

TVLSI-00330-2009 6 

we know how our scheduler will work at run-time; hence we 
can identify the critical tasks at design-time, saving runtime 
computations. These tasks will be labeled as “critical tasks”, 
and the scheduler will assign them a value of criticality that 
represents the delay that they will generate when they are not 
reused. At run-time, the replacement policy will take into 
account this information assigning greater priority to the 
critical tasks.  

Fig. 4 shows the pseudo-code of this algorithm. Firstly, the 
set of critical tasks (CT) is initialized to the whole set of tasks 
in the task-graph (line 1). Then, the function schedule 

(task_graph, CT) (line 2) is called. This function schedules the 
task-graph assuming that all the tasks in the CT set are reused 
and returns a reference schedule (ref_sch). Any scheduling 
algorithm can be used in this step. Since these computations 
are carried out at design-time, in this case we use a 
branch&bound-based scheduler that guarantees the optimal 
solution. Hence, as initially all the tasks have been assigned to 
CT, in this step we obtain an ideal schedule with no 
reconfiguration overhead. This ideal schedule is used as a 
reference during the critical-task identification process.  

The objective of this process is to find a schedule that 
provides the same performance than ref_sch, but with the 
minimum number of tasks assigned to CT. We initialize CT as 
empty (line 3) and we start an iterative process (lines 5-12).  
The while loop starts computing a new schedule (current_sch, 
line 6) and compares its execution time with the execution 
time of ref_sch (line 7).  If both schedules have the same 
execution time, the iterative process finishes. Otherwise the 
function compare identifies which tasks have been delayed in 
current_sch (line 10). In the next step (line 11) the function 
max_weight identifies the delayed task with the greatest 
weight, and finally this task is added to the CT set (line 12).  

Fig. 3 (b) depicts an example of this process. As it can be 
seen in the figure, all the tasks in the graph are initially labeled 
as non critical. Then, the algorithm performs the first iteration 
of the loop and identifies a 4 ms delay (reconfiguration of 
Task 1) compared to the ideal execution. Hence, this task is 
labeled as critical, with a criticality of 4. In the following 
iteration of the loop, Task 1 is reused, and in this case the 

configurations of Tasks 2, 3 and 4 do not generate any 
execution time overhead. Hence the algorithm stops and tasks 
2, 3 and 4 remain labeled as non critical.  

 
3) Mobility calculation 

 
Finally, the scheduler assigns a value of mobility to each 

task. This parameter identifies how many times a 
reconfiguration can be delayed without generating any 
overhead in the task-graph execution. More precisely, this 
value represents how many events can be skipped before 
reconfiguring a given task without generating any delay. This 
means that the reconfiguration circuitry is ready, and that at 
least one of the RUs is available, but for some reason our 
scheduler decides to delay the reconfiguration. In fact the 
reason is always the same: the replacement policy has selected 
as a victim a task that the scheduler prefers not to replace. 
Hence the scheduler will check the mobility, and if there is still 
margin for delaying the reconfiguration, it will wait until the 
following scheduling event hoping that by that time the 
replacement policy will find a less important victim.  

Fig. 5 shows the pseudo-code of this algorithm. Initially all 
the mobility of each task is initialized to 0. Then the process 
starts identifying the critical and the non-critical tasks of the 
task graph and storing them in CT and NCT, respectively (1-2). 
(Note that by definition the critical tasks have no mobility 
because even if their reconfiguration is scheduled ASAP they 
will generate a delay). In the main loop (3-15) the algorithm 
sequentially extracts a task t from NCT (4), and the function 
schedule () obtains a first schedule (ref_sch) assuming that t 
has mobility 0. Then, the do-while loop (6-13) tentatively 
assigns a greater mobility value (7) to t and calculates a new 
schedule (new_sch), but this time delaying the reconfiguration 
of t as many times as the value of t.mobility. This is again done 
in the function schedule () (8). Then, the algorithm checks if it 
is feasible to assign that new mobility to t without degrading 
the performance of the system (9). Thus, if diff>0 (10), an 
extra overhead has been generated; hence the algorithm 

Mobility assignment: 
1. CT := obtain_critical_tasks (task_graph);
2. NCT := obtain_non_critical_tasks (task_graph);
3. WHILE (NCT ≠ Ø){

/* Get a task from NCT */
4.   t := get_task(NCT);

/* Schedule with previous t.mobility */
5.   ref_sch := schedule (task_graph, CT); 
6.   DO{
7.     t.mobility++;

/* Schedule with new t.mobility */
8.     new_sch := schedule (task_graph, CT); 
9.     diff := ex_time (new_sch) – ex_time (ref_sch);
10.     IF (diff > 0){
11.       t.mobility--;
12.     }
13. }WHILE (diff == 0);
14.  remove (t, NCT);
15.}

 
Fig. 5.  Algorithm to assign the mobility to the non-critical tasks. 
 

Critical tasks (CT) identification:
1. CT := whole_set_of_tasks (task_graph);
2. ref_sch := schedule (task_graph, CT);
3. CT := Ø;
4. end := 0;
5. WHILE (not end){
6.   current_sch := schedule (task_graph, CT);
7.   IF (ex_time (current_sch) == ex_time (ref_sch)){
8.      end := 1;
9.   }ELSE{
10.    delayed_tasks := compare (current_sch, ref_sch);
11.    t := max_weight (delayed_tasks);
12.    add_critical_task (t, CT);
13.  }
14.}
15.RETURN CT;

Fig. 4.  Algorithm to identify the critical tasks. 
 

TVLSI-00330-2009 7 

restores the former mobility value to t (11) and exits from the 
do-while loop (13). Finally, the task is removed from NCT. 
The while loop continues until NCT is empty.  

Fig. 6 illustrates this process with an example. Firstly, the 
algorithm selects the non-critical tasks of the graph: 2, 3 and 4. 
For Task 3, it attempts to assign a mobility of 1. However, 
when the scheduler delays its reconfiguration it generates a 
delay in the execution of 4 ms. Hence, the mobility of Task 3 
is set to 0. On the contrary, Task 2 and 4 can be delayed 
without any performance degradation. Hence the scheduler 
assigns them a mobility of 1 in the first iteration. Afterwards it 
attempts to assign a value of 2, but in both cases new 
overheads are generated. Hence, the mobility of Task 2 and 4 
are set to 1.  

B. Run-time phase 

At run-time, the scheduler steers the execution of the task 
graph taking into account the information obtained at design-
time and the available resources. To take all these run-time 
decisions as quickly as possible the scheduler only considers 
some discretized time values following an event-triggered 
approach. When certain events happen the scheduler will look 
for reconfigurations or tasks that are ready to be scheduled. 
Three different events trigger the execution of the scheduler: 
new_graph, which is generated when the information of a new 

task graph is received; end_of_reconfiguration, which is 
generated by a RU when a new configuration has been loaded 
or when the RU has identified that a configuration can be 
reused since it was already loaded in a previous execution; and 
end_of _execution, which is generated by a RU when a task 
finishes its execution. This approach greatly reduces the 
complexity of the run-time scheduling process, but at the same 
time it provides enough flexibility to optimize the execution. 
Basically, the scheduler processes these events sequentially 
and carries out the proper actions. Fig. 7 depicts this process.  

When a new task graph arrives and the reconfiguration 
circuitry is free (1-3), the system attempts to schedule the 
reconfiguration of the first task in the reconfiguration sequence 
of the received graph.  

For the end_of_reconfiguration and reused_task events, the 
system checks the dependencies of the task that has been 
loaded or reused in order to identify if it can start its execution. 
To this end, the scheduler will check if all its predecessors 
have already finished their execution (5). In that case, the task 
is ready to be executed (6) and the system starts its execution 
(7). Then, it attempts to schedule a new reconfiguration (8). 

Finally, for the end_of_execution event, the scheduler 
checks again if the reconfiguration circuitry is idle. If so, it 
looks for a new reconfiguration (10-11). After that, it updates 
the task-graph dependencies, decreasing the number of 
predecessors of each successor of the finished task (12). 
Finally, the system checks if any of the tasks that are currently 
loaded in any RU can start its execution (13-15).  

The following subsections describe the replacement policy 
and the run-time mobility management, respectively. Both 
techniques are applied inside the look_for_reconfiguration 
function.  

 
1) Replacement policy 

 
Transparently to the event management, the scheduler also 

RU1
RU2
RU3

Delaying Task 2

1
2
3

4

t=38

RU1
RU2
RU3

Delaying Task 3

1
2
3

4

t=42

RU1
RU2
RU3

Delaying Task 4

2
3

4

t=42

1
2

3

4

t=38

1st iteration 2nd iteration

1st iteration

1st iteration

1
2

3

4

2nd iteration

t=50

1

1

e.t.: Execution time

n: Non critical task
n: Critical task

Reconfiguration that
generates overhead
Reconfiguration that does
not generate overhead

e.t. 12

e.t. 8 e.t. 16

e.t. 6

TASK GRAPH

2

4

3

 
Fig. 6.  Example of the mobility calculation. The reconfiguration latency is 
always 4 ms. 
 

/* task = task that triggered the event
* RC = Reconfiguration Circuitry */
CASE event IS:

1. new_graph:
2.      IF (RC == idle){

3.          look_for_reconfiguration (&rec_sequence);
4.      }
5. end_of_reconfiguration or reused_task:
6.     check_dependencies (&task);
7.        IF (is_ready (&task)){
8.            start_execution (task);
9.        }
10.    look_for_reconfiguration (rec_sequence);
11. end_of_execution:
12.     IF (RC == idle){
13.         look_for_reconfiguration (rec_sequence);
14.     }
15.     update_task_dependencies (&task)
16.     FOR (i := 0 TO NUMBER_OF_RUS){
17.         IF (RU_state == IDLE) AND (is_ready (&task)){
18.             start_execution ();
19.         }
20.     }

Fig. 7.  Pseudo-code of the run-time phase. 
 

TVLSI-00330-2009 8 

applies a replacement policy especially designed for this 
system. We have called it Look Forward + Critical (LF+C) 
because it takes into account the criticality of the tasks and 
whether they are going to be loaded in the near future or not. 
We assume that the “near future” means “in the context of the 
graph currently in execution” because this is the only region of 
the future that the scheduler can analyze. According to this 
information, the scheduler classifies the RUs into 6 categories: 

• Busy: RUs with a task in execution, or with a 
configuration that has recently been prefetched 
and that is waiting for execution. These RUs are 
never selected for replacement. 

• Free candidates (FC): These RUs do not have 
any task loaded. When the system is initialized, all 
the RUs are free.  

• Non-critical candidates (NC): RUs with non-
critical tasks that are not going to be executed in 
the near future.  

• Non-critical reusable candidates (NRC): RUs 
with non-critical tasks that are going to be 
executed in the near future.  

• Critical candidates (CC): RUs with critical tasks 
that are not going to be executed in the near future.  

• Reusable critical candidates (RCC): RUs with 
critical tasks that are going to be executed in the 
near future. 

The replacement technique assigns the maximum priority to 
the RCCs and the minimum to the FCs. Hence, it attempts to 
replace those tasks that are not critical and/or are not going to 
be executed soon. If all the possible victims of the replacement 
policy are critical and belong to the same category (CC or 

RCC), the scheduler use the criticality value to select the 
victim. If all of them had the same criticality, the victim is 
selected randomly.  

It may be unclear whether is better to assign more priority to 
the NRC candidates or to the CC candidates. On the one hand, 
assigning more priority to the NRC can be a good strategy, 
since if those tasks are replaced the scheduler will have to load 
them again soon. On the other hand, according to our 
definition of critical tasks, replacing a critical task will 
generate delays in the execution if that task is needed in the 
future, whereas replacing a non-critical task will not generate 
any delay. In [22] we propose to assign more priority to the 
NRC category to reduce the number of replacements needed. 
However, after extensive simulations we have identified that 
assigning more priority to the CCs always provides better 
performance, and that the percentage of reused tasks were very 
similar. Hence we have modified our previous replacement 
module and currently we assign more priority to the CCs than 
to the NRCs. Hence RCCs have the maximum priority, then 
CCs, then NRCs, then NCs and finally FCs. 

We have designed a new replacement policy for three 
reasons. Firstly, conventional replacement policies, such as 
LRU, can easily fall into a configuration trashing problem 
when the number of active tasks is greater than the number of 
RUs. Secondly, since the scheduler deals with task graphs, it 
has some information about the tasks that are going to be 
executed in the near future, and it can take advantage of that. 
Finally, the objective of conventional replacement policies is 
to improve the reuse. This is a good objective, but in our 
replacement policy it is only a secondary goal. In this case, the 

e.t. 8
W 14

e.t. 6
W 6

e.t. 6
W 6

Reconfiguration that
generates overhead
Reconfiguration that does
not generate overhead

N Execution of a task
N: execution time

N Execution of a reused
task

e.t.: execution time         
W: weight
C: critical candidate
NC: non-critical candidate

e.t. 6
W 8

e.t. 2
W 2

Task 2 Task 3Task 1

RU1
RU2
RU3
RU4

8

RU5

6
6

6
2

5
5

8
6
6

6
2

5
5

8
6
6

6
2 5

5

8
6
6

6
2

5
5

8
6
6

6
2

5
8

6
2

5
5 6

6

5

LRU

LFD

LF+C

C

C

NC
NC

NC

NC
C
NC

NC
C
NC

CC

C
NC

C C

C

C
NCNC

NC

First iteration Second iteration

time
44 76 80 88

NC

C

e.t. 5
W 10

e.t. 5
W 5

RU1
RU2
RU3
RU4
RU5

RU1
RU2
RU3
RU4
RU5

 
Fig. 8.  Execution of three graphs in a system with 5 RUs and 4 ms of reconfiguration latency. 
 

TVLSI-00330-2009 9 

main objective is to improve the performance by attempting to 
reuse as many critical tasks as possible.  

Fig. 8 depicts an example that illustrates the benefits of our 
replacement policy. To simplify the figure we have mixed 
categories 3 and 4 (non-critical candidates in the figure) and 
categories 5 and 6 (critical candidates in the figure). The figure 
presents a system that executes twice three task graphs using 
three different replacement policies: LRU, LFD and LF+C. 

The figure shows that the well-known LRU policy does not 
reuse any task, since there are 7 tasks competing for 5 RUs. 
Hence, it provides the worst execution: 88 ms. This is an 
example of configuration trashing. The second replacement 
policy (LFD) is the optimal one regarding reuse, as it was 
proved in [3]. Basically, this strategy replaces the task that will 
be requested farthest into the future. LFD cannot be applied in 
dynamic systems, since in order to apply LFD the system 
needs to know all the future events, but it can be used as a 
reference. In this case the system reuses 5 tasks and the 
execution time is 80 ms.The figure shows that the LRU policy 
does not reuse any task, since there are 7 tasks competing for 5 
RUs. Hence, it provides the worst execution: 88 ms. The LFD 
replacement policy is the optimal one regarding reuse, and 
used  here as a reference; as we explained in Section II. In this 
case the system reuses 5 tasks and the execution time is 80 ms.  

Finally, the figure also shows what happens when the system 
uses LF+C. For the first two graphs, the replacement policy 
simply selects a Free Candidate (FC). When Graph 3 starts 
RU1 and RU4 are critical candidates, whereas RU2, RU3 and 
RU5 are non-critical candidates. For this graph the scheduler 
selects the first two non-critical candidates (RU2 and RU3) 
because critical candidates have a greater priority. When 
Graph 1 is executed again, the first task is reused, and the 
scheduler loads the two remaining nodes in RU3 and RU5 
respectively, since they are the only non-critical candidates 
available. Although only one task has been reused, no delays 
have been generated due to the reconfigurations, since this was 
the only critical task. When graph 2 starts again its first task, 
which is the only critical task of the graph, is reused. For the 
other task the replacement policy selects the first non-critical 
RU (in this case RU3). Finally, when Graph 3 is executed, the 
system reuses its first task (which again is critical) and loads 
its second task in RU3, again a NC candidate. As it can be 

seen in the figure, LF+C does not achieve the optimal result 
regarding reuse. Whereas LFD reused 5 tasks, LF+C only 
reused 3. However, as it was explained before, reuse is only a 
secondary objective for our heuristic. The main objective is 
performance, and in this example it achieves the optimal 
performance, hiding the latency of all the reconfigurations in 
the second execution of the graphs. Using LF+C the 6 graphs 
are executed in 76 ms, 4 ms less than using LFD and 12 ms 
less than using LRU. As a conclusion the objective of LF+C is 
not only to reuse more, but also to reuse better.  

 
2) Mobility management 

 
In Fig. 7 there were three calls to the function 

look_for_reconfiguration(). In this function, the scheduler 
must decide whether to schedule the reconfiguration of the 
following task in the reconfiguration sequence or not. To this 
end it takes into account the mobility of the task and the victim 
selected by the replacement policy.  

Fig. 9 shows the pseudo-code of this function. Firstly, the 
function extracts the first task in the reconfiguration sequence 
(1). Then, the scheduler selects a victim applying our 
replacement policy (2).  If all the RUs are busy, this would not 
be possible (3). Otherwise, the function checks if the victim is 
critical (4). In this case the scheduler will try to delay the 
reconfiguration. To this end it checks if the mobility of the task 
is greater than the number of total skipped events at that 
moment (5). In that case, the system will just increase the 
number of skipped events (6). Otherwise, the function triggers 
the reconfiguration of the new_task (9) replacing the selected 
victim and removing new_task from the reconfiguration 
sequence (10). 

As it was shown in the motivational example of Fig. 2, this 
approach allows reusing more critical tasks, leading to an 
improvement on the system performance.   

VI. IMPLEMENTATION DETAILS 

In this section we will explain in detail the proposed 
architecture. As mentioned before, we have implemented two 
different versions of our scheduler: a SW version and a HW 
one. In both cases, we have used a Xilinx™ Virtex II PRO 
FPGA and the EDK development tool to carry out our 
experiments. We have not implemented a complete HW multi-
tasking system, but just a simulation platform where the 
execution of the tasks in the RUs and the reconfigurations are 
simulated using programmable timers. The system includes an 
additional timer that is used to measure the execution time of a 
given simulation with clock-cycle accuracy. 

In the SW version (Fig. 10 (a)) the scheduler is a program 
that runs in a PowerPC™, and that communicates with a set of 
timers representing the RUs. This program is a C-code 
compiled for the PowerPC architecture. The assembly code is 
stored in an internal memory in the FPGA. During the run-time 
scheduling phase the timers communicate with the processor 
generating interrupts and the processor can directly access to 
the timer interfaces using the system bus. When a task is 

void look_for_reconfiguration (){
1. new_task := next_task_in_rec_sequence();
2. apply_replacement (victim);
3. IF (victim ≠ Ø){
4.   IF (critical_candidate (victim) == TRUE){
5.     IF (new_task.mobility > skipped_events){
6.       skipped_events++;       /* initially 0 */
7.     }
8.   }ELSE{
9.     load (&new_task);
10.    delete_task_in_rec_sequence (&new_task);
11.  }
12.}
}

 
Fig. 9.  Pseudo-code of the function look_for_reconfiguration(). 
 

TVLSI-00330-2009 10 

assigned to a RU the processor loads its information (the 
reconfiguration latency or the execution time) in the timer and 
activates it when the reconfiguration or the execution must be 
simulated. When the timer finishes the countdown, it generates 
an interrupt that will trigger the corresponding event 
(end_of_reconfiguration or end_of_execution). The PowerPC 
can then carry out the proper scheduling actions.  

In the HW version (Fig. 10 (b)) the scheduler is a HW 
module implemented using some of the available 
reconfigurable resources in the device. This module only 
provides support for the run-time phase of our scheduling 
flow, since the design-time phase must be carried out at 
compile time. The information regarding the different task-
graphs is stored in an internal memory, and when the execution 
of a task-graph must be simulated in this platform, the 
PowerPC transfers all the task graph information to the HW 
scheduler. When the simulation finishes, the scheduler 
generates an interrupt.   

In this implementation the RUs interact directly with the 
HW scheduler using point to point communication lines. This 
approach reduces the penalties due to HW/SW 
communications from one to two orders of magnitude in our 
experiments. A block diagram of the HW scheduler is depicted 
in Fig. 11. Next we will describe these blocks in detail.  

 
A. Reconfiguration queue and task-graph table 

These modules are initially used to store the reconfiguration 
sequence and the task-graph information. During the task-
graph execution this information is updated according to the 
run-time events.  

The Task-graph table stores the information of the current 
task graph including all the precedence constraints. This table 
is used to guarantee the proper task execution. For this 
purpose, the table supports the following operations: 

Insertion: used to store the information of a new task graph. 
The latency of this operation is NT clock cycles, where NT is 
the number of tasks.      

Check: This operation enquires about whether a task is 
ready to be executed or not, which happens when all its 
precedence constraints have already been met. This operation 
just requires one clock cycle. 

Hit: This operation enquires about whether a task is 

currently waiting for execution. This information will be used 
by the replacement module. This operation just requires one 
clock cycle. 

Update: This operation updates the task graph information 
when a task finishes its execution. It updates the state of the 
task and its precedence constraints. To this end the successors 
of that task are extracted and sequentially updated, 
decrementing the value stored in Nº predec. This value 
represents the number of predecessors that have not finished 
their execution. The table includes some specific HW support 
for this operation that is not included in the figure for 
simplicity. The latency of this operation in 2*NS+1 clock 
cycles, where NS is the number of successors of the task.  

Initially, we implemented an associative version of this 
table, which allowed having a table with a relatively small 
number of entries, since the particular position of a task in the 
table was not relevant. In addition it provided very high 
performance for small graphs. However, as the complexity of 
the tested task graphs grew, it was also necessary to use a 
larger table, which consumed an unacceptable amount of 
resources for large task-graphs (in some cases this grew up to 
99% of the scheduler). Furthermore, its operation frequency 
greatly decreased, which limited the performance of the 
system. In order to solve these problems, we implemented a 
non-associative version of the table using one of the SRAM 
blocks (BRAMs) available in the FPGA and some additional 
HW support. In this second version, each task is stored in the 
table according to its ID, where the ID is its unique identifier. 
For instance, the information of Task N is stored in the 
Table[N] entry. The main drawback is that the non-associative 
table needs as many memory positions as different tasks can be 
executed in the system. However, this is not a big problem, 
since in our experiments just one BRAM was enough to deal 
with all of our tested task graphs. With this approach, the HW 
area required for the task-graph table has been drastically 
reduced. Moreover its operation frequency is more than four 
times faster than the associative version (up to 400 MHz), 
even for a large number of entries. 

 
B. Replacement module 

This module implements the replacement technique. It 
demands a very affordable HW support as it can be seen in 
Fig. 12. The RUs are analyzed sequentially, extracting the task 

Rec. queue

task k
task j
task i

task k
task j
task i

Task info

unit state

Task info

unit state

RU 1 info RU n info

Control unit

Task-graph table Event queue

end execution
task_loaded
task reused

end execution
task_loaded
task reused

Task info

unit state

Task info

unit state
…

Replacement 
module

Task ID Nº predec. Nº succ. Succ’s IDs
0
1
...
n

 
Fig. 11.  Block diagram of the HW scheduler. 
 

System bus

RU 1 RU N...

INT_REQ
Power PC Power PC

System bus

RU 1 RU N...

INT_REQ

Scheduler
...

Virtex-II Pro Virtex-II Pro

(a) (b)
 

Fig. 10.  Simulation platform. (a) SW implementation of the scheduler. (b) 
HW implementation of the scheduler. 
 

TVLSI-00330-2009 11 

information using a counter and some multiplexers. This 
information includes whether the task is critical or not. In 
order to check if the task is going to be executed in the near 
future, the replacement module uses the hit operation of the 
task-graph table. With this information, it is straight forward to 
identify the category of the task. The replacement module also 
includes a simple state machine and a register that stores the 
information of the best candidate found. Finally, it includes a 
comparator that identifies whether a given task can be reused 
(in this case the reconfiguration is cancelled and the system 
generates a reused_task event). The comparator and the 
implementation of the state machine are not included in the 
figure for simplicity.   

C. Control unit and event management 

As mentioned before, we have integrated our scheduler in a 
simulation platform that allows simulating the reconfiguration 
and execution of the RUs using programmable timers.  

In this platform two registers are associated with each timer. 
These registers store the state of the RU (idle, reconfiguring, 
ready to execute and executing) and the information of the 
loaded task. The RUs also include HW support to generate 
some run-time events whenever the corresponding timer 
indicates that a task has finished its reconfiguration or 
execution. These events are stored in an event queue. Since it 
is possible that several RU modules generate events 
simultaneously, the queue includes an arbiter with a fixed 
priority scheme that prevents access conflicts.  

The control unit is basically a state machine that extracts the 
events from the queue and carries out the actions described 
previously in Fig. 7. This unit also takes into account the 
mobility of the tasks to decide whether to delay a given 
reconfiguration or not. Fig. 13 depicts the HW support needed 
to make this decision. Basically a counter stores the number of 
events skipped so far and a comparator identifies if the 
reconfiguration can be delayed without introducing overheads 
(i.e. if the mobility of the task is greater then the number of 
skipped events). In that case, if the replacement module has 
selected a critical task as victim, the reconfiguration is delayed 
and the skipped events counter is incremented. 

D. Synthesis results 

Fig. 14 shows the resources used for the HW 
implementation of our scheduler in the Xilinx™ Virtex II PRO 
XC2VP30 FPGA for both versions of the task-graph table. 
The resources needed by the scheduler with the associative 
table grow linearly with the number of table entries. This limits 
the scalability of the scheduler, since for large systems it 
consumes an important percentage of the FPGA resources. Our 
current implementation has solved this problem and consumes 
less than 5% of the available resources even for large tables. 
Moreover, in both cases, the implementation of the scheduler 
demands only 1% of the BRAMs (this is not presented in the 
figure for simplicity). An additional 4% of the FPGA slices 
has also been used to implement a DMA controller in order to 
optimize the transactions in the system bus. This controller is 
not part of our scheduler and can be used by any other 
component of the system.   

Regarding the supported clock frequency, the current 
scheduler supports frequencies up to 100 MHz even for tables 
with a large number of entries. Our previous implementation 
also supported a 100 MHz clock frequency, but only for small 
tables. For larger tables the clock frequency was significantly 
reduced. For instance, the maximum frequency for the 
associative table with 128 was 65 MHz. Hence the new design 
of the task-graph table reduces the area cost and improves the 
performance of the scheduler.   

VII. PERFORMANCE EVALUATION 

In this section we will evaluate the performance of our 
scheduler. For that purpose, we have tested it using a set of 

 
Task mobility

COMP
>

Skipped Events

Replacement
module 

Critical task
selected

Delay_reconfiguration

increment

 
Fig. 13.  HW support to implement the skip-event feature. 
 

0
5

10
15
20

8 16 32 64 128

Flip Flops (%) Assoc. LUTs (%) Assoc.

Flip Flops (%) non Assoc. LUTs (%) non Assoc.

Number of entries

Scheduler: % resources used

 
Fig. 14.  Resources consumption of the scheduler both for the associative 
table (AT) and the non-associative one. 
 

 RU 0

Task_id0 C0 Free0

RU n…

Task
graph
Table hit

Counter

NC

NRC

task_id

… … … …
Busy0

Busy
Free

CC

RCC
 

Fig. 12.  HW to label the candidates. 
 

TVLSI-00330-2009 12 

task graphs extracted from actual multimedia applications. The 
applications used to extract the task graphs are: two versions 
of a JPEG decoder (JPEG and Parallel-JPEG), a MPEG-1 
encoder, a Pattern recognition application (that applies the 
Hough transform), and a 3D rendering application based on 
the open source Pocket-GL library (Pocket GL). In the latter 
case the application includes 20 different task graphs. Table 1 
presents the number of nodes, and the initial execution time of 
each graph. This initial execution time represents an ideal 
scenario with no delays due to the run-time management or 
due to the reconfigurations. We have tested our system 
separately for two groups of graphs. In the first group we have 
included the first four applications, and in the second group we 
have used the graphs from the Pocket-GL application. For 
simplicity, in Table 1 these graphs have been grouped in four 
categories (from A to D) taking into account the number of 
nodes. The table presents the average execution time for each 
category. It also includes the delays due to the reconfiguration 
overheads when using an on-demand approach, assuming a 
reconfiguration latency of 4 ms, which is the time needed to 
reconfigure the larger task used. As it can be seen in the table 
the original reconfiguration overheads are very important, 
especially for the Pocket GL application, where it can even be 
greater than the task-graph execution time. 

As a first experiment we have tested the efficiency of our 
replacement policy (LF+C). To this end we have compared it 
with two well-known replacement policies: Least Recently 

Used (LRU) and Longest Forward Distance (LFD). As it was 
already explained in Section II, LFD is the optimal policy 
regarding reuse, but it can only be applied when all the future 
events are known, hence it cannot be applied in dynamic 
systems. However, it can be used to obtain an upper-bound 
value for the reuse. This is done recording all the events 
generated during the execution of a given experiment and 
afterwards, repeating the experiment but using LFD.  

For the task graphs of the first group we have evaluated our 
system when executing all the possible combinations of two of 
these task graphs. For each combination, we execute the two 

graphs alternatively simulating an execution pattern (e.g. JPEG 
– Parallel-JPEG – JPEG – Parallel-JPEG...). Each 
combination has been executed several iterations, and we have 
measured the average results without taking into account the 
first iteration, because during the first execution it is not 
possible to reuse anything. On average, 10.5 tasks were 
executed in each of these experiments; hence there were 
always significantly more active tasks than RUs. We are 
particularly interested in analyzing these situations since these 
experiments are so demanding that the only way to obtain a 
good performance is to apply an efficient replacement policy. 

Fig. 15 (a) and (b) show the average results regarding the 
reconfiguration overhead and task reuse for a system with a 
variable number of RUs. The Remaining Reconfiguration 
Overhead percentage is defined as the percentage of the 
original reconfiguration overhead that remains after applying 
the scheduling techniques. The Reuse percentage is defined as 
the number of reused tasks divided by the total number of 
executed tasks.The system applies the same ASAP prefetch 
approach in the three cases. As it can be seen in Fig. 15 (a) and 
(b), the prefetch technique is a powerful way to reduce the 
reconfiguration overheads, because it can remove most of the 
original overhead even when no significant reuse is achieved 
(for instance, for three RUs using LRU no task is reused). 
However, the remaining overhead is still important (35% of 
the original overhead). Hence, a good replacement technique 

0
10
20
30
40

3 4 5 6 7 8

LF +C
LFD
LRU

Number of RUs

0
20
40
60
80

3 4 5 6 7 8

LF +C
LFD
LRU

Number of RUs

Reuse (%)

Remaining Rec. 
Overhead (%)

(a)

0

20

40

60

3 4 5 6 7 8

LF +C
LFD
LRU

0

50

100

3 4 5 6 7 8

LF +C
LFD
LRU

Number of RUs

Reuse (%)

Remaining Rec. 
Overhead (%)

Number of RUs

(b)

0

50

100

3 4 5 6 7 8

LF+C Skip Events

LF+C ASAP

0
10
20
30
40

3 4 5 6 7 8

LF+C Skip Events

LF+C ASAP

0
10
20
30
40

3 4 5 6 7 8

LF+C Skip Events

LF+C ASAP

0
20
40
60
80

3 4 5 6 7 8

LF+C Skip Events

LF+C ASAP

Number of RUs

Number of RUs

Reuse (%)

Remaining Rec. 
Overhead (%)

(c)

Number of RUs

Number of RUs

Reuse (%)

Remaining Rec. 
Overhead (%)

(d)  
Fig. 15.  Evaluation of the execution the task-graphs in groups 1 (a) and 2 (b) with different replacement policies; and comparison between an ASAP scheduler 
and our scheduler (Skip Event) applying in both cases the LF+C replacement policy for tasks in group 1 (c) and 2 (d). 
 

TABLE I 
DETAILS OF THE TASK-GRAPHS USED AS BENCHMARKS IN THE PERFORMANCE 

EVALUATION  

Task graph Group 
Number 
of tasks 

Initial 
execution  
time (ms) 

On-demand 
reconfiguration 
overhead (ms) 

JPEG 1 4 79 16 
Parallel-JPEG 1 8 54 20 
MPEG-1 1 5 37 20 
HOUGH 1 6 94 16 
POCKET GL (A) 2 2 4 8 
POCKET GL (B) 2 4 16 16 
POCKET GL (C) 2 5 27 20 
POCKET GL (D) 2 6 49 24 

 
TVLSI-00330-2009 13 

may have a very positive impact in the execution of these 
applications. Of course as more RUs are included in the 
system, the reuse percentage increases and this helps to reduce 
the overhead for the three replacement policies. Nevertheless, 
the results obtained when using a LRU approach are clearly 
suboptimal. LFD greatly increases the reuse percentage, and 
reduces the overheads. Finally LF+C obtains almost as good 
results regarding reuse as LFD, and in some cases even better 
performance. The reason is that although LF+C reuses a 
slightly smaller percentage of tasks than LFD, most of these 
tasks are critical, since LF+C assigns them more priority, and 
reusing critical tasks has a direct impact in the performance of 
the system.  

For the Pocket GL (Fig. 15 (b)) we have simulated 500 
iterations. In each iteration one of the 20 task graphs was 
randomly selected from the Pocket-GL library. Although there 
are a lot of different task graphs, reusing tasks was possible 
because all of them include some common tasks (in total there 
are 10 different tasks). The results for this experiment confirm 
that our replacement policy achieves almost optimal results 
regarding reuse (just 3% worse than LFD). In this case the 
prefetch approach hides 55% of the reconfiguration overheads. 
However, if the system includes enough RUs and it applies our 
efficient replacement policy the overhead disappears, whereas 
if it uses the LRU approach, a significant percentage (almost 
10%) will penalize the system performance.   

All these results have been obtained combining the 
replacement policies with an ASAP scheduling approach. 
These techniques are broadly used at-run time because they 
provide good results and they do not generate an excessive 
penalty due to the run-time computations. However, they often 
lead to local-optimum decisions. Hence we have included in 
our scheduler some simple support to delay certain 
reconfigurations in order to find better schedules as explained 
in Section V.a.2. Figures 15 (c) and 15 (d) present the benefits 
of this approach both regarding reuse and performance. 

The results demonstrate that delaying some 
reconfigurations can improve significantly the performance of 
the system, and even increase the percentage of reused tasks. 

The results of Fig. 15 (d) are especially good, since our 
scheduler clearly outperforms the purely ASAP approach.   

Finally, Fig. 16 shows the results of our experiments 
regarding execution time. Fig. 16 (a) and (b) show the average 
execution times for the task graphs of group 1 and 2, 
respectively In both cases, we compare our results with the 
optimal execution times that have been obtained assuming that 
there is not any reconfiguration overhead, and using a 
branch&bound scheduler. Of course these optimal schedules 
cannot be obtained at run-time, since the branch&bound 
scheduler needs from hundred of milliseconds to seconds to 
find the optimal schedules even for these relatively small 
graphs. In any case, these results demonstrate that our 
scheduler provides near-optimal results as long as enough RUs 
are available to take advantage of our replacement policy. The 
average number of tasks executed in the experiment with the 
first group of tasks is 11.5, whereas in the second group is 10. 
Hence we are obtaining near-optimal results even when the 
number of task doubles the number of RUs.  

We have also evaluated the delays that the run-time 
scheduler generates in the task-graph executions due to the 
run-time computations. The SW implementation of our 
scheduler generates on average an execution-time overhead of 
only 1% for the graphs in the group 1. However, it generates 
almost a 10% overhead in the execution of the Pocket GL 
graphs. On average 36% of these overheads are due to the 
HW/SW communications and the remaining ones are due to 
the run-time scheduling computations that must deal with 
complex data structures such as graphs and lists. If the 
overheads generated by our SW scheduler are not acceptable 
for a given system, the best solution is to include the HW 
implementation that can almost completely eliminate them. In 
this case the scheduling computations are carried out at run-
time in just a few clock cycles, and the average delay 
generated in the execution of a graph is 0.02 ms including the 
time needed to send the information of the graph from the 
PowerPC to our scheduler.    

 
VIII. CONCLUSIONS 

An efficient HW multitasking system based on 
reconfigurable HW needs some specific support to reduce the 
reconfiguration overheads. We have developed a scheduler 
that provides this support applying a prefetch scheduling 
technique, in order to load most configurations in advance, and 
a replacement technique, which  reduces the number of 
reconfigurations needed and even improves the results 
obtained by the prefetch technique assigning more priority to 
the most critical tasks.   

As the experimental results have shown, the cooperation 
between the prefetch and the replacement techniques provides 
very good results hiding most of the reconfiguration delays. 
Moreover, the possibility of delaying some reconfigurations in 
order to escape from local-optimum decisions can greatly 
improve the results for some run-time conditions (as in Fig. 15 
(d)). Finally, with the developed HW support, and using the 
information extracted at design-time, all these techniques can 
be applied at run-time while generating a negligible delay due 

20
22
24
26
28

3 4 5 6 7 8

Optimal Ex. time

LF+C + Skip Events

60

65
70

75

3 4 5 6 7 8

Optimal Ex. Time

LF+C + Skip Events
Average e.t. (ms) 
Tasks Group 1

Average e.t. (ms) 
Tasks Group 2

Number of RUs

Number of RUs

(a)

(b)

 
Fig. 16. Comparison of the average execution times (e.t.) in our experiments 
with the optimal execution times. 
 

TVLSI-00330-2009 14 

to their computations. Hence our scheduler provides a 
transparent and efficient management of the execution of task-
graphs for reconfigurable multi-tasking systems. This 
management can significantly improve the performance of the 
system, and even reduce the energy consumption, because the 
replacement technique can largely reduce the number of 
reconfigurations needed.   

 
IX. REFERENCES 

[1] www.sony.net/Products/SC-HP/cx_news/vol42/pdf/sideview42.pdf 
[2] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde and R. Lauwereins, 

“Interconnection networks enable fine-grain dynamic multi-tasking on 
FPGAs”, in International Conference on Field Programmable Logic 

and Applications (FPL), Montpellier, France, 2002, pp. 795-805. 
[3] L.A. Belady, “A study of replacement algorithms for virtual storage 

computers”, IBM Systems Journal, vol. 5, pp. 78-101, no. 2, 1966. 
[4] S. Banerjee, E. Bozorgzadeh and N. Dutt, “Integrating physical 

constraints in HW-SW partitioning for architectures with partial 
dynamic reconfiguration”, IEEE Trans on VLSI Systems, vol. 14, pp. 
1189-1202, November 2006. 

[5] E. Pérez-Ramo, J. Resano, D. Mozos and F. Catthoor, “Reducing the 
reconfiguration overhead: a survey of techniques”, in International 

Conference on Engineering of Reconfigurable Systems and Algorithms 

(ERSA), Las Vegas, NV, USA, 2007, pp. 191- 194. 
[6] H. Walder and M. Platzner, “Online scheduling for block-partitioned 

reconfigurable devices”, in Design, Automation & Test in Europe 

(DATE), Munich, Germany, 2003, pp. 290-295. 
[7] Y. Qu, J.-P. Soininen and J. Nurmi, “A parallel configuration model for 

reducing the run-time reconfiguration overhead”, in Design, Automation 

& Test in Europe (DATE), Munich, Germany, 2006, pp. 1-6 
[8] J. Noguera and R. M. Badia, “Multitasking on reconfigurable 

architectures: microarchitecture support and dynamic scheduling”, ACM 

Trans. on Embedded Computing Systems, vol. 3, pp. 385 – 406, May 
2004. 

[9] Z. Pan and B. E. Wells, “Hardware supported task scheduling on 
dynamically reconfigurable SoC architectures”, IEEE Trans. on VLSI 

Systems, vol. 16, pp. 1465 – 1474, November 2008. 
[10] V. Nollet, P. Avasare, H. Eeckhaut, D. Verkest and H. Corporaal, “Run-

time management of a MPSoC containing FPGA fabric tiles”, IEEE 

Trans. on VLSI Systems, vol. 16,  pp. 24 – 33, January 2008. 
[11] K. N. Vikram and V. Vasudevan, “Mapping data-parallel tasks onto 

partially reconfigurable hybrid processor architectures”, IEEE Trans. on 

VLSI Systems, vol. 14, pp. 1010-1023, September 2006. 
[12] S. Banerjee, E. Bozorgzadeh and N. Dutt, “Exploiting application data-

parallelism on dynamically reconfigurable architectures: placement and 
architectural considerations”, IEEE Trans. on VLSI Systems, vol. 17, pp. 
234 – 247, issue 2, 2009. 

[13] K. Kosciuszkiewicz, F. Morgan and K. Kepa, “Run-time management of 
reconfigurable hardware tasks using embedded Linux”, in International 

Conference on Field Programmable Technology (ICFPT), Kitakyushu, 
Japan, 2007. pp. 209 – 215. 

[14] H. Kwok-Hay So and R. W. Brodersen, “A unified hardware/software 
runtime environment for FPGA-based reconfigurable computers using 
BORPH”, ACM Transactions in Embedded Computing Systems, vol. 7, 
pp. 259-264, 2008. 

[15] W. Fu and K. Compton, “Scheduling intervals for reconfigurable 
computing”, in IEEE Symposium on field-programmable custom 

computing machines (FCCM), Palo Alto, CA, USA, 2008, pp. 87 – 96. 
[16] K. Danne, R. Miihlenbernd and M. Platzner, “Executing hardware tasks 

on dynamically reconfigurable devices under real-time conditions”, in 
International Conference on Field Programmable Logic and 

Applications (FPL), Madrid, Spain, 2006, pp. 1 – 6. 
[17] Z. Li and S. Hauck, “Configuration prefetching techniques for partial 

reconfigurable coprocessor with relocation and defragmentation”, in 
Proceedings of the ACM/SIGDA International Symposium on Field 

Programmable Gate Arrays (FPGA), Monterey, CA, USA, pp. 187-195 
[18] J. Resano, D. Mozos, D Verkest and F Catthoor, “A reconfiguration 

manager for dynamically reconfigurable hardware”, IEEE Design&Test, 
vol. 22, pp. 452-460, September 2005. 

[19] J. Resano, D. Mozos and F. Catthoor, “A hybrid prefetch scheduling 
heuristic to minimize at run-time the reconfiguration overhead of 
dynamically reconfigurable hardware”, in Design, Automation & Test in 

Europe (DATE), Munich, Germany, 2005, pp. 106-111. 
[20] J. Resano, J.A. Clemente, C. González, D. Mozos, and F. Catthoor. 

“Efficiently scheduling runtime reconfigurations”, ACM Trans. Design 

Automation of Electronic Systems, Vol. 13, pp. 58-70, September 2008. 
[21] J. A. Clemente, C. González, J. Resano and D. Mozos. “A task graph 

execution manager for reconfigurable multi-tasking systems”, 
Microprocessors & Microsystems, vol. 34, pp. 73 – 83, June 2010. 

[22] J. A. Clemente, C. González, J. Resano and D. Mozos, “A hardware 
task-graph scheduler for reconfigurable multi-tasking systems”, in 
International Conference on ReConFigurable Computing and FPGAs 

(ReConFig), Cancun, Mexico, 2008, pp.79-84. 
[23] C. Wong, P. Marchal and P. Yang, “Task concurrency management 

methodology to schedule the MPEG4 IM1 player on a highly parallel 
processor platform”, in International Conf. on HW/SW Codesign and 

System Synthesis (CODES), Copenhagen, Denmark, 2001, pp. 170-175. 
 

Juan A. Clemente was born in 1984. He started 
studying a Computer Science Degree at 
Universidad Complutense de Madrid (UCM), 
Spain, in 2002 and finished in 2007.  

Since then he is a PhD student and works there 
as a teaching assistant and also as a researcher in 
the GHADIR group. He has also an active 
collaboration with the Embedded Systems 
Laboratory in the École Polytechnique Fédérale de 
Lausanne (EPFL), Lausanne, Switzerland. In his 
work he mainly focuses on developing scheduling 

techniques to hide reconfiguration latencies in dynamically reconfigurable 
systems.  
 

Javier Resano received the Bachelor Degree in 
Physics in 1997, a Master Degree in Computer 
Science in 1999, and the PhD degree in 2005 at 
the Universidad Complutense of Madrid, Spain.  

Currently he is Associate Professor at the 
Computer Eng. Department of the Universidad of 
Zaragoza, and he is a member of the GHADIR 
research group, from Universidad Complutense, 
and the GAZ research group, from Universidad de 
Zaragoza. He also collaborates with the Digital 
Design Technology Group from IMEC-laboratory 

since 2002. His research has been focused in hardware/software co-design, 
task scheduling techniques, Dynamically Reconfigurable Hardware and 
FPGA design. 

 
Carlos González was born in 1984. He started 
studying a Computer Science Degree at 
Universidad Complutense de Madrid (UCM) in 
2002 and finished in 2007.  

Since then he is a PhD student and since 2008 
works there as a teaching assistant. In his work he 
mainly focuses on applying run-time 
reconfiguration in aerospace applications. He has 
recently started with this topic, working with 
algorithms that deal with hyperspectral images. He 
is also currently collaborating with Juan Antonio 

Clemente on developing techniques to hide reconfiguration latencies in 
dinamically reconfigurable systems. 

 
Daniel Mozos obtained a B.S. in physics and a 
Ph.D. in computer science from the Universidad 
Complutense de Madrid.  

He is a permanent professor in the Computer 
Architecture and Automation Department of the 
Universidad Complutense de Madrid, where he 
leads the GHADIR research group on dynamically 
reconfigurable architectures. His research interests 
include design automation, computer architecture, 
and reconfigurable computing.