Implementación y análisis de un motor de
renderizado fotorrealista en SYCL, para

arquitecturas heterogéneas

Implementation and analysis of a photo-realistic rendering engine
in SYCL, for heterogeneous architectures

Enrique de la Calle Montilla

Trabajo de Fin de Máster
Máster en Métodos Formales en Ingenieŕıa Informática

Facultad de Informática
Universidad Complutense de Madrid

Dirigido por
Carlos Garćıa Sánchez

1


Madrid, 2023

2


Abstract

Implementation and analysis of a photo-realistic render-

ing engine in SYCL, for heterogeneous architectures

This work covers the development, implementation, and analysis of Eleven Render, an
architecture-independent photo-realistic rendering engine. It is written in C++ and utilizes
the SYCL programming model which allows code to run in a vast range of heterogeneous
architectures like: CPUs, GPUs, and FPGAs, bestowing upon Eleven Render, with a porta-
bility level rarely seen in other engines, and making it one of the first SYCL rendering engine
implementations. A new shading language called ASL has been formally specified and im-
plemented in Coq. Making use of Coq’s verification tools and the Vellvm project, which
endows LLVM-IR of a formally specified semantics, a semantic-preserving translator from
ASL to LLVM-IR has been implemented and verified. The resulting LLVM-IR shader code
is then fed to the Intel’s SYCL implementation DPC++, and is compiled to the desired
target architecture. The entire process has been evaluated and benchmarked in different
architectures.

Keywords

Render 3D, SYCL, DPC++, Path Tracing, Ray Tracing, Coq, Vellvm, ASL

3


Resumen

Implementación y análisis de un motor de renderizado

fotorrealista en SYCL, para arquitecturas heterogéneas

Este trabajo cubre el desarrollo, la implementación y análisis de Eleven Render, un mo-
tor de renderizado fotorrealista independiente de plataformas de hardware. Está hecho en
C++ y hace uso del modelo de programación de SYCL, que permite ejecutar código en un
gran abanico de arquitecturas heterogéneas, tales como: CPUs, GPUs y FPGAs, dotando
a Eleven Render, de una portabilidad raramente vista en otros motores, y siendo aśı una
de las primeras implementaciones en SYCL de motor gráfico fotorrealista. Se ha propuesto
también un nuevo lenguage de shading llamado ASL, que ha sido formalmente especificado e
implementado en Coq. Haciendo uso de las herramientas de verificación de Coq y el proyecto
Vellvm, que dota al lenguage LLVM-IR de una semántica formalmente especificada, se ha
implementado y verificado un traductor de ASL a LLVM-IR. El código resultante de dicha
traducción, es entonces incluido en el proceso de compilación llevado a cabo por la imple-
mentación de SYCL hecha por Intel: DPC++. Todo este proceso ha sido posteriormente
evaluado y sometido a pruebas de rendimiento en diferentes arquitecturas de hardware.

Palabras Clave

Render 3D, SYCL, DPC++, Path Tracing, Trazado de Rayos, Coq, Vellvm, ASL

4


Contents

1 Introduction 8
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.1 Photo-realistic Rendering . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 SYCL / DPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.3 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.4 Coq / OCAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.5 iTrees (Interaction Trees) . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.6 Vellvm (Verified LLVM) . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Render Architecture 21
2.1 High level architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Data Handling, Communication, and Rendering Processes . . . . . . . . . . 22

2.2.1 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 SYCL port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Including a custom shading language in the pipeline . . . . . . . . . . . . . . 27
2.3.1 Shading kernel execution . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 IR Populator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 C++ Shader example . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Miscellaneous improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Eleven Blender 35
3.1 Geometry loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Material interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5


4 Abstract Shading Language 39
4.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.1 AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.3 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 LLVM-IR Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Translation verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 The Vellvm memory model . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Renv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Bisimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.4 Vellvm typ vs dtyp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.5 The missing lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Benchmarking 52
5.1 Backend comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Anatomy of a rendering session . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Cycles comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Conclusions 58

6


Eleven Render

7


Chapter 1

Introduction

1.1 Background and Motivation

This Master’s Thesis merge different technologies and concepts which go from the abstract
world of formal verification, to the concrete hard silicon, passing through the simulations of
computer graphics. In the following subsections, the main notions will be briefly explained,
along with a small motivation which will help to understand the decisions taken.

1.1.1 Photo-realistic Rendering

Rendering is the process for which digital content is visually represented. There are two pri-
mary categories: offline rendering and online (or real-time) rendering. This work focuses on
offline photo-realistic rendering specifically, which seeks to obtain the highest visual fidelity
usually thanks to compute-intensive algorithms such as Path Tracing [1][2]. Path Tracing
achieves those results by simulating the light behavior in a similar way it does in the real
world, in essence, mimicking the light paths where the photons travel, approximating by
Monte Carlo method, what is known as: ”The Rendering Equation” [3].

As one might expect, the convergence of the Monte Carlo method requires substantial com-
putational resources in order to accurately approximate The Rendering Equation, required
for achieving high visual fidelity. These light simulations have been ran traditionally almost
exclusively over CPUs instead other architectures like GPUs. One of the main reasons, is the
lack of memory. GPUs VRAM availability has always lagged behind traditional and more
scalable CPUs architectures, forcing GPU photo-realistic rendering engine developers to rely

8


in techniques such as out-of-core memory1 with a huge performance penalization when the
rendered scenes surpassed the available memory in the target architecture.

Scenes in modern feature films are in the order of hundred GBs. For example, Walt Disney
Animation Studios released a dataset to allow people to render a sample scene from the 2016
movie ”Moana”. This dataset was provided with the purpose of: developing, testing and
benchmarking new rendering algorithms [4]. The Base Package weights 93GB and includes
the raw data required for rendering one frame of the scene. To be more precise, the latest
consumer Nvidia 4000 series line-up peaks at 24GB, meanwhile the professional Quadro
Nvidia line-up, peaks at 48GB for single GPU configurations.

Render GPU acceleration Open Source
V-Ray Render
(Chaos Group)

Nvidia only No

RenderMan
(Disney Pixar)

Nvidia only No

MoonRay
(Dreamworks)

Nvidia only Yes

Arnold
(Autodesk)

Nvidia only No

Redshift
(Maxon) Nvidia,

AMD (only MacOS)

No

Mantra
(SideFX)

No No

Octane
(Otoy)

Nvidia only No

Cycles
(Blender)

Nvidia, AMD, Intel Yes

Figure 1.1: Most Popular Render engines and GPU support

Nonetheless, the gap between GPU and rendering memory requirements is closing. Consumer
GPUs in 2023 are approaching a point where they can cover some professional rendering

1Out-of-core memory technique is a method which schedules data transmission to the device at run-time,
allowing processing big datasets without requiring all the data loaded into the device

9


works, for that reason, many of the current production renders have started releasing their
own GPU accelerated versions, as we can see in the Figure 1.1 table, which covers the most
popular offline rendering engines.

Shading Languages

Approximating The Rendering Equation is usually just one part of the process. Artists re-
quire more flexibility deciding how light should behave in the scene apart from the standard
models. For that reason, shading languages exists. In the offline rendering context, Shading
Languages are DSL (Domain Specific Languages) which usually allows users to code their
own light interactions. There are dozens of shading languages, but the most notable example
is Open Shading Language, or OSL, developed by Sony Pictures as an effort to standardize
the diverse shading languages landscape.

In this master thesis, a shading model has been proposed and implemented into Eleven
Render. This shading model, could be defined as Parameter-driven Shading. This means,
the light behavior is defined by a hard-coded function (Disney’s Principled BRDF[5]), but
the input parameters of that function can be described programmatically with custom code
executed in the device. This is at a half way between a fully expressive shading model, and
using fixed parameters.

A visual example of this shading model is: a shader which describes the color for certain
material without relying in texture data, for example a noise function. In the Figure 1.2 can
be seen how a noise texture shader can be linked to the albedo input parameter (a), where
it will describe the color for the surface, or can also be linked to the metallic input (b) to
describe the amount of reflection for each point of the surface.

(a) Noise shader linked to albedo input (b) Noise shader linked to metallic input

Figure 1.2: Noise shader applied to albedo and metallic inputs of the Principled BRDF

10


One of the downsides of this approach, is that it won’t allow to completely model the light
behaviour. It’s possible to parameterize the variables of the BRDF function, but is not flexi-
ble enough to tell the light how it should behave. Controlling the light behaviour in the path
tracing algorithm shouldn’t be a hard task. In theory it would be enough to tell the bouncing
ray it’s new direction and the energy lost in the bounce. The main issue is importance sam-
pling. Importance sampling is a way to speed up convergence times2 and is required in any
modern rendering engine, but it interferes with the probability for ray bouncing direction.
That means that for completely modeling the light behaviour, it’s necessary to take into
account the probability and the distribution of the sampling determined by the importance
sampling method. Modern shading languages like OSL overcomes this complexity making
use of closures, which are just a symbolic representation of that behaviour which allows a
series of operations over it, and leverages the complexity of handling the distribution prob-
abilities, to the closure implementation.

OSL, as well as most languages, is released with natural language specification (OSL 1.12
spec[7]), which sometimes can be ambiguous an imprecise. In this work, a different approach
have been used, which is to formally define a shading language called ASL. This have been
done, in the name of academic interest, with the purpose of showing, how Formal Methods
can be integrated in a development project, and how they can provide value to it. ASL
provides to Eleven, a rigorous and a correct implementation of a shading language.

1.1.2 SYCL / DPC++

SYCL is a cross-platform abstraction layer that enables code for heterogeneous processors
to be written using modern C++, and manages data resources and code execution on those
devices. The main purpose of the SYCL standard is to unify the development for multiple
platforms into a single codebase. As can be seen in the Figure 1.1, many rendering engines
develop their Graphics Processing Unit (GPU) versions choosing only one platform (usu-
ally Nvidia’s CUDA), leaving other vendors devices behind. The GPUs render versions are
developed in parallel with the main CPU render, creating two separate codebases. This
usually implies limited features for the GPU branch which usually lags behind the CPU
one [8, 9]. For that reason, a framework which would unify the development process for a
diverse amount of target devices, would be of such a high value from the rendering engine
development standpoint. That’s where SYCL plays a crucial role in this work. The main
Eleven code-base has been developed in SYCL with the purpose of making Eleven as much
portable as possible.

2Importance sampling is explained in more detail in the Chapter 4.2 of the previous work[6]

11


The main drawback of a programming model which targets multiple different architectures,
is that it is limited from a feature standpoint by the weakest of the architectures. For that
reason, SYCL doesn’t support virtual function calls, function pointers, exceptions and global
device variables. These limitations will make the development of Eleven more challenging,
as will be seen in the subsequent chapters.

SYCL ecosystem is emerging, most of the implementations are not even up to the date with
the specification. This is the list of all SYCL rendering engines found on the internet as of
the writing of this report:

• Cycle’s OneAPI backend

• ChameleonRT

• triSYCL path tracer demo

triSYCL path’s tracer, is just a demo lacking of more advanced features, meanwhile ChameleonRT[10]
offers a more complete path tracing engine, but lacks from a shader loading system and the
SYCL implementation3, is only compatible with Intel devices due to the limitations of the
Embree’s SYCL library. Cycle’s OneAPI backend allows the well known open source ren-
derer Cycles, to run over oneAPI devices, but it won’t work with different SYCL backends,
just with Intel XE-Hpg architecture GPUs.

Given the current landscape, Eleven stands as one of the pioneering SYCL rendering engines,
making it, an early and valuable contribution to the SYCL community and the heterogeneous
architecture development. It’s the first SYCL engine to make use of the flexibility of SYCL
multi-target capabilities.

DPC++

DPC++ (Data Parallel C++) [11] is an open source implementation of C++ compiler with
SYCL standard support maintained by Intel, and the one chosen among several options
for this work. It not only adds some features to the original SYCL specification, but it also
includes a compiler built on top of the LLVM project which makes it easy to extend to devices
other than native Intel architectures. DPC++ is an outstanding component of the oneAPI
toolkit4, which is a comprehensive initiative designed to unify programming across various

3ChameleonRT is compatible with a huge variety of devices. The limitation stated, is just for the
SYCL+Embree backend

4https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#gs.3v44bt

12

https://github.com/blender/cycles/tree/main/src/device/oneapi
https://github.com/Twinklebear/ChameleonRT/tree/master
https://github.com/triSYCL/path_tracer
https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#gs.3v44bt


devices through a single API. Furthermore, the toolkit incorporates optimized libraries for
diverse device types, including oneMKL for linear algebra, oneDAL for machine learning, or
adds additional functionality with the Rendering Toolkit designed to accelerate these types
of workloads with a set of rendering and ray tracing libraries. Every toolkit and libraries are
actively maintained by Intel to support their broad portfolio of devices and architectures. It
is worth mentioning that Codeplay, which was acquired by Intel in 2022, has expanded its
support for a wide array of devices beyond Intel products through the implementation of a
plugin interface. The main device targets actually supported by DPC++ are:

• spir64 x86 64: Intel’s CPUs

• spir64 fpga: Intel’s FPGA

• spir64 gen: Intel’s GPUs

• nvptx64-nvidia-cuda: Nvidia GPUs (HIP/CUDA)

• amdgcn-amd-amdhsa: AMD GPUs (HIP)

A considerable number of available targets out of the box, is one of the reasons why DPC++
has been chosen for the development of Eleven. Nevertheless, in theory, the Eleven’s source
code could in theory be compiled with other SYCL implementations such as ComputeCpp
from Codeplay5, triSYCL from Xilinx6, hipSYCL from Universitat Heidelberg7(actually de-
noted as AdaptiveCpp).

5https://developer.codeplay.com/products/computecpp/ce/home
6https://github.com/triSYCL/triSYCL
7https://github.com/OpenSYCL/OpenSYCL

13

https://intel.github.io/llvm-docs/design/PluginInterface.html
https://developer.codeplay.com/products/computecpp/ce/home
https://github.com/triSYCL/triSYCL
https://github.com/OpenSYCL/OpenSYCL


DPC++ Compilation

Figure 1.3: AOT/JIT compilation figure extracted from Data Parallel C++’s book [11,
Figure 13.2]

SYCL works with two kinds of compilation systems: JIT and AOT. Just In Time (JIT) code
is compiled at run-time, meanwhile Ahead Of Time (AOT) code, is compiled previously.
AOT mode is selected with the -fsycl-targets flag, and will compile code for each target
selected. That code will be bundled later, and a fat binary is produced, which holds the
code for each device. In this project, this compilation system will be exploited to include
arbitrary shader code to those fat binaries, and execute them in the device.

1.1.3 LLVM

The LLVM Project is an open source compiler infrastructure. The high level compilation
steps for a project done in LLVM is show at the Figure 1.4.

14


Source code

Front-End

• Clang (C,
C++)

• Kaleidoscope
(Haskell)

• Rustc (Rust)

...

LLVM-IR

IR Passes

• Optimization

• Analysis

•
Transformation

Back-End
CPU:

• x86

• ARM

...

GPU:

•
AMDGPU
(AMD)

• NVPTX
(Nvidia)

• SPIR-V

...

Target machine code

Figure 1.4: LLVM high-level compilation diagram

First, a front-end will translate an arbitrary programming language code to an intermediate
representation called LLVM-IR (acronym for LLVM Intermediate Representation). Clang is
maybe one of the most popular front-ends for LLVM because it works with C/C++ code.
The previously mentioned Intel’s DPC++ compiler, is effectively a front-end for DPC++
code 8. The result IR from the front-end can be analyzed, transformed and optimized accord-
ing to the requirements of the project, and finally, a back-end will translate the LLVM-IR
code to the required target.

LLVM have some advantages which makes it interesting from the heterogeneous architecture
development perspective, which would explain why the many of the most important SYCL
implementations like: AdaptiveCpp, Intel’s DPC++ and neoSYCL, are built over the LLVM
project.

8While the DPC++ compiler can be understood as an LLVM front-end, it provides way more functionality
like orchestrating and coordinating the linking and driver functionality of heterogeneous architectures

15


LLVM-IR

The intermediate representation LLVM-IR is flexible enough to work as a middle step lan-
guage between many architectures, while maintain the ability of being analyzed and opti-
mized easily, thanks to the Static Single Assignment form and its clear defined semantics.
After knowing this characteristics, it’s possible to see how LLVM-IR is a good candidate to
serve the purpose of being the target language for the verified translator for ASL, the custom
shading language proposed in this work.

1.1.4 Coq / OCAML

Coq is described in their webpage as: ”a formal proof management system which provides
a formal language to write mathematical definitions, executable algorithms and theorems
together with an environment for semi-interactive development of machine-checked proofs.
Typical applications include the certification of properties of programming languages (e.g.
the CompCert[12] compiler certification project)...”[13]. It allows to program in a functional
manner, and prove properties over that code with their proof system. Thanks to Coq, it’s
possible to define the ASL syntax and semantics, and reason about them in a later step.
From this point to the rest of this work, any time the ”verification” word appears, is in the
context of Coq’s proof system.

OCAML

OCaml is a programming language with a significant presence in static analysis and formal
methods, because their simple semantics in its functional features. Coq is implemented in
OCaml, and it’s used usually to provide verification tools to OCAML development workflows.
It comes with two code-generating tools: ocamllex, which creates a lexical analyzer based on
given regular expressions and their corresponding semantic actions, and ocamlyacc, which
constructs a parser based on a specified grammar. These tools are similar to the well known
lex and yacc utilities, commonly found in C development environments.

Code extraction

One of the main features of Coq is the ability to extract the Coq’s definitions, to different
functional programming languages like OCaml, Haskell and Scheme. This way, it’s possible
to move from the Coq’s encapsulated and formal context, to some production environment,
while keeping high degree of assurance in the correctness of the generated code. This Coq’s
feature, is extensively used to bring the ASL abstract syntax and the verified code translation,
to the OCaml toolchain.

16


1.1.5 iTrees (Interaction Trees)

Interaction Trees are a way to represent mutually recursive, nonterminating computations in
Coq. They acquire special importance when taking into account the termination constraints
of Coq, which makes representation of potentially-infinite computations, hard to implement
and to reason about. The iTree library achieves all of this thanks to co-induction, but
their rich library hides the coinduction details, and the end user will only need to reason
equationally.

Figure 1.5: itree Coq definition [14, Figure 1]

Bisimiliarity

The itrees allow different types of equivalence as it’s explained in the original paper [14,
Section 2.2] This equivalences relationships are a key-point in the rewriting strategies when
reasoning about the computations.

• Strong bisimulation. Two itrees t1 ∼= t2 if all the itree nodes are exactly the same.

• Weak bisimulation. Weak bisimulation relaxes the relationship between to itrees
t1 ≈ t2, ignoring the Tau steps t1 ∼= t2→ t1 ≈ t2. It’s required when reasoning about
iteration and other processes which add internal Taus.

• Heterogeneous bisimulation. In the previous bisimulations, to see if two Ret nodes
were the same, they were compared it under the propositional equality relationship Eq

so Ret(a) ∼= Ret(b) if a = b. But a broader definition of bisimulation is needed when
comparing two itrees with different return types. For that, heterogeneous bisimulation
under the R relationship eutt R is used, where R is a relationship between two types.

Both, Strong and Weak bisimulation are a concrete specification of the Heterogeneous
bisimulation under the propositional equality (Leibniz equality).

One of the most notable contributions of interaction trees project, is that the authors provide
an example of two programming languages implemented as interaction trees: Imp and Asm.
The first is a simple imperative language, and the second one, a language similar to assembly.
Then, they implement a translation from Imp to Asm, and prove such translation.

17


1.1.6 Vellvm (Verified LLVM)

The Vellvm project[15] formalizes in a ITree-based form, the semantics of a representative
subset of the LLVM Intermediate Representation known as VIR (Verified IR). It also pro-
vides a rich theory library to reason about such semantics, but it lacks from one element in
the project, which is a simple imperative language translation verification like the one which
does iTrees [14, Ch.5].

That point is partially covered by Helix[16], a specific math DSL which achieves the conver-
sion of math to executable code. But it’s highly specificity and complex language features,
leaves a hole to be covered in the Vellvm project, which is the verification of a translation
of a simple imperative language. Here’s is where ASL can contribute to the Vellvm project.
ASL aims to be a simple imperative language with a verified translation to VIR. Making
such verified translation, would serve as a reference point for other verified translations of
other languages to VIR.

The Vellvm project is actually very complete and their library is extensive enough. The main
issue found during the development, was the entry barrier and the lack of documentation.
After understanding the project and reading the code base, one can grasp the concepts
and will start getting speed. But there’s a very step curve until reaching that point which
aggravates heavily if one does not have a good handling of Coq. This work seeks to serve as
a small entry point to reduce that barrier.

1.2 Previous Work

This Master thesis is the continuation of Eleven Render’s development explained in the previ-
ous bachelor thesis ”Evaluation and acceleration of Path Tracing algorithm in heterogeneous
architectures” [6]. In this previous work, a photo-realistic rendering engine was coded in
CUDA with a very fundamentalist point of view, where most of the code was done from
scratch with the objective of fully understanding all the steps in the development of the
rendering engine. The Bachelor’s thesis mostly covered the implementation and the analysis
of the Path Tracing algorithm in Nvidia GPUs. This implementation has been maintained,
adapted and build around it, while keeping the rendering core.

18


1.3 Goals

Eleven is an ambitious project. It tries to merge many different projects into one. Because it
covers the continuous development of a rendering engine, the main goal could be ambiguous,
for that reason there it is a list with the main objectives that have been designated for this
master’s thesis before starting working on it:

1. Make a Rendering engine which works for almost any target architecture and evaluate
it.

2. Make it usable with some 3D software.

3. Include a working shading model with runtime shader loading.

4. Formally specify an expressive abstract shading language which allowed other shading
languages to be converted to it.

5. Create a translator of such a language to LLVM-IR and verify such a translation.

These subgoals have been revisited after finishing the project:

The 1st point has been successfully completed, the SYCL port is finished except for further
optimizations and some code improvements. The 2nd point is also successfully completed,
the interoperability with Blender 3D software was done thanks to the ”ElevenBlender” add-
on and architectural changes in the engine. The 3rd point has been achieved partially. It is
”partially”, because the Parameter-driven Shading model, is limited from an expressiveness
perspective, and the original goal was to make a fully expressive shading model. The run-
time shader loading wasn’t achieved because the complexities of the DPC++ toolchain. The
final two points is were the expectations were off. A shader language has been formally
specified and a translator verified, but the language in particular is just too simple to be
useful. The verification took way longer than expected and was more tedious that initially
devised. Even so, a minimal proof of concept has been done an included into the project
which demonstrates its possible to further develop the main goal.

1.4 Work Plan

The work plan devised for this work has been: include features until the deadline. This plan
allowed to fit a huge number of features and a rich final result, but have prove to be hard
to maintain. Many of those features have been just deprecated in order of better fitting
solutions. One example of this, is the previous version of ASL, initially done in September of

19


2022, which was a more featured interpreted language (instead the correct and ”compiled”
approach), or shadergraph2osl, which was an Eleven’s plugin to convert Blender’s shader
graph into OSL shaders by using Model Driven Engineering, which needed to be halted be-
cause the lack of time.

A timeline with the main milestones reached is shown bellow:

Sep 2021 · · · · · ·• Finished Eleven Render
Bachelor’s degree.

Feb 2022 · · · · · ·•

Several upgrades to the
original codebase
(denoiser, texture
formats, passes, .mtl
loading).

Jul 2022 · · · · · ·• SYCL Port finished.

Aug 2022 · · · · · ·• First working Blender
integration.

Sep 2022 · · · · · ·• ASL POC initial design
and implementation..

Nov 2022 · · · · · ·• Blender plug-in first
release.

Dec 2022 · · · · · ·• TCP architecture.

Apr 2023 · · · · · ·• Verified ASL assignment.

Jul 2023 · · · · · ·• ASL translation verified.

Aug 2023 · · · · · ·• Shading system working.

Aug 2023 · · · · · ·• AST mainly verified
translation.

Sep 2023 · · · · · ·• Benchmarks done.

Table 1.1: Timeline of the development of this master’s thesis

20

https://github.com/101001000/shadergraph2osl


Chapter 2

Render Architecture

The previous Eleven Render Bachelor’s Thesis was focused in the inner workings of the ren-
dering kernel. In contrast, this new version, is heavily focused not so in the path-tracing
algorithm, but in how the architecture has been reworked to make it work with heteroge-
neous architectures.

This chapter, covers the development decisions and architectural details, as well as the
details of how the compilation is handled in such a way that custom IR code can be side-
loaded to device kernels. The final result of this chapter, is a fully working rendering engine
available in the https://github.com/101001000/ElevenRender repository, as well as a working
compilation CMake infrastructure which won’t be detailed for brevity, but which details are
available for consulting in the CMakeLists.txt file.

2.1 High level architecture overview

This section provides an overview of Eleven’s high-level operation. Subsequent sections will
dig deeper into the specifics of its architectural design.

Eleven can be seen as a process with three main inputs of data: input commands, scene
data and LLVM-IR shaders. Scene data holds the scene representation as well as mate-
rial information, textures, geometry data etc. Input commands are a way to interface with
Eleven, where the main actions like data loading, render starting etc will be requested. Fi-
nally LLVM-IR shaders are feed in the compilation phase of Eleven, as they require some
processing and be compiled for the target architecture. The input LLVM-IR shaders can
have two origins: The first is being obtained from the ASL LLVM-IR translator explained
in the Section 4.3. The second, is being translated with some LLVM-IR frontend from some

21

https://github.com/101001000/ElevenRender
https://github.com/101001000/ElevenRender/blob/master/CMakeLists.txt


other programming language. An example of this second procedure is explained in the Sec-
tion 2.3.3, where a C++ shader is included to Eleven.

Limiting shading loading to the compilation phase, is one of the limitations of the current
architecture which its expected to change in future versions in favor of run-time shader load-
ing. This could be done thanks to the JIT capabilities of LLVM, but there are some caveats
discussed in the Section 2.3.

After loading the data to Eleven, it is pre-processed and sent to the device. Then the
device will compute the path tracing’s algorithm along with the shader execution. The
data is retrieved from the device asynchronously to facilitate progressive rendering and sent
periodically back as output data when requested. This output data holds the render result,
some render telemetry data like the render progress, and other relevant information such as
what devices are available and their status.

2.2 Data Handling, Communication, and Rendering

Processes

2.2.1 IPC

Eleven runs as a TCP server implemented with boost::asio, which allows to run the ren-
dering process in a different machine to the one running the 3D software. The way Inter
Process Communication (IPC) is handled in Eleven Render, is through the same TCP socket
used for communicating with the 3D software. That socket is also used for sending the scene
data. The main advantage of this method, is flexibility over efficiency, as it is not ideal in a
scenario where the 3D software and Eleven shares the same machine, because the data needs
to be copied to be sent over socket, which means having data redundancy in the same ma-
chine. For example, sending 4GB of texture data (16 textures of 4096x4096, 4 float channels)
with the added message overhead and texture color space conversion, in the local machine,
takes 11.84s which translates into an effective 337MB/s, a poor result for an IPC.

IPC scene loading is currently the main bottleneck for most scenarios. This operation could
be replaced in a future by reading the own data from the 3D software application, and in
the case of Python and C++, this is usually done through ”Python Bindings”.

22


Blender Eleven Render

Scene

Eleven Blender
TCP
socket

File System

Input Manager

input
commands

scene data

render data

render data

scene data

Figure 2.1: Data handling from Blender to Eleven

Input Commands

Input commands are implemented as boost::program options in a similar way one would
interact with program arguments. It’s the way the user would communicate with Eleven.

• Load commands: they are in charge of retrieving data from the socket or the filesystem
(using the --path option). The current load commands implemented are the following
ones:

– load config

– load texture

– load camera

– load hdri

– load brdf material

The --mirror x and --mirror y options when loading a HDRI or a texture, will mirror
the image in the selected axis. The --recompute normals option will apply a normal
computation algorithm described in Section 2.4 if present when loading objects.

• Get commands: Retrieve information from the device such as rendering progress, avail-
able devices or the render result.

– get info

– get sycl info

23


– get pass

• Action commands: instruct the rendering kernel how they should operate

– start

– pause

– abort

2.2.2 SYCL port

For the porting part, the ”Intel DPC++ Compatibility Tool” (DPCT1 or recently denoted
as SYCLomatic2) [17, 18] was used in a first instance without much success. This tool allows
to convert CUDA code to SYCL code, but there were some limitations this tool had the
moment was used, mainly with global variables and the cuRAND library. The result code
was partially converted and didn’t compiled. For that reason a manual port was done.

One of the manual steps, was to replace the cuRAND manually for a simple pseudorandom
number generator described in the Figure 2.2. Another of the manual steps required, was
to replace all the math.h library functions, for their namespaced alternative sycl:: This is
specially required when using the CUDA backend, as the generated code won’t be able to
link because of the namespace clash.

RngGenerator::RngGenerator(uint32_t _seed) {
this->state = hashf(_seed + 1);

}
float RngGenerator::next() {

state ^= state << 13;

state ^= state >> 17;

state ^= state << 5;

return (static_cast<float>(state) /

static_cast<float>(UINT_MAX));

}

uint32_t hashf(uint32_t seed) {
uint32_t hash = 0;

for (size_t i = 0; i < sizeof(uint32_t); i++) {
hash += (seed >> (i * 8)) & 0xFF;

hash += (hash << 10);

hash ^= (hash >> 6);

}
hash += (hash << 3);

hash ^= (hash >> 11);

hash += (hash << 15);

return hash;

}

Figure 2.2: Pseudorandom number generator

Another big change from the porting effort, is the device data transmission handling.

1Intel® DPC++ Compatibility Tool: https://www.intel.com/content/www/us/en/developer/

tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm
2SYCLomatic repository: https://github.com/oneapi-src/SYCLomatic

24

https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm
https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm
https://github.com/oneapi-src/SYCLomatic


Device data transmission

SYCL provides two mechanism for memory movement between host and devices:

• Buffers and Accessors.

• USM.

Buffers represent data which can be in both: host and device. SYCL then, manages by
itself the memory copies implicitly when accessing the data from the Accessors, resolving
data dependencies by itself. USM is in the other hand, a pointer approach which allows for
implicit and explicit memory management. In Eleven Render, the explicit USM memory
management approach has been chosen and is performed by the SYCLCopy.cpp file. There
are two reasons for this decision: practicality, as the USM explicit model is very similar to
the previously implemented in the older Eleven Render CUDA version, and performance,
as according to Intel when using Buffers, ”The array references are implemented with more
complex code and are significantly slower than native C++ array references. While it is
acceptable to reference a small amount of data, computationally intensive algorithms using
host accessor pay a large performance penalty and should be avoided.” [19]. USM explicit
model is what it has been used by the OneAPI Blender Cycle’s officially supported imple-
mentation.

Eleven’s current rendering process is monolithic from the data transmission perspective. It
will send all the scene information as one step, not allowing granularity for different scene
elements. In the future, interactive rendering3 could require other memory management
mechanism, as the synchronization process of the scene assets could be simplified if implicit
copies would be used. Granularity of data transmission is the next logical step for Eleven’s
architecture, as it allows for lower data redundancy when copies of small chunks of data are
deleted before sending the next ones.

SYCL rendering kernel execution

A SYCL queue is an abstraction for an action submission on a single device and there are two
queues in Eleven: k q and d q. The first one controls the execution of the main rendering
kernel. The kernel render enqueue function will submit synchronous render calls for each
sample required in the k q queue, and because it will wait for each sample rendering call to
end with the .wait() directive, it needs to be in a secondary non locking thread.

3Interactive rendering is the process for which a scene will be rendered while their elements can be
modified with almost real-time visualization of the changes

25

https://github.com/101001000/ElevenRender/blob/master/src/SYCLCopy.cpp
https://github.com/blender/blender/blob/main/intern/cycles/device/oneapi/device_impl.cpp
https://github.com/blender/blender/blob/main/intern/cycles/device/oneapi/device_impl.cpp


In theory, one wouldn’t need a synchronous queue, as the code can be refactored so rendering
results are independent from each execution, but enqueueing kernels, even if they are not
synchronous, added an unexpected high time overhead (probably because the SYCL mecha-
nisms to solve dependencies over kernel scheduling), so the best solution was to queue them
synchronously in other thread.

nd range

As it was proved in the previous bachelor’s thesis, preserving spacial locality of emitted rays
is a must when rendering scenes from the performance perspective[6, sec.2.2.2]. In CUDA,
that was done by setting the blocksize parametter and matching the 2D dimensions of
the sensor. SYCL is slightly more complex when abstracting kernel execution. It provides
a construct called nd range which is pretty similar to the CUDA counterpart. It requires
to set a global and local dimensions for the execution block. SYCL does not include any
heuristic about what’s the optimal blocksize like CUDA does, so an empirical test have been
realized over the blocksize, to find a good candidate of this parameter. The optimality of
this parameter is dependant of many factors, like available registers, the kernel executed,
even the kind of scene can affect the performance.

2 4 6 8 10

3,000

4,000

5,000

6,000

Blocksize

ti
m
e(
m
s)

(a) Nvidia RTX3090

1 2 3 4 5 6 7

6,800

7,000

7,200

7,400

Blocksize

ti
m
e(
m
s)

(b) Intel i9 13900KF

Figure 2.3: Rendering times by blocksize parameter

For the Nvidia GPU, blocks of four by four or more are good candidates, and for the Intel
CPU, any value is probably good. The lower value on blocksize 2, is with high probably, an
outlier.

26


2.3 Including a custom shading language in the pipeline

This section covers how arbitrary shading LLVM-IR code, is included into the Eleven’s com-
pilation pipeline and executed subsequently by the target architecture. The end goal is to
provide a language agnostic workflow for including shaders into Eleven.

Executing shading code has been usually a challenging task for rendering engines. Even
more if we try to target a vast amount of different architectures as SYCL programming
model does. For that reason, the implementation explored in this section is non-trivial and
relies in a non-traditional solution which are subject to change.

The preferred way to execute custom kernel in SYCL is what in the SYCL 2020 specification
is defined as kernel bundle, a way to represent kernels which can be defined and executed
at run-time. The main challenge here is that those kernels need to be previously built. The
current way build those kernels at run-time, is relying directly on the own API to build that
kernel. For example, one could build a OpenCL kernel by doing something similar at what
is written in the Figure 2.4. The issue of this approach is that we need to call individually
each API (which would break the portability of Eleven) and most of them doesn’t have
support for LLVM-IR code input. Other option is to implement a way to load LLVM-IR
code as uncompiled kernels (bundle state::input) and use the functions sycl::build and
sycl::compile to make a runtime executable, or even convert LLVM-IR code back to C.
This options will be explored in a future, but for now, a more rudimentary static compiling
approach has been used.

cl_kernel ocl_kernel = clCreateKernel(ocl_program, "add", nullptr);

auto mybundle = sycl::make_kernel_bundle<sycl::backend::opencl, sycl::bundle_state::executable>(ocl_program, ctx);

Figure 2.4: Making a kernel bundle from OpenCL program

The static compiling approach explored, is not a simple task. Eleven’s materials can hold
more than one shader, specifically one per input of the Disney’s BRDF function so the
challenge is slightly more complex when we try to add more than one shading function.
SYCL virtual function limitations[20, p. 15], doesn’t allow us to use function pointers for the
required shading function in case we have more than one, so each function header needs to
be declared at compile time and dispatched at run-time individually. For that same reason,
we have to rely heavily into the C++ macros offered by The Boost Preprocessing library
which simplifies the work of handling hundreds of function definitions.

27


2.3.1 Shading kernel execution

The proccess for the shader code execution in the device kernel, happens with the call of the
shade(int shader id, shader params...) function. shader params... are the input
and output parameters of the shader and shader id is an run-time identifier of the required
shader function. In the function implementation, shader id is feed to a switch case construct
which will call the corresponding shade function. The max number of shader functions is
defined by the MAX SHADERS macro, which indicates how many shade function headers will
be generated. The Figure 2.5 shows such an implementation along with the result code after
the macro expansions.

#define ASL_SHADE_CASE(z, n, text) \
case n: shade##n##_(shading pars); break;

void shade(int id, shading pars. . . ) {
switch (id) {

BOOST_PP_REPEAT(MAX_SHADERS,
ASL_SHADE_CASE, )

}
}

void shade(int id, shading pars. . . ) {
switch (id) {

case 0: shade0_(shading pars. . . ) ; break;
case 1: shade1_(shading pars. . . ) ; break;
case 2: shade2_(shading pars. . . ) ; break;
. . .

}
}

macro

expansion

Figure 2.5: Shade function dispatcher implementation

The expanded functions shade0 , shade1 , shade2 , ... are also declared in a header
file, and decorated with the SYCL EXTERNAL macro to indicate external linking, but their im-
plementation is left empty on purpose as the contents will be filled with the generated IR code
from the shader IR compiling step. Instead of that, a file called shader placeholder.cpp

holds an empty implementation of such function. This is because getting providing raw IR
code is not enough for heterogeneous architectures. As it was shown in the Section 1.1.2,
when compiling with AOT, the same code is compiled for each target defined, and the result
is multiple definitions of the same function. Something similar is needed when handling
the IR, but instead, generating a different code for each target. DPC++ will do it to the
empty function of shader placeholder.cpp, and then the contents of the resulting file are
substituted with the IR code.

28


shader placeholder.cpp shader code.asl

DPC++
-emit-llvm

ASL compiler

shader empty.ll asl compiled.ll

IR populator shader.ll

Eleven’s
source code

Eleven’s
dependencies

DPC++ Executable

Figure 2.6: Shader compilation process

2.3.2 IR Populator

A python script function body replace.py will loop over every file in the shaders directory.
The shaders need to be named as ”0.ll”, ”1.ll”... where each number is the shader identifier
a material will reference. From each file, the function body delimited by the entry: and ret

void tags is stripped from its metadata and placed into their correspondent multiple-target
shade functions in the shader.ll file.

Opaque Pointers

One particularity from the last versions of LLVM, is the introduction of Opaque Pointers.
Opaque pointers are just pointers which type information has been striped off. ”The commu-
nity consensus is that the costs of pointee types outweight the benefits, and that they should
be removed”[21], for that reason, since Clang 15.0, pointers won’t have the type information
on it by default. But this is not entirely true for all the LLVM backends, for example CUDA
backend will keep pointer types.

29

https://github.com/101001000/ElevenRender/blob/master/function_body_replace.py


For homogenizing the generated shader.ll file, pointers without type information, will be
substituted for their float* counterparts.

0.ll:
shade_asl (...) {

entry:

shader0 body

ret void

}

1.ll:
shade_asl (...) {

entry:

shader1 body

ret void

}

shader.ll:spir shade_asl0 (...) {

entry:

ret void

}

cuda shade_asl0 (...) {

entry:

ret void

}

spir shade_asl1 (...) {

entry:

ret void

}

cuda shade_asl1 (...) {

entry:

ret void

}

Figure 2.7: Function body replacement

2.3.3 C++ Shader example

One of the main advantages of this shader loading system, is that we can load almost any
shader coded in any language which can be compiled to LLVM-IR. For example, this this
shadertoy’s shader4 GLSL shader in C++ will be put in Eleven: First, a function that should
have the main output arguments (Cr, Cg, Cb) as float references is created. If additional
arguments are required (for example the uv coordinates), it’s required to have the same
name as the arguments defined in ASL. Only the function body will be used, so the function
declaration is irrelevant.

4Shadertoy is a place where people publish their own OpenGL shaders written in GLSL shading language,
and they can be rendered interactively.

30

https://www.shadertoy.com/view/mtyGWy


inline void palette(float t, float& ox, float& oy, float& oz) __attribute__((always_inline)) {
ox = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.263));

oy = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.416));

oz = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.557));

}
void shade(float u, float v, float& Cr, float& Cg, float& Cb) {

float t = 1.0;

float u1 = u;

float v1 = v;

Cr = 0.0; Cg = 0.0; Cb = 0.0;

for (float i = 0.0; i < 4.0; i++) {
u1 = fract(u1 * 1.5) - 0.5;

v1 = fract(v1 * 1.5) - 0.5;

float d = length(u1, v1) * exp(-length(u, v));

float colx, coly, colz;

palette(length(u, v) + i * 0.4 + t * 0.4, colx, coly, colz);

d = sin(d * 8.0 + t) / 8.0;

d = abs(d);

d = pow(0.1 / d, 1.2);

Cr += colx * d;

Cg += coly * d;

Cb += colz * d;

}
}

Figure 2.8: Shadertoy shader C++ adaptation

The code in the Figure 2.8 is a C++ adaptation of the GLSL shader code. Bear in mind
that the functions fract, length, abs, exp, pow, sqrt, sin, cos need to be also im-
plemented in the same file and decorated with inline and attribute ((always inline))

to force the inlining in Clang, because we want at the end a big and self-contained function
with no calls to other functions. In this example, those functions have been implemented
manually but omitted for simplicity. Then we will ran the Clang compiler over it to retrieve
the .ll code with the following command:

$ clang++ -c -S -emit -llvm shader.cpp -O3 -ffp -contract=off -o 0.ll

The -ffp-contract=off flag is required to avoid fusing the mult-add operations, which are
not implemented yet in the current pipeline. Then, we can add that ”0.ll” shader to the
”shaders” folder and run the Eleven compilation. By doing this, if we execute a rendering

31


operation with the ”albedo shader id” set to ”0” in some material, we will see a similar result
to the Figure 2.9

Figure 2.9: Shadertoy shader rendered in Eleven Render

2.4 Miscellaneous improvements

Intel Open Image denoiser

One of the most common ”hacks” for speeding up convergence times, is to apply a end-image
denoiser. Intel provides an open source solution broadly used in the industry as a part of
their oneAPI’s Rendering Toolkit, which makes use of machine learning as a novel way to
denoise images. This solution has been chosen because it’s easiness to implement has been
included in the current version of Eleven. It’s enabled by default, and can be selectively
disabled.

32


(a) 10 samples, no denoise (b) 10 samples, denoised

(c) 500 samples, no denoise (d) 500 samples, denoised

Figure 2.10: Visual differences for denoise application in a sample scene

As we can observe in the Figure 2.10, denoising gives an overall better looking image, but
it won’t give back detail unless that detail is minimally present in the pre-denoised image.
For example, the leaves marks in the teapot are washed out in the denoising from (a) to (b),
but are preserved from (c) to (d) as the denoiser can do a better job identifying the details
for less noisy images.

Normal re-computation

Eleven’s rendering can show shading artifacts when trying to render a flat surface with
normals which doesn’t have been computed as flat. This happens for example when Blender
applies an smooth shading normal computation over flat surfaces. For that reason as a
simple solution, an algorithm to recompute the normals has been implemented to address
this issue and compute a more suitable normal vectors. This algorithm will average the
surrounding face normals of some vertex weighted by the size of the face and is implemented

33


in the recompute normals face weight function in ObjLoader.cpp file. As it is shown in
the Figure 2.11, after recomputing the normals, some shading artifacts can be successfully
removed.

(a) original normals (b) face weighted normals

Figure 2.11: Eleven render comparison with and without normal re-computation

Per-face Material assignation

The previous version of Eleven retrieved the materials from an intersection in an object
basis. That means that each MeshObject had one material assigned. The common approach
however, is doing it per face basis, so each triangle in an object can have a different material
assigned.

34


Chapter 3

Eleven Blender

A big part of making a rendering engine is the way it integrates with 3D applications. Most
3D design applications, provides an api to develop custom plugins or addons as most of
them rely in third party rendering engines. That’s also the case for the very well known
open source 3D application Blender, the application I have chosen to be the first in interact
with Eleven Render. To achieve this, I have release a custom add-on called ElevenBlender
which repository is available here: https://github.com/101001000/ElevenBlender.

This add-on is fully implemented in python and it is in charge of managing all the connec-
tion with Eleven, converting the scene data to Eleven’s format, and to set-up and configure
render parameters.

35

https://github.com/101001000/ElevenBlender


Figure 3.1: Eleven Blender’s configuration panel panel.py

The moment the addon is loaded and Eleven selected as the main rendering engine, a config-
uration panel like the one shown in Figure 3.1 will be available in the render settings. It will
be possible to connect to a running instance by TCP by providing an IP and a port. After
a successful connection, a list of the available devices in the machine where the rendering
instance is running, will be listed. This list will exclude the devices for which Eleven has not
been compiled with the AOT tags --fsycl-target as it won’t be able to run shader code.
After that, its possible to render the current scene with the standard Blender’s rendering
command.

In the next subchapters, it is detailed how the data is handled to Eleven.

3.1 Geometry loading

The current method to send the scene geometry data is inefficient but easy to implement
and it takes advantage of the previous Eleven Render wavefront loading code. Ideally, one
would read the geometry data directly from the Blender’s memory, but what Eleven Blender
does instead, is to call the Blenders Wavefront IO export module to create a .obj file with
all the scene geometry and a .mtl file with the material data. Both files are sent and loaded
in Eleven.

Blender’s have two functions to export Wavefront files: bpy.ops.export scene.obj and

36

https://github.com/101001000/ElevenBlender/blob/dev/ElevenBlender/panel.py


bpy.ops.wm.obj export. The first one is the legacy one and deprecated in new Blender’s
versions, but was compatible with rapidobj Wavefront Obj loading library, used by Eleven’s
blender, while the second had compatibility issues with rapidobj due to discrepancies in the
material formatting (even both working under the same wavefront standard). The solution
has been to strip all the data from the exported material information and deleted the empty
name materials which were crashing rapidobj. This is a temporary solution, as it requires to
open and process a file (which can be in the order of 100M of lines) line by line to search for
the material inconsistencies, so probably in a future, an issue to the rapidobj library should
be opened and fixed from the library level.

Both methods have been tested with a very different results. The second one is notably
faster, almost 24 times for the demo scene ”living-roomCC0” detailed in Figure 5.1 as it can
be seen in Figure 3.2

export scene.obj (old) wm.obj export (new)

0

5

10

12.61

0.52

T
im

e
to

ex
p
or
t
(s
)

Figure 3.2: Comparaison between both Wavefron exporting methods of Blender

3.2 Material interface

Blender offers here two options when making a custom rendering engine: using the Blender’s
shading nodes or not using it and define custom shading nodes (which is the default option)
with the parameter: bpy.types.RenderEngine.bl use shading nodes custom. Most third
party rendering engines which work with Blender take the second approach, as it’s way easier
and they are able to share their custom shading node system along with many 3D different
software in addition to being unaffected to having to maintain compatibility with Blender’s

37


nodes when they are changing which each release1. Eleven’s on the other hand, prioritize
flexibility with different shading systems than having their own, for that reason, Eleven’s
will make use of the already existing Blender’s node system even if that meant a very limited
implementation and compatibility with the materials.

That been said, the way Eleven’s processes Blender’s materials is quite simple: It just read
the material surface input, and if it’s an Principled BSDF, use it, if not, discard the material.
The Principled BSDF node, is in essence the Disney’s Principled model which Eleven’s makes
use of, so the parameters are paired one by one with the Eleven’s material representation.
About the parameters, it’s important to explain that because the incompatibility with all the
Blender’s shading system, the current system will only retrieve the value inputs which doesn’t
come from nodes, except from texture nodes (or normal node with a texture attached), in
such case the texture will be retrieved and sent to Eleven.

Custom shaders

If Eleven has been compiled with some shaders as described in the Section 2.3, it is possible
to select them as material inputs with the custom ”Eleven Shader Selector Node” which will
take an ID as value and will tell Eleven to compute that shader for the input it’s connected.

Figure 3.3: Eleven Shader Selector node

1Disney’s Renderman can use Blender’s shading node thanks to their support to OSL. For some nodes,
they will translate it to their own shading system manually, but for the rest, they will rely in the OSL
implementation of Blender’s node system

38


Chapter 4

Abstract Shading Language

Abstract Shading Language was designed to be an expressive language to control the light
interaction in a ray-traced scene with the advantage of being abstract enough to work as a
middle step between other shading languages and Eleven. That intention was ambitious, as
the process of verifying the translation to VIR was harder and more tedious than initially
devised.

This chapter covers the specification of such language, together with the formalization of the
semantics in Coq and the verified VIR translation. All the implementation details of ASL
like the Coq defined semantics, OCAML parser, translation implementation and all the con-
crete formal proofs, can be found in the ASL repository https://github.com/101001000/ASL.

Because the large amount of work and time required for verifying it, the goal changed from a
featured shading language, to create the simplest possible language and demonstrate a proof
of concept for its inclusion in the project. For that reason, at the moment, the language
only supports variable declarations and integer assignments. An example of ASL code can
be seen in Listing 4.1.

var x;

var y;

x = 5;

y = 4;

Listing 4.1: ASL code example

39

https://github.com/101001000/ASL


4.1 Syntax

4.1.1 AST

The abstract syntax tree of ASL is defined in the Coq’s file: AST.v and its definition is
provided in the Figure 4.1. There are two key aspects to discuss here:

The first aspect is why the declarations are being separated from the statements. This de-
cision restricts the flexibility of adding variable declarations in other parts of the code, but
it simplifies the verification by splitting the problem it in two steps: Allocation step and
execution step. This approach allows to verify the execution step assuming all the variables
are allocated. This will be explained more in detail in the verification section.

The second aspect concerns the datatypes ASL is using: int32, from Vellvm.DynamicValues

and the list datatype. The reason of choosing int32 is for facilitating the translation ver-
ification, but it might be preferable to reduce dependency on Vellvm data types to achieve
more independence from the Vellvm semantics, by choosing another datatype which would
represent integers. The challenge would be creating an equivalence between both types,
which is problematic for operations outside the range of an int32 in the case of using non
bounded integers. The reason why ASL is using the native Coq’s polymorphic list datatype
instead of using the classic tree recursion from simple, formally defined imperative languages
like the While language[22]. This is because ASL represents a sequence of instructions which
will represent instruction blocks when the control structures are included.

Inductive dec :=
| Var (x : string).

Inductive expr : Type :=
| Lit (n : int32).

Inductive stmt : Type :=
| Assign (x : string) (e : expr)
| Skip.

Inductive prog :=
| Prog (ds: list dec) (ss : list stmt).

Figure 4.1: AST.v

40

https://github.com/101001000/ASL/blob/master/src/coq/Lang/AST.v


Inductive list (A : Type) : Type :=
| nil : list A

| cons : A → list A → list A.

Listing 4.2: Coq’s list implementation

4.1.2 Lexer

Lexing in OCAML is relatively simple thanks to the tool ocamllex. To use it, it’s just
needed to define a lexer.mll file as the one shown in the Figure 4.2 and then, run the tool
over it. The result, is an OCAML module lexer.ml file which can be used for lexing text
inputs.

One of the most important parts of lexing, is making sure that the correct type is being used.
As OCAML is a strong typed language, and Coq extraction will be extremely tricky with
the types exported (one can have many different implementations for the same datatype),
the types are usually incompatible and require type conversions. CompCert, the formally
verified C compiler[12], provides an useful OCAML module to make this type conversions
between Coq and OCAML called Camlcoq.ml. This module is used in this project to handle
most of the datatype conversions.

41

https://github.com/101001000/ASL/blob/master/src/lexer.mll
https://github.com/AbsInt/CompCert/blob/master/lib/Camlcoq.ml


{
open Parser

exception Error of string

}

let digit = [’0’−’9’]
let letter_or_underscore = [’a’−’z’ ’A’−’Z’ ’_’]
let letter_or_underscore_or_digit = [’a’−’z’ ’A’−’Z’ ’_’ ’0’−’9’]

rule token = parse

| [’ ’ ’\t’ ’\n’] { token lexbuf }
| "var" { VAR }
| ’=’ { EQ }
| ’;’ { SEMI }
| letter_or_underscore letter_or_underscore_or_digit∗ { ID(Lexing.lexeme lexbuf) }
| digit+ as i { INT (Camlcoq.Z.of_sint32(Int32.of_string i)) }
| eof { EOF }
| _
{ raise (Error (Printf.sprintf "Unexpected char at %d.\n" (Lexing.lexeme_start lexbuf))) }

Figure 4.2: ASL Lexer

4.1.3 Parser

The parsing process is similarly done as the lexing, but with the tool menhir. In this case,
the most remarkable parts are: how both statements and var declarations are handled as
lists, how the INT token has a Camlcoq.Z.t which represents the same type as Dynamic-
Types.Int32 from the ASL.v and how expressions and var declarations explicit constructor
has been removed , because Coq extract will optimize singleton constructors. This can be
avoided in the extraction code by using the flag Extraction KeepSingleton but it will
throw other errors.

42


%{
open AST

%}

%token <Camlcoq.Z.t> INT

%token <string> ID

%token SEMI

%token VAR

%token EQ

%token EOF

%token SKIP

%start <AST.prog> main

%%
main:
| ds = decs ss = stmts EOF { Prog (ds,ss) }
stmts:
| s = stmt SEMI {[s]}
| s = stmt SEMI t = stmts {s :: t}
stmt:
| a = assign { a }
| sk = skip { sk }
decs:
| d = dec SEMI {[d]}
| d = dec SEMI t = decs {d :: t}
dec:
| VAR x = ID {Camlcoq.coqstring_of_camlstring x}
expr:
| i = INT { i }
assign:
| i = ID EQ e = expr { Assign(Camlcoq.coqstring_of_camlstring i, e) }
skip:
| SKIP { Skip }

Figure 4.3: ASL Parser

43


4.2 Semantics

Denotation

The semantics in Eleven can be found in the file Semantics.v. They are a trimmed down
version of the Imp language described in the interaction trees paper[14], where the SetVar is
an uninterpreted event, which can be ”triggered” (trigger is just a fancy way to call Vis e

(fun x -> Ret x)). SetVar is a variant of the State monad, which comes from the Coq’s
ext-lib extended library.

Definition denote_expr (e : expr) : itree eff int32 :=
match e with

| Lit n ⇒ ret n

end.
Definition denote_stmt (s : stmt) : itree eff unit :=

match s with

| Assign x e ⇒ v ← denote_expr e ;; trigger (SetVar x v)
| Skip ⇒ ret tt

end.
Fixpoint denote_stmts (ss : list stmt) : itree eff unit :=

match ss with

| nil ⇒ ret tt

| h :: t ⇒ denote_stmt h ;; denote_stmts t

end.
Fixpoint denote_decs (ds : list dec) : itree eff unit :=

match ds with

| nil ⇒ ret tt

| h :: t ⇒ match h with | Var x ⇒ trigger (SetVar x (Int32.repr 0%Z)) end ;; denote_decs t

end.
Definition denote_prog (p : prog) : itree eff unit :=

match p with

| Prog ds ss ⇒ denote_decs ds ;; denote_stmts ss

end.

Figure 4.4: Program denotation

Interpretation

Once the construction of the denotation interaction tree is defined, it’s necessary to define
the interpretation of such tree, or how those evens should be ”triggered”. The State events

44

https://github.com/101001000/ASL/blob/master/src/coq/Lang/Semantics.v
https://github.com/coq-community/coq-ext-lib/blob/master/theories/Data/Monads/StateMonad.v


like SetVar, are interpreted as map insertions over a zero initalized string to int32 map:
mapE string (Int32.repr 0%Z). This behaviour is encapsulated in the mapE event and the
interp map definition of the interaction trees library Map, which abstracts the interaction
with such map.

Definition handle_State {E: Type → Type} ‘{mapE string (Int32.repr 0%Z) −< E}: State > itree E :=
fun _ e ⇒ match e with

| SetVar x v ⇒ insert x v

end.

Definition interp_asl {E A} (t : itree (State +’ E) A) : stateT env (itree E) A :=
let t’ := interp (bimap handle_State (id_ E)) t in

interp_map t’.

4.3 LLVM-IR Translation

This section covers the file Compiler.v which holds a translator from the ASL AST to the
Vellvm AST. The top level translation starts with the compile definition shown in the
Figure 4.5, which will take a prog input and will concat the result of compiling the declara-
tions(compile decs s) and compiling the statements (compile stmts s). The result of such
functions should return the Vellvm dataype code which holds a sequence of VIR instructions.
The Vellvm datatype code is polymorphic, and accepts two kind of instruction: dynamically
typed and statically typed. A deeper explanation about type differences in Vellvm is in Sec-
tion 4.4.4. For now, it’s only necessary to know that most compiling definition are defined
twice, one with the s suffix, returning statically typed code: code typ, and another one
without the suffix which returns the same code but dynamically typed: code dtyp.

Definition compile (p : AST.prog) : code typ :=
match p with

| Prog ds ss ⇒ (compile_decs_s ds) ++(compile_stmts_s ss)
end.

Figure 4.5: Top level compile definition

Both definitions compile decs s and compile stmts s are just recursive calls over a list,
which execute over the head of the list, their non recursive counterparts compile dec s

compile stmt s, which will match the required code generation definition. In the case of

45

https://github.com/DeepSpec/InteractionTrees/blob/master/theories/Events/Map.v
https://github.com/101001000/ASL/blob/master/src/coq/Compiler/Compiler.v


the compilation of a variable declaration, the required code for allocating a variable of name x
will be concatenated with the initialization of that same variable to the value ”0”. Compiling
a Skip statement, will return an empty list as expected, meanwhile compiling an assignment
will result in the LLVM instruction store. This two definitions are avaliable in the Figure 4.6

Definition compile_dec_s (d : dec) : code typ :=
match d with

| Var x ⇒ [gen_alloc_s x; gen_store_s x (Int32.repr 0%Z)]
end.

Definition compile_stmt_s (s : stmt) : code typ :=
match s with

| Skip ⇒ nil

| Assign x v ⇒ match v with

| Lit n ⇒ [gen_store_s x n]
end

end.

Figure 4.6: Translation code for declaration and statement

gen allocate and gen store

This two definitions, return the construction of a LLVM instruction:

Definition gen_allocate_s (x:string) :=
(LLVMAst.IId (LLVMAst.Name x), LLVMAst.INSTR_Alloca (TYPE_I 32) None None).

Definition gen_store_s (x:string) (i:Int32.int) :=
(IVoid 0%Z, (INSTR_Store false ((TYPE_I 32%N), (EXP_Integer (Int32.unsigned i)))
(TYPE_Pointer (TYPE_I 32), (EXP_Ident (ID_Local (Name x)))) None)).

4.4 Translation verification

4.4.1 The Vellvm memory model

Vellvm memory works with three elements: global env, local env and memory stack.
This project only focuses in the local local env and memory stack, ignoring the global
environment. Local environment is a map of some identifiers, to values (which in the ASL
context, are always pointers to memory blocks). Memory stack is a map of some block
address to a block holding a data value.

46


get val

get val is a helpful definition which has been included to retrieve values from Vellvm mem-
ory. It will retrieve the pointer associated to some variable in the local environment and
will return the int32 value hosted in the block the pointer is addressing. There are many
reasons why this operation could fail, in any of such cases, it will return a None option value:

1. There’s no variable in the local environment

2. The variable does not hold a pointer

3. There’s no block associated to the pointer

4. The data inside the block is not an int32

In retrospective, it would be easier from a verification standpoint to distinguish the
different errors instead of just returning a None value.

Definition get_val (g: global_env) (l: local_env) (m: memory_stack) (x: string) : option int32 :=
let addr := FMapAList.alist_find (Name x) l
in match addr with

| Some (UVALUE_Addr ptr) ⇒
let val := read m ptr (DTYPE_I 32%N) in
match val with

| inl _ ⇒ None

| inr v ⇒ match v with

| UVALUE_I32 i ⇒ Some i

| _ ⇒ None

end

end

| _ ⇒ None

end.

4.4.2 Renv

This section covers the file Renv.v which is the proposition which relates the Vellvm envi-
ronment with ASL environment. It can be split into four sub-propositions:
The first proposition ensures that every value retrieved successfully with the get val def-
inition for the Vellvm environment, must be the same as the value retrieved by the ASL
environment and the other way around. This is the core of the Renv relationship and en-
sures that all the values are the same between the two data models.

47

https://github.com/101001000/ASL/blob/master/src/coq/Compiler/Renv.v


Definition Renvh (env_asl : env) (g_llvm : global_env) (l_llvm : local_env)
(m_llvm : memory_stack) : Prop :=

forall k v, (alist_In k env_asl v ↔ get_val g_llvm l_llvm m_llvm k = Some v)
∧ (FMapAList.alist_find k env_asl = None ↔ FMapAList.alist_find (Name k) l_llvm = None)
∧ unique_ptrs l_llvm

∧ no_next_key l_llvm m_llvm.

The issue is that is not a hard enough. One could imagine an scenario where some variable
is in the local vellvm environment pointing to some un-allocated memory region, in such
case it would match the environment relationship. For that reason, it’s needed to harden
the relationship with the second proposition, ensuring that if some variable is not in the asl
environment, it won’t be in the local vellvm environment neither.

The third proposition ensures that there’s not repeated pointers in the local environment,
and the fourth one asserts that the pointer of the next empty block, is not in the local
environment. This last two prepositions act more like an invariant which will be used in the
reasoning.
The objective here now is to prove that the Renv proposition is held after doing some
operations over both environments at the same time. For example:

Lemma RenvhAssign :
forall env_asl g_llvm l_llvm m_llvm x i,
Renvh env_asl g_llvm l_llvm m_llvm → Renvh (FMapAList.alist_add x i env_asl) g_llvm
(FMapAList.alist_add (LLVMAst.Name x) (UVALUE_Addr (next_logical_key m_llvm, 0%Z)) l_llvm)
(mem_stack_add m_llvm x i).

This other Lemma proves something similar but for overwriting variables. It requires some
preconditions: for example, a variable overwriting will be unsuccessful if is not allocated pre-
viously, and it’s necessary to ensure that the variable match with some pointer in the Vellvm
environment, and not other value. This requirements are encoded in the allocated ptr m

-> FMapAList.alist find (Name x) l = Some (UVALUE Addr ptr) -> FMapAList.alist find

x e = Some i0

Lemma Renv_holds : forall e g l m i x ptr i0,
allocated ptr m →
FMapAList.alist_find (Name x) l = Some (UVALUE_Addr ptr) →
FMapAList.alist_find x e = Some i0 →
Renvh e g l m →
Renvh (FMapAList.alist_add x i e) g l

match write m ptr (DVALUE_I32 i) with

48


| inl _ ⇒ empty_memory_stack

| inr x1 ⇒ x1

end.

4.4.3 Bisimilarity

The end of the compilation verification is in the file CompilerCorrectness.v

Well formedness

The ASL code in Figure 4.7 is actually legal in ASL, and the result is: x− > 0, y− > 1, z− > 2,
because ASL doesn’t require allocations the same way Vellvm does. But, if that code is com-
piled with the compiler proposed, executing the result Vellvm code, would provide a different
result, because z has not been allocated.

var x;

var y;

x = 1;

z = 2;

Figure 4.7: ASL legal code

This is actually one of the limitations of the compiler proposed. One way to solve this, could
be to create a more complex environment relationship which would take into account this
edge case, but instead for simplicity, the way this have been solved, is by limiting the ASL
code which can be compiled with a proposition called well formed. An ASL program is well
formed, if all their assignments are declared previously. Formally:

Definition well_formed p :=
match p with

| Prog ds ss ⇒ forall x e, (In (Assign x e) ss → In (Var x) ds)
end.

Figure 4.8: Well formedness proposition

where In v l is a proposition which states that some value v is in the list l. Could be
interesting to extract this Coq’s definition, to check in OCAML if an ASL is well formed or
not. This could be done by making a boolean definition instead a propositional one, because
Coq won’t extract propositions, as it’s not clear how a proposition can be computed.

49

https://github.com/101001000/ASL/blob/master/src/coq/Compiler/CompilerCorrectness.v


4.4.4 Vellvm typ vs dtyp

One could think that the translation is already verified, but there’s one step left. The
compiler correct theorem verifies the compilation of declarations and statements but only
for the dynamic typed version of the code.

There are to kind of types in Vellvm: DynamicTypes (dtyp) and StaticTypes (typ). The main
difference is that the DynamicTypes strips the pointer type information, as the semantics in
Vellvm doesn’t require knowing the type of the pointer1 and add complexity to the reasoning.
There’s a file called TypToDtyp.v which will strip that information converting some typ into
dtyp. Vellvm requires it’s input code to be typed with dynamic types.
The main issue is that even if Vellvm can work without the pointer type, LLVM-IR source
code requires that type information to be available, which means the ASL translator requires
to output code statically typed. That’s the reason why the compile function outputs typ
code, so it can be properly printed with the relevant type information.

So it’s required to add a last theorem which covers this typ/dtyp conversion:

Theorem compiler_correct_convert: forall (p : AST.prog),
well_formed_prog p →
bisimilar TT (denote_prog p) (denote_code (TypToDtyp.convert_typ [] (compile p))).

where the function TypToDtyp.convert typ [] is applied to the compiled code to get the
dynamic version of the code. The issue is that the convert typ function, makes the pro-
cess of verifying even more tedious, so for getting rid of it, two lemmas are demonstrated:
TypToDtyp.convert typ [] (compile decs s ds) = (compile decs ds) and:
TypToDtyp.convert typ [] (compile stmts s ds) = (compile stmts ds). After doing
it, it’s possible to rely into a list application lemma to prove the final compiler theorem.

4.4.5 The missing lemma

There is one lemma left which has not been demonstrated yet to fully verify the compilation:

Lemma next_logical_key_mem_stack_add : forall m x i,
next_logical_key (mem_stack_add m x i) = ((next_logical_key m) + 1%Z)%Z.

It states that after adding some element of the memory, the key to the next empty
block, will be the same as the one before adding some element plus one. Verifying this

1Vellvm team came to the same conclusion as other individuals: Type information of pointer is not
necessary for the semantics of LLVM programs and only add complexity, which matches the design decision
of the LLVM project of using the Opaque Pointers explained in the Section 2.3.2

50


lemma, doesn’t suppose any different challenge to the one faced previously, but is tedious
and wasn’t done on time.

51


Chapter 5

Benchmarking

This chapter aims to provide some empirical data about the evaluation of Eleven in similar
scenarios to the ones used by the end user.

5.1 Backend comparison

This section will compare Eleven along different devices with different back-ends. The objec-
tive is to show a quick summary of what are the relative differences between devices which
could be found in rendering workstations.
The scene chosen to be rendered in these devices is the one shown in Figure 5.1:

52


Figure 5.1: Shader demo scene for backend comparison
.

It have been chosen for couple reasons: First, it is completely open source with license CC0.
All the models and resources are available in polyhaven.com, second, it has a moderately
complex geometry (a very complex geometry scene would take most of the time building in
the CPU, the required data structures for ray tracing intersections, which is not relevant
for SYCL performance), third, it has a memory footprint of about 1̃Gb, where most of it
are 1024x1024 pixel textures and should fit in most consumers GPUs, and last, it includes
three shaders: the shader from Section 2.3.3 compiled from C++ code, which is heavy and
requires some resources and is applied to the up triangular sphere, and two lightweight color
ASL shaders for the two spheres, compiled with the ASL compilers.
The results of the rendering execution across the devices listed, are available in the Figure 5.2

53

polyhaven.com


OS Compiler version Vendor Model Device type Backend Result

Windows DPC++ 2022-12 Intel i9 13900KF @3Ghz CPU OpenCL 8:10m
Windows DPC++ 2022-12 Intel i5 12600 CPU OpenCL 15:57m
Windows DPC++ 2022-12 Nvidia RTX3090 GPU NVPTX 3:20m
Linux DPC++ (#5801970) AMD 6700XT GPU AMD HIP 1:33m
Linux ICPX 2023.2.0 Intel i9 13900K CPU OpenCL 7:02m
Linux ICPX 2023.2.0 Intel ARC A770 GPU LevelZero 5:17m
Linux ICPX 2023.2.0 Intel ARC A770 GPU OpenCL 5:20m

Figure 5.2: Backend comparison

The main two notable points to highlight here are: the low performance of the high-end
RTX3090 in comparison to the mid-end AMD 6700XT, where the second one performed
almost twice at fast, and the minimal difference between different LevelZero and OpenCL
backends for the same device.

5.2 Anatomy of a rendering session

The same scene used in the previous section, has been used to illustrate a timeline of an
execution of Eleven, with the objective of provide a deeper understanding of where time is
spent. This execution in particular, is the same as the one shown in the table Figure 5.2
with an Nvidia RTX3090. The timeline is the following one:

54


60ms · · · · · ·• Sending and loading
configuration.

3230ms · · · · · ·• Sending and loading
textures.

130ms · · · · · ·• Sending and loading
materials.

700ms · · · · · ·• Object processing.

880ms · · · · · ·• Sending and loading
geometry data.

2450ms · · · · · ·• BVH construction.

1170ms · · · · · ·• Scene device copy.

40ms · · · · · ·• Setup Kernel.

138000ms · · · · · ·• Rendering Kernel.

Table 5.1: Timeline of an Eleven rendering session

5.3 Cycles comparison

One of the main criticism of the previous bachelor’s thesis was, that there were no direct
comparisons with some state of the art rendering engine to measure the performance of
Eleven in the real world. This was in a big part, because setting up Eleven’s scenes were a
tedious task and didn’t really matched other engines scenes format. Now thanks to the new
architecture and ElevenBlender, that has changed. This subsection will cover the awaited
performance comparison against the well known and established Cycles rendering engine.
Because Cycles is unable to run the Eleven’s shaders, the three elements with custom shaders
from the previous scene have been removed.

55


Cycles CUDA vs Eleven CUDA

(a) 100 samples, Blender Cycles (b) 100 samples, Eleven Render

(c) (a) image zoomed in (d) (b) image zoomed in

Figure 5.3: Comparison between Cycles and Eleven in CUDA

56


Eleven scene has required 4:28min, the same scene in Cycles has required 33s which makes
Cycles 8.1x times faster than Eleven in this scenario. It’s important also to mention that
Cycles squish those 100 samples better than Eleven, probably because a better implementa-
tion of the Importance sampling.

An important consideration to have in mind is, DPC++ makes use of the LLVM NVPTX
back-end for generating the code which will run the Nvidia GPU, meanwhile, Cycles uses
nvcc, an Nvidia proprietary compiler, with different optimizations and code generation. This
is another factor to take into account, and future optimization will require to have this point
in consideration.

Cycles CPU vs Eleven OpenCL

The same procedure has been repeated but targeting the OpenCL back-end, resulting in
3:44min for the Eleven’s execution and 1:27min for the Cycle’s one, making Cycles 2.57x
times faster. This has been a great surprise, usually, CPU rendering code is better opti-
mized than GPU one, so getting a relatively close performance without any optimization in
consideration by just targeting the OpenCL back-end, is not something expected.

57


Chapter 6

Conclusions

In this Master thesis, the development of Eleven Render has been documented. It includes
the architectural changes which have allowed a remote and flexible client-server workflow,
and the ability to run over multiple heterogeneous architectures. Many improvements have
been included such as the inclusion of an AI denoiser, and a shading system which supports
any language which can be translated to LLVM-IR has been defined, implemented and
included in the compilation chain. Also, a custom plugin to be used in the well known 3D
open source software Blender has been coded in python making use of the Blender’s API
bpy. A custom shading language ASL has been formalized and implemented in Coq, along
with their own translation to LLVM-IR, but because the complexity and the tedious work of
verifying such translation, at the end, ASL is an extremely simple language considered just a
proof of concept which deviates from the initial complete-shading language plans proposed.
It has been also shown how code in this ASL shader language can be included in a Eleven
scene. Finally, Eleven has been benchmark-ed and profiled into the main DPC++ target
architectures, and a comparison with an state of the art render (Cycles), It has been shown
that it is relatively efficient.
This work will serve as a documented open source project for people who want to step up
their ”Toy Renders”1, and who want to dig in the details of the inner workings of an offline
rendering engine. It will also serve as the seed of what I consider a missing part of the Vellvm
project: an IMP to VIR verified and documented translation.

1A toy render is a simple 3D render usually done as a coding exercise which cover just the fundamentals
of rendering

58


Bibliography

[1] P. Christensen, J. Fong, J. Shade, W. Wooten, B. Schubert, A. Kensler, S. Fried-
man, C. Kilpatrick, C. Ramshaw, M. Bannister, et al., “Renderman: An advanced
path-tracing architecture for movie rendering,” ACM Transactions on Graphics (TOG),
vol. 37, no. 3, pp. 1–21, 2018.

[2] M. Pharr, W. Jakob, and G. Humphreys, Physically based rendering: From theory to
implementation. MIT Press, 2023.

[3] J. T. Kajiya, “The rendering equation,” in Proceedings of the 13th annual conference
on Computer graphics and interactive techniques, pp. 143–150, 1986.

[4] W. D. A. Studios, “”the moana island scene.” http://technology.disneyanimation.
com/islandscene/. https://datasets.disneyanimation.com/moanaislandscene/

island-README-v1.1.pdf.

[5] B. Burley and W. D. A. Studios, “Physically-based shading at disney,” in Acm Siggraph,
vol. 2012, pp. 1–7, vol. 2012, 2012.

[6] E. d. l. Calle Montilla, “Evaluación y aceleración del algoritmo path tracing en arqui-
tecturas heterogéneas.” Trabajo de Fin de Grado en Ingenieŕıa Informática, Facultad
de Informática UCM, Departamento de Arquitectura de Computadores y Automática,
Curso 2020-2021., 2021.

[7] L. Gritz, “Open shading language spec 1.12.” https://github.com/

AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/

osl-languagespec.pdf, 2022.

[8] C. Group, “”v-ray gpu supported features”.” https://docs.chaos.com/display/

VMAX/V-Ray+GPU+Supported+Features.

[9] D. Pixar, “”renderman features and limitations”.” https://rmanwiki.pixar.com/

display/REN24/Features+and+Limitations.

59

http://technology.disneyanimation.com/islandscene/
http://technology.disneyanimation.com/islandscene/
https://datasets.disneyanimation.com/moanaislandscene/island-README-v1.1.pdf
https://datasets.disneyanimation.com/moanaislandscene/island-README-v1.1.pdf
https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf
https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf
https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf
https://docs.chaos.com/display/VMAX/V-Ray+GPU+Supported+Features
https://docs.chaos.com/display/VMAX/V-Ray+GPU+Supported+Features
 https://rmanwiki.pixar.com/display/REN24/Features+and+Limitations
 https://rmanwiki.pixar.com/display/REN24/Features+and+Limitations


[10] W. Usher, “ChameleonRT.” https://github.com/Twinklebear/ChameleonRT, 2019.

[11] J. Reinders, B. Ashbaugh, J. Brodman, M. Kinsner, J. Pennycook, and X. Tian, Data
parallel C++: mastering DPC++ for programming of heterogeneous systems using C++
and SYCL. Springer Nature, 2021.

[12] X. Leroy, S. Blazy, D. Kästner, B. Schommer, M. Pister, and C. Ferdinand, “Compcert-
a formally verified optimizing compiler,” in ERTS 2016: Embedded Real Time Software
and Systems, 8th European Congress, 2016.

[13] Inria, “Coq proof assistant.”

[14] L.-y. Xia, Y. Zakowski, P. He, C.-K. Hur, G. Malecha, B. C. Pierce, and S. Zdancewic,
“Interaction trees: representing recursive and impure programs in coq,” Proceedings of
the ACM on Programming Languages, vol. 4, no. POPL, pp. 1–32, 2019.

[15] Y. Zakowski, C. Beck, I. Yoon, I. Zaichuk, V. Zaliva, and S. Zdancewic, “Modular,
compositional, and executable formal semantics for llvm ir,” Proceedings of the ACM
on Programming Languages, vol. 5, no. ICFP, pp. 1–30, 2021.

[16] V. Zaliva, HELIX: From Math to Verified Code. Carnegie Mellon University, 2020.

[17] Z. Jin and J. Vetter, “Evaluating cuda portability with hipcl and dpct,” in 2021 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW),
pp. 371–376, IEEE, 2021.

[18] G. Castaño, Y. Faqir-Rhazoui, C. Garćıa, and M. Prieto-Mat́ıas, “Evaluation of in-
tel’s dpc++ compatibility tool in heterogeneous computing,” Journal of Parallel and
Distributed Computing, vol. 165, pp. 120–129, 2022.

[19] Intel, “”performance impact of usm and buffers”.” https://www.intel.

com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/

performance-impact-of-usm-and-buffers.html.

[20] Khronos Group, “Sycl specification.” "https://registry.khronos.org/SYCL/specs/sycl-
2020/pdf/sycl-2020.pdf”, 2023. Accessed: April 18, 2023.

[21] L. Project, “”opaque pointers”.” https://releases.llvm.org/15.0.0/docs/

OpaquePointers.html.

[22] H. R. Nielson and F. Nielson, Semantics with applications: an appetizer. Springer
Science & Business Media, 2007.

60

https://github.com/Twinklebear/ChameleonRT
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html
"
https://releases.llvm.org/15.0.0/docs/OpaquePointers.html
https://releases.llvm.org/15.0.0/docs/OpaquePointers.html

	Introduction
	Background and Motivation
	Photo-realistic Rendering
	SYCL / DPC++
	LLVM
	Coq / OCAML
	iTrees (Interaction Trees)
	Vellvm (Verified LLVM)

	Previous Work
	Goals
	Work Plan

	Render Architecture
	High level architecture overview
	Data Handling, Communication, and Rendering Processes
	IPC
	SYCL port

	Including a custom shading language in the pipeline
	Shading kernel execution
	IR Populator
	C++ Shader example

	Miscellaneous improvements

	Eleven Blender
	Geometry loading
	Material interface

	Abstract Shading Language
	Syntax
	AST
	Lexer
	Parser

	Semantics
	LLVM-IR Translation
	Translation verification
	The Vellvm memory model
	Renv
	Bisimilarity
	Vellvm typ vs dtyp
	The missing lemma


	Benchmarking
	Backend comparison
	Anatomy of a rendering session
	Cycles comparison

	Conclusions