Implementación y análisis de un motor de renderizado fotorrealista en SYCL, para arquitecturas heterogéneas Implementation and analysis of a photo-realistic rendering engine in SYCL, for heterogeneous architectures Enrique de la Calle Montilla Trabajo de Fin de Máster Máster en Métodos Formales en Ingenieŕıa Informática Facultad de Informática Universidad Complutense de Madrid Dirigido por Carlos Garćıa Sánchez 1 Madrid, 2023 2 Abstract Implementation and analysis of a photo-realistic render- ing engine in SYCL, for heterogeneous architectures This work covers the development, implementation, and analysis of Eleven Render, an architecture-independent photo-realistic rendering engine. It is written in C++ and utilizes the SYCL programming model which allows code to run in a vast range of heterogeneous architectures like: CPUs, GPUs, and FPGAs, bestowing upon Eleven Render, with a porta- bility level rarely seen in other engines, and making it one of the first SYCL rendering engine implementations. A new shading language called ASL has been formally specified and im- plemented in Coq. Making use of Coq’s verification tools and the Vellvm project, which endows LLVM-IR of a formally specified semantics, a semantic-preserving translator from ASL to LLVM-IR has been implemented and verified. The resulting LLVM-IR shader code is then fed to the Intel’s SYCL implementation DPC++, and is compiled to the desired target architecture. The entire process has been evaluated and benchmarked in different architectures. Keywords Render 3D, SYCL, DPC++, Path Tracing, Ray Tracing, Coq, Vellvm, ASL 3 Resumen Implementación y análisis de un motor de renderizado fotorrealista en SYCL, para arquitecturas heterogéneas Este trabajo cubre el desarrollo, la implementación y análisis de Eleven Render, un mo- tor de renderizado fotorrealista independiente de plataformas de hardware. Está hecho en C++ y hace uso del modelo de programación de SYCL, que permite ejecutar código en un gran abanico de arquitecturas heterogéneas, tales como: CPUs, GPUs y FPGAs, dotando a Eleven Render, de una portabilidad raramente vista en otros motores, y siendo aśı una de las primeras implementaciones en SYCL de motor gráfico fotorrealista. Se ha propuesto también un nuevo lenguage de shading llamado ASL, que ha sido formalmente especificado e implementado en Coq. Haciendo uso de las herramientas de verificación de Coq y el proyecto Vellvm, que dota al lenguage LLVM-IR de una semántica formalmente especificada, se ha implementado y verificado un traductor de ASL a LLVM-IR. El código resultante de dicha traducción, es entonces incluido en el proceso de compilación llevado a cabo por la imple- mentación de SYCL hecha por Intel: DPC++. Todo este proceso ha sido posteriormente evaluado y sometido a pruebas de rendimiento en diferentes arquitecturas de hardware. Palabras Clave Render 3D, SYCL, DPC++, Path Tracing, Trazado de Rayos, Coq, Vellvm, ASL 4 Contents 1 Introduction 8 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Photo-realistic Rendering . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.2 SYCL / DPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.3 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.1.4 Coq / OCAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1.5 iTrees (Interaction Trees) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.6 Vellvm (Verified LLVM) . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Render Architecture 21 2.1 High level architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Data Handling, Communication, and Rendering Processes . . . . . . . . . . 22 2.2.1 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 SYCL port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Including a custom shading language in the pipeline . . . . . . . . . . . . . . 27 2.3.1 Shading kernel execution . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 IR Populator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 C++ Shader example . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Miscellaneous improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Eleven Blender 35 3.1 Geometry loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Material interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 4 Abstract Shading Language 39 4.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.2 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.3 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 LLVM-IR Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Translation verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 The Vellvm memory model . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.2 Renv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.3 Bisimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.4 Vellvm typ vs dtyp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.5 The missing lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Benchmarking 52 5.1 Backend comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Anatomy of a rendering session . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Cycles comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Conclusions 58 6 Eleven Render 7 Chapter 1 Introduction 1.1 Background and Motivation This Master’s Thesis merge different technologies and concepts which go from the abstract world of formal verification, to the concrete hard silicon, passing through the simulations of computer graphics. In the following subsections, the main notions will be briefly explained, along with a small motivation which will help to understand the decisions taken. 1.1.1 Photo-realistic Rendering Rendering is the process for which digital content is visually represented. There are two pri- mary categories: offline rendering and online (or real-time) rendering. This work focuses on offline photo-realistic rendering specifically, which seeks to obtain the highest visual fidelity usually thanks to compute-intensive algorithms such as Path Tracing [1][2]. Path Tracing achieves those results by simulating the light behavior in a similar way it does in the real world, in essence, mimicking the light paths where the photons travel, approximating by Monte Carlo method, what is known as: ”The Rendering Equation” [3]. As one might expect, the convergence of the Monte Carlo method requires substantial com- putational resources in order to accurately approximate The Rendering Equation, required for achieving high visual fidelity. These light simulations have been ran traditionally almost exclusively over CPUs instead other architectures like GPUs. One of the main reasons, is the lack of memory. GPUs VRAM availability has always lagged behind traditional and more scalable CPUs architectures, forcing GPU photo-realistic rendering engine developers to rely 8 in techniques such as out-of-core memory1 with a huge performance penalization when the rendered scenes surpassed the available memory in the target architecture. Scenes in modern feature films are in the order of hundred GBs. For example, Walt Disney Animation Studios released a dataset to allow people to render a sample scene from the 2016 movie ”Moana”. This dataset was provided with the purpose of: developing, testing and benchmarking new rendering algorithms [4]. The Base Package weights 93GB and includes the raw data required for rendering one frame of the scene. To be more precise, the latest consumer Nvidia 4000 series line-up peaks at 24GB, meanwhile the professional Quadro Nvidia line-up, peaks at 48GB for single GPU configurations. Render GPU acceleration Open Source V-Ray Render (Chaos Group) Nvidia only No RenderMan (Disney Pixar) Nvidia only No MoonRay (Dreamworks) Nvidia only Yes Arnold (Autodesk) Nvidia only No Redshift (Maxon) Nvidia, AMD (only MacOS) No Mantra (SideFX) No No Octane (Otoy) Nvidia only No Cycles (Blender) Nvidia, AMD, Intel Yes Figure 1.1: Most Popular Render engines and GPU support Nonetheless, the gap between GPU and rendering memory requirements is closing. Consumer GPUs in 2023 are approaching a point where they can cover some professional rendering 1Out-of-core memory technique is a method which schedules data transmission to the device at run-time, allowing processing big datasets without requiring all the data loaded into the device 9 works, for that reason, many of the current production renders have started releasing their own GPU accelerated versions, as we can see in the Figure 1.1 table, which covers the most popular offline rendering engines. Shading Languages Approximating The Rendering Equation is usually just one part of the process. Artists re- quire more flexibility deciding how light should behave in the scene apart from the standard models. For that reason, shading languages exists. In the offline rendering context, Shading Languages are DSL (Domain Specific Languages) which usually allows users to code their own light interactions. There are dozens of shading languages, but the most notable example is Open Shading Language, or OSL, developed by Sony Pictures as an effort to standardize the diverse shading languages landscape. In this master thesis, a shading model has been proposed and implemented into Eleven Render. This shading model, could be defined as Parameter-driven Shading. This means, the light behavior is defined by a hard-coded function (Disney’s Principled BRDF[5]), but the input parameters of that function can be described programmatically with custom code executed in the device. This is at a half way between a fully expressive shading model, and using fixed parameters. A visual example of this shading model is: a shader which describes the color for certain material without relying in texture data, for example a noise function. In the Figure 1.2 can be seen how a noise texture shader can be linked to the albedo input parameter (a), where it will describe the color for the surface, or can also be linked to the metallic input (b) to describe the amount of reflection for each point of the surface. (a) Noise shader linked to albedo input (b) Noise shader linked to metallic input Figure 1.2: Noise shader applied to albedo and metallic inputs of the Principled BRDF 10 One of the downsides of this approach, is that it won’t allow to completely model the light behaviour. It’s possible to parameterize the variables of the BRDF function, but is not flexi- ble enough to tell the light how it should behave. Controlling the light behaviour in the path tracing algorithm shouldn’t be a hard task. In theory it would be enough to tell the bouncing ray it’s new direction and the energy lost in the bounce. The main issue is importance sam- pling. Importance sampling is a way to speed up convergence times2 and is required in any modern rendering engine, but it interferes with the probability for ray bouncing direction. That means that for completely modeling the light behaviour, it’s necessary to take into account the probability and the distribution of the sampling determined by the importance sampling method. Modern shading languages like OSL overcomes this complexity making use of closures, which are just a symbolic representation of that behaviour which allows a series of operations over it, and leverages the complexity of handling the distribution prob- abilities, to the closure implementation. OSL, as well as most languages, is released with natural language specification (OSL 1.12 spec[7]), which sometimes can be ambiguous an imprecise. In this work, a different approach have been used, which is to formally define a shading language called ASL. This have been done, in the name of academic interest, with the purpose of showing, how Formal Methods can be integrated in a development project, and how they can provide value to it. ASL provides to Eleven, a rigorous and a correct implementation of a shading language. 1.1.2 SYCL / DPC++ SYCL is a cross-platform abstraction layer that enables code for heterogeneous processors to be written using modern C++, and manages data resources and code execution on those devices. The main purpose of the SYCL standard is to unify the development for multiple platforms into a single codebase. As can be seen in the Figure 1.1, many rendering engines develop their Graphics Processing Unit (GPU) versions choosing only one platform (usu- ally Nvidia’s CUDA), leaving other vendors devices behind. The GPUs render versions are developed in parallel with the main CPU render, creating two separate codebases. This usually implies limited features for the GPU branch which usually lags behind the CPU one [8, 9]. For that reason, a framework which would unify the development process for a diverse amount of target devices, would be of such a high value from the rendering engine development standpoint. That’s where SYCL plays a crucial role in this work. The main Eleven code-base has been developed in SYCL with the purpose of making Eleven as much portable as possible. 2Importance sampling is explained in more detail in the Chapter 4.2 of the previous work[6] 11 The main drawback of a programming model which targets multiple different architectures, is that it is limited from a feature standpoint by the weakest of the architectures. For that reason, SYCL doesn’t support virtual function calls, function pointers, exceptions and global device variables. These limitations will make the development of Eleven more challenging, as will be seen in the subsequent chapters. SYCL ecosystem is emerging, most of the implementations are not even up to the date with the specification. This is the list of all SYCL rendering engines found on the internet as of the writing of this report: • Cycle’s OneAPI backend • ChameleonRT • triSYCL path tracer demo triSYCL path’s tracer, is just a demo lacking of more advanced features, meanwhile ChameleonRT[10] offers a more complete path tracing engine, but lacks from a shader loading system and the SYCL implementation3, is only compatible with Intel devices due to the limitations of the Embree’s SYCL library. Cycle’s OneAPI backend allows the well known open source ren- derer Cycles, to run over oneAPI devices, but it won’t work with different SYCL backends, just with Intel XE-Hpg architecture GPUs. Given the current landscape, Eleven stands as one of the pioneering SYCL rendering engines, making it, an early and valuable contribution to the SYCL community and the heterogeneous architecture development. It’s the first SYCL engine to make use of the flexibility of SYCL multi-target capabilities. DPC++ DPC++ (Data Parallel C++) [11] is an open source implementation of C++ compiler with SYCL standard support maintained by Intel, and the one chosen among several options for this work. It not only adds some features to the original SYCL specification, but it also includes a compiler built on top of the LLVM project which makes it easy to extend to devices other than native Intel architectures. DPC++ is an outstanding component of the oneAPI toolkit4, which is a comprehensive initiative designed to unify programming across various 3ChameleonRT is compatible with a huge variety of devices. The limitation stated, is just for the SYCL+Embree backend 4https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#gs.3v44bt 12 https://github.com/blender/cycles/tree/main/src/device/oneapi https://github.com/Twinklebear/ChameleonRT/tree/master https://github.com/triSYCL/path_tracer https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#gs.3v44bt devices through a single API. Furthermore, the toolkit incorporates optimized libraries for diverse device types, including oneMKL for linear algebra, oneDAL for machine learning, or adds additional functionality with the Rendering Toolkit designed to accelerate these types of workloads with a set of rendering and ray tracing libraries. Every toolkit and libraries are actively maintained by Intel to support their broad portfolio of devices and architectures. It is worth mentioning that Codeplay, which was acquired by Intel in 2022, has expanded its support for a wide array of devices beyond Intel products through the implementation of a plugin interface. The main device targets actually supported by DPC++ are: • spir64 x86 64: Intel’s CPUs • spir64 fpga: Intel’s FPGA • spir64 gen: Intel’s GPUs • nvptx64-nvidia-cuda: Nvidia GPUs (HIP/CUDA) • amdgcn-amd-amdhsa: AMD GPUs (HIP) A considerable number of available targets out of the box, is one of the reasons why DPC++ has been chosen for the development of Eleven. Nevertheless, in theory, the Eleven’s source code could in theory be compiled with other SYCL implementations such as ComputeCpp from Codeplay5, triSYCL from Xilinx6, hipSYCL from Universitat Heidelberg7(actually de- noted as AdaptiveCpp). 5https://developer.codeplay.com/products/computecpp/ce/home 6https://github.com/triSYCL/triSYCL 7https://github.com/OpenSYCL/OpenSYCL 13 https://intel.github.io/llvm-docs/design/PluginInterface.html https://developer.codeplay.com/products/computecpp/ce/home https://github.com/triSYCL/triSYCL https://github.com/OpenSYCL/OpenSYCL DPC++ Compilation Figure 1.3: AOT/JIT compilation figure extracted from Data Parallel C++’s book [11, Figure 13.2] SYCL works with two kinds of compilation systems: JIT and AOT. Just In Time (JIT) code is compiled at run-time, meanwhile Ahead Of Time (AOT) code, is compiled previously. AOT mode is selected with the -fsycl-targets flag, and will compile code for each target selected. That code will be bundled later, and a fat binary is produced, which holds the code for each device. In this project, this compilation system will be exploited to include arbitrary shader code to those fat binaries, and execute them in the device. 1.1.3 LLVM The LLVM Project is an open source compiler infrastructure. The high level compilation steps for a project done in LLVM is show at the Figure 1.4. 14 Source code Front-End • Clang (C, C++) • Kaleidoscope (Haskell) • Rustc (Rust) ... LLVM-IR IR Passes • Optimization • Analysis • Transformation Back-End CPU: • x86 • ARM ... GPU: • AMDGPU (AMD) • NVPTX (Nvidia) • SPIR-V ... Target machine code Figure 1.4: LLVM high-level compilation diagram First, a front-end will translate an arbitrary programming language code to an intermediate representation called LLVM-IR (acronym for LLVM Intermediate Representation). Clang is maybe one of the most popular front-ends for LLVM because it works with C/C++ code. The previously mentioned Intel’s DPC++ compiler, is effectively a front-end for DPC++ code 8. The result IR from the front-end can be analyzed, transformed and optimized accord- ing to the requirements of the project, and finally, a back-end will translate the LLVM-IR code to the required target. LLVM have some advantages which makes it interesting from the heterogeneous architecture development perspective, which would explain why the many of the most important SYCL implementations like: AdaptiveCpp, Intel’s DPC++ and neoSYCL, are built over the LLVM project. 8While the DPC++ compiler can be understood as an LLVM front-end, it provides way more functionality like orchestrating and coordinating the linking and driver functionality of heterogeneous architectures 15 LLVM-IR The intermediate representation LLVM-IR is flexible enough to work as a middle step lan- guage between many architectures, while maintain the ability of being analyzed and opti- mized easily, thanks to the Static Single Assignment form and its clear defined semantics. After knowing this characteristics, it’s possible to see how LLVM-IR is a good candidate to serve the purpose of being the target language for the verified translator for ASL, the custom shading language proposed in this work. 1.1.4 Coq / OCAML Coq is described in their webpage as: ”a formal proof management system which provides a formal language to write mathematical definitions, executable algorithms and theorems together with an environment for semi-interactive development of machine-checked proofs. Typical applications include the certification of properties of programming languages (e.g. the CompCert[12] compiler certification project)...”[13]. It allows to program in a functional manner, and prove properties over that code with their proof system. Thanks to Coq, it’s possible to define the ASL syntax and semantics, and reason about them in a later step. From this point to the rest of this work, any time the ”verification” word appears, is in the context of Coq’s proof system. OCAML OCaml is a programming language with a significant presence in static analysis and formal methods, because their simple semantics in its functional features. Coq is implemented in OCaml, and it’s used usually to provide verification tools to OCAML development workflows. It comes with two code-generating tools: ocamllex, which creates a lexical analyzer based on given regular expressions and their corresponding semantic actions, and ocamlyacc, which constructs a parser based on a specified grammar. These tools are similar to the well known lex and yacc utilities, commonly found in C development environments. Code extraction One of the main features of Coq is the ability to extract the Coq’s definitions, to different functional programming languages like OCaml, Haskell and Scheme. This way, it’s possible to move from the Coq’s encapsulated and formal context, to some production environment, while keeping high degree of assurance in the correctness of the generated code. This Coq’s feature, is extensively used to bring the ASL abstract syntax and the verified code translation, to the OCaml toolchain. 16 1.1.5 iTrees (Interaction Trees) Interaction Trees are a way to represent mutually recursive, nonterminating computations in Coq. They acquire special importance when taking into account the termination constraints of Coq, which makes representation of potentially-infinite computations, hard to implement and to reason about. The iTree library achieves all of this thanks to co-induction, but their rich library hides the coinduction details, and the end user will only need to reason equationally. Figure 1.5: itree Coq definition [14, Figure 1] Bisimiliarity The itrees allow different types of equivalence as it’s explained in the original paper [14, Section 2.2] This equivalences relationships are a key-point in the rewriting strategies when reasoning about the computations. • Strong bisimulation. Two itrees t1 ∼= t2 if all the itree nodes are exactly the same. • Weak bisimulation. Weak bisimulation relaxes the relationship between to itrees t1 ≈ t2, ignoring the Tau steps t1 ∼= t2→ t1 ≈ t2. It’s required when reasoning about iteration and other processes which add internal Taus. • Heterogeneous bisimulation. In the previous bisimulations, to see if two Ret nodes were the same, they were compared it under the propositional equality relationship Eq so Ret(a) ∼= Ret(b) if a = b. But a broader definition of bisimulation is needed when comparing two itrees with different return types. For that, heterogeneous bisimulation under the R relationship eutt R is used, where R is a relationship between two types. Both, Strong and Weak bisimulation are a concrete specification of the Heterogeneous bisimulation under the propositional equality (Leibniz equality). One of the most notable contributions of interaction trees project, is that the authors provide an example of two programming languages implemented as interaction trees: Imp and Asm. The first is a simple imperative language, and the second one, a language similar to assembly. Then, they implement a translation from Imp to Asm, and prove such translation. 17 1.1.6 Vellvm (Verified LLVM) The Vellvm project[15] formalizes in a ITree-based form, the semantics of a representative subset of the LLVM Intermediate Representation known as VIR (Verified IR). It also pro- vides a rich theory library to reason about such semantics, but it lacks from one element in the project, which is a simple imperative language translation verification like the one which does iTrees [14, Ch.5]. That point is partially covered by Helix[16], a specific math DSL which achieves the conver- sion of math to executable code. But it’s highly specificity and complex language features, leaves a hole to be covered in the Vellvm project, which is the verification of a translation of a simple imperative language. Here’s is where ASL can contribute to the Vellvm project. ASL aims to be a simple imperative language with a verified translation to VIR. Making such verified translation, would serve as a reference point for other verified translations of other languages to VIR. The Vellvm project is actually very complete and their library is extensive enough. The main issue found during the development, was the entry barrier and the lack of documentation. After understanding the project and reading the code base, one can grasp the concepts and will start getting speed. But there’s a very step curve until reaching that point which aggravates heavily if one does not have a good handling of Coq. This work seeks to serve as a small entry point to reduce that barrier. 1.2 Previous Work This Master thesis is the continuation of Eleven Render’s development explained in the previ- ous bachelor thesis ”Evaluation and acceleration of Path Tracing algorithm in heterogeneous architectures” [6]. In this previous work, a photo-realistic rendering engine was coded in CUDA with a very fundamentalist point of view, where most of the code was done from scratch with the objective of fully understanding all the steps in the development of the rendering engine. The Bachelor’s thesis mostly covered the implementation and the analysis of the Path Tracing algorithm in Nvidia GPUs. This implementation has been maintained, adapted and build around it, while keeping the rendering core. 18 1.3 Goals Eleven is an ambitious project. It tries to merge many different projects into one. Because it covers the continuous development of a rendering engine, the main goal could be ambiguous, for that reason there it is a list with the main objectives that have been designated for this master’s thesis before starting working on it: 1. Make a Rendering engine which works for almost any target architecture and evaluate it. 2. Make it usable with some 3D software. 3. Include a working shading model with runtime shader loading. 4. Formally specify an expressive abstract shading language which allowed other shading languages to be converted to it. 5. Create a translator of such a language to LLVM-IR and verify such a translation. These subgoals have been revisited after finishing the project: The 1st point has been successfully completed, the SYCL port is finished except for further optimizations and some code improvements. The 2nd point is also successfully completed, the interoperability with Blender 3D software was done thanks to the ”ElevenBlender” add- on and architectural changes in the engine. The 3rd point has been achieved partially. It is ”partially”, because the Parameter-driven Shading model, is limited from an expressiveness perspective, and the original goal was to make a fully expressive shading model. The run- time shader loading wasn’t achieved because the complexities of the DPC++ toolchain. The final two points is were the expectations were off. A shader language has been formally specified and a translator verified, but the language in particular is just too simple to be useful. The verification took way longer than expected and was more tedious that initially devised. Even so, a minimal proof of concept has been done an included into the project which demonstrates its possible to further develop the main goal. 1.4 Work Plan The work plan devised for this work has been: include features until the deadline. This plan allowed to fit a huge number of features and a rich final result, but have prove to be hard to maintain. Many of those features have been just deprecated in order of better fitting solutions. One example of this, is the previous version of ASL, initially done in September of 19 2022, which was a more featured interpreted language (instead the correct and ”compiled” approach), or shadergraph2osl, which was an Eleven’s plugin to convert Blender’s shader graph into OSL shaders by using Model Driven Engineering, which needed to be halted be- cause the lack of time. A timeline with the main milestones reached is shown bellow: Sep 2021 · · · · · ·• Finished Eleven Render Bachelor’s degree. Feb 2022 · · · · · ·• Several upgrades to the original codebase (denoiser, texture formats, passes, .mtl loading). Jul 2022 · · · · · ·• SYCL Port finished. Aug 2022 · · · · · ·• First working Blender integration. Sep 2022 · · · · · ·• ASL POC initial design and implementation.. Nov 2022 · · · · · ·• Blender plug-in first release. Dec 2022 · · · · · ·• TCP architecture. Apr 2023 · · · · · ·• Verified ASL assignment. Jul 2023 · · · · · ·• ASL translation verified. Aug 2023 · · · · · ·• Shading system working. Aug 2023 · · · · · ·• AST mainly verified translation. Sep 2023 · · · · · ·• Benchmarks done. Table 1.1: Timeline of the development of this master’s thesis 20 https://github.com/101001000/shadergraph2osl Chapter 2 Render Architecture The previous Eleven Render Bachelor’s Thesis was focused in the inner workings of the ren- dering kernel. In contrast, this new version, is heavily focused not so in the path-tracing algorithm, but in how the architecture has been reworked to make it work with heteroge- neous architectures. This chapter, covers the development decisions and architectural details, as well as the details of how the compilation is handled in such a way that custom IR code can be side- loaded to device kernels. The final result of this chapter, is a fully working rendering engine available in the https://github.com/101001000/ElevenRender repository, as well as a working compilation CMake infrastructure which won’t be detailed for brevity, but which details are available for consulting in the CMakeLists.txt file. 2.1 High level architecture overview This section provides an overview of Eleven’s high-level operation. Subsequent sections will dig deeper into the specifics of its architectural design. Eleven can be seen as a process with three main inputs of data: input commands, scene data and LLVM-IR shaders. Scene data holds the scene representation as well as mate- rial information, textures, geometry data etc. Input commands are a way to interface with Eleven, where the main actions like data loading, render starting etc will be requested. Fi- nally LLVM-IR shaders are feed in the compilation phase of Eleven, as they require some processing and be compiled for the target architecture. The input LLVM-IR shaders can have two origins: The first is being obtained from the ASL LLVM-IR translator explained in the Section 4.3. The second, is being translated with some LLVM-IR frontend from some 21 https://github.com/101001000/ElevenRender https://github.com/101001000/ElevenRender/blob/master/CMakeLists.txt other programming language. An example of this second procedure is explained in the Sec- tion 2.3.3, where a C++ shader is included to Eleven. Limiting shading loading to the compilation phase, is one of the limitations of the current architecture which its expected to change in future versions in favor of run-time shader load- ing. This could be done thanks to the JIT capabilities of LLVM, but there are some caveats discussed in the Section 2.3. After loading the data to Eleven, it is pre-processed and sent to the device. Then the device will compute the path tracing’s algorithm along with the shader execution. The data is retrieved from the device asynchronously to facilitate progressive rendering and sent periodically back as output data when requested. This output data holds the render result, some render telemetry data like the render progress, and other relevant information such as what devices are available and their status. 2.2 Data Handling, Communication, and Rendering Processes 2.2.1 IPC Eleven runs as a TCP server implemented with boost::asio, which allows to run the ren- dering process in a different machine to the one running the 3D software. The way Inter Process Communication (IPC) is handled in Eleven Render, is through the same TCP socket used for communicating with the 3D software. That socket is also used for sending the scene data. The main advantage of this method, is flexibility over efficiency, as it is not ideal in a scenario where the 3D software and Eleven shares the same machine, because the data needs to be copied to be sent over socket, which means having data redundancy in the same ma- chine. For example, sending 4GB of texture data (16 textures of 4096x4096, 4 float channels) with the added message overhead and texture color space conversion, in the local machine, takes 11.84s which translates into an effective 337MB/s, a poor result for an IPC. IPC scene loading is currently the main bottleneck for most scenarios. This operation could be replaced in a future by reading the own data from the 3D software application, and in the case of Python and C++, this is usually done through ”Python Bindings”. 22 Blender Eleven Render Scene Eleven Blender TCP socket File System Input Manager input commands scene data render data render data scene data Figure 2.1: Data handling from Blender to Eleven Input Commands Input commands are implemented as boost::program options in a similar way one would interact with program arguments. It’s the way the user would communicate with Eleven. • Load commands: they are in charge of retrieving data from the socket or the filesystem (using the --path option). The current load commands implemented are the following ones: – load config – load texture – load camera – load hdri – load brdf material The --mirror x and --mirror y options when loading a HDRI or a texture, will mirror the image in the selected axis. The --recompute normals option will apply a normal computation algorithm described in Section 2.4 if present when loading objects. • Get commands: Retrieve information from the device such as rendering progress, avail- able devices or the render result. – get info – get sycl info 23 – get pass • Action commands: instruct the rendering kernel how they should operate – start – pause – abort 2.2.2 SYCL port For the porting part, the ”Intel DPC++ Compatibility Tool” (DPCT1 or recently denoted as SYCLomatic2) [17, 18] was used in a first instance without much success. This tool allows to convert CUDA code to SYCL code, but there were some limitations this tool had the moment was used, mainly with global variables and the cuRAND library. The result code was partially converted and didn’t compiled. For that reason a manual port was done. One of the manual steps, was to replace the cuRAND manually for a simple pseudorandom number generator described in the Figure 2.2. Another of the manual steps required, was to replace all the math.h library functions, for their namespaced alternative sycl:: This is specially required when using the CUDA backend, as the generated code won’t be able to link because of the namespace clash. RngGenerator::RngGenerator(uint32_t _seed) { this->state = hashf(_seed + 1); } float RngGenerator::next() { state ^= state << 13; state ^= state >> 17; state ^= state << 5; return (static_cast(state) / static_cast(UINT_MAX)); } uint32_t hashf(uint32_t seed) { uint32_t hash = 0; for (size_t i = 0; i < sizeof(uint32_t); i++) { hash += (seed >> (i * 8)) & 0xFF; hash += (hash << 10); hash ^= (hash >> 6); } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return hash; } Figure 2.2: Pseudorandom number generator Another big change from the porting effort, is the device data transmission handling. 1Intel® DPC++ Compatibility Tool: https://www.intel.com/content/www/us/en/developer/ tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm 2SYCLomatic repository: https://github.com/oneapi-src/SYCLomatic 24 https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html#gs.5bmosm https://github.com/oneapi-src/SYCLomatic Device data transmission SYCL provides two mechanism for memory movement between host and devices: • Buffers and Accessors. • USM. Buffers represent data which can be in both: host and device. SYCL then, manages by itself the memory copies implicitly when accessing the data from the Accessors, resolving data dependencies by itself. USM is in the other hand, a pointer approach which allows for implicit and explicit memory management. In Eleven Render, the explicit USM memory management approach has been chosen and is performed by the SYCLCopy.cpp file. There are two reasons for this decision: practicality, as the USM explicit model is very similar to the previously implemented in the older Eleven Render CUDA version, and performance, as according to Intel when using Buffers, ”The array references are implemented with more complex code and are significantly slower than native C++ array references. While it is acceptable to reference a small amount of data, computationally intensive algorithms using host accessor pay a large performance penalty and should be avoided.” [19]. USM explicit model is what it has been used by the OneAPI Blender Cycle’s officially supported imple- mentation. Eleven’s current rendering process is monolithic from the data transmission perspective. It will send all the scene information as one step, not allowing granularity for different scene elements. In the future, interactive rendering3 could require other memory management mechanism, as the synchronization process of the scene assets could be simplified if implicit copies would be used. Granularity of data transmission is the next logical step for Eleven’s architecture, as it allows for lower data redundancy when copies of small chunks of data are deleted before sending the next ones. SYCL rendering kernel execution A SYCL queue is an abstraction for an action submission on a single device and there are two queues in Eleven: k q and d q. The first one controls the execution of the main rendering kernel. The kernel render enqueue function will submit synchronous render calls for each sample required in the k q queue, and because it will wait for each sample rendering call to end with the .wait() directive, it needs to be in a secondary non locking thread. 3Interactive rendering is the process for which a scene will be rendered while their elements can be modified with almost real-time visualization of the changes 25 https://github.com/101001000/ElevenRender/blob/master/src/SYCLCopy.cpp https://github.com/blender/blender/blob/main/intern/cycles/device/oneapi/device_impl.cpp https://github.com/blender/blender/blob/main/intern/cycles/device/oneapi/device_impl.cpp In theory, one wouldn’t need a synchronous queue, as the code can be refactored so rendering results are independent from each execution, but enqueueing kernels, even if they are not synchronous, added an unexpected high time overhead (probably because the SYCL mecha- nisms to solve dependencies over kernel scheduling), so the best solution was to queue them synchronously in other thread. nd range As it was proved in the previous bachelor’s thesis, preserving spacial locality of emitted rays is a must when rendering scenes from the performance perspective[6, sec.2.2.2]. In CUDA, that was done by setting the blocksize parametter and matching the 2D dimensions of the sensor. SYCL is slightly more complex when abstracting kernel execution. It provides a construct called nd range which is pretty similar to the CUDA counterpart. It requires to set a global and local dimensions for the execution block. SYCL does not include any heuristic about what’s the optimal blocksize like CUDA does, so an empirical test have been realized over the blocksize, to find a good candidate of this parameter. The optimality of this parameter is dependant of many factors, like available registers, the kernel executed, even the kind of scene can affect the performance. 2 4 6 8 10 3,000 4,000 5,000 6,000 Blocksize ti m e( m s) (a) Nvidia RTX3090 1 2 3 4 5 6 7 6,800 7,000 7,200 7,400 Blocksize ti m e( m s) (b) Intel i9 13900KF Figure 2.3: Rendering times by blocksize parameter For the Nvidia GPU, blocks of four by four or more are good candidates, and for the Intel CPU, any value is probably good. The lower value on blocksize 2, is with high probably, an outlier. 26 2.3 Including a custom shading language in the pipeline This section covers how arbitrary shading LLVM-IR code, is included into the Eleven’s com- pilation pipeline and executed subsequently by the target architecture. The end goal is to provide a language agnostic workflow for including shaders into Eleven. Executing shading code has been usually a challenging task for rendering engines. Even more if we try to target a vast amount of different architectures as SYCL programming model does. For that reason, the implementation explored in this section is non-trivial and relies in a non-traditional solution which are subject to change. The preferred way to execute custom kernel in SYCL is what in the SYCL 2020 specification is defined as kernel bundle, a way to represent kernels which can be defined and executed at run-time. The main challenge here is that those kernels need to be previously built. The current way build those kernels at run-time, is relying directly on the own API to build that kernel. For example, one could build a OpenCL kernel by doing something similar at what is written in the Figure 2.4. The issue of this approach is that we need to call individually each API (which would break the portability of Eleven) and most of them doesn’t have support for LLVM-IR code input. Other option is to implement a way to load LLVM-IR code as uncompiled kernels (bundle state::input) and use the functions sycl::build and sycl::compile to make a runtime executable, or even convert LLVM-IR code back to C. This options will be explored in a future, but for now, a more rudimentary static compiling approach has been used. cl_kernel ocl_kernel = clCreateKernel(ocl_program, "add", nullptr); auto mybundle = sycl::make_kernel_bundle(ocl_program, ctx); Figure 2.4: Making a kernel bundle from OpenCL program The static compiling approach explored, is not a simple task. Eleven’s materials can hold more than one shader, specifically one per input of the Disney’s BRDF function so the challenge is slightly more complex when we try to add more than one shading function. SYCL virtual function limitations[20, p. 15], doesn’t allow us to use function pointers for the required shading function in case we have more than one, so each function header needs to be declared at compile time and dispatched at run-time individually. For that same reason, we have to rely heavily into the C++ macros offered by The Boost Preprocessing library which simplifies the work of handling hundreds of function definitions. 27 2.3.1 Shading kernel execution The proccess for the shader code execution in the device kernel, happens with the call of the shade(int shader id, shader params...) function. shader params... are the input and output parameters of the shader and shader id is an run-time identifier of the required shader function. In the function implementation, shader id is feed to a switch case construct which will call the corresponding shade function. The max number of shader functions is defined by the MAX SHADERS macro, which indicates how many shade function headers will be generated. The Figure 2.5 shows such an implementation along with the result code after the macro expansions. #define ASL_SHADE_CASE(z, n, text) \ case n: shade##n##_(shading pars); break; void shade(int id, shading pars. . . ) { switch (id) { BOOST_PP_REPEAT(MAX_SHADERS, ASL_SHADE_CASE, ) } } void shade(int id, shading pars. . . ) { switch (id) { case 0: shade0_(shading pars. . . ) ; break; case 1: shade1_(shading pars. . . ) ; break; case 2: shade2_(shading pars. . . ) ; break; . . . } } macro expansion Figure 2.5: Shade function dispatcher implementation The expanded functions shade0 , shade1 , shade2 , ... are also declared in a header file, and decorated with the SYCL EXTERNAL macro to indicate external linking, but their im- plementation is left empty on purpose as the contents will be filled with the generated IR code from the shader IR compiling step. Instead of that, a file called shader placeholder.cpp holds an empty implementation of such function. This is because getting providing raw IR code is not enough for heterogeneous architectures. As it was shown in the Section 1.1.2, when compiling with AOT, the same code is compiled for each target defined, and the result is multiple definitions of the same function. Something similar is needed when handling the IR, but instead, generating a different code for each target. DPC++ will do it to the empty function of shader placeholder.cpp, and then the contents of the resulting file are substituted with the IR code. 28 shader placeholder.cpp shader code.asl DPC++ -emit-llvm ASL compiler shader empty.ll asl compiled.ll IR populator shader.ll Eleven’s source code Eleven’s dependencies DPC++ Executable Figure 2.6: Shader compilation process 2.3.2 IR Populator A python script function body replace.py will loop over every file in the shaders directory. The shaders need to be named as ”0.ll”, ”1.ll”... where each number is the shader identifier a material will reference. From each file, the function body delimited by the entry: and ret void tags is stripped from its metadata and placed into their correspondent multiple-target shade functions in the shader.ll file. Opaque Pointers One particularity from the last versions of LLVM, is the introduction of Opaque Pointers. Opaque pointers are just pointers which type information has been striped off. ”The commu- nity consensus is that the costs of pointee types outweight the benefits, and that they should be removed”[21], for that reason, since Clang 15.0, pointers won’t have the type information on it by default. But this is not entirely true for all the LLVM backends, for example CUDA backend will keep pointer types. 29 https://github.com/101001000/ElevenRender/blob/master/function_body_replace.py For homogenizing the generated shader.ll file, pointers without type information, will be substituted for their float* counterparts. 0.ll: shade_asl (...) { entry: shader0 body ret void } 1.ll: shade_asl (...) { entry: shader1 body ret void } shader.ll:spir shade_asl0 (...) { entry: ret void } cuda shade_asl0 (...) { entry: ret void } spir shade_asl1 (...) { entry: ret void } cuda shade_asl1 (...) { entry: ret void } Figure 2.7: Function body replacement 2.3.3 C++ Shader example One of the main advantages of this shader loading system, is that we can load almost any shader coded in any language which can be compiled to LLVM-IR. For example, this this shadertoy’s shader4 GLSL shader in C++ will be put in Eleven: First, a function that should have the main output arguments (Cr, Cg, Cb) as float references is created. If additional arguments are required (for example the uv coordinates), it’s required to have the same name as the arguments defined in ASL. Only the function body will be used, so the function declaration is irrelevant. 4Shadertoy is a place where people publish their own OpenGL shaders written in GLSL shading language, and they can be rendered interactively. 30 https://www.shadertoy.com/view/mtyGWy inline void palette(float t, float& ox, float& oy, float& oz) __attribute__((always_inline)) { ox = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.263)); oy = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.416)); oz = 0.5 + 0.5 * cos(6.28318 * (1.0 * t + 0.557)); } void shade(float u, float v, float& Cr, float& Cg, float& Cb) { float t = 1.0; float u1 = u; float v1 = v; Cr = 0.0; Cg = 0.0; Cb = 0.0; for (float i = 0.0; i < 4.0; i++) { u1 = fract(u1 * 1.5) - 0.5; v1 = fract(v1 * 1.5) - 0.5; float d = length(u1, v1) * exp(-length(u, v)); float colx, coly, colz; palette(length(u, v) + i * 0.4 + t * 0.4, colx, coly, colz); d = sin(d * 8.0 + t) / 8.0; d = abs(d); d = pow(0.1 / d, 1.2); Cr += colx * d; Cg += coly * d; Cb += colz * d; } } Figure 2.8: Shadertoy shader C++ adaptation The code in the Figure 2.8 is a C++ adaptation of the GLSL shader code. Bear in mind that the functions fract, length, abs, exp, pow, sqrt, sin, cos need to be also im- plemented in the same file and decorated with inline and attribute ((always inline)) to force the inlining in Clang, because we want at the end a big and self-contained function with no calls to other functions. In this example, those functions have been implemented manually but omitted for simplicity. Then we will ran the Clang compiler over it to retrieve the .ll code with the following command: $ clang++ -c -S -emit -llvm shader.cpp -O3 -ffp -contract=off -o 0.ll The -ffp-contract=off flag is required to avoid fusing the mult-add operations, which are not implemented yet in the current pipeline. Then, we can add that ”0.ll” shader to the ”shaders” folder and run the Eleven compilation. By doing this, if we execute a rendering 31 operation with the ”albedo shader id” set to ”0” in some material, we will see a similar result to the Figure 2.9 Figure 2.9: Shadertoy shader rendered in Eleven Render 2.4 Miscellaneous improvements Intel Open Image denoiser One of the most common ”hacks” for speeding up convergence times, is to apply a end-image denoiser. Intel provides an open source solution broadly used in the industry as a part of their oneAPI’s Rendering Toolkit, which makes use of machine learning as a novel way to denoise images. This solution has been chosen because it’s easiness to implement has been included in the current version of Eleven. It’s enabled by default, and can be selectively disabled. 32 (a) 10 samples, no denoise (b) 10 samples, denoised (c) 500 samples, no denoise (d) 500 samples, denoised Figure 2.10: Visual differences for denoise application in a sample scene As we can observe in the Figure 2.10, denoising gives an overall better looking image, but it won’t give back detail unless that detail is minimally present in the pre-denoised image. For example, the leaves marks in the teapot are washed out in the denoising from (a) to (b), but are preserved from (c) to (d) as the denoiser can do a better job identifying the details for less noisy images. Normal re-computation Eleven’s rendering can show shading artifacts when trying to render a flat surface with normals which doesn’t have been computed as flat. This happens for example when Blender applies an smooth shading normal computation over flat surfaces. For that reason as a simple solution, an algorithm to recompute the normals has been implemented to address this issue and compute a more suitable normal vectors. This algorithm will average the surrounding face normals of some vertex weighted by the size of the face and is implemented 33 in the recompute normals face weight function in ObjLoader.cpp file. As it is shown in the Figure 2.11, after recomputing the normals, some shading artifacts can be successfully removed. (a) original normals (b) face weighted normals Figure 2.11: Eleven render comparison with and without normal re-computation Per-face Material assignation The previous version of Eleven retrieved the materials from an intersection in an object basis. That means that each MeshObject had one material assigned. The common approach however, is doing it per face basis, so each triangle in an object can have a different material assigned. 34 Chapter 3 Eleven Blender A big part of making a rendering engine is the way it integrates with 3D applications. Most 3D design applications, provides an api to develop custom plugins or addons as most of them rely in third party rendering engines. That’s also the case for the very well known open source 3D application Blender, the application I have chosen to be the first in interact with Eleven Render. To achieve this, I have release a custom add-on called ElevenBlender which repository is available here: https://github.com/101001000/ElevenBlender. This add-on is fully implemented in python and it is in charge of managing all the connec- tion with Eleven, converting the scene data to Eleven’s format, and to set-up and configure render parameters. 35 https://github.com/101001000/ElevenBlender Figure 3.1: Eleven Blender’s configuration panel panel.py The moment the addon is loaded and Eleven selected as the main rendering engine, a config- uration panel like the one shown in Figure 3.1 will be available in the render settings. It will be possible to connect to a running instance by TCP by providing an IP and a port. After a successful connection, a list of the available devices in the machine where the rendering instance is running, will be listed. This list will exclude the devices for which Eleven has not been compiled with the AOT tags --fsycl-target as it won’t be able to run shader code. After that, its possible to render the current scene with the standard Blender’s rendering command. In the next subchapters, it is detailed how the data is handled to Eleven. 3.1 Geometry loading The current method to send the scene geometry data is inefficient but easy to implement and it takes advantage of the previous Eleven Render wavefront loading code. Ideally, one would read the geometry data directly from the Blender’s memory, but what Eleven Blender does instead, is to call the Blenders Wavefront IO export module to create a .obj file with all the scene geometry and a .mtl file with the material data. Both files are sent and loaded in Eleven. Blender’s have two functions to export Wavefront files: bpy.ops.export scene.obj and 36 https://github.com/101001000/ElevenBlender/blob/dev/ElevenBlender/panel.py bpy.ops.wm.obj export. The first one is the legacy one and deprecated in new Blender’s versions, but was compatible with rapidobj Wavefront Obj loading library, used by Eleven’s blender, while the second had compatibility issues with rapidobj due to discrepancies in the material formatting (even both working under the same wavefront standard). The solution has been to strip all the data from the exported material information and deleted the empty name materials which were crashing rapidobj. This is a temporary solution, as it requires to open and process a file (which can be in the order of 100M of lines) line by line to search for the material inconsistencies, so probably in a future, an issue to the rapidobj library should be opened and fixed from the library level. Both methods have been tested with a very different results. The second one is notably faster, almost 24 times for the demo scene ”living-roomCC0” detailed in Figure 5.1 as it can be seen in Figure 3.2 export scene.obj (old) wm.obj export (new) 0 5 10 12.61 0.52 T im e to ex p or t (s ) Figure 3.2: Comparaison between both Wavefron exporting methods of Blender 3.2 Material interface Blender offers here two options when making a custom rendering engine: using the Blender’s shading nodes or not using it and define custom shading nodes (which is the default option) with the parameter: bpy.types.RenderEngine.bl use shading nodes custom. Most third party rendering engines which work with Blender take the second approach, as it’s way easier and they are able to share their custom shading node system along with many 3D different software in addition to being unaffected to having to maintain compatibility with Blender’s 37 nodes when they are changing which each release1. Eleven’s on the other hand, prioritize flexibility with different shading systems than having their own, for that reason, Eleven’s will make use of the already existing Blender’s node system even if that meant a very limited implementation and compatibility with the materials. That been said, the way Eleven’s processes Blender’s materials is quite simple: It just read the material surface input, and if it’s an Principled BSDF, use it, if not, discard the material. The Principled BSDF node, is in essence the Disney’s Principled model which Eleven’s makes use of, so the parameters are paired one by one with the Eleven’s material representation. About the parameters, it’s important to explain that because the incompatibility with all the Blender’s shading system, the current system will only retrieve the value inputs which doesn’t come from nodes, except from texture nodes (or normal node with a texture attached), in such case the texture will be retrieved and sent to Eleven. Custom shaders If Eleven has been compiled with some shaders as described in the Section 2.3, it is possible to select them as material inputs with the custom ”Eleven Shader Selector Node” which will take an ID as value and will tell Eleven to compute that shader for the input it’s connected. Figure 3.3: Eleven Shader Selector node 1Disney’s Renderman can use Blender’s shading node thanks to their support to OSL. For some nodes, they will translate it to their own shading system manually, but for the rest, they will rely in the OSL implementation of Blender’s node system 38 Chapter 4 Abstract Shading Language Abstract Shading Language was designed to be an expressive language to control the light interaction in a ray-traced scene with the advantage of being abstract enough to work as a middle step between other shading languages and Eleven. That intention was ambitious, as the process of verifying the translation to VIR was harder and more tedious than initially devised. This chapter covers the specification of such language, together with the formalization of the semantics in Coq and the verified VIR translation. All the implementation details of ASL like the Coq defined semantics, OCAML parser, translation implementation and all the con- crete formal proofs, can be found in the ASL repository https://github.com/101001000/ASL. Because the large amount of work and time required for verifying it, the goal changed from a featured shading language, to create the simplest possible language and demonstrate a proof of concept for its inclusion in the project. For that reason, at the moment, the language only supports variable declarations and integer assignments. An example of ASL code can be seen in Listing 4.1. var x; var y; x = 5; y = 4; Listing 4.1: ASL code example 39 https://github.com/101001000/ASL 4.1 Syntax 4.1.1 AST The abstract syntax tree of ASL is defined in the Coq’s file: AST.v and its definition is provided in the Figure 4.1. There are two key aspects to discuss here: The first aspect is why the declarations are being separated from the statements. This de- cision restricts the flexibility of adding variable declarations in other parts of the code, but it simplifies the verification by splitting the problem it in two steps: Allocation step and execution step. This approach allows to verify the execution step assuming all the variables are allocated. This will be explained more in detail in the verification section. The second aspect concerns the datatypes ASL is using: int32, from Vellvm.DynamicValues and the list datatype. The reason of choosing int32 is for facilitating the translation ver- ification, but it might be preferable to reduce dependency on Vellvm data types to achieve more independence from the Vellvm semantics, by choosing another datatype which would represent integers. The challenge would be creating an equivalence between both types, which is problematic for operations outside the range of an int32 in the case of using non bounded integers. The reason why ASL is using the native Coq’s polymorphic list datatype instead of using the classic tree recursion from simple, formally defined imperative languages like the While language[22]. This is because ASL represents a sequence of instructions which will represent instruction blocks when the control structures are included. Inductive dec := | Var (x : string). Inductive expr : Type := | Lit (n : int32). Inductive stmt : Type := | Assign (x : string) (e : expr) | Skip. Inductive prog := | Prog (ds: list dec) (ss : list stmt). Figure 4.1: AST.v 40 https://github.com/101001000/ASL/blob/master/src/coq/Lang/AST.v Inductive list (A : Type) : Type := | nil : list A | cons : A → list A → list A. Listing 4.2: Coq’s list implementation 4.1.2 Lexer Lexing in OCAML is relatively simple thanks to the tool ocamllex. To use it, it’s just needed to define a lexer.mll file as the one shown in the Figure 4.2 and then, run the tool over it. The result, is an OCAML module lexer.ml file which can be used for lexing text inputs. One of the most important parts of lexing, is making sure that the correct type is being used. As OCAML is a strong typed language, and Coq extraction will be extremely tricky with the types exported (one can have many different implementations for the same datatype), the types are usually incompatible and require type conversions. CompCert, the formally verified C compiler[12], provides an useful OCAML module to make this type conversions between Coq and OCAML called Camlcoq.ml. This module is used in this project to handle most of the datatype conversions. 41 https://github.com/101001000/ASL/blob/master/src/lexer.mll https://github.com/AbsInt/CompCert/blob/master/lib/Camlcoq.ml { open Parser exception Error of string } let digit = [’0’−’9’] let letter_or_underscore = [’a’−’z’ ’A’−’Z’ ’_’] let letter_or_underscore_or_digit = [’a’−’z’ ’A’−’Z’ ’_’ ’0’−’9’] rule token = parse | [’ ’ ’\t’ ’\n’] { token lexbuf } | "var" { VAR } | ’=’ { EQ } | ’;’ { SEMI } | letter_or_underscore letter_or_underscore_or_digit∗ { ID(Lexing.lexeme lexbuf) } | digit+ as i { INT (Camlcoq.Z.of_sint32(Int32.of_string i)) } | eof { EOF } | _ { raise (Error (Printf.sprintf "Unexpected char at %d.\n" (Lexing.lexeme_start lexbuf))) } Figure 4.2: ASL Lexer 4.1.3 Parser The parsing process is similarly done as the lexing, but with the tool menhir. In this case, the most remarkable parts are: how both statements and var declarations are handled as lists, how the INT token has a Camlcoq.Z.t which represents the same type as Dynamic- Types.Int32 from the ASL.v and how expressions and var declarations explicit constructor has been removed , because Coq extract will optimize singleton constructors. This can be avoided in the extraction code by using the flag Extraction KeepSingleton but it will throw other errors. 42 %{ open AST %} %token INT %token ID %token SEMI %token VAR %token EQ %token EOF %token SKIP %start main %% main: | ds = decs ss = stmts EOF { Prog (ds,ss) } stmts: | s = stmt SEMI {[s]} | s = stmt SEMI t = stmts {s :: t} stmt: | a = assign { a } | sk = skip { sk } decs: | d = dec SEMI {[d]} | d = dec SEMI t = decs {d :: t} dec: | VAR x = ID {Camlcoq.coqstring_of_camlstring x} expr: | i = INT { i } assign: | i = ID EQ e = expr { Assign(Camlcoq.coqstring_of_camlstring i, e) } skip: | SKIP { Skip } Figure 4.3: ASL Parser 43 4.2 Semantics Denotation The semantics in Eleven can be found in the file Semantics.v. They are a trimmed down version of the Imp language described in the interaction trees paper[14], where the SetVar is an uninterpreted event, which can be ”triggered” (trigger is just a fancy way to call Vis e (fun x -> Ret x)). SetVar is a variant of the State monad, which comes from the Coq’s ext-lib extended library. Definition denote_expr (e : expr) : itree eff int32 := match e with | Lit n ⇒ ret n end. Definition denote_stmt (s : stmt) : itree eff unit := match s with | Assign x e ⇒ v ← denote_expr e ;; trigger (SetVar x v) | Skip ⇒ ret tt end. Fixpoint denote_stmts (ss : list stmt) : itree eff unit := match ss with | nil ⇒ ret tt | h :: t ⇒ denote_stmt h ;; denote_stmts t end. Fixpoint denote_decs (ds : list dec) : itree eff unit := match ds with | nil ⇒ ret tt | h :: t ⇒ match h with | Var x ⇒ trigger (SetVar x (Int32.repr 0%Z)) end ;; denote_decs t end. Definition denote_prog (p : prog) : itree eff unit := match p with | Prog ds ss ⇒ denote_decs ds ;; denote_stmts ss end. Figure 4.4: Program denotation Interpretation Once the construction of the denotation interaction tree is defined, it’s necessary to define the interpretation of such tree, or how those evens should be ”triggered”. The State events 44 https://github.com/101001000/ASL/blob/master/src/coq/Lang/Semantics.v https://github.com/coq-community/coq-ext-lib/blob/master/theories/Data/Monads/StateMonad.v like SetVar, are interpreted as map insertions over a zero initalized string to int32 map: mapE string (Int32.repr 0%Z). This behaviour is encapsulated in the mapE event and the interp map definition of the interaction trees library Map, which abstracts the interaction with such map. Definition handle_State {E: Type → Type} ‘{mapE string (Int32.repr 0%Z) −< E}: State > itree E := fun _ e ⇒ match e with | SetVar x v ⇒ insert x v end. Definition interp_asl {E A} (t : itree (State +’ E) A) : stateT env (itree E) A := let t’ := interp (bimap handle_State (id_ E)) t in interp_map t’. 4.3 LLVM-IR Translation This section covers the file Compiler.v which holds a translator from the ASL AST to the Vellvm AST. The top level translation starts with the compile definition shown in the Figure 4.5, which will take a prog input and will concat the result of compiling the declara- tions(compile decs s) and compiling the statements (compile stmts s). The result of such functions should return the Vellvm dataype code which holds a sequence of VIR instructions. The Vellvm datatype code is polymorphic, and accepts two kind of instruction: dynamically typed and statically typed. A deeper explanation about type differences in Vellvm is in Sec- tion 4.4.4. For now, it’s only necessary to know that most compiling definition are defined twice, one with the s suffix, returning statically typed code: code typ, and another one without the suffix which returns the same code but dynamically typed: code dtyp. Definition compile (p : AST.prog) : code typ := match p with | Prog ds ss ⇒ (compile_decs_s ds) ++(compile_stmts_s ss) end. Figure 4.5: Top level compile definition Both definitions compile decs s and compile stmts s are just recursive calls over a list, which execute over the head of the list, their non recursive counterparts compile dec s compile stmt s, which will match the required code generation definition. In the case of 45 https://github.com/DeepSpec/InteractionTrees/blob/master/theories/Events/Map.v https://github.com/101001000/ASL/blob/master/src/coq/Compiler/Compiler.v the compilation of a variable declaration, the required code for allocating a variable of name x will be concatenated with the initialization of that same variable to the value ”0”. Compiling a Skip statement, will return an empty list as expected, meanwhile compiling an assignment will result in the LLVM instruction store. This two definitions are avaliable in the Figure 4.6 Definition compile_dec_s (d : dec) : code typ := match d with | Var x ⇒ [gen_alloc_s x; gen_store_s x (Int32.repr 0%Z)] end. Definition compile_stmt_s (s : stmt) : code typ := match s with | Skip ⇒ nil | Assign x v ⇒ match v with | Lit n ⇒ [gen_store_s x n] end end. Figure 4.6: Translation code for declaration and statement gen allocate and gen store This two definitions, return the construction of a LLVM instruction: Definition gen_allocate_s (x:string) := (LLVMAst.IId (LLVMAst.Name x), LLVMAst.INSTR_Alloca (TYPE_I 32) None None). Definition gen_store_s (x:string) (i:Int32.int) := (IVoid 0%Z, (INSTR_Store false ((TYPE_I 32%N), (EXP_Integer (Int32.unsigned i))) (TYPE_Pointer (TYPE_I 32), (EXP_Ident (ID_Local (Name x)))) None)). 4.4 Translation verification 4.4.1 The Vellvm memory model Vellvm memory works with three elements: global env, local env and memory stack. This project only focuses in the local local env and memory stack, ignoring the global environment. Local environment is a map of some identifiers, to values (which in the ASL context, are always pointers to memory blocks). Memory stack is a map of some block address to a block holding a data value. 46 get val get val is a helpful definition which has been included to retrieve values from Vellvm mem- ory. It will retrieve the pointer associated to some variable in the local environment and will return the int32 value hosted in the block the pointer is addressing. There are many reasons why this operation could fail, in any of such cases, it will return a None option value: 1. There’s no variable in the local environment 2. The variable does not hold a pointer 3. There’s no block associated to the pointer 4. The data inside the block is not an int32 In retrospective, it would be easier from a verification standpoint to distinguish the different errors instead of just returning a None value. Definition get_val (g: global_env) (l: local_env) (m: memory_stack) (x: string) : option int32 := let addr := FMapAList.alist_find (Name x) l in match addr with | Some (UVALUE_Addr ptr) ⇒ let val := read m ptr (DTYPE_I 32%N) in match val with | inl _ ⇒ None | inr v ⇒ match v with | UVALUE_I32 i ⇒ Some i | _ ⇒ None end end | _ ⇒ None end. 4.4.2 Renv This section covers the file Renv.v which is the proposition which relates the Vellvm envi- ronment with ASL environment. It can be split into four sub-propositions: The first proposition ensures that every value retrieved successfully with the get val def- inition for the Vellvm environment, must be the same as the value retrieved by the ASL environment and the other way around. This is the core of the Renv relationship and en- sures that all the values are the same between the two data models. 47 https://github.com/101001000/ASL/blob/master/src/coq/Compiler/Renv.v Definition Renvh (env_asl : env) (g_llvm : global_env) (l_llvm : local_env) (m_llvm : memory_stack) : Prop := forall k v, (alist_In k env_asl v ↔ get_val g_llvm l_llvm m_llvm k = Some v) ∧ (FMapAList.alist_find k env_asl = None ↔ FMapAList.alist_find (Name k) l_llvm = None) ∧ unique_ptrs l_llvm ∧ no_next_key l_llvm m_llvm. The issue is that is not a hard enough. One could imagine an scenario where some variable is in the local vellvm environment pointing to some un-allocated memory region, in such case it would match the environment relationship. For that reason, it’s needed to harden the relationship with the second proposition, ensuring that if some variable is not in the asl environment, it won’t be in the local vellvm environment neither. The third proposition ensures that there’s not repeated pointers in the local environment, and the fourth one asserts that the pointer of the next empty block, is not in the local environment. This last two prepositions act more like an invariant which will be used in the reasoning. The objective here now is to prove that the Renv proposition is held after doing some operations over both environments at the same time. For example: Lemma RenvhAssign : forall env_asl g_llvm l_llvm m_llvm x i, Renvh env_asl g_llvm l_llvm m_llvm → Renvh (FMapAList.alist_add x i env_asl) g_llvm (FMapAList.alist_add (LLVMAst.Name x) (UVALUE_Addr (next_logical_key m_llvm, 0%Z)) l_llvm) (mem_stack_add m_llvm x i). This other Lemma proves something similar but for overwriting variables. It requires some preconditions: for example, a variable overwriting will be unsuccessful if is not allocated pre- viously, and it’s necessary to ensure that the variable match with some pointer in the Vellvm environment, and not other value. This requirements are encoded in the allocated ptr m -> FMapAList.alist find (Name x) l = Some (UVALUE Addr ptr) -> FMapAList.alist find x e = Some i0 Lemma Renv_holds : forall e g l m i x ptr i0, allocated ptr m → FMapAList.alist_find (Name x) l = Some (UVALUE_Addr ptr) → FMapAList.alist_find x e = Some i0 → Renvh e g l m → Renvh (FMapAList.alist_add x i e) g l match write m ptr (DVALUE_I32 i) with 48 | inl _ ⇒ empty_memory_stack | inr x1 ⇒ x1 end. 4.4.3 Bisimilarity The end of the compilation verification is in the file CompilerCorrectness.v Well formedness The ASL code in Figure 4.7 is actually legal in ASL, and the result is: x− > 0, y− > 1, z− > 2, because ASL doesn’t require allocations the same way Vellvm does. But, if that code is com- piled with the compiler proposed, executing the result Vellvm code, would provide a different result, because z has not been allocated. var x; var y; x = 1; z = 2; Figure 4.7: ASL legal code This is actually one of the limitations of the compiler proposed. One way to solve this, could be to create a more complex environment relationship which would take into account this edge case, but instead for simplicity, the way this have been solved, is by limiting the ASL code which can be compiled with a proposition called well formed. An ASL program is well formed, if all their assignments are declared previously. Formally: Definition well_formed p := match p with | Prog ds ss ⇒ forall x e, (In (Assign x e) ss → In (Var x) ds) end. Figure 4.8: Well formedness proposition where In v l is a proposition which states that some value v is in the list l. Could be interesting to extract this Coq’s definition, to check in OCAML if an ASL is well formed or not. This could be done by making a boolean definition instead a propositional one, because Coq won’t extract propositions, as it’s not clear how a proposition can be computed. 49 https://github.com/101001000/ASL/blob/master/src/coq/Compiler/CompilerCorrectness.v 4.4.4 Vellvm typ vs dtyp One could think that the translation is already verified, but there’s one step left. The compiler correct theorem verifies the compilation of declarations and statements but only for the dynamic typed version of the code. There are to kind of types in Vellvm: DynamicTypes (dtyp) and StaticTypes (typ). The main difference is that the DynamicTypes strips the pointer type information, as the semantics in Vellvm doesn’t require knowing the type of the pointer1 and add complexity to the reasoning. There’s a file called TypToDtyp.v which will strip that information converting some typ into dtyp. Vellvm requires it’s input code to be typed with dynamic types. The main issue is that even if Vellvm can work without the pointer type, LLVM-IR source code requires that type information to be available, which means the ASL translator requires to output code statically typed. That’s the reason why the compile function outputs typ code, so it can be properly printed with the relevant type information. So it’s required to add a last theorem which covers this typ/dtyp conversion: Theorem compiler_correct_convert: forall (p : AST.prog), well_formed_prog p → bisimilar TT (denote_prog p) (denote_code (TypToDtyp.convert_typ [] (compile p))). where the function TypToDtyp.convert typ [] is applied to the compiled code to get the dynamic version of the code. The issue is that the convert typ function, makes the pro- cess of verifying even more tedious, so for getting rid of it, two lemmas are demonstrated: TypToDtyp.convert typ [] (compile decs s ds) = (compile decs ds) and: TypToDtyp.convert typ [] (compile stmts s ds) = (compile stmts ds). After doing it, it’s possible to rely into a list application lemma to prove the final compiler theorem. 4.4.5 The missing lemma There is one lemma left which has not been demonstrated yet to fully verify the compilation: Lemma next_logical_key_mem_stack_add : forall m x i, next_logical_key (mem_stack_add m x i) = ((next_logical_key m) + 1%Z)%Z. It states that after adding some element of the memory, the key to the next empty block, will be the same as the one before adding some element plus one. Verifying this 1Vellvm team came to the same conclusion as other individuals: Type information of pointer is not necessary for the semantics of LLVM programs and only add complexity, which matches the design decision of the LLVM project of using the Opaque Pointers explained in the Section 2.3.2 50 lemma, doesn’t suppose any different challenge to the one faced previously, but is tedious and wasn’t done on time. 51 Chapter 5 Benchmarking This chapter aims to provide some empirical data about the evaluation of Eleven in similar scenarios to the ones used by the end user. 5.1 Backend comparison This section will compare Eleven along different devices with different back-ends. The objec- tive is to show a quick summary of what are the relative differences between devices which could be found in rendering workstations. The scene chosen to be rendered in these devices is the one shown in Figure 5.1: 52 Figure 5.1: Shader demo scene for backend comparison . It have been chosen for couple reasons: First, it is completely open source with license CC0. All the models and resources are available in polyhaven.com, second, it has a moderately complex geometry (a very complex geometry scene would take most of the time building in the CPU, the required data structures for ray tracing intersections, which is not relevant for SYCL performance), third, it has a memory footprint of about 1̃Gb, where most of it are 1024x1024 pixel textures and should fit in most consumers GPUs, and last, it includes three shaders: the shader from Section 2.3.3 compiled from C++ code, which is heavy and requires some resources and is applied to the up triangular sphere, and two lightweight color ASL shaders for the two spheres, compiled with the ASL compilers. The results of the rendering execution across the devices listed, are available in the Figure 5.2 53 polyhaven.com OS Compiler version Vendor Model Device type Backend Result Windows DPC++ 2022-12 Intel i9 13900KF @3Ghz CPU OpenCL 8:10m Windows DPC++ 2022-12 Intel i5 12600 CPU OpenCL 15:57m Windows DPC++ 2022-12 Nvidia RTX3090 GPU NVPTX 3:20m Linux DPC++ (#5801970) AMD 6700XT GPU AMD HIP 1:33m Linux ICPX 2023.2.0 Intel i9 13900K CPU OpenCL 7:02m Linux ICPX 2023.2.0 Intel ARC A770 GPU LevelZero 5:17m Linux ICPX 2023.2.0 Intel ARC A770 GPU OpenCL 5:20m Figure 5.2: Backend comparison The main two notable points to highlight here are: the low performance of the high-end RTX3090 in comparison to the mid-end AMD 6700XT, where the second one performed almost twice at fast, and the minimal difference between different LevelZero and OpenCL backends for the same device. 5.2 Anatomy of a rendering session The same scene used in the previous section, has been used to illustrate a timeline of an execution of Eleven, with the objective of provide a deeper understanding of where time is spent. This execution in particular, is the same as the one shown in the table Figure 5.2 with an Nvidia RTX3090. The timeline is the following one: 54 60ms · · · · · ·• Sending and loading configuration. 3230ms · · · · · ·• Sending and loading textures. 130ms · · · · · ·• Sending and loading materials. 700ms · · · · · ·• Object processing. 880ms · · · · · ·• Sending and loading geometry data. 2450ms · · · · · ·• BVH construction. 1170ms · · · · · ·• Scene device copy. 40ms · · · · · ·• Setup Kernel. 138000ms · · · · · ·• Rendering Kernel. Table 5.1: Timeline of an Eleven rendering session 5.3 Cycles comparison One of the main criticism of the previous bachelor’s thesis was, that there were no direct comparisons with some state of the art rendering engine to measure the performance of Eleven in the real world. This was in a big part, because setting up Eleven’s scenes were a tedious task and didn’t really matched other engines scenes format. Now thanks to the new architecture and ElevenBlender, that has changed. This subsection will cover the awaited performance comparison against the well known and established Cycles rendering engine. Because Cycles is unable to run the Eleven’s shaders, the three elements with custom shaders from the previous scene have been removed. 55 Cycles CUDA vs Eleven CUDA (a) 100 samples, Blender Cycles (b) 100 samples, Eleven Render (c) (a) image zoomed in (d) (b) image zoomed in Figure 5.3: Comparison between Cycles and Eleven in CUDA 56 Eleven scene has required 4:28min, the same scene in Cycles has required 33s which makes Cycles 8.1x times faster than Eleven in this scenario. It’s important also to mention that Cycles squish those 100 samples better than Eleven, probably because a better implementa- tion of the Importance sampling. An important consideration to have in mind is, DPC++ makes use of the LLVM NVPTX back-end for generating the code which will run the Nvidia GPU, meanwhile, Cycles uses nvcc, an Nvidia proprietary compiler, with different optimizations and code generation. This is another factor to take into account, and future optimization will require to have this point in consideration. Cycles CPU vs Eleven OpenCL The same procedure has been repeated but targeting the OpenCL back-end, resulting in 3:44min for the Eleven’s execution and 1:27min for the Cycle’s one, making Cycles 2.57x times faster. This has been a great surprise, usually, CPU rendering code is better opti- mized than GPU one, so getting a relatively close performance without any optimization in consideration by just targeting the OpenCL back-end, is not something expected. 57 Chapter 6 Conclusions In this Master thesis, the development of Eleven Render has been documented. It includes the architectural changes which have allowed a remote and flexible client-server workflow, and the ability to run over multiple heterogeneous architectures. Many improvements have been included such as the inclusion of an AI denoiser, and a shading system which supports any language which can be translated to LLVM-IR has been defined, implemented and included in the compilation chain. Also, a custom plugin to be used in the well known 3D open source software Blender has been coded in python making use of the Blender’s API bpy. A custom shading language ASL has been formalized and implemented in Coq, along with their own translation to LLVM-IR, but because the complexity and the tedious work of verifying such translation, at the end, ASL is an extremely simple language considered just a proof of concept which deviates from the initial complete-shading language plans proposed. It has been also shown how code in this ASL shader language can be included in a Eleven scene. Finally, Eleven has been benchmark-ed and profiled into the main DPC++ target architectures, and a comparison with an state of the art render (Cycles), It has been shown that it is relatively efficient. This work will serve as a documented open source project for people who want to step up their ”Toy Renders”1, and who want to dig in the details of the inner workings of an offline rendering engine. It will also serve as the seed of what I consider a missing part of the Vellvm project: an IMP to VIR verified and documented translation. 1A toy render is a simple 3D render usually done as a coding exercise which cover just the fundamentals of rendering 58 Bibliography [1] P. Christensen, J. Fong, J. Shade, W. Wooten, B. Schubert, A. Kensler, S. Fried- man, C. Kilpatrick, C. Ramshaw, M. Bannister, et al., “Renderman: An advanced path-tracing architecture for movie rendering,” ACM Transactions on Graphics (TOG), vol. 37, no. 3, pp. 1–21, 2018. [2] M. Pharr, W. Jakob, and G. Humphreys, Physically based rendering: From theory to implementation. MIT Press, 2023. [3] J. T. Kajiya, “The rendering equation,” in Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pp. 143–150, 1986. [4] W. D. A. Studios, “”the moana island scene.” http://technology.disneyanimation. com/islandscene/. https://datasets.disneyanimation.com/moanaislandscene/ island-README-v1.1.pdf. [5] B. Burley and W. D. A. Studios, “Physically-based shading at disney,” in Acm Siggraph, vol. 2012, pp. 1–7, vol. 2012, 2012. [6] E. d. l. Calle Montilla, “Evaluación y aceleración del algoritmo path tracing en arqui- tecturas heterogéneas.” Trabajo de Fin de Grado en Ingenieŕıa Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020-2021., 2021. [7] L. Gritz, “Open shading language spec 1.12.” https://github.com/ AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/ osl-languagespec.pdf, 2022. [8] C. Group, “”v-ray gpu supported features”.” https://docs.chaos.com/display/ VMAX/V-Ray+GPU+Supported+Features. [9] D. Pixar, “”renderman features and limitations”.” https://rmanwiki.pixar.com/ display/REN24/Features+and+Limitations. 59 http://technology.disneyanimation.com/islandscene/ http://technology.disneyanimation.com/islandscene/ https://datasets.disneyanimation.com/moanaislandscene/island-README-v1.1.pdf https://datasets.disneyanimation.com/moanaislandscene/island-README-v1.1.pdf https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/doc/osl-languagespec.pdf https://docs.chaos.com/display/VMAX/V-Ray+GPU+Supported+Features https://docs.chaos.com/display/VMAX/V-Ray+GPU+Supported+Features https://rmanwiki.pixar.com/display/REN24/Features+and+Limitations https://rmanwiki.pixar.com/display/REN24/Features+and+Limitations [10] W. Usher, “ChameleonRT.” https://github.com/Twinklebear/ChameleonRT, 2019. [11] J. Reinders, B. Ashbaugh, J. Brodman, M. Kinsner, J. Pennycook, and X. Tian, Data parallel C++: mastering DPC++ for programming of heterogeneous systems using C++ and SYCL. Springer Nature, 2021. [12] X. Leroy, S. Blazy, D. Kästner, B. Schommer, M. Pister, and C. Ferdinand, “Compcert- a formally verified optimizing compiler,” in ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress, 2016. [13] Inria, “Coq proof assistant.” [14] L.-y. Xia, Y. Zakowski, P. He, C.-K. Hur, G. Malecha, B. C. Pierce, and S. Zdancewic, “Interaction trees: representing recursive and impure programs in coq,” Proceedings of the ACM on Programming Languages, vol. 4, no. POPL, pp. 1–32, 2019. [15] Y. Zakowski, C. Beck, I. Yoon, I. Zaichuk, V. Zaliva, and S. Zdancewic, “Modular, compositional, and executable formal semantics for llvm ir,” Proceedings of the ACM on Programming Languages, vol. 5, no. ICFP, pp. 1–30, 2021. [16] V. Zaliva, HELIX: From Math to Verified Code. Carnegie Mellon University, 2020. [17] Z. Jin and J. Vetter, “Evaluating cuda portability with hipcl and dpct,” in 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 371–376, IEEE, 2021. [18] G. Castaño, Y. Faqir-Rhazoui, C. Garćıa, and M. Prieto-Mat́ıas, “Evaluation of in- tel’s dpc++ compatibility tool in heterogeneous computing,” Journal of Parallel and Distributed Computing, vol. 165, pp. 120–129, 2022. [19] Intel, “”performance impact of usm and buffers”.” https://www.intel. com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/ performance-impact-of-usm-and-buffers.html. [20] Khronos Group, “Sycl specification.” "https://registry.khronos.org/SYCL/specs/sycl- 2020/pdf/sycl-2020.pdf”, 2023. Accessed: April 18, 2023. [21] L. Project, “”opaque pointers”.” https://releases.llvm.org/15.0.0/docs/ OpaquePointers.html. [22] H. R. Nielson and F. Nielson, Semantics with applications: an appetizer. Springer Science & Business Media, 2007. 60 https://github.com/Twinklebear/ChameleonRT https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/performance-impact-of-usm-and-buffers.html " https://releases.llvm.org/15.0.0/docs/OpaquePointers.html https://releases.llvm.org/15.0.0/docs/OpaquePointers.html Introduction Background and Motivation Photo-realistic Rendering SYCL / DPC++ LLVM Coq / OCAML iTrees (Interaction Trees) Vellvm (Verified LLVM) Previous Work Goals Work Plan Render Architecture High level architecture overview Data Handling, Communication, and Rendering Processes IPC SYCL port Including a custom shading language in the pipeline Shading kernel execution IR Populator C++ Shader example Miscellaneous improvements Eleven Blender Geometry loading Material interface Abstract Shading Language Syntax AST Lexer Parser Semantics LLVM-IR Translation Translation verification The Vellvm memory model Renv Bisimilarity Vellvm typ vs dtyp The missing lemma Benchmarking Backend comparison Anatomy of a rendering session Cycles comparison Conclusions