______________________________________________ IMPLEMENTATION AND PERFORMANCE EVALUATION OF A SEMANTIC IMAGE SEGMENTATION SYSTEM ON A MOBILE DEVICE _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ IMPLEMENTACIÓN Y EVALUACIÓN DE RENDIMIENTO DE UN SISTEMA DE SEGMENTACIÓN SEMÁNTICA DE IMÁGENES SOBRE UN DISPOSITIVO MÓVIL ______________________________________________ TRABAJO FIN DE GRADO CURSO 2020-2021 AUTORA ESTHER CARREÑO ALOCÉN DIRECTORES LUIS PIÑUEL MORENO FRANCISCO D. IGUAL GRADO EN INGENIERÍA INFORMÁTICA FACULTAD DE INFORMÁTICA UNIVERSIDAD COMPLUTENSE DE MADRID MADRID, SEPTIEMBRE DE 2021 ACKNOWLEDGEMENTS To the project directors Luis Piñuel Moreno and Francisco D. Igual, for all their trust and support. To my good friend Martín Bárez, who has always been there for me. To my boyfriend, for helping me not to fall into the quicksand of my mind. [55] ABSTRACT There has been a growth of interest in semantic segmentation in recent times, and its employment in tasks such as autonomous driving, medical diagnosis or video surveillance is now crucial. The training and inference processes in Deep Neural Networks (DNNs) are performed in data centres, which causes unbearable latency. Edge Computing is a response to this limitation. Nevertheless, it is restricted by the computing power and energy consumption in devices. This project proposes the implementation of algorithms for semantic segmentation of images using DeepLab and TensorFlow as a basis, along with its adaptation and throughput evaluation in terms of response time in a mobile device among different semantic segmentation models. A Raspberry Pi was used along with a Coral USB Accelerator by Google, which provides an Edge TPU to accelerate inference in Machine Learning by quantizing the models. The final goal is to prove an efficient implementation in this low energy consumption architecture. Keywords Semantic segmentation, DNNs, Edge Computing, DeepLab, Coral, Energy Efficiency, Latency, Raspberry Pi RESUMEN En los últimos tiempos ha habido un aumento en el interés por la segmentación semántica, y su uso en tareas como la conducción autónoma, el diagnóstico médico o la videovigilancia es crucial. Los procesos de entrenamiento e inferencia en Redes Neuronales Profundas (DNNs) se realizan en centros de datos, lo que causa una latencia insostenible. El Edge Computing es una respuesta a esta limitación, pero está restringido por la potencia computacional y el consumo de energía de los dispositivos. Este proyecto propone la implementación de algoritmos de segmentación semántica de imágenes usando como base DeepLab y TensorFlow, además de su adaptación y evaluación de rendimiento según el tiempo de respuesta en dispositivos móviles entre diferentes modelos de segmentación semántica. Para ello se ha utilizado una Raspberry Pi y se ha optado por el acelerador USB Coral, de Google, que ofrece un Edge TPU para acelerar la inferencia en Machine Learning mediante la cuantización. El objetivo final es demostrar que una implementación eficiente en una arquitectura de bajo consumo energético es posible. Palabras clave Segmentación semántica, DNNs, Edge Computing, DeepLab, Coral, Eficiencia Energética, Latencia, Raspberry Pi CONTENT INDEX Acknowledgements III Abstract IV Resumen V Content Index VII Figure Index XI Table Index XIII Capítulo 1 - Introduction 1 1.1 Motivation 1 1.1.1 Semantic segmentation 1 1.2 Goals 3 1.3 Project plan 3 Capítulo 2 - State of art 5 2.1 Convolutional Neural Networks 6 2.2 Transfer Learning 9 2.2.1 Transfer Learning in Edge-TPU 9 2.3 Quantization 10 2.3.1 How to quantize 10 2.3.3 Risks of quantization 12 Capítulo 3 - Hardware devices and architectures used 13 3.1 Laptop i7 NVIDIA® GeForce® GTX 13 3.2 Virtual Machines 13 3.3 Raspberry Pi 4 Model B 8GB RAM 13 3.4 USB Accelerator Google Coral 15 3.4.1 Edge TPU 16 Capítulo 4 - Frameworks used 17 4.1 TensorFlow 17 4.2 TensorFlow Lite 17 4.2.1 Post-training quantization 18 4.2.1.1 Full integer post-training quantization 19 4.2.1.2 TensorFlow Lite converter 20 4.3 DeepLab 21 4.3.1 How does DeepLab work 21 4.3.1.1 Spatial pyramid pooling 22 4.3.1.2 Atrous Convolutions 22 4.3.1.3 Depthwise Separable Convolutions 23 4.3.2 Datasets 23 4.3.3 Model Zoo 24 4.3.4 Evaluation metrics 26 4.3.5 Advantages of DeepLab 26 4.4 OpenCV 26 Capítulo 5 - Implementation and experimentation 27 5.1 DeepLab local installation 27 5.1.1 Laptop installation 27 5.1.1.1 Training 28 5.1.1.2 Evaluation 30 5.1.1.3 Visualization 31 5.1.1.4 Script (draft) 32 5.2 Virtual Machine 34 7 5.2.1 Virtual Machine configuration 35 5.2.2 Updated script 35 5.2.3 Measured times for each model 36 5.3 DeepLab in Raspberry Pi 4 37 5.3.1 Raspberry Pi configuration 37 5.3.2 Measured times for each model 38 5.4 Coral USB Accelerator in Raspberry Pi 4 44 5.4.1 Coral installation 44 5.4.2 How to run a model on the Edge TPU 44 5.4.3 How to run TFLite object detection models on the Raspberry Pi 45 5.4.4 How to run Edge TPU object detection models on the Raspberry Pi using Coral USB Accelerator 46 5.4.5 How to quantize a model to run it on Edge TPU 48 Capítulo 6 - Results Comparison and Analysis 53 6.1 Laptop vs. Virtual Machine execution times 53 6.2 Virtual Machine image input vs. Virtual Machine webcam input 53 6.3 Virtual Machine vs. Raspberry Pi 54 6.4 Raspberry Pi: regular model vs. Edge TPU model 55 Capítulo 7 - Conclusions and future work 57 Bibliografía 59 Apéndice A - Configuration Guidelines 65 8 FIGURE INDEX Figure 1-1. Example of semantic image segmentation Figure 2-1. Example of a CNN sequence to classify handwritten numbers Figure 2-2. 4x4x3 RGB Image Figure 2-3. Convolutioning process Figure 2-4. Quantization technique decision tree Figure 3-1. Raspberry Pi 4 Model B architecture. Figure 3-2. USB Accelerator Google Coral. Figure 3-3. Two Edge TPU chips on a US penny Figure 4-1. Workflow to create a model for Edge TPU Figure 4-2. TFLite conversion diagram Figure 4-3. Segmentation result example on Flickr image Figure 4-4. Spatial pyramid pooling Figure 4-5. Example mobilenetv2_coco_voc_trainaug Figure 4-6. Example xception65_coco_voc_trainval Figure 5-1. Console output when running model_test.py Figure 5-2. Recommended Directory structure for Training and Evaluation Figure 5-3. Command used to run train.py Figure 5-4. Terminal output when training DeepLab using the official repository (train.py) Figure 5-5. Command used to run eval.py Figure 5-6. Command used to run vis.py Figure 5-7. Segmentation results after running vis.py Figure 5-8. mobilenetv2_coco_voctrainaug segmentation laptop output example Figure 5-9. mobilenetv2_coco_voctrainval segmentation laptop output example Figure 5-10. xception_coco_voctrainaug segmentation laptop output example 10 Figure 5-11. xception_coco_voctrainval segmentation laptop output example Figure 5-12. Raspberry Pi installation interface Figure 5-13. RasPi Unclear input, xception_coco_voctrainval Figure 5-14. RasPi Unclear input, xception_coco_voctrainaug Figure 5-15. RasPi Unclear input, mobilenetv2_coco_voctrainval Figure 5-16. RasPi Unclear input, alternative mobilenetv2_coco_voctrainval Figure 5-17. RasPi Unclear input, mobilenetv2_coco_voctrainaug Figure 5-18. RasPi Clear input, xception_coco_voctrainval Figure 5-19. RasPi Clear input, xception_coco_voctrainaug Figure 5-20. RasPi Clear input, mobilenetv2_coco_voctrainval Figure 5-21. RasPi Clear input, mobilenetv2_coco_voctrainaug Figure 5-22. RasPi Clear input, alternative mobilenetv2_coco_voctrainaug Figure 5-23. RasPi Several objects, xception_coco_voctrainval Figure 5-24. RasPi Several objects, xception_coco_voctrainaug Figure 5-25. RasPi Several objects, mobilenetv2_coco_voctrainval Figure 5-26. RasPi Several objects, mobilenetv2_coco_voctrainaug Figure 5-27. Edge TPU inference results Figure 5-28. TFLite object detection models on a Raspberry Pi Figure 5-29. Edge TPU object detection models on a Raspberry Pi using Coral Figure 5-30. Terminal output when converting the TF model to a TF Lite model Figure 5-31. Terminal output when compiling the TF Lite model to Edge TPU Figure 5-32. Inference result of dummy_edgetpu.tflite model with input image Figure 5-33. Inference result of smarty_edgetpu.tflite model with input image Figure 5-34. Inference result of smarty_edgetpu.tflite model with webcam input 11 TABLE INDEX Table 1-1. Gantt diagram with an overview the tasks done Table 2-1. Compared inference times of DNNs models for edge and cloud computing. Table 2-2. Quantization methods and their performance in TensorFlow Lite. Table 4-1. Post-training quantization techniques Table 4-2. Computation complexity (in terms of Multiply-Adds and CPU Runtime) and segmentation performance (in terms of mIOU) on the PASCAL VOC val or test set Table 5-1. Execution time of semantic segmentation (in seconds) using different models Table 5-2. Execution time of semantic segmentation (in seconds) using different models in the Virtual Machine Table 5-3. Semantic segmentation execution times (in seconds) of webcam images in the Virtual Machine Table 5-4. Semantic segmentation execution times (in seconds) of webcam images in the Raspberry Pi 4 Table 6-1. Inference time comparison (in seconds) between Virtual Machine and Raspberry Pi Table 6-2. Time comparison (in seconds) between regular and EdgeTPU semantic segmentation model in Raspberry Pi 12 13 Capítulo 1 - Introduction 1.1 Motivation In the last few years there has been a major breakthrough in Artificial Intelligence. This great stride was possible thanks to the progress of processors’ computing capacity. Nowadays, there are a large amount of services that have been automated, and the tendency to avoid human interaction for different tasks is increasing to the point where autonomous processes are rapidly expanding. For that purpose, we use Deep Learning in order that the computer behaves like a human during the execution of an assignment. In particular, we use Convolutional Neural Network (CNN) because they are designed to process image pixel data. 1.1.1 Semantic segmentation Semantic segmentation is a part of computer vision, it can be described as the assignment of a label (which refers to a concept, for instance a car or a person) to each pixel in an image according to the object that has been detected. It has become a key process necessary for multiple tasks, such as autonomous driving (by detecting signs and obstacles), medical diagnosis automated through images (this could be used for early detection of tumours), crowd management and people counting, and video surveillance, among others. To this effect, DeepLab, a semantic segmentation model developed by Google and based on Tensorflow, is noteworthy. 1 Figure 1-1. Example of semantic image segmentation. [18] As the Figure 1-1 shows, semantic segmentation labels each pixel in the image with a category label, however it does not differentiate instances. This means that it does not separate the objects of the same category. In this example, we can see that both cows in the image are considered as a whole unit of pixels that correspond to the cow category. There has been a growth of interest in this field, motivated by the revival of the Deep Neural Networks (DNNs) and the improvement of processors’ computing capacity. However, it has to be taken into consideration that the training and inference processes in DNNs are performed in data centres (in cloud), which often leads to an unbearable response time or latency. In response to this limitation, Edge Computing is being used, which consists in performing the operations near the device which needs the calculations. This approach reduces the latency, but it is limited by the computing power as well as the energy consumption in these devices. Note that inference refers to the process of making predictions on unseen data using trained DNNs model. 2 1.2 Goals This project aims to study a neural network model and to compare its metrics, like precision and latency, depending on the trained models used, their datasets and the architecture by creating a semantic segmentation application. The final goal is to prove an efficient implementation in a low energy consumption architecture. 1.3 Project plan Table 1-1. Gantt diagram with an overview the tasks done 3 4 Capítulo 2 - State of art Training and inference processes in DNNs are usually performed in data centres. Although some cloud services, such as AWS, provide inference services like Amazon Polly, Rekognition and Lex, it is not enough. When devices send data to the cloud for processing, the reception of the final results is totally dependent on the congestion as well as the latency of internet connection. Moreover, there is a lack of efficiency in regards to cost and energy when streaming dense information, like images and video. It should be noted that when working with real-time applications, results need to be returned soon to maintain a positive user experience and to meet the requirements of the application. Because of this, an adequate inference latency is crucial. This means that the faster the prediction is, the higher the number of predictions per time unit we get is, and hence a generally reduced cost. [23] In response to these limitations of cloud, Edge Computing comes into play; it consists in performing the operations near the device which needs those calculations. This approach reduces the latency, but it is limited by the computing power as well as the energy consumption in these devices. 5 Table 2-1. Compared inference times of DNNs models for edge and cloud computing. [23] Table 2-1 shows a comparison between the inference times that we can find of different DNNs models for both cloud and on-premise (edge) computing. From this information it can be gathered that there is a balance between accuracy and latency. In addition, it also compares GPU and CPU. Later on in this project TPU will also be discussed, as the USB Accelerator Google Coral provides an Edge TPU coprocessor for the Raspberry Pi 4 Model B, which will reduce the latency even more. 2.1 Convolutional Neural Networks Computer Vision (CV) is an area of Artificial Intelligence that imitates the human vision system, in such a manner that computers can identify and process items in images and videos just like humans do. [25] 6 Figure 2-1. Example of a CNN sequence to classify handwritten numbers. [36] Convolutional Neural Networks (CNNs) are Deep Learning algorithms used in the field of CV. CNNs use an image as an input and give a value to different the objects in that image in order to distinguish all of them. This kind of network is equivalent to the neuron connections in the human brain. [54] Figure 2-2. 4x4x3 RGB Image. [36] 7 Figure 2-2 represents the three color planes of a RGB image. In this case, it is only 4x4 pixels, but even when the image has bigger dimensions, the CNN reduces the processing complexity while keeping the necessary information for a good prediction. In order to implement the first convolution operation in the convolution layer, for a 5x5x1 input image there is a Filter (K), which is selected to be 3x3x1. K = 101, 010, 101 ( ) Figure 2-3. Convolutioning process. [36] Then, the filter shifts 9 times performing a matrix multiplication by the corresponding portion (P) of the image. With the convolution operation it is aimed to extract high-level attributes from the image. After the convolutional layer, the pooling layer reduces as well the size of the image. This way, the processing of the data does not need to employ such a high computational power. Max Pooling is the best type of pooling, given that it returns the maximum value from P and it is a noise suppressant, which means that it rejects the noisy data that does not provide any useful information. As stated in [36], “by adding a Fully-Connected layer the network learns non-linear combinations of high-level features”. Then the image matrix is flattened into a column vector, which is used as an input for a feed-forward neural network and backpropagation is practiced in each iteration of the training process to compute the gradient and loss function. 8 Finally, the features of the images are classified using the Softmax Classification technique [45] by filtering the values which are below a maximum established value. [36] 2.2 Transfer Learning Transfer Learning (sometimes also called "fine tuning") focuses on applying the knowledge previously obtained for one task to solve similar problems. [37] In this way, it is not necessary to learn everything from scratch, which would consume time and computing resources. Hence, the key motivation of using transfer learning is that getting a large quantity of labeled data is necessary for models which solve complex tasks, but labeling that data can be hard in regards to the time and labor employed. Moreover, transfer learning may be highly needed in case the data can be readily outdated because the previously obtained labeled data may not follow the same distribution later. [29] In semantic segmentation, and generally in computer vision, pre-trained models based on big CNNs are used in transfer learning. [24] To sum up, we can use a model that is already trained to perform extra training using a smaller training dataset to teach the model new classifications. 2.2.1 Transfer Learning in Edge-TPU Using transfer learning, it is possible to retrain an existing model compatible with Edge TPU. This can be done by two methods, which are explained at [44]: ● “Retraining the whole model by adjusting the weights across the whole network.” ● “Removing the final layer that performs classification, and training a new layer on top that recognizes your new classes. Once you're happy with the model's performance, simply convert it to TensorFlow Lite and then compile it for the Edge TPU. This will be explained later in the work.” 9 2.3 Quantization As explained previously, NNs employ high computational costs and consume a lot of memory. Because of this, it is important to optimize its training and inference. Nowadays, more models move from servers to edge because of the advantages that edge offers regarding latency. When running models on the edge, network optimization is even more crucial, taking into account their limitations in computing power. One technique for reducing the complexity of CNNs is quantization. The main idea of quantization is to convert floating point weights and inputs into the nearest integers. This helps to consume less memory and may lead to faster calculations depending on the hardware. In other words, “it makes the model smaller and faster”. [32] Even if the 8-bit integer depiction can be less precise, the inference accuracy of the NN is not remarkably compromised. [31] There are several ways of representing floating points and integers, but float32 and int8 (which ranges from -128 to 127) are the most common for quantization. [11] Although the calculation speed depends on the hardware used, in general, int8 is typically faster than float32. Nevertheless, it must be taken into account that float32 is used by default for training and inference for NNs. [3] 2.3.1 How to quantize We can perform quantization through matrix multiplications. This procedure is an approximation; hence, we lose information in the process. However, this may be acceptable sometimes. [11] There are two main ways to perform quantization: ● Post-training quantization: It consists in training the model using the default float32 weights and inputs, and later quantizing the weights. It is easy to perform, but the disadvantage is that it can lead to accuracy loss. ● Quantization-aware training: The weights are quantized during the training phase. int8 quantization offers better results, but it is a more complex option. 10 Table 2-2. Quantization methods and their performance in TensorFlow Lite. [26] As we can see in Table 2.2, there are multiple quantization techniques available for TensorFlow Lite. Figure 2-4. Quantization technique decision tree. [26] Figure 2-4 shows a decision tree to help select the most adequate quantization method. It takes into account the size and the expected precision of the model. 11 2.3.2 Risks of quantization NN are immensely complex functions, and in spite of being continuous, they can transform fastly. Quantization implies surrender accuracy. Therefore, for tasks in which safety is vital, it is precise to be excessively cautious. [26] 12 Capítulo 3 - Hardware devices and architectures used 3.1 Laptop i7 NVIDIA® GeForce® GTX I used my Windows laptop to locally install DeepLab, as well as the GPU version of TensorFlow. Then I trained, evaluated and visualized the semantic segmentation of images using different models. 3.2 Virtual Machines I configured a virtual machine to be able to execute DeepLab models. I first used VMware Workstation 15 player, nevertheless I had problems trying to get images from the webcam that I used. Because of that, I finally decided to employ VM VirtualBox, where I used Ubuntu 20.04 as the Operating System. Overall, keeping the Virtual Machine (VM) in order was complicated because of all the continuous problems I encountered. As a consequence, I had to configure the VM multiple times (every time one got broken). I wrote a configuration script with all the necessary steps to correctly configure the VM, which can be found at Apéndice A. 3.3 Raspberry Pi 4 Model B 8GB RAM Raspberry Pi 4 Model B is a small computer. In every new model there is a continuous improvement of the processor speed, multimedia performance, memory, and connectivity, while trying to keep similar power consumption. It brings desktop performance analogous to entry-level x86 PC systems. [34] 13 Figure 3-1. Raspberry Pi 4 Model B architecture. [33] Between the key features, it offers a high-performance 64-bit quad-core processor, a pair of micro-HDMI ports which support 4K dual-display, dual-band 2.4/5.0 GHz wireless LAN, Bluetooth 5.0, Gigabit Ethernet, USB 3.0, and PoE capability [34] (Power over Ethernet, which grants the capability of supplying power to a networking device through the cable which also transmit the data). [48] In regards to graphics, it includes hardware video decode at up to 4Kp60, and video rendering to 1080p and 30fps and up to 8GB of RAM. Finally, it must be emphasized that Raspberry Pi 4 Model B has 8GB of RAM LPDDR4. This is a random-access synchronous dynamic memory which is deployed in mobile devices due to its low energy consumption. 14 3.4 USB Accelerator Google Coral “The Coral USB Accelerator is a USB device designed by Google that accelerates inference time for machine learning models. It provides an Edge TPU as a coprocessor for the computer” [46] (in my case, for the Raspberry Pi 4 Model B). It works for Linux, Mac, and Windows. Figure 3-2. USB Accelerator Google Coral. [46] To get started, it is necessary to download the Edge TPU runtime and the PyCoral library on the computer. [17] Then, I will use the accelerator to run TensorFlow Lite models, which are compatible with Edge TPU. To be able to train my own models, I need to use post-training quantization, which will be explained later in this work. 15 3.4.1 Edge TPU According to Google [8] and Coral [49] documentation, “Tensor Processing Units (TPUs) are Google’s application-specific integrated circuits (ASICs).” “The Edge TPU provides high performance machine learning inferencing for low-power devices.” “One Edge TPU can implement 4 trillion fixed-point operations per second (4TOPS) using 2 watts of power.” Figure 3-3. Two Edge TPU chips on a US penny. [49] Edge TPU is an alternative to Cloud TPUs, also by Google. “Cloud TPUs run in a Google data center [...], and can perform 420 teraflops.” “This means that they are perfect for training great, complex models. On the other hand, Edge TPU is ideal for small, low-power devices, since it provides extremely fast and power-efficient on-device ML inferencing.” [49] It needs to be noted that Edge TPU only supports the TensorFlow Lite framework. 16 Capítulo 4 - Frameworks used 4.1 TensorFlow TensorFlow is an open source platform for Machine Learning developed by the Google Brain team. “It offers its own ecosystem of tools, libraries and resources to enable the development of new innovations in the field of Machine Learning.“ [40] 4.2 TensorFlow Lite Tensorflow Lite (TFLite) [42] is a set of tools that enables edge computing. It helps to run machine learning models on devices. Its key features include: ● Optimized for on-device machine learning. It addresses latency (it is not cloud, so there is no round-trip to a server), it ensures privacy (no personal data spreads out of the device), there is no need for internet connectivity, there is reduced model and binary size and regarding power consumption, it offers efficient inference without network connections. ● Multiple platform support ● Multiple language support, such as Java, C++ and Python. ● High performance, with hardware acceleration and optimization of the models. ● End-to-end examples for frequent ML tasks. Previously, some quantization techniques available for TFLite were discussed. Now, the process of how to create a TFLite model for the Edge TPU will be studied. As a brief introduction, it can be stated that the main steps are: 1. Picking a new model or retraining one 2. Converting a TF (.pb) model into a compressed flat buffer (.tflite) with the TFLite Converter. 3. Deploying: taking the compressed .tflite file and loading it into a device. 4. Optimizing: quantizing the model. 17 Figure 4-1. Workflow to create a model for Edge TPU. [13] Figure 4-1 explains the two ways of generating the compatible Edge TPU model: quantization-aware training, and post-training quantization. These were previously introduced. As shown in the figure, quantization-aware training is a much broader process. For the sake of simplicity, I will focus on post-training quantization. [13] 4.2.1 Post-training quantization “Post-training quantization [31] is a conversion technique that is able to reduce the model size as well as improving the [...] latency”, without compromising the accuracy too much. The quantization can be implemented by converting a trained TensorFlow model to TFLite format using the TFLite Converter. As mentioned before, it is necessary to simplify models to be able to bring pretrained models to architectures with less resources. Table 4-1. Post-training quantization techniques. [31] 18 4.2.1.1 Full integer post-training quantization As Table 4-1 indicates, there are several techniques to choose from. Given that Coral USB Accelerator will be used, the chosen technique is Full integer quantization, since it supports Edge TPU. Its main benefits are that it makes the model 4x smaller and provides 3x more speedup. [31] This quantization technique is used to transform an already trained network into a quantized model, hence it does not require modifications to the network. [32] “In addition, it offers further latency improvements, cuts down peak memory usage, and is compatible with integer-only hardware devices or accelerators” [31] (Coral in this case). It is required to calibrate the range of all floating-point tensors in the model. To calibrate variable tensors like model input or outputs, it is necessary to run a few inference cycles with a small representative dataset. [16] Further in this project, I explained that for my own quantization I used dummy images and webcam images as representative datasets, 19 4.2.1.2 TensorFlow Lite converter This converter transforms a TF model into a TFLite model (an optimized FlatBuffer .tflite). There are two ways to use the converter: by the Python API or by the command line. Figure 4-2. TFLite conversion diagram. [41] To create a TFLite model for the Edge TPU, the first step is to convert the model to TFLite. Beware that to create a compatible model with post-training quantization, TensorFlow 1.15 must be the installed version (to set input and output type to int8); it is not possible to use TensorFlow 2.0 because it only supports float inputs and outputs. Finally, it is necessary to compile the model for compatibility with the Edge TPU. [19] 20 4.3 DeepLab “DeepLab [35] is a state-of-art deep learning model for semantic segmentation of images that aims to assign meaningful labels to all pixels of the input image.” It was designed and open-sourced by Google in 2016. It has several features and many improvements have been made since then, counting DeepLab V2 [4], DeepLab V3 [5] and DeepLab V3+. [6] Figure 4-3. Segmentation result example on Flickr image. [53] DeepLab allows the users to train the model, evaluate the results in terms of mIoU (mean intersection-over-union) and visualize the segmentation results. 4.3.1 How does DeepLab work To understand DeepLab we will focus on DeepLabv3+ [6], its latest version. The model consists of two steps: [28] ● Encoding phase: The goal is to get the key information from the image. This will be done by a pre-trained CNN. ● Decoding phase: The previously extracted information is used to rebuild output of correct dimensions. 21 4.3.1.1 Spatial pyramid pooling To guarantee that the model is robust to changes in object sizes, spatial pyramid pooling (SPP) networks are employed, which take multi-scale information because they use different scaled variants of the input during training. Then, the crucial features that can represent most information are combined. Figure 4-4. Spatial pyramid pooling. [28] “Encoder-Decoder networks transform the input into a dense form that can represent all the input information.” [28] 4.3.1.2 Atrous Convolutions ● To mitigate the enhancement in the computational and memory requirements of training caused by SPP, autrous convolutions are introduced. ● Atrous Convolutions get information from a broader effective field of view, but they keep equal computational complexity. [52] Its generalized form (in which normal convolution has ratio r =1) is: Astrous Spatial Pyramid Pooling (ASPP) is the combination of SPP with autrous convolutions. They normally consist of 4 parallel operations. 22 4.3.1.3 Depthwise Separable Convolutions This technique allows to reduce the computation number when performing convolutions. For instance, the input is 12 x 12 x 3 and a convolution of 5 x 5 is desired, which gives an output 8 x 8 x 1. For this purpose, the convolution is divided in two steps: ● Depthwise convolution: In this step the convolution 5 x 5 x 1 is performed, and the obtained output is 8 x 8 x 3. ● Pointwise convolution: Then, the channel number must increase. 1 x 1 kernels are used with 3 as the depth of the input. With this 1 x 1 x 3 convolution, the output is 8 x 8 x 1. To increase the number of channels, the convolutions 1 x 1 x 3 can be applied as much as wanted. 4.3.2 Datasets I have studied mainly these these datasets to train the models, which have different characteristics: ● PASCAL VOC 2012 [30]: It includes about 1400 images for training and validation and 20 object categories such as animals, vehicles and other daily life objects. ● ADE20k [2]: This dataset contains more than 27K images and over 3K object categories. It is very dense, as there are many objects in the images; this does not happen, for example, in PASCAL VOC 2012, which includes few objects in each image. ● Cityscapes [7]: It is focused on stereo video sequences recorded in streets and roads from 50 cities. There are about 3000 images classified in 30 categories and divided in 8 groups (nature, sky, humans…). These images have been chosen from key frames of the videos. 23 4.3.3 Model Zoo This model zoo [51] offers DeepLab models trained on PASCAL VOC 2012, Cityscapes and ADE20K. I have work the most with PASCAL VOC 2012, whose directory includes: ● A frozen inference graph: frozen_inference_graph.pb ● A checkpoint: model.ckpt.data-00000-of-00001, model.ckpt.index These checkpoints used were: ● mobilenetv2_coco_voc_trainaug ● mobilenetv2_coco_voc_trainval ● xception65_coco_voc_trainaug ● xception65_coco_voc_trainval The checkpoints have been pre-trained on the PASCAL VOC 2012 train_aug set or train_aug + trainval set. Become aware that MobileNet-v2 based models do not use ASPP (seen before) and decoder modules for fast computation. This implies that at a glance we could deduce that its accuracy is lower than in Xception_68. Table 4-2. Computation complexity (in terms of Multiply-Adds and CPU Runtime) and segmentation performance (in terms of mIOU) on the PASCAL VOC val or test set. [51] 24 Looking at the PASCAL mIOU [10] it shows that the precision is indeed lower in MobileNet-v2 than in Xception_68. However, the times are faster in MobileNet-v2. Let’s try the model with the lowest mIOU (mobilenetv2_coco_voc_trainaug) and the one with the highest mIOU (xception65_coco_voc_trainval) with this input [1]. Figure 4-5. Example mobilenetv2_coco_voc_trainaug. Figure 4-6. Example xception65_coco_voc_trainval As said before, at a glance it is possible to see that its accuracy is higher in Xception_68. 25 4.3.4 Evaluation metrics The metric in which the accuracy is measured is MIoU (Mean Intersection over Union). It consists in calculating the IoU between the ground truth and the output predicted by the NN. This considers the region common to both and calculated the similarity percentage in comparison to the actual one. [10] 4.3.5 Advantages of DeepLab There are three main advantages of using DeepLab: [4] ● Speed: thanks to atrous convolution. ● Accuracy: state-of-the-art results are obtained on complex datasets, such as PASCAL VOC 2012. ● Simplicity: this system consists of two fixed modules, DCNNs and CRFs. 4.4 OpenCV OpenCV [27] is an Open source Computer Vision library. I have used it in the Python program that I wrote to be able to capture movement and recognize objects through my webcam. 26 Capítulo 5 - Implementation and experimentation As seen in the project plan, after researching semantic segmentation and DeepLab, and experimenting with a DeepLab demo, I proceeded to install DeepLab and test models locally, both in my laptop and in a virtual machine. 5.1 DeepLab local installation 5.1.1 Laptop installation As a first step, I cloned the git repository of DeepLap [35] in my Windows laptop with the help of Git Bash in order to implement a local installation. I installed a Tensorflow version compatible with GPU (python -m pip install tensorflow-gpu==1.15.3) as well as some libraries and drivers required for its correct functioning in GPU, such as CUDA 9. The software dependencies table can be found here [9]. When everything is installed, try the model_test.py to test that it is working correctly. Figure 5-1. Console output when running model_test.py 27 5.1.1.1 Training As a first step, I decided to start running DeepLab on the PASCAL VOC 2012 dataset. [30] The converted dataset will be saved at the directory ./deeplab/datasets/pascal_voc_seg/tfrecord following this structure: Figure 5-2. Recommended Directory structure for Training and Evaluation [50] Then, I trained the models using the train.py file (in the models/research/deeplab folder). The Flags can be modified according to the preferred requirements. In this case, a training job using xception_65 is done using this command: Figure 5-3. Command used to run train.py. [50] 28 According to the DeepLab repository [50] , ${PATH_TO_INITIAL_CHECKPOINT} is the path to the initial checkpoint, ${PATH_TO_TRAIN_DIR} is the directory in which training checkpoints and events will be written to, and ${PATH_TO_DATASET} is the directory in which the PASCAL VOC 2012 dataset resides. I specified the paths as follows: ● PATH_TO_INITIAL_CHECKPOINT=\deeplab\datasets\pascal_voc_seg\deeplabv3_pas cal_trainval\model.ckpt ● PATH_TO_TRAIN_DIR=\deeplab\datasets\pascal_voc_seg\exp\train_on_train_set \train ● PATH_TO_DATASET=\deeplab\datasets\pascal_voc_seg\tfrecord Figure 5-4. Terminal output when training DeepLab using the official repository (train.py) This will train the model on the dataset and save the checkpoint files to train_logdir, which will later be used in the evaluation. 29 5.1.1.2 Evaluation The official DeepLab repository offers an evaluation file (eval.py). To evaluate a model trained with xception_65 this command is used: Figure 5-5. Command used to run eval.py. [50] According to the DeepLab repository, ${PATH_TO_CHECKPOINT} is the path to the trained checkpoint (i.e., the path to train_logdir) and ${PATH_TO_EVAL_DIR} is the directory in which evaluation events will be written to. I specified the paths as follows: ● PATH_TO_CHECKPOINT=\deeplab\datasets\pascal_voc_seg\exp\train_o n_train_set\train ● PATH_TO_EVAL_DIR=\deeplab\datasets\pascal_voc_seg\exp\train_on_tr ain_set\eval ● PATH_TO_DATASET=\deeplab\datasets\pascal_voc_seg\tfrecord However, I could not get any relevant information about the evaluation. When running the evaluation script I got “nan” as a value of MIoU (instead of a number), which means “not a number”. Despite trying to obtain a meaningful result, I could not find the solution. 30 5.1.1.3 Visualization To perform inference employing DeepLab official repository the file vis.py is used. This command is used for a visualization job using xception_65: Figure 5-6. Command used to run vis.py [50] I specified the paths as follows: ● PATH_TO_CHECKPOINT=\deeplab\datasets\pascal_voc_seg\exp\train_on _train_set\train ● PATH_TO_VIS_DIR=\deeplab\datasets\pascal_voc_seg\exp\train_on_trai n_set\vis ● PATH_TO_DATASET=\deeplab\datasets\pascal_voc_seg\tfrecord As a result, at \deeplab\datasets\pascal_voc_seg\exp\train_on_train_set\vis the segmentation results can be found: Figure 5-7. Segmentation results after running vis.py 31 5.1.1.4 Script (draft) After experimenting with the local version, I created a script to implement semantic segmentation in given images and measure the execution times. Firstly, I created a program to perform semantic segmentation on a sample input image; the outputs were semantic labels overlayed on the original image. This was based on a DeepLab Jupyter Demo. [12] The program performed these tasks: ● Load the latest version of the pretrained DeepLab model ● Load the colormap from the PASCAL VOC dataset ● Assign colors to different labels (object types) ● Visualize the input image, and add the overlay of colors on the regions corresponding to different objects I modified the program to measure inference execution times for different models as follows: Next, I tested the different models and collected their execution times. I measured each model 20 times to get the maximum and minimum values: Model Execution time mobilenetv2_coco_voctrainaug max: 0.937139 s min: 0.782747 s mobilenetv2_coco_voctrainval max: 1.179126 s min: 0.876957 s xception_coco_voctrainaug max: 9.069711 s min: 8.678269 s xception_coco_voctrainval max: 10.749623 s min: 8.941553 s Table 5-1. Execution time (in seconds) of semantic segmentation using different models 32 In addition, I saved the output images as .png format. Here I present some examples of the results of the different models: Figure 5-8. mobilenetv2_coco_voctrainaug segmentation laptop output example Figure 5-9. mobilenetv2_coco_voctrainval segmentation laptop output example Figure 5-10. xception_coco_voctrainaug segmentation laptop output example Figure 5-11. xception_coco_voctrainval segmentation laptop output example 33 As stated previously, MobileNet-v2 models do not use ASPP and decoder modules for fast computation. Table 5-1 shows that -val models have slightly higher inference execution times than its -aug version. It must be taken into consideration that the models have been pre-trained on PASCAL VOC 2012 train_aug set or train_aug + trainval set, which means that -val models have higher pretraining than -aug models. This could explain the higher accuracy of xception_coco_voctrainval and mobilenetv2_coco_voctrainval that can be observed in the figures above. 5.2 Virtual Machine In the virtual machine I installed DeepLab locally just like I did in my laptop. However, after experimenting with the local version and obtaining more problems than relevant outcomes, I will now present the execution times for different models which have now been run in the Virtual Machine with the earlier discussed script. I have used the exact same input image as before. Model Execution time mobilenetv2_coco_voctrainaug max: 1.37438035 s min: 1.037979 s mobilenetv2_coco_voctrainval max: 1.2677140 s min: 1.101976156 s xception_coco_voctrainaug max: 10.826671s min: 8.763001 s xception_coco_voctrainval max: 12.645364 s min: 9.4737105 s Table 5-2. Execution time (in seconds) of semantic segmentation using different models in the Virtual Machine 34 Table 5-2 displays that the execution times in the Virtual Machine are slightly higher than the times of the models run directly in the laptop, as well as proportional. However, it seems that there is not much difference between the first two models. mobilenetv2_coco_voctrainaug has a higher maximum value than mobilenetv2_coco_voctrainval, unlike what happened in the laptop; but its minimum value is proportional to the results in the laptop. I continued developing the script to implement semantic segmentation in webcam images and measuring different execution times. This new version of the script will be discussed later. 5.2.1 Virtual Machine configuration The configuration of the virtual machine VM VirtualBox with all the settings necessary to run the script that I implemented to execute DeepLab models can be found at Apéndice A. I used Ubuntu 20.04 as the Operating System, and I activated a virtual environment with python 3.7 and installed TensorFlow 1.15. 5.2.2 Updated script As an improvement of the program, with the help of the library OpenCV [27] the input image is captured through a webcam. Moreover, instead to compute the execution time as a whole, different times will be measured: ● Image capture time ● Image redimension time ● Inference time 35 5.2.3 Measured times for each model Models Capture time Inference time Redimension time mobilenetv2_coco_voctr ainaug max_1: 0.744371 max2: 0.026295 min: 0.004427 max_1: 1.566632 max2: 1.184288 min: 1.079641 max: 0.042289 max2: 0.028708 min: 0.007805 mobilenetv2_coco_voctr ainval max_1: 0.712017 max2: 0.018788 min: 0.003908 max_1: 1.785160 max2: 1.590308 min: 1.144132 max: 0.033032 max2: 0.0235562 min: 0.012508 xception_coco_voctrain aug max_1: 0.716405 max2: 0.013771 min: 0.004082 max_1: 11.369783 max2: 10.720958 min: 8.296429 max: 0.019892 max2: 0.015657 min: 0.005462 xception_coco_voctrain val max_1: 0.742603 max2: 0.015859 min: 0.005048 max_1: 16.157057 max2: 9.322600 min: 8.121144 max: 0.025819 max2: 0.023149 min: 0.008753 Table 5-3. Semantic segmentation execution times (in seconds)of webcam images in the Virtual Machine These are the maximums and minimum values of the capture, interface and redimensions times extracted from 20 iterations. Two maximums are presented because the first measurement has always the highest value (max_1) for capture and inference times. This first iteration is so high that it is not representative of the real times, because of that, max2 is needed, which corresponds to a value independent from the first measurement. In redimension time this does not happen, but for the sake of coherence I have also included two max values. Table 5-2 shows that the inference time is much higher for the Xception models, and, in particular, xception_coco_voctrainaug has the highest inference time range. Regarding Mobilenet v2, mobilenetv2_coco_voctrainval appears to have the highest times. 36 5.3 DeepLab in Raspberry Pi 4 As a further step, I used a Raspberry Pi as architecture to execute the program. To do so, I configured the device as explained in the next section. In addition, I captured the execution times as done before in the Virtual Machine. 5.3.1 Raspberry Pi configuration Firstly, I installed Ubuntu 20.10 on a microSD, in order to use it as the Operating System of the Raspberry Pi. This process was very user friendly thanks to the installation interface. I followed this [20] guide. Figure 5-12. Raspberry Pi installation interface. [20] I installed the Operating System successfully. However, I had certain issues regarding the software and libraries dependencies. I finally decided to install Ubuntu 20.04 on the microSD, which was definitely a more complex process. I installed Ubuntu Server 20.04 LTS on Raspberry Pi 4, as well as Ubuntu GNOME 3 desktop environment on it following this tutorial. [39] Then I had to install all the software needed, for instance Tensorflow and OpenCV. In this case I installed Tensorflow 2.4 and the last numpy version. [21] Finally, I connected the camera and executed my script, which ran successfully. 37 5.3.2 Measured times for each model Models Capture time Inference time Redimension time mobilenetv2_coco_voctr ainaug max_1: 0.265220 max2: 0.003352 min: 0.001896 max_1: 3.361115 max2: 2.792313 min: 2.239232 max: 0.022485 max2: 0.022106 min: 0.019185 mobilenetv2_coco_voctr ainval max_1: 0.265913 max2: 0.003488 min: 0.0019512 max_1: 3.518164 max2: 2.715697 min: 2.364746 max: 0.023496 max2: 0.023123 min: 0.018469 xception_coco_voctrain aug max_1: 0.261291 max2: 0.009885 max3: 0.003544 min: 0.001897 max_1: 33.316450 max2: 25.020431 min: 23.625263 max: 0.020824 max2: 0.020595 min: 0.018410 xception_coco_voctrain val max_1: 0.263705 max2: 0.007603 max3: 0.002768 min: 0.002304 max_1: 32.254001 max2: 23.877033 min: 21.974183 max: 0.024083 max2: 0.020609 min: 0.018630 Table 5-4. Semantic segmentation execution times (in seconds) of webcam images in the Raspberry Pi 4 These results have also been extracted from 20 iterations. In this case, it happens the same as in the times measured in the Virtual Machine, which have been previously discussed. The first measurement has the highest values in Capture and Inference time, so we have max_1 for that value, and max2 for the next highest value. In some cases I have also included max3 because max2 was extremely high. I can conclude that the semantic segmentation times executed in the Raspberry Pi are approximately twice as high as the results obtained in the Virtual Machine. Moreover, xception_coco_voctrainaug still has the highest values. I will now show the segmentation results I captured through the webcam in different models and its characteristics. 38 Firstly, I used the model xception_coco_voctrainval; the one with the highest MIoU (87%), as seen in Table 4-2. I concluded that this model needs to have clear images as input, with good lightning and clear figures, to make a prediction. If the image is not totally clear, the output of the prediction is all “background”. Moreover, when semantically segmenting an object, the borders are very accurately specified in the image, unlike in the models xception_coco_voctrainaug and mobilenetv2_coco_voctrainaug, which I measured later. It must be noted that when I performed the segmentation in both Xception models, the Raspberry Pi heated to a very high temperature and I had to refrigerate it. This happened due to the high inference time. I also concluded that xception_coco_voctrainaug is not worth the time it takes for the inference in relation to its accuracy (82-83% MIoU), which is almost the same as mobilenetv2_coco_voctrainval (80%) and has a very low inference time. When using mobilenetv2_coco_voctrainaug, the model with the lowest MIoU, I noted that in spite of not having a very clear input image, this model makes a prediction. Sometimes the inference is correct, and sometimes it is not. Moreover, this model does not offer very accurate segmentation of the objects. The perimeter that it covers is generally larger than the actual object. However, the inference time is much lower in Mobilenet than in Xception. On the other hand, the accuracy is not much of a problem in mobilenetv2_coco_voctrainval (80% of MIoU versus 75 - 77% of mobilenetv2_coco_voctrainaug), and it doesn not need totally clear images to make a right prediction. I will now compare the inference behaviour of the models in similar situations. Figure 5-13. RasPi Unclear input, xception_coco_voctrainval 7 39 Figure 5-14. RasPi Unclear input, xception_coco_voctrainaug 26 Figure 5-15. RasPi Unclear input, mobilenetv2_coco_voctrainval 29 Figure 5-16. RasPi Unclear input, alternative mobilenetv2_coco_voctrainval 31 Figure 5-17. RasPi Unclear input, mobilenetv2_coco_voctrainaug 39 When the image is not totally clear (is too light or too dark), figures 5-13 to 5-17 show how the Xception models have more trouble segmenting the images when they do not know for sure which tag the object has. On the contrary, Mobilenet models 40 perform inference even when the image image is not clear. This can have right (Figure 15) or wrong (Figure 16) results. Figure 5-18. RasPi Clear input, xception_coco_voctrainval 9 Figure 5-19. RasPi Clear input, xception_coco_voctrainaug Figure 5-20. RasPi Clear input, mobilenetv2_coco_voctrainval 23 Figure 5-21. RasPi Clear input, mobilenetv2_coco_voctrainaug 30 41 Figure 5-22. RasPi Clear input, alternative mobilenetv2_coco_voctrainaug 63 When the input image has good lighting, Xception models generally behave well, though xception_coco_voctrainaug (Figure 19) may not be as precise as xception_coco_voctrainval (Figure 18). mobilenetv2_coco_voctrainval has correct behaviour too (Figure 5-20). However, mobilenetv2_coco_voctrainaug may identify the correct object, but highlight other areas in the image in which the object does not appear (Figure 5-21), or it can even indicate a different tag that does not correspond to the object in the input image (Figure 5-22). Figure 5-23. RasPi Several objects, xception_coco_voctrainval 47 Figure 5-24. RasPi Several objects, xception_coco_voctrainaug 42 Figure 5-25. RasPi Several objects, mobilenetv2_coco_voctrainval 51 Figure 5-26. RasPi Several objects, mobilenetv2_coco_voctrainaug 9 In this situation there are various objects, and the lightning is not ideal. It is rare that in xception_coco_voctrainval the second monitor is not identified. However, the inference of the other objects is very accurate. The other thing to be noted is that, in this case, mobilenetv2_coco_voctrainval looks less accurate than mobilenetv2_coco_voctrainaug. To conclude, using the model mobilenetv2_coco_voctrainval is a good solution because of its accuracy and its inference time. This model is not as accurate as xception_coco_voctrainval, but it is much faster (as it can be noted in the tables), which makes mobilenetv2_coco_voctrainval a good option if the precision is not extremely crucial. 43 5.4 Coral USB Accelerator in Raspberry Pi 4 To be able to bring pretrained models to architectures with less resources, it is necessary to simplify those models. This is done by quantization, a technique that creates models with smaller dimensions in TensorFlow Lite format. 5.4.1 Coral installation To get started, it is necessary to download the Edge TPU runtime and the PyCoral library on the computer. All the installation steps are specified in the Coral documentation [17] for different Operating Systems. 5.4.2 How to run a model on the Edge TPU Firstly, I plugged in the Coral USB. Then, I followed the Coral documentation to perform image classification with a MobileNet v2 using a Coral repository [22] to download the model, labels and a photo. The last step is to run the image classifier with the photo. Figure 5-27. Edge TPU inference results Figure 5-27 shows the results of the performed inference on the Edge TPU using TensorFlow Lite. The top object classification label shows its confidence score, from 0 to 1.0. In this case, the program is 0.75781 sure that the image represents a scarlet macaw. 44 5.4.3 How to run TFLite object detection models on the Raspberry Pi For this next procedure, I used this [15] repository following this guide [14]. I had already installed Tensorflow and OpenCV in my Raspberry Pi, so I saved those steps. To set up the TensorFlow Lite (TFLite) detection model, I used one Google's sample TFLite model, which can be found in the Sample_TFLite_model folder. The sample is a quantized SSDLite-MobileNet-v2 object detection model which is trained off the MSCOCO dataset and converted to run on TFLite. To run the TFLite model, I executed the following command: python3 TFLite_detection_webcam.py --modeldir=Sample_TFLite_mode Figure 5-28. TFLite object detection models on a Raspberry Pi Figure 5-28 shows that the webcam is used to capture the input video. The object detection takes place in real time. As the image shows, the model does not implement semantic segmentation, but just object detection. The difference is that in semantic segmentation, each pixel is labeled with an object class, unlike what happens in object detection. At the top right corner, Figure 5-28 displays “5,98 FPS”. Frames Per Second determines how fast the object detection model processes the input video and creates the output. [38] 45 5.4.4 How to run Edge TPU object detection models on the Raspberry Pi using Coral USB Accelerator As explained before, the Coral USB Accelerator uses the Edge TPU, which is an ASIC chip with highly parallelized ALUs (arithmetic logic units) that are directly connected to each other, unlike what happens in GPUs. [14] All this makes object detection models run faster. To set up Coral the first step is to install the libedgetpu1 library. There are two options, but I chose libedgetpu1-std, which allows 22.6 FPS (frames per second). The other option, ibedgetpu1-max, allows 26.1 FPS, but Coral would get hotter. sudo apt-get install libedgetpu1-std The next step is to set up the Edge TPU detection model. This model is compiled specifically to run on Edge TPU devices like Coral, and is kept in a .tflite file. There are two alternatives: ● Using Google’s sample model, which is compiled from the quantized SSDLite-MobileNet-v2 we used in the previous step. ● Using my own custom EdgeTPU model. However, edgetpu-compiler package does not work on the Raspberry Pi, but on a Linux PC. Finally, I used Google's sample model to get a better comparison with the previous step. 46 Figure 5-29. Edge TPU object detection models on a Raspberry Pi using Coral Figure 5-29 shows the object detection. At the top right corner, it displays “22,47 FPS”. In comparison to the TFLite object detection model on a Raspberry Pi without Coral (5,98 FPS), the usage of Coral USB Accelerator has obviously increased the speed at which the object detection model processes the video and generates the desired output. [38] This means that the average inference performance (which is the same as frames-per-second) computed by the total inferences performed within the total of timing windows. [43] 47 5.4.5 How to quantize a model to run it on Edge TPU As explained earlier, to create a model for Edge TPU runnable by the Coral USB Accelerator, the TensorFlow model needs to be quantized. I will focus on post-training quantization. The main steps are to use the TensorFlow Lite Converter to obtain a TF Lite model (.tflite), and then compile this model specifically for Edge TPU to get an Edge TPU model, which can then be deployed to the Coral hardware. To use the TF Lite Converter the Python API or the command line can be used. All these conversion steps were done in the Virtual Machine, because it would not work to implement it on the Raspberry Pi. In my case, I used the tf.compat.v1.lite.TFLiteConverter Python API to convert a frozen graph (.pb) from a file. I used the model Mobilenet_1.0_224, which is small, low-latency and low-power. For it, I created a script in which I first converted the model and then saved it as a .tflite file. Figure 5-30. Terminal output when converting the TF model to a TF Lite model Once I had the TFLite file, the next step was to compile it for Edge TPU, which I did following the Coral compiling documentation. [13] Firstly, I had to follow some instructions to install the compiler (edgetpu-compiler) on my Linux system. To use the compiler, the following command is used: edgetpu_compiler [options] model… 48 Figure 5-31. Terminal output when compiling the TF Lite model to Edge TPU When the model has compiled successfully, it generates a _edgetpu.tflite file. Then, it is time to run this model on the Raspberry Pi with PyCoral API. The first thing to do is to install PyCoral with this command: sudo apt-get install python3-pycoral After this, I tried an inferencing example from Coral to test that the model worked correctly. Unfortunately, it was not the case. I tested a cat image and it detected a clock. In addition, this model was not adequate to perform semantic segmentation. I later noticed that Figure 5-31 indicates that no operations would be performed in Edge TPU, which means that the type of quantization that I needed was not successful. 49 Then, I created a Google Colab program to perform the quantization, which behaved properly. I needed to use a representative dataset to perform the full integer post-training quantization and be able to run the resulting model in an Edge TPU architecture (the Coral Accelerator). For this purpose, I first generated a dataset of dummy images, which are randomly generated, to test that it worked properly. I later created my own dataset of 30 images taken from my own webcam to use them as the representative dataset. Once this step was completed, the mobilenetv2_dm05_coco_voc_trainval model was transformed into a TF Lite model. I compared the bytes saved with the quantization using the dummy images: TensorFlow Model is 3042785 bytes TFLite Model is 963896 bytes Post training int8 quantization saves 2078889 bytes And my webcam images dataset: TensorFlow Model is 3042785 bytes TFLite Model is 2789284 bytes Post training int8 quantization saves 253501 bytes The higher the quality of the representative dataset, the smaller number of bytes saved in the quantization. The next step is to compile the TF Lite model to Edge TPU (to get a _edgetpu.tflite model) and to download it to perform inference on the Raspberry Pi using the Coral Accelerator. It must be noted that the only model that I found whose operations can run for EdgeTPU is mobilenetv2_dm05_coco_voc_trainval. 50 Figure 5-32. Inference result of dummy_edgetpu.tflite model with input image When I used the dummy_edgetpu.tflite model (the model quantized using dummy images as a representative dataset) using a downloaded image as an input, I got the results displayed in Figure 5-32. Figure 5-33. Inference result of smarty_edgetpu.tflite model with input image And when I used the smarty_edgetpu.tflite model (the model quantized using webcam images as a representative dataset) using a downloaded image as an input, I got the result of Figure 5-33. It can be noted that neither are very accurate. To perform inference with these models I created a script, which can be executed as follows: python3 sem_seg_edgetpu_pics.py --model dummy_edgetpu.tflite --input bird.bmp --keep_aspect_ratio --output segmentation_result_dummy.jpg 51 Figure 5-34. Inference result of smarty_edgetpu.tflite model with webcam input However, when using the dummy model with a webcam image input, the inference results were almost always labeled as “background” even though there were several objects present in those images. Nevertheless, smarty models predicted some object labels even if they were not always correct or accurate, as we can see in this example in Figure 5-34. Regarding inference time, the dummy model takes a little longer than the smarty model. A comparison between the EdgeTPU models and the non-quantized models will be presented later. - 52 Capítulo 6 - Results Comparison and Analysis The inference, capture and redimension times of the different models in various architectures have already been already shown. In this section the results will be compared and analyzed to comprehend better their meaning. 6.1 Laptop vs. Virtual Machine execution times In the overall execution time results of the semantic segmentation executed in the laptop (Table 5-1) are slightly lower than the execution times in the Virtual Machine (Table 5-2). There is not a big difference between the models’ behaviour depending on the architecture used; the times in the laptop are proportional to the times in the virtual machine. These results were computed with the same program using the same input images. 6.2 Virtual Machine image input vs. Virtual Machine webcam input Later, I implemented some modifications in the program to measure the inference, capture and redimension times on their own, unlike the tables before where all these times were considered the execution time. This time the program did not perform the semantic segmentation on pictures given as input, but on images captured through the webcam, hence the capture time was also measured. (Table 5-3) In this table it can be noted that the capture time has no significant variance between the models. Moreover, the first capture time of each model is significantly higher than the rest. This phenomenon happens during the inference as well, but not so much during the redimension time. When comparing the inference of the webcam images with the execution time of the input images, it can be seen that the times are lower in the laptop than in the virtual machine for the Mobilenet V2 models. All this taking into account that the execution time of the laptop includes the inference and redimension time. However, for the Xception models this is more ambiguous, but we can see that the minimum values are somewhat smaller in the virtual machine than in the laptop. 53 Finally, the redimension time has the lowest values for the Xception models, specifically the xception_coco_voctrainaug model; while the Mobilenet V2 models have higher redimension times. 6.3 Virtual Machine vs. Raspberry Pi When it comes to the Raspberry Pi, the capture times are significantly smaller than in the Virtual Machine, and there is not much time difference between the models. (Table 5-4) Models Virtual Machine inference time Raspberry Pi inference time mobilenetv2_coco_voctrainaug max_1: 1.566632 max2: 1.184288 min: 1.079641 max_1: 3.361115 max2: 2.792313 min: 2.239232 mobilenetv2_coco_voctrainval max_1: 1.785160 max2: 1.590308 min: 1.144132 max_1: 3.518164 max2: 2.715697 min: 2.364746 xception_coco_voctrainaug max_1: 11.369783 max2: 10.720958 min: 8.296429 max_1: 33.316450 max2: 25.020431 min: 23.625263 xception_coco_voctrainval max_1: 16.157057 max2: 9.322600 min: 8.121144 max_1: 32.254001 max2: 23.877033 min: 21.974183 Table 6-1. Inference time comparison (in seconds) between Virtual Machine and Raspberry Pi However, the inference time in the RasPi is approximately two times higher than in the Virtual Machine. Table 6-1 shows the inference time comparison between both architectures. Regarding the redimension time, the Xception models have smaller values in the Virtual Machine than in the Raspberry Pi. In particular, the minimum values are extremely small, while the maximum values are very similar to the ones in the RasPi. 54 6.4 Raspberry Pi: regular model vs. Edge TPU model The quantization and EdgeTPU compilation were performed on a mobilenetv2_dm05_coco_voc_trainval model. This will be a comparison of the capture, between the non-quantized (regular) model and the _edgetpu.tflite model. Model Capture time Inference time Redimension time regular max1: 0.303826 max2: 0.0177173 min: 0.001894 max1: 5.633136 max2: 2.095738 min: 1.611738 max: 0.058004 min: 0.020867 dummy_edgetpu.tflite max: 0.269128 min: 0.266331 max: 0.008964 min: 0.008219 max: 0.033319 min: 0.029464 smarty_edgetpu.tflite max: 0.273280 min: 0.268579 max: 0.007211 min: 0.007089 max: 0.031751 min: 0.030280 Table 6-2. Time comparison (in seconds) between regular and EdgeTPU semantic segmentation model in Raspberry Pi The table shows, as we have seen in previous regular models, that mobilenetv2_dm05_coco_voc_trainval regular model always has a higher capture and inference time on the first iteration (max1), but this pattern does not happen for the redimension time. In the _edgetpu.tflite models I implemented a script which performs just one operation each time it is run, unlike the script of the regular model, which has a loop that captures webcam images until it is specified that it should stop. Because of this, the capture times of the _edgetpu.tflite models are higher (all the times the capture time is measured, it is considered the first (and only) time during that execution, and therefore the highest). Overall, it seems like the capture time of the regular model may be a little higher. With regards to the inference, we see that in the _edgetpu.tflite models the time is extremely lower thanks to the Coral Accelerator. Even in the case that it happened the same as with the capture time, that means that the inference time could be even lower in the next iterations if there was a loop in the script. 55 On the other hand, it is interesting to see that the times for the _edgetpu.tflite models have very similar maximum and minimum values. The redimension time seems very similar for all models, but as said previously, the difference between maximum and minimum values is very small for _edgetpu.tflite models. Moreover, the dummy model has slightly higher inference time than the smarty model. 56 Capítulo 7 - Conclusions and future work Semantic segmentation has become essential nowadays. Even everyday people use this technology on their phones, and it should be possible for everyone to be able to take advantage of it. Thanks to the advances in Edge Computing, this is now possible. This project shows that the results obtained are positive. By simplifying (quantization) a semantic segmentation model and using an USB Coral Accelerator I was able to perform inference in real-time on images captured through the webcam on a device restricted by computing power and energy consumption (Raspberry Pi). Human-eye vision cannot differentiate a succession of images in 24 FPS, which is equivalent to 0.0416 seconds. This means that we perceive it as a video, not as independent images. I obtained an average inference time of 0.0085 and 0.0071 seconds in different EdgeTPU models, hence if the script which runs the inference had a loop, the semantic segmentation would operate like a video, in real time. However, to get these results, we have to sacrifice accuracy. Nonetheless, using a regular (non-quantized) semantic segmentation model on the Raspberry Pi, or even a Linux virtual machine, and using DeepLab as a basis, an accurate implementation is possible although inference times are not efficient. The next steps would be to deploy DeepLab to other platforms such as Intel NCS and Imagination Technologies PowerVR. 57 58 BIBLIOGRAFÍA 1. (n.d.). picsum. https://i.picsum.photos/id/1012/3973/2639.jpg?hmac=s2eybz51lnKy2ZHkE2wsgc6S81f VD1W2NKYOSh8bzDc 2. ADE20K. (n.d.). MIT CSAIL. https://groups.csail.mit.edu/vision/datasets/ADE20K/ 3. Briot, A., Viswanath, P., & Yogamani, S. (2018). Analysis of Efficient CNN Design Techniques for Semantic Segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 10.1109/CVPRW.2018.00109 4. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, pp. 834-848. 10.1109/TPAMI.2017.2699184 5. Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. (arXiv:1706.05587). 6. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. European Conference on Computer Vision (ECCV), pp 833-851. 7. The Cityscapes Dataset. (n.d.). https://www.cityscapes-dataset.com 8. Cloud Tensor Processing Units (TPUs). (n.d.). Google Cloud. https://cloud.google.com/tpu/docs/tpus 9. Configuraciones de compilación probadas. (n.d.). TensorFlow. https://www.tensorflow.org/install/source?hl=es-419#tested_build_configurations 10. CYBORG NITR. (2020, May 9). MIoU Calculation. Medium. https://medium.com/@cyborg.team.nitr/miou-calculation-4875f918f4cb 59 https://i.picsum.photos/id/1012/3973/2639.jpg?hmac=s2eybz51lnKy2ZHkE2wsgc6S81fVD1W2NKYOSh8bzDc https://i.picsum.photos/id/1012/3973/2639.jpg?hmac=s2eybz51lnKy2ZHkE2wsgc6S81fVD1W2NKYOSh8bzDc https://groups.csail.mit.edu/vision/datasets/ADE20K/ https://www.cityscapes-dataset.com https://cloud.google.com/tpu/docs/tpus https://www.tensorflow.org/install/source?hl=es-419#tested_build_configurations https://medium.com/@cyborg.team.nitr/miou-calculation-4875f918f4cb 11. Danka, T. (2020, Jun 29). How to accelerate and compress neural networks with quantization. Towards Data Science. https://towardsdatascience.com/how-to-accelerate-and-compress-neural-networks -with-quantization-edfbbabb6af7 12. DeepLab Demo. (n.d.). https://colab.research.google.com/github/tensorflow/models/blob/master/research /deeplab/deeplab_demo.ipynb?authuser=1 13. Edge TPU Compatibility overview. (n.d.). Coral. https://coral.ai/docs/edgetpu/models-intro/#compatibility-overview 14. EdjeElectronics. (n.d.). Part 2 - How to Run TensorFlow Lite Object Detection Models on the Raspberry Pi (with Optional Coral USB Accelerator). GitHub. https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-an d-Raspberry-Pi/blob/master/Raspberry_Pi_Guide.md 15. EdjeElectronics. (n.d.). TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi. GitHub. https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-an d-Raspberry-Pi 16. Full integer quantization. (n.d.). TensorFlow. https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer _quantization 17. Get started with the USB Accelerator. (n.d.). Coral. https://coral.ai/docs/accelerator/get-started 18. Hasan, T. (2018, May 12). Semantic Segmentation. Data Science Portfolio. https://tariq-hasan.github.io/concepts/computer-vision-semantic-segmentation/ 60 https://towardsdatascience.com/how-to-accelerate-and-compress-neural-networks-with-quantization-edfbbabb6af7 https://towardsdatascience.com/how-to-accelerate-and-compress-neural-networks-with-quantization-edfbbabb6af7 https://colab.research.google.com/github/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb?authuser=1 https://colab.research.google.com/github/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb?authuser=1 https://coral.ai/docs/edgetpu/models-intro/#compatibility-overview https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/blob/master/Raspberry_Pi_Guide.md https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/blob/master/Raspberry_Pi_Guide.md https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi https://github.com/EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization https://coral.ai/docs/accelerator/get-started https://tariq-hasan.github.io/concepts/computer-vision-semantic-segmentation/ 19. How do I create a TensorFlow Lite model for the Edge TPU? (n.d.). Coral. https://coral.ai/docs/edgetpu/faq/#how-do-i-create-a-tensorflow-lite-model-for-the -edge-tpu 20. How to install Ubuntu Desktop on Raspberry Pi 4. (n.d.). Ubuntu. https://ubuntu.com/tutorials/how-to-install-ubuntu-desktop-on-raspberry-pi-4#1-over view 21. Install Ubuntu 20.04 + OpenCV + TensorFlow (Lite) on Raspberry Pi 4. (n.d.). Q-engineering. https://qengineering.eu/install-ubuntu-20.04-on-raspberry-pi-4.html 22. Kovalev, D. (n.d.). PyCoral API. GitHub. https://github.com/google-coral/pycoral 23. Makaya, C., Iyer, A., Salfity, J., Athreya, M., & Lewis, M. A. (2020). Cost-effective Machine Learning Inference Offload for Edge Computing. arXiv:2012.04063. 24. Marcelino, P. (2018, Oct 23). Transfer learning from pre-trained models. Towards Data Science. https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f12 4751 25. Mihajlovic, I. (2019, Apr 25). Everything You Ever Wanted To Know About Computer Vision. Towards Data Science. https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-com puter-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e 26. Model optimization. (n.d.). TensorFlow. https://www.tensorflow.org/lite/performance/model_optimization 27. OpenCV. (n.d.). https://opencv.org 28. Pal, S. (2019, February 26). Semantic Segmentation: Introduction to the Deep Learning Technique Behind Google Pixel’s Camera! Analytics Vidhya. 61 https://coral.ai/docs/edgetpu/faq/#how-do-i-create-a-tensorflow-lite-model-for-the-edge-tpu https://coral.ai/docs/edgetpu/faq/#how-do-i-create-a-tensorflow-lite-model-for-the-edge-tpu https://ubuntu.com/tutorials/how-to-install-ubuntu-desktop-on-raspberry-pi-4#1-overview https://ubuntu.com/tutorials/how-to-install-ubuntu-desktop-on-raspberry-pi-4#1-overview https://qengineering.eu/install-ubuntu-20.04-on-raspberry-pi-4.html https://github.com/google-coral/pycoral https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751 https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751 https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e https://www.tensorflow.org/lite/performance/model_optimization https://opencv.org https://www.analyticsvidhya.com/blog/2019/02/tutorial-semantic-segmentation-goo gle-deeplab/ 29. Pan, S. J., & Yang, Q. (2009). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(Issue 10). 10.1109/TKDE.2009.191 30. PASCAL VOC 2012 dataset. (n.d.). https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pasc al.md 31. Post-training quantization. (n.d.). TensorFlow. https://www.tensorflow.org/lite/performance/post_training_quantization 32. Quantization. (n.d.). Coral. https://coral.ai/docs/edgetpu/models-intro/#quantization 33. Raspberry Pi 4 model B - Specifications. (n.d.). Raspberry Pi. https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/ 34. Raspberry Pi (Trading) Ltd. (2021). Raspberry Pi 4 Computer Model B. https://datasheets.raspberrypi.org/rpi4/raspberry-pi-4-product-brief.pdf 35. ruslo. (n.d.). DeepLab repository. GitHub. https://github.com/tensorflow/models/tree/master/research/deeplab 36. Saha, S. (2018, Dec 15). A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way. Towards Data Science. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-n etworks-the-eli5-way-3bd2b1164a53 37. Sarkar, D. (2018, Nov 14). A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning. Towards Data Science. https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-lear ning-with-real-world-applications-in-deep-learning-212bf3b2f27a 62 https://www.analyticsvidhya.com/blog/2019/02/tutorial-semantic-segmentation-google-deeplab/ https://www.analyticsvidhya.com/blog/2019/02/tutorial-semantic-segmentation-google-deeplab/ https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pascal.md https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pascal.md https://www.tensorflow.org/lite/performance/post_training_quantization https://coral.ai/docs/edgetpu/models-intro/#quantization https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/ https://datasheets.raspberrypi.org/rpi4/raspberry-pi-4-product-brief.pdf https://github.com/tensorflow/models/tree/master/research/deeplab https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a 38. Saxena, P. (2020, Sep 3). Increase Frame Per Second (FPS) rate in the Custom Object Detection Step by step. Towards Data Science. https://towardsdatascience.com/no-gpu-for-your-production-server-a20616bb04bd 39. Shovon, S. (2020, Sep). Install Ubuntu Desktop 20.04 LTS on Raspberry Pi 4. linuxhint. https://linuxhint.com/install-ubuntu-desktop-20-04-lts-on-raspberry-pi-4/ 40. TensorFlow. (n.d.). https://www.tensorflow.org/?hl=es-419 41. TensorFlow Lite converter. (n.d.). TensorFlow. https://www.tensorflow.org/lite/convert 42. TensorFlow Lite Guide. (n.d.). TensorFlow. https://www.tensorflow.org/lite/guide 43. Torelli, P., & Bangale, M. (n.d.). Measuring Inference Performance of Machine-Learning Frameworks on Edgeclass Devices with the MLMark™ Benchmark. 44. Transfer learning. (n.d.). Coral. https://coral.ai/docs/edgetpu/models-intro/#transfer-learning 45. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017, May). ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60(Issue 6), pp 84–90. 46. USB Accelerator datasheet. (n.d.). Coral. https://coral.ai/docs/accelerator/datasheet/ 47. Visual Object Classes Challenge 2012 (VOC2012). (n.d.). http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html 48. What Is Power over Ethernet (PoE)? (n.d.). CISCO. https://www.cisco.com/c/en/us/solutions/enterprise-networks/what-is-power-over-et hernet.html 49. What is the Edge TPU? (n.d.). Coral. https://coral.ai/docs/edgetpu/faq/#what-is-the-edge-tpu 63 https://towardsdatascience.com/no-gpu-for-your-production-server-a20616bb04bd https://linuxhint.com/install-ubuntu-desktop-20-04-lts-on-raspberry-pi-4/ https://www.tensorflow.org/?hl=es-419 https://www.tensorflow.org/lite/convert https://www.tensorflow.org/lite/guide https://coral.ai/docs/edgetpu/models-intro/#transfer-learning https://coral.ai/docs/accelerator/datasheet/ http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html https://www.cisco.com/c/en/us/solutions/enterprise-networks/what-is-power-over-ethernet.html https://www.cisco.com/c/en/us/solutions/enterprise-networks/what-is-power-over-ethernet.html https://coral.ai/docs/edgetpu/faq/#what-is-the-edge-tpu 50. YknZhu. (n.d.). Running DeepLab on PASCAL VOC 2012 Semantic Segmentation Dataset. GitHub. https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pasc al.md#recommended-directory-structure-for-training-and-evaluation 51. YknZhu. (n.d.). TensorFlow DeepLab Model Zoo. GitHub. https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/mod el_zoo.md 52. Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. International Conference on Learning Representations (ICLR). 53. yukun. (n.d.). DeepLab Repository. https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/img/v is3.png 54. Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shuffle Net: An Extremely Efficient Convolutional Neural Network for Mobile. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 6848-6856. 55. Ultimate Spinach. (1968). Mind Flowers (Behold & See) 64 https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pascal.md#recommended-directory-structure-for-training-and-evaluation https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pascal.md#recommended-directory-structure-for-training-and-evaluation https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/img/vis3.png https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/img/vis3.png APÉNDICE A Configuration Guidelines This guideline has the purpose to support the users to configure the virtual machine or system intended to execute the semantic segmentation script that has been explained during this work. The script will run a program in which images captured through a webcam are used to perform semantic segmentation with different models. Introduction To begin with, in order to follow these steps it is much easier to work from a Linux environment. For this reason, I will not consider any other Operating Systems. If the user has a Windows computer, it is recommended to install the Oracle VM VirtualBox. I downloaded Ubuntu 20.04 to use it as the Operating System in my virtual machine. Virtual environment When Ubuntu is correctly installed, the next step is to create a virtual environment for Python. We install the environment and the Python version needed, in this case python 3.7: sudo apt install software-properties-common sudo add-apt-repository ppa:deadsnakes/ppa sudo apt install python3.7 Afterwards, it is necessary to access the directory that the environment is going to be located at, and then create the environment: sudo apt-get install python3-pip sudo apt install python3-virtualenv python3-venv virtualenv environment_name -p python3.7.9 To activate the environment (from the directory in was created at): source environment_name/bin/activate 65 Tools installation The next step is to download the corresponding Tensorflow version according to the previously installed Python version. This can be done by using pip: pip install tensorflow==1.15 In order to use images captured from the webcam as input for the program, OpenCV needs to be installed: sudo apt update python -m pip install opencv-python To verify that the correct OpenCV version was installed, the following command returns the current version installed in our machine. Output example: 4.2.0 python3 -c "import cv2; print(cv2.__version__)" Webcam configuration Now that all the necessary tools are installed, it is time to connect the camera to the virtual machine. When we run our virtual machine, the upper part of the menu shows the window ‘Devices’ or ‘Dispositivos’, which should include the option ‘Webcams’ or ‘Cámaras web’ once we click on it. At this point, we just need to select our webcam, which should appear listed in that option. If, on the contrary, the option ‘Webcams’ does not appear in the menu, that means that we need to install the adequate VirtualBox Extension pack corresponding to the version of our VirtualBox. The final step Once the webcam is selected, the script should run correctly and obtain the input images through the webcam and semantically segment them. 66 67