Olcoz Herrero, Katzalin

Olcoz Herrero
Universidad Complutense de Madrid
Ciencias Físicas
Arquitectura de Computadores y Automática
Arquitectura y Tecnología de Computadores
    A unified cloud-enabled discrete event parallel and distributed simulation architecture
    (Elsevier, 2022-07) Risco Martín, José Luis; Henares Vilaboa, Kevin; Mittal, Saurabh; Almendras Aruzamen, Luis Fernando; Olcoz Herrero, Katzalin
    Cloud infrastructure provides rapid resource provision for on-demand computational require-ments. Cloud simulation environments today are largely employed to model and simulate complex systems for remote accessibility and variable capacity requirements. In this regard, scalability issues in Modeling and Simulation (M & S) computational requirements can be tackled through the elasticity of on-demand Cloud deployment. However, implementing a high performance cloud M & S framework following these elastic principles is not a trivial task as parallelizing and distributing existing architectures is challenging. Indeed, both the parallel and distributed M & S developments have evolved following separate ways. Parallel solutions has always been focused on ad-hoc solutions, while distributed approaches, on the other hand, have led to the definition of standard distributed frameworks like the High Level Architecture (HLA) or influenced the use of distributed technologies like the Message Passing Interface (MPI). Only a few developments have been able to evolve with the current resilience of computing hardware resources deployment, largely focused on the implementation of Simulation as a Service (SaaS), albeit independently of the parallel ad-hoc methods branch. In this paper, we present a unified parallel and distributed M & S architecture with enough flexibility to deploy parallel and distributed simulations in the Cloud with a low effort, without modifying the underlying model source code, and reaching important speedups against the sequential simulation, especially in the parallel implementation. Our framework is based on the Discrete Event System Specification (DEVS) formalism. The performance of the parallel and distributed framework is tested using the xDEVS M & S tool, Application Programming Interface (API) and the DEVStone benchmark with up to eight computing nodes, obtaining maximum speedups of 15.95x and 1.84x, respectively.
    Genome sequence alignment-design space exploration for optimal performance and energy architectures
    (Institute of Electrical and Electronics Engineers (IEEE), 2021-12-01) Qureshi, Yasir Mahmood; Herruzo, José M.; Zapater, Marina; Olcoz Herrero, Katzalin; González Navarro, Sonia; Plata, Óscar; Atienza, David
    Next generation workloads, such as genome sequencing, have an astounding impact in the healthcare sector. Sequence alignment, the first step in genome sequencing, has experienced recent breakthroughs, which resulted in next generation sequencing (NGS). As NGS applications are memory bounded with random memory access patterns, we propose the use of high bandwidth memories like 3D stacked HBM2, instead of traditional DRAMs like DDR4, along with energy efficient compute cores to improve both performance and energy efficiency. Three state-of-the-art NGS applications, Bowtie2, BWA-MEM, and HISAT2 are used as case studies to explore and optimize NGS computing architectures. Then, using the gem5-X architectural simulator, we obtain an overall 68 percent performance improvement and 71 percent energy savings using HBM2 instead of DDR4. Furthermore, we propose an architecture based on ARMv8 cores and demonstrate that 16 ARMv8 64-bit OoO cores with HBM2 outperforms 32-cores of Intel Xeon Phi Knights Landing (KNL) processor with 3D stacked memory. Moreover, we show that by using frequency scaling we can achieve up to 59 percent and 61 percent energy savings for ARM in-order and OoO cores, respectively. Lastly, we show that many ARMv8 in-order cores at 1.5GHz match the performance of fewer OoO cores at 2GHz, while attaining 4.5x energy savings.
    Gem5-X: a many-core heterogeneous simulation platform for architectural exploration and optimization
    (Association for Computing Machinery, 2021-12) Qureshi, Yasir Mahmood; Simon, William Andrew; Zapater, Marina; Olcoz Herrero, Katzalin; Atienza, David
    The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.
    Gem5-x: a gem5-based system level simulation framework to optimize many-core platforms
    (IEEE, 2019) Mahmood Qureshi, Yasir; Simon, William Andrew; Zapater, Marina; Atienza, David; Olcoz Herrero, Katzalin
    The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively.
    Optimization of a Line detection algorithm for autonomous vehicles on a RISC-V with accelerator
    (Universidad Nacional de La Plata, 2022-10) Belda Beneyto, María José; Olcoz Herrero, Katzalin; Castro Rodríguez, Fernando; Tirado Fernández, Francisco
    In recent years, autonomous vehicles have attracted the attention of many research groups, both in academia and business, including researchers from leading companies such as Google, Uber and Tesla. This type of vehicles are equipped with systems that are subject to very strict requirements, essentially aimed at performing safe operations -both for potential passengers and pedestrians- as well as carrying out the processing needed for decision making in real time. In many instances, general-purpose processors alone cannot ensure that these safety, reliability and real-time requirements are met, so it is common to implement paper explores the acceleration of a line detection aprunning without accelerator.
    A machine learning-based framework for throughput estimation of time-varying applications in multi-core servers
    (IEEE, 2019) Iranfar, Arman; Souza, Wellington Silva de; Zapater, Marina; Olcoz Herrero, Katzalin; Souza, Samuel Xavier de; Atienza, David
    Accurate workload prediction and throughput estimation are keys in efficient proactive power and performance management of multi-core platforms. Although hardware performance counters available on modern platforms contain important information about the application behavior, employing them efficiently is not straightforward when dealing with time-varying applications even if they have iterative structures. In this work, we propose a machine learning-based framework for workload prediction and throughput estimation using hardware events. Our framework enables throughput estimation over various available system configurations, namely, number of parallel threads and operating frequency. In particular, we first employ workload clustering and classification techniques along with Markov chains to predict the next workload for each available system configuration. Then, the predicted workload is used to estimate the next expected throughput through a machine learning-based regression model. The comparison with state of the art demonstrates that our framework is able to improve Quality of Service (QoS) by 3.4x, while consuming 15% less power thanks to the more accurate throughput estimation.
    Resource management for power-constrained HEVC transcoding using reinforcement learning
    (IEEE Computer Society, 2020-12-01) Costero Valero, Luis María; Iranfar, Arman; Zapater, Marina; Atienza, David; Olcoz Herrero, Katzalin
    The advent of online video streaming applications and services along with the users' demand for high-quality contents require High Efficiency Video Coding (HEVC), which provides higher video quality and more compression at the cost of increased complexity. On one hand, HEVC exposes a set of dynamically tunable parameters to provide trade-offs among Quality-of-Service (QoS), performance, and power consumption of multi-core servers on the video providers' data center. On the other hand, resource management of modern multi-core servers is in charge of adapting system-level parameters, such as operating frequency and multithreading, to deal with concurrent applications and their requirements. Therefore, efficient multi-user HEVC streaming necessitates joint adaptation of application- and system-level parameters. Nonetheless, dealing with such a large and dynamic design space is challenging and difficult to address through conventional resource management strategies. Thus, in this work, we develop a multi-agent Reinforcement Learning framework to jointly adjust application- and system-level parameters at runtime to satisfy the QoS of multi-user HEVC streaming in power-constrained servers. In particular, the design space, composed of all design parameters, is split into smaller independent sub-spaces. Each design sub-space is assigned to a particular agent so that it can explore it faster, yet accurately. The benefits of our approach are revealed in terms of adaptability and quality (with up to to 4x improvements in terms of QoS when compared to a static resource management scheme), and learning time (6 x faster than an equivalent mono-agent implementation). Finally, we show that the power-capping techniques formulated outperform the hardware-based power capping with respect to quality.
    Containergy-a container-based energy and performance profiling tool for next generation workloads
    (MDPI, 2020-05) Souza, Wellington Silva de; Iranfar, Arman; Braulio, Anderson; Zapater, Marina; Souza, Samuel Xavier de; Olcoz Herrero, Katzalin; Atienza, David
    Run-time profiling of software applications is key to energy efficiency. Even the most optimized hardware combined to an optimally designed software may become inefficient if operated poorly. Moreover, the diversification of modern computing platforms and broadening of their run-time configuration space make the task of optimally operating software ever more complex. With the growing financial and environmental impact of data center operation and cloud-based applications, optimal software operation becomes increasingly more relevant to existing and next-generation workloads. In order to guide software operation towards energy savings, energy and performance data must be gathered to provide a meaningful assessment of the application behavior under different system configurations, which is not appropriately addressed in existing tools. In this work we present Containergy, a new performance evaluation and profiling tool that uses software containers to perform application run-time assessment, providing energy and performance profiling data with negligible overhead (below 2%). It is focused on energy efficiency for next generation workloads. Practical experiments with emerging workloads, such as video transcoding and machine-learning image classification, are presented. The profiling results are analyzed in terms of performance and energy savings under a Quality-of-Service (QoS) perspective. For video transcoding, we verified that wrong choices in the configuration space can lead to an increase above 300% in energy consumption for the same task and operational levels. Considering the image classification case study, the results show that the choice of the machine-learning algorithm and model affect significantly the energy efficiency. Profiling datasets of AlexNet and SqueezeNet, which present similar accuracy, indicate that the latter represents 55.8% in energy saving compared to the former.
    Applying game-learning environments to power capping scenarios via reinforcement learning
    (Springer international Publishing, 2022-08-05) Hernández Aguado, Pablo; Costero Valero, Luis María; Olcoz Herrero, Katzalin; Igual Peña, Francisco Daniel
    Research in deep learning for video game playing has received much attention and provided very relevant results in the last years. Frameworks and libraries have been developed to ease game playing research leveraging Reinforcement Learning techniques. In this paper, we propose to use two of them (RLLIB and GYM) in a very different scenario, such as learning to apply resource management policies in a multi-core server, specifically, we leverage the facilities of both frameworks coupled to derive policies for power-capping. Using RLlib and Gym enables implementing different resource management policies in a simple and fast way and, as they are based on neural networks, guarantees the efficiency in the solution, and the use of hardware accelerators for both training and inference. The results demonstrate that game-learning environments provide an effective support to cast a completely different scenario, and open new research avenues in the field of resource management using reinforcement learning techniques with minimal development effort.
    A QoS and container-based approach for energy saving and performance profiling in multi-core servers
    (IEEE, 2019) Souza, Wellington Silva de; Iranfar, Arman; Silva, Anderson; Zapater, Marina; Souza, Samuel Xavier de; Olcoz Herrero, Katzalin; Atienza, David
    In this work we present ContainEnergy, a new performance evaluation and profiling tool that uses software containers to perform application runtime assessment, providing energy and performance profiling data. It is focused on energy efficiency for next generation workloads and IT infrastructure.