Improving the Performance, Portability, and Productivity of Hardware Accelerators

Martínez Sánchez, Pablo Antonio

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Martínez Sánchez, Pablo Antonio

Supervised by:

José Manuel García Carrasco Director
Gregorio Bernabé García Director

Defence university: Universidad de Murcia

Fecha de defensa: 22 June 2023

Committee:

Manuel Eugenio Acacio Sánchez Chair
Julio Sahuquillo Borras Secretary
Antonio García Guirado Committee member

Department:

Computer Engineering and Technology

Type: Thesis

Teseo: 833236 DIALNET DIGITUM editor

Abstract

With the end of Moore's Law and Dennard's scaling, attention is shifting to new ways of enhancing computer performance. Improving microprocessor performance is becoming increasingly complex, whereas computational power demands still grow tremendously fast. In recent years, we are witnessing a paradigm change: rather than using one single chip, the CPU, for computing everything, computers are evolving into more heterogeneous organizations. In this new configuration, multiple specialized chips compute specific workloads while the CPU orchestrates them, and is only used for actual computing when no other chip can be used. These specialized chips are usually called accelerators. Since they are highly specialized, architecture enhancements have tremendous room for improvement, unlike CPUs. Accelerators are way more efficient than CPUs in terms of performance, energy consumption, or both. Like multicores, accelerators come with great benefits to computer performance, but also notable challenges to the programming workflow. In environments with multiple accelerators, writing code for each of them is very inefficient since each accelerator is programmed with different languages. Performance is also concerning because programming languages often struggle to exploit hardware to take advantage of its full potential. Lastly, portability is also complicated because when a program is designed for an specific accelerator, it cannot be executed in a different one. Achieving programming languages that provide productivity, performance and portability is known as the P3 problem. To tackle it, in this thesis, we have studied how two different single-source programming languages perform in real-world scenarios. After studying their performance in each of the three P3 categories, we found that they struggle to achieve good performance, portability, and productivity at the same time. Therefore, we have proposed a new domain-specific language specialized in deep neural networks that supports multiple heterogeneous architectures and reaches superior results in all P3 aspects. Even though we can develop programs with decent portability, productivity and performance in heterogeneous environments, there is much code already written. Therefore, if we wish to target new hardware, we would need to rewrite this code with new languages in order to use new accelerators. In this thesis, we propose a compiler that automatically matches and replaces existing code with API calls. Since the target API can be reconfigured easily, our compiler can target an optimized CPU library, which is more efficient than executing the handwritten code or an API that relies on a hardware accelerator. Our proposal is designed for C/C++ and recognizes linear algebra and tensor codes. The main strength of this proposal is its ability to recognize simple code (e.g., the three-loop structure of matrix multiplication) as well as complex code constructs (like the Strassen algorithm, hand-optimized vectorized code, etc.). Furthermore, a notable trend in SoC design, which is becoming increasingly common, is including a sea of disparate accelerators inside the chip. Even though the hardware is already offering performance improvements never seen before, the software is still struggling to take advantage of it. For example, there is no clear way of managing multiple accelerators to accelerate a given workload or how to assign accelerators to the right tasks automatically. Using multiple accelerators concurrently, like how ILP exploits multiple functional units, is called Accelerator-Level Parallelism (ALP). In this thesis, we show a new proposal for exploiting ALP in heterogeneous environments. We present a framework capable of orchestrating multiple accelerators to run a single task jointly, significantly improving performance. We apply our framework to matrix multiplication and convolution use cases, demonstrating that it automatically schedules tasks between accelerators with a low prediction error and a work distribution very close to the optimal. Like multicores did, heterogeneous computing is increasing the complexity of software development and making the architecture of computers more and more complex due to the diverse hardware variety. All computer architecture advances have come with increasing hardware complexity, which we must tame to make computers practical and useful. New architectures, different from the long-lasting CPU, bring unprecedented levels of performance and energy efficiency. In this thesis we have shown that performance portability is possible with singlesource languages, as well as a novel DSL for DNNs that achieves excellent performance, productivity and portability in heterogeneous environments. Also, we have designed a novel methodology for detecting and compiling acceleratable parts of CPU code to specialized hardware accelerators automatically. And lastly, we have proposed a framework for exploiting Accelerator-Level Parallelism in heterogeneous environments. We expect that the proposal described in this thesis will help to improve the usability and the performance of heterogeneous computing, which will relentlessly establish the standard for future-generation computing systems