Hardware techniques for the design of efficient inference accelerators of deep neural networks

Muñoz Martínez, Francisco

Hardware techniques for the design of efficient inference accelerators of deep neural networks

Muñoz Martínez, Francisco

Supervised by:

Manuel Eugenio Acacio Sánchez Director
José Luis Abellan Miguel Director

Defence university: Universidad de Murcia

Fecha de defensa: 19 December 2022

Committee:

Antonio González Colás Chair
José Cano Reyes Secretary
Davide Bertozzi Committee member

Department:

Computer Engineering and Technology

Type: Thesis

Teseo: 824655 DIALNET DIGITUM editor

Abstract

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. While first-generation rigid accelerator proposals use simple fixed dataflows tailored for dense DNNs, more recent architectures have argued for flexibility to efficiently support a wide variety of layer types, dimensions, and sparsity. As the complexity of these accelerators grows, the analytical models currently being used for design-space exploration are unable to capture execution-time subtleties, leading to inexact results in many cases. This opens up a need for cycle-level simulation tools to allow for fast and accurate design-space exploration of DNN accelerators, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, the first contribution of this thesis is STONNE, a cycle-level microarchitectural simulation framework that, plugged into a high-level DNN framework, allows for full-model evaluation of state-of-the-art DNN accelerators. Once we have a validated simulator, the second contribution of this thesis focuses on flexible architectures for DNNs. DNN accelerators use three separate NoCs within the accelerator, namely distribution, multiplier and reduction networks (or DN, MN, and RN, respectively) between the global buffer(s) and compute units (multipliers/adders). These NoCs enable data delivery, and more importantly, on-chip reuse of operands and outputs to minimize the expensive off-chip memory accesses. Among them, the RN, used to generate and reduce the partial sums produced during DNN processing, is what implies the largest fraction of chip area and power dissipation, thus representing a first-order driver of the energy efficiency of the accelerator. RNs can be orchestrated to exploit a Temporal, Spatial or Spatio-Temporal reduction dataflow. Among these, the latter is the one that has shown superior performance. However, as we demonstrate, a state-of-the-art implementation of the Spatio-Temporal reduction dataflow, based on the addition of Accumulators (Ac) to the RN (i.e. RN+Ac strategy), can result in significant area and energy expenses. To cope with this important issue, we propose STIFT (that stands for Spatio-Temporal Integrated Folding Tree) that implements the Spatio-Temporal reduction dataflow entirely on the RN hardware substrate –i.e. without the need of the extra accumulators. STIFT results in significant area and power savings regarding the more complex RN+Ac strategy, at the same time performance is preserved. The third contribution of this thesis increases the flexibility of current sparse accelerators by adding support for several dataflows within the same hardware substrate. Existing Sparse-Sparse Matrix Multiplication (SpMSpM) accelerators are tailored to a particular SpMSpM dataflow, that determines their overall efficiency. We demonstrate that this static decision inherently results in a suboptimal dynamic solution. This is because different SpMSpM kernels show varying features (i.e., dimensions, sparsity pattern, sparsity degree), which makes each dataflow better suited to different data sets. Motivated by this observation, we propose Flexagon, the first reconfigurable SpMSpM accelerator that is capable of performing SpMSpM computation by using the particular dataflow that best matches each case. Flexagon accelerator is based on a novel Merger-Reduction Network (MRN) that unifies the concept of reducing and merging in the same substrate, increasing efficiency. Additionally, Flexagon also includes a 3-tier memory hierarchy, specifically tailored to the different access characteristics of the input and output compressed matrices. Using detailed cycle-level simulation of contemporary DNN models from a variety of application domains, we show that Flexagon achieves average performance benefits of 4.59x, 1.71x, and 1.35x with respect to the state-of-the-art SIGMA-like, SpArch-like and GAMMA-like accelerators (265% , 67% and 18%, respectively, in terms of average performance/area efficiency).