| Title: |
Mix-GEMM: An efficient HW-SW architecture for mixed-precision quantized deep neural networks inference on edge devices |
| Authors: |
Reggiani, Enrico; Pappalardo, Alessandro; Doblas Font, Max; Moretó Planas, Miquel; Olivieri, Mauro; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Barcelona Supercomputing Center |
| Contributors: |
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors; Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
| Publisher Information: |
Institute of Electrical and Electronics Engineers (IEEE) |
| Publication Year: |
2023 |
| Collection: |
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge |
| Subject Terms: |
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors; Deep learning; Neural networks (Computer science); High performance computing -- Energy consumption; Performance evaluation; Training; Computer architecture; Energy efficiency; Computational efficiency; Aprenentatge profund; Xarxes neuronals (Informàtica); Càlcul intensiu (Informàtica) -- Consum d'energia |
| Description: |
Deep Neural Network (DNN) inference based on quantized narrow-precision integer data represents a promising research direction toward efficient deep learning computations on edge and mobile devices. On one side, recent progress of Quantization-Aware Training (QAT) frameworks aimed at improving the accuracy of extremely quantized DNNs allows achieving results close to Floating-Point 32 (FP32), and provides high flexibility concerning the data sizes selection. Unfortunately, current Central Processing Unit (CPU) architectures and Instruction Set Architectures (ISAs) targeting resource-constrained devices present limitations on the range of data sizes supported to compute DNN kernels.This paper presents Mix-GEMM, a hardware-software co-designed architecture capable of efficiently computing quantized DNN convolutional kernels based on byte and sub-byte data sizes. Mix-GEMM accelerates General Matrix Multiplication (GEMM), representing the core kernel of DNNs, supporting all data size combinations from 8- to 2-bit, including mixed-precision computations, and featuring performance that scale with the decreasing of the computational data sizes. Our experimental evaluation, performed on representative quantized Convolutional Neural Networks (CNNs), shows that a RISC-V based edge System-on-Chip (SoC) integrating Mix-GEMM achieves up to 1.3 TOPS/W in energy efficiency, and up to 13.6 GOPS in throughput, gaining from 5.3× to 15.1× in performance over the OpenBLAS GEMM frameworks running on a commercial RISC-V based edge processor. By performing synthesis and Place and Route (PnR) of the enhanced SoC in Global Foundries 22nm FDX technology, we show that Mix-GEMM only accounts for 1% of the overall area consumption. ; This research was supported by the ERDF Operational Program of Catalonia 2014-2020, with a grant from the Spanish State Research Agency [PID2019-107255GB] and with DRAC project [001-P-001723], by the grant [PID2019-107255G-C21] funded by MCIN/AEI/ 10.13039/501100011033, by the Generalitat de Catalunya ... |
| Document Type: |
conference object |
| File Description: |
14 p.; application/pdf |
| Language: |
English |
| Relation: |
https://ieeexplore.ieee.org/document/10071076; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C21/ES/BSC - COMPUTACION DE ALTAS PRESTACIONES VIII/; https://hdl.handle.net/2117/386754 |
| DOI: |
10.1109/HPCA56546.2023.10071076 |
| Availability: |
https://hdl.handle.net/2117/386754; https://doi.org/10.1109/HPCA56546.2023.10071076 |
| Rights: |
Open Access |
| Accession Number: |
edsbas.BE98C38A |
| Database: |
BASE |