## BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport Triggered NN SoC

IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - 2023

Maarten J. Molendijk<sup>1,2</sup>, Floran A.M. de Putter<sup>1</sup>, Manil Dev Gomony<sup>1</sup>, Pekka Jääskeläinen<sup>3</sup>, Henk Corporaal<sup>1</sup>

<sup>1</sup>Eindhoven University of Technology, the Netherlands <sup>2</sup>NXP Semiconductors, the Netherlands <sup>3</sup>Tampere University, Finland



#### **NN Architecture and Hardware Development Cost**



- Large variety of NN architectures
- Rapidly evolving

BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC IEEE ICCD 2023

S. Hashemi et. al. "Understanding the impact of precision quantization on the accuracy and energy of neural networks," DATE 2017
<u>https://www.researchgate.net/figure/Chip-Design-and-Manufacturing-Cost-under-Different-Process-Nodes-Data-Source-from-IBS fig1 340843129</u>

Source: [2]

2

#### **NN Architecture and Hardware Development Cost**



BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC IEEE ICCD 2023

[1] S. Hashemi et. al. "Understanding the impact of precision quantization on the accuracy and energy of neural networks," DATE 2017 [2].https://www.researchgate.net/figure/Chip-Design-and-Manufacturing-Cost-under-Different-Process-Nodes-Data-Source-from-IBS fig1 340843129

#### **Operand Precision Scaling** GPT-3 O $10^{11}$ Number of Model Parameters $10^{-10}$ $10^{-10}$ $10^{-10}$ GPT-2 Transformer-XL Transformer (Big) VGG16 0 ALBERT AlexNet NASNet BERT 0 ResNet-50 ransforme Inception V: (Base) (ception GoogLeNet $10^{6}$ DQN 0 $10^{5}$

2016

Year

2017

Deployment on edge devices  $\rightarrow$  Quantization

2018

2019

2020

Source: [3]

## BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC IEEE ICCD 2023

2013

2012

2014

Operand width  $\downarrow$ 

2015

MAC HW superlinearly

Overhead (sub)linearly

4



- Deployment on edge devices  $\rightarrow$  Quantization
- Operand width ↓
  - MAC HW superlinearly
  - Overhead (sub)linearly
- Efficient data reuse + minimized data movement

5 [3] https://www.researchgate.net/figure/Number-of-parameters-ie-weights-in-recent-landmark-neural-networks1-2-31-43 fig1 349044689 [4] P. C. Knag et al., "A 617-TOPS/W All-Digital Binary Neural Network Accelerator in 10-nm FinFET CMOS." IEEE J. Solid-State Circuits (JSSC), 2021.



#### **Transport-Triggered Architecture**

**RF.out**  $\rightarrow$  **ALU.in1t.add** 

nop

**LSU.out**  $\rightarrow$  **ALU.in2** 



- + Compile-time configurable  $\rightarrow$  flexible schedule
- + Exposed datapath  $\rightarrow$  RF bypassing
- + Exposed datapath  $\rightarrow$  Operand sharing

#### **BrainTTA: Toolchain and System**



#### **Design Flow**

- OpenASIP, retargetable [1]
- LLVM-based compiler
- ISA simulator
- HDL Database  $\rightarrow$  custom units

#### **BrainTTA: Toolchain and System**



#### IRQ DBG DMEM LSU RISC-\ RISC-V IMEM - U-DMEM 16kB 32x16kB RISC-V PMEM LSU ARBITER 꾼 DMEM $725 \mu n$ TTA TTA **TTA CORE** TTA 16kB PMEM DMEM PMEM 32x16kB **AXI INTERCONNECT** GCU APB IMEM 4x32kB RISC TTA QSPI JTAG DM/ MEM IMEM $1730 \mu m$

#### **Design Flow**

- OpenASIP, retargetable [1]
- LLVM-based compiler
- ISA simulator
- HDL Database  $\rightarrow$  custom units

#### Architecture design:

- Tech: GF 22nm FDX
- RISC-V + peripherals
- DMEM/PMEM split + banked access
- IMEM + HW loopbuffer





#### vMAC

- 8-bit MAC
  - Scalar-Vector MAC
  - Vector-Vector MAC
- Binary MAC
- Ternary MAC



#### vMAC

- 8-bit MAC
  - Scalar-Vector MAC
  - Vector-Vector MAC
- Binary MAC
- Ternary MAC

#### vADD

- Vector-Vector addition
- Residual support



#### vMAC

- 8-bit MAC
  - Scalar-Vector MAC
  - Vector-Vector MAC
- Binary MAC
- Ternary MAC

#### vADD

- Vector-Vector addition
- Residual support

#### vOPS

- Requantization
- Binarization
- Ternarization
- MaxPool
- Auxiliary ops

Quantized NNs → OS schedule

```
for h in [0, H - R + 1]:Output feature map heightfor w in [0, W - S + 1]:Output feature map widthfor m in [0, M]:Output feature map widthfor m in [0, M]:Output feature map widthfor c in [0, C]:Input channelsfor c in [0, R]:Kernel heightfor s in [0, S]:Kernel widthaccu += in [h+r][w+s][c] * w[c][r][s][m]output [h][w][m] = act_function (accu)
```



| for h in $[0, H - R + 1]$ : | Output feature map <b>height</b><br>Output feature map <b>width</b> |
|-----------------------------|---------------------------------------------------------------------|
| for m in $[0, M/32]$ :      | <i>Ouput channels</i> $(v_M = 32)$                                  |
| accu = bias[32*m]           |                                                                     |
| for c in [0, C/4]:          | Input channels $(v_C = 4)$                                          |
| for r in [0, R]:            | Kernel <b>height</b>                                                |
| for s in [0, S]:            | Kernel width                                                        |
| for tm in [0, 31]           | ]:                                                                  |
| for tc in [0,               | 3]:                                                                 |
| accu += i                   | n [h+r][w+s][4*c+tc]                                                |
| * w[4                       | *c+tc][r][s][32*m+tm]                                               |
| output[h][w][m] = act_f     | function (accu)                                                     |









# ACCU cost >> MUL cost → wider reduction tree





$$v_C=4\ v_M=2$$



ACCU cost >> MUL cost → wider reduction tree IFM broadcast → multiple reduction trees

#### Post-layout Energy Consumption [GF 22nm FDX]



BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC IEEE ICCD 2023

Operating conditions:TT corner,, T=25 °C, Vdd=0.5V Conv params:W=H=16, M=C=128, R=S=3 21

#### Post-layout Energy Consumption [GF 22nm FDX]



BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC IEEE ICCD 2023

Operating conditions:TT corner,, T=25 °C, Vdd=0.5V Conv params:W=H=16, M=C=128, R=S=3 22

#### **Comparison to SotA**

|                             | Eyeriss v2                  | XNE             | SamurAl         | XPULPNN                            | Dustin                              | This work                                              |
|-----------------------------|-----------------------------|-----------------|-----------------|------------------------------------|-------------------------------------|--------------------------------------------------------|
| Tech                        | 65nm                        | 22nm            | 28nm            | 22nm                               | 65nm                                | 22nm                                                   |
| Progamm<br>ability          | Configurable                | ASM             | ASM             | Compiler                           | Compiler                            | Compiler                                               |
| Energy<br>efficiency        | 252 GOPS/W(8b) <sup>1</sup> | 8.7 TOPS/W (1b) | 1.3 TOPS/W (8b) | 2.2 TOPS/W (8b)<br>6.1 TOPS/W (2b) | 606 GOPS/W (8b)<br>2304 GOPS/W (2b) | 2.5 TOPS/W (8b)<br>14.9 TOPS/W (T)<br>28.6 TOPS/W (1b) |
| Memory<br>Cap. [kB]         | 246                         | 520             | 464             | 640                                | 80                                  | 1024 <sup>2</sup>                                      |
| Area Eff.<br>[GOPS/m<br>m2] | 5.5 (8b) <sup>3</sup>       | 28.9 (1b)       | 0.6 (8b)        | 21.7 (8b)<br>70.7 (2b)             | 0.9 (8b)<br>3.46 (2b)               | 25.8 (8b)<br>103.0 (T)<br>206.0 (1b)                   |

#### \*after technology scaling

<sup>1</sup>Average on AlexNet.

<sup>2</sup>Excluding instruction memory.

<sup>3</sup>Area estimated using gatecount diff. EyerissV1, EyerissV2.

#### **Comparison to SotA**

|                             | Eyeriss v2                  | XNE             | SamurAl         | XPULPNN                           | +14% in                             | This work                                              |
|-----------------------------|-----------------------------|-----------------|-----------------|-----------------------------------|-------------------------------------|--------------------------------------------------------|
| Tech                        | 65nm                        | 22nm            | 28nm            | 22nm                              | 65nm                                | 22nm                                                   |
| Progamm<br>ability          | Configurable                | ASM             | ASM             | Compiler                          | +39%*                               | Compiler                                               |
| Energy<br>efficiency        | 252 GOPS/W(8b) <sup>1</sup> | 8.7 TOPS/W (1b) | 1.3 TOPS/W (8b) | 2.2 TOPS/W (8b)<br>6.1 TOPS/W (2b | 606 GOPS/W (8b)<br>2304 GOPS/W (2b) | 2.5 TOPS/W (8b)<br>14.9 TOPS/W (T)<br>28.6 TOPS/W (1b) |
| Memory<br>Cap. [kB]         | 246                         | 520             | 464             | 640                               | 80                                  | 1024 <sup>2</sup>                                      |
| Area Eff.<br>[GOPS/m<br>m2] | 5.5 (8b) <sup>3</sup>       | 28.9 (1b)       | 0.6 (8b)        | 21.7 (8b)<br>70.7 (2b)            | 0.9 (8b)<br>3.46 (2b)               | 25.8 (8b)<br>103.0 (T)<br>206.0 (1b)                   |

#### \*after technology scaling

<sup>1</sup>Average on AlexNet.

<sup>2</sup>Excluding instruction memory.

<sup>3</sup>Area estimated using gatecount diff. EyerissV1, EyerissV2.

#### Programmable vs. Fixed-Function Trade-off

- Programmable architectures
- Fixed-function
  - Spatial loop unrolling
- Fixed FM/weight bufferring



#### **BrainTTA - Conclusion**

- Efficient and flexible NN inference engine:
  - Mixed-precision
  - Compile-time reconfigurable
  - Eff: 2.47 / 14.9 / 28.6 [TOPS/W]
  - Throughput: 77 / 307 / 614 [GOPS]
- Superlinear energy eff. scaling
  - 8-bit  $\rightarrow$  ternary: x5.96
  - 8-bit  $\rightarrow$  binary: x11.4



