SPU Architecture


  • Distributes memory into small banks near processing elements to improve throughput

  • Reduces data motion by performing computations close to on-chip memory banks

  • Eliminates energy and memory bottlenecks caused by accessing off-chip memory

Scalable Core Design

  • Can be scaled to match needs and constraints of any deployment environment

  • Targets a wide range of applications and form factors

  • Digital design can be ported to other process nodes to balance performance and cost.


10 X 10 = 100 
Dual sparsity

Our hardware can achieve multiplicative benefits in speed and efficiency when both forms of sparsity are present.

10x gains from each type yield 100x

SPARSE Weights

  • Supports sparsely connected models

  • Only stores and computes on weights that matter

  • 10x improvement in speed, efficiency, and memory



  • Supports sparse activations

  • Skips computation when a neuron outputs zero

  • 10x increase in speed and efficiency

Core Design


  • 512 kB on-chip SRAM per core

  • 5 MB effective SRAM with weight sparsity

  • 1.3 mm  single core (22nm process)

  • AXI interface


4 Core Configuration

Want more details?