Many neural networks are highly overparameterized. By introducing
parameter sparsity during training, model size and number of operations can be reduced by a factor
Many neural networks also perform too many unecessary operations. By introducing activation sparsity during training, the number of computations can be reduced by another factor of 10.
When paired together, dual sparsity can reduce power by 100x and required memory by 10x. In order to realize these gains, we've designed training utilities that make it easy to create high-sparsity models without harming performance.
Dual sparsity gains cannot be realized without hardware support.
We've designed the SPU to support sparse data formats to ensure models are compressed in memory
and zero-skipping is exploited during runtime.
Accesing off-chip memory uses significantly more power than compute operations. By moving memory on-chip, data motion is minimized and power is reduced. Breaking memory into smaller banks and placing them near each processing element further reduces data motion and eliminates memory bottlenecks.
Because of dual-sparsity, large complex models will now fit in on-chip memory, allowing new use cases to be brought to smaller form factors without affecting battery life.