The entirety of acceleration hardware industry is based around this fact. Basica...

The entirety of acceleration hardware industry is based around this fact. Basically every single accelerator tries to move more memory towards significantly simpler (and wider) compute. At the extreme of this is compute-on-DRAM, not a new idea certainly, but one that is yet to materialize. Systolic architectures are also very efficient . GPUs are far less so, their main advantage is relatively good tooling and extensive programmability, not energy efficiency, per se. And CPUs utterly and completely suck at high throughput workloads, power efficiency wise. They do often have enough compute to do e.g. lightweight deep learning though.

Years ago it cost 8pJ/mm to move a byte on-chip. It's probably closer to 5pJ/mm now, but ALUs consume a fraction of this energy to do something with that byte. And once you leave the chip and hit the memory bus, things get _really_ slow and expensive.