So, what if, we build a stack/set of transistors in same shape as a trained model? It would eliminate most of the software stack too and should run very fast. No memory/gpu required, the chip acts as both storage and processing device, purpose built to be physical model of a trained model.
This is literally what talaas has done with chatjimmy.ai.
Try it, it's llama 3.1 8B at 16000 tokens per second.
chatjimmy.ai https://taalas.com/the-path-to-ubiquitous-ai/
Wow that incredibly fast. I like this outcome more than centralized datacenters.
But it can only run that model, so it will be outdated in a few years at best.
There’s lots of things you can do in hardware that could be done in software but cost. FPGA should have solved this long ago, but apparently the guys who own the IP want to make it as hard as possible to use it …