The only memory involved should be at the input and output of a pipeline stage that does an entire layer of an LLM. I'm of the opinion that we'll end up with effectively massive FPGAs with some stages of pipelining that have NO memory access internally, so that you get one token per clock cycle.

100 million tokens per second is currently worth about $130,000,000/day. (Or so ChatGPT 4.1 told me a few days ago)

I'd like to drop that by a factor of at least 1000:1

In theory that would be ideal, I feel like FPGA's haven't kept up compared to GPU's. The latest GPU's will be at 4nm, while FPGA's will be still at 28nm. The pipelines are huge, it would take many FPGA's to fit one LLM if everything is kept on-die. Cerebras is attempting this, but has to use a whole silicon wafer:

https://www.cerebras.ai/

We need FPGA's at the latest process node, with many GB's of HBM in the package. Fast reconfigurability would also be a nice have.

I feel like the FPGA has stagnated over the last decade as the two largest companies in this space were acquired by Intel and AMD. Those companies haven't kept up the pace of innovation in this space, as it isn't their core business.

> The latest GPU's will be at 4nm, while FPGA's will be still at 28nm.

16 nm (or “14 nm”) for Ultrascale+.