> Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.

They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.

The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.

Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.

The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.

Text to speech or diagnostics equipment where the core model is relatively small and never changes seems like the ideal application. You might be able to fit something in the 25-30B range in 2nm to 14A, but it would need a way to update.

Large models are simply out of the question in my opinion. If you need 400+ different chip designs, it’ll be billions of dollars to tape out before you even make the first chip.

> The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.

I'm not sure I follow (It's late, I am tired and I haven't had my dinner yet. That's my stupid trifecta!)

The original chip has the weights, so it's literally just a bunch of on-die (read-only) memory cells. The FPGA, while you could use it for the memory cells, would be way too expensive to use as pure memory. Typically one would hook up (read-only) storage to it, so you still need that read-only chip anyway.

The FPGA is just the compute bits, but this chip has on-die weights, not just compute.

I was proposing that the they have the base weights on a primary (permanent) chip, and have a secondary (replaceable) smaller chip with weights for a specific use-case, or for fine-tuning with new knowledge/updates to the model.

The matrices can be multiplied LoRA style, applying the matrix in the secondary chip to the primary chip, resulting in up-to-date weights through which the prompt is pushed.