> The better solution would be making part of the chip cluster use something like FPGA which can be reprogrammed.
I'm not sure I follow (It's late, I am tired and I haven't had my dinner yet. That's my stupid trifecta!)
The original chip has the weights, so it's literally just a bunch of on-die (read-only) memory cells. The FPGA, while you could use it for the memory cells, would be way too expensive to use as pure memory. Typically one would hook up (read-only) storage to it, so you still need that read-only chip anyway.
The FPGA is just the compute bits, but this chip has on-die weights, not just compute.
I was proposing that the they have the base weights on a primary (permanent) chip, and have a secondary (replaceable) smaller chip with weights for a specific use-case, or for fine-tuning with new knowledge/updates to the model.
The matrices can be multiplied LoRA style, applying the matrix in the secondary chip to the primary chip, resulting in up-to-date weights through which the prompt is pushed.
I'm wondering about something different. FPGA seems ideal for an AI chip because you can simply flash the latest model. The downsides are low density and low clockspeed. It seems that you can only fit 100-300M parameters in even very large FPGA, but that seems like it would be enough for most finetuning.
I'm thinking of a situation where you do the initial model calculations in hardware on the Taalas chip then hand that off to the FPGA to do the LoRA subset of calculations in hardware that can be continuously re-tuned to keep the model up-to-date. This would probably reduce throughput (or at least increase latency), but would save tons of money by allowing you to use the chips longer.