> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques
So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
The best use is actually for a layer that "almost fits" into VRAM, such that automated offloading to system RAM will be rare enough that it doesn't impact performance.
As in when your secondary memory is fast enough, after the first 10% of the model are processed you can swap their memory with the part for 50% to 60% and when that is done you swap back to have the 0-10% ready in time for the next iteration?
Sounds ambitious, for the small improvement in effective capacity. In particular when I start wondering if real life speed differences would be small enough for that 10% increase, or if it would be even smaller. And that's before factoring in power/cooling cost for saturating another interface.