> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.
Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
KV cache is, well, a cache that can fill up and trigger eviction. You require enough space to execute at least 1 fwd pass of 1 request at your context length. KV cache hits reduce TTFT by avoiding prefill. You don’t get to skip decode.
MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off.
KV cache is also eminently swappable if you have fast storage, since it mostly sees small append-only writes per token - it's not rewritten continuously like the activations. (I believe it's even better if you use cached input tokens across requests, since that portion of KV cache can then be recycled and save a single ~KV-cache sized write per request.) Accessing swapped-out cache may be slow, but it's highly preferable to not having that cache amount at all and recomputing from scratch.