Until now I have not run models that do not fit in 128 GB.
I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.
Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.
But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.
There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.
For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.
AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.
You are right, which is why I do not intend to use a GGUF file but a set of files with a different layout, and this is why I need to make changes in llama.cpp.
If you have to come up with a custom format anyway, why not just make it a draft extension to GGUF layout definitions (something like "coalesced expert fetch" or the like) and submit it for inclusion in the standard? Then future models could be autoconverted to such a format.