I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
Just look at deepseek V4, this preview model uses only 8 GB for 1M token KV cache(the context). It's insanely efficient already. It's just that most models that are coming out are barely catching up with technical breakthroughs. Deepseek are pioneers.
Unfortunately V4 is not trained for most real world usage, it is mainly for world general knowledge.
Maybe we can run more powerful models locally.
I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.
The size of the KV cache (context stored) is proportional to the number of layers of the model and number of "hidden dimensions". For a 400B model it could be 30-60GB for just an 8K context window (depends on the model, etc, just a ballpark).
So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.
Edits: Better context
That's my hope as well as I tend to use low end GPUs (e.g. NVIDIA GeForce RTX 2060 @ 6GB). Been looking for an image generation model that can fit that vid card, for use with Ollama + GUI in Linux. No luck yet, since money's tight and jobs are tighter :(
An Arc B580 will just about fit Flux.2 Klein (At FP8). However, you can also easily get much larger GPUs on RunPod or Vast at $0.25/hr.
I would strongly recommend exploring that option, renting an RTX 5090 for an evening of image generation for a dollar or two is way more fun then trying to jam big models on little cards. Just take some time to create a reasonable, scripted, deployment workflow for when you create a fresh instance.
hey what's your Venmo?
We're only a few years into this new tech getting serious research manhours thrown at it. Already some incredible optimizations have been found in a short amount of time. Not only has the efficiency of inference been increasing dramatically, the quality of tiny models has been significantly improving.
The future is bright for local AI.