Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.

I think we'll see

- better hardware

- more efficient model runtime algorithms/code

- smarter/more efficient models (same capability with less parameters)

So ideally these will all come together and help.