Hacker News

Wasn't streaming models from storage into limited memory a case where it was impressive that you could make the elephant dance at all?

If you want to get usable speeds from very large models that haven't been quantitized to death on local machines, RDMA over Thunderbolt enables that use case.

Consumer PC GPUs don't have enough RAM, enterprise GPUs that can handle the load very well are obscenely expensive, Strix Halo tops out at 128 Gigs of RAM and is limited on Thunderbolt ports.

zozbot234 15 hours ago [ - ]

The bad performance you saw was with very limited memory and very large models, so streaming weights from storage was a huge bottleneck. If you gradually increase RAM, more and more of the weights are cached and the speed improves quite a bit, at least until you're running huge contexts and most of the RAM ends up being devoted to that. Is the overall speed "usable"? That's highly subjective, but with local inference it's convenient to run 24x7 and rely on non-interactive use. Of course scaling out via RDMA on Thunderbolt is still there as an option, it's just not the first approach you'd try.

Dylan16807 4 hours ago [ - ]

> If you gradually increase RAM, more and more of the weights are cached and the speed improves quite a bit

It'll increase a lot based on the zero-ram baseline. But it's still complete garbage compared to fitting the model in RAM. Even if you fit most of it in RAM you're still probably an order of magnitude slower than fitting all of it in RAM, most of your time spent waiting for your SSD.

GeekyBear 7 hours ago [ - ]

If you don't care about performance, you have a lot of options.