Hacker News

You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.

kristopolous 5 hours ago [ - ]

Yeah there's a lot of people that advocate for really slow inference on cheap infra. That's something else that should be expressed in this fidelity

Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.

At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.

Essentially a task failure is 1 in 10, I want to monitor and retry.

If it's 1 in 1000, then I can walk away.

The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless

But sometimes you do...

zozbot234 4 hours ago [ - ]

If you launch enough tasks in parallel you aren't going to care that 1 in 10 failed, as long as the other 9 are good. Just rerun the failed job whenever you get around to it, the infra will still be getting plenty of utilization on the rest.