Hacker News

The typical inference workloads have moved quite a bit in the last six months or so.

Your point would have been largely correct in the first half of 2025.

Now, you're going to have a much better experience with a couple of Nvidia GPUs.

This is because of two reasons - the reasoning models require a pretty high number of tokens per second to do anything useful. And we are seeing small quantized and distilled reasoning models working almost as well as the ones needing terabytes of memory.