Hacker News

> but from online models O do not get replies much faster.

My point is that the raw token/second isn't all that matters. The tokens/second required for the correct/acceptable quality result is what actually matters. From my experience, the large LLM will almost always one shot an answers that takes many back-and-forth iterations/revisions from LLAMA 3.x. With higher reasoning tasks, you might spend many iterations only to realize the small model isn't capable of providing an answer, but the large model could after a few iterations. That wasted time is usually only pennies, if you would have just started with the large model.

Of course, it matters what you're actually doing.