You still can get decent stuff out of local ones.

Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.

If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.

This. The APIs are so cheap and they are up and running right now with 10x better quality output. Unless whatever you are doing is Totally Top Secret or completely nefarious, then send your prompts to an API.

I don’t see too much time spent to respond. I have above average hardware but nothing ultra fancy and I get decent response times from something like LLAMA 3.x. Maybe I am just happy with not instant replies but from online models O do not get replies much faster.

> but from online models O do not get replies much faster.

My point is that the raw token/second isn't all that matters. The tokens/second required for the correct/acceptable quality result is what actually matters. From my experience, the large LLM will almost always one shot an answers that takes many back-and-forth iterations/revisions from LLAMA 3.x. With higher reasoning tasks, you might spend many iterations only to realize the small model isn't capable of providing an answer, but the large model could after a few iterations. That wasted time is usually only pennies, if you would have just started with the large model.

Of course, it matters what you're actually doing.