But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.

In this case it would be 20-40 hours to accomplish the same amount in f work when running locally

Run one task, while you do another? Or while you sleep / eat / rave?

While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?

I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.

Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?

Why would you use this when your company has access to actual SOTA? I don't get it.

Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?

Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.