While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?
I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.
Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.
While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?
I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.
Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?
Why would you use this when your company has access to actual SOTA? I don't get it.
Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.