Hacker News

2000 tokens per second is absolutely insane for a model that's on par with GPT 4.1. However throughoutput is only one part of the equation, the other being latency. As of right now it looks like the latency for every API call is quite high, it takes few seconds to receive first token for every API call. This means it's not as exciting for agentic use where many API calls are being made in quick succession. I wish providers focused more on this part.