Hacker News

I'm on plane wifi right now but I'll benchmark later today — when I benchmarked GLM-4.5, I could get 150-200tps in the Bay Area, California. Qwen3 is probably somewhat lower TBH. We have an open-source coding agent that includes a TPS benchmarker that works with any OpenAI compatible API, including ours: https://github.com/synthetic-lab/octofriend

To run the TPS benchmark, just run:

    octo bench tps

All it does is ask the model to write a long story without making tool calls (although we do send the tool definitions over, to accurately benchmark differences in tool call serialization/parsing). It usually consumes a little over 1k tokens so it's fairly cheap to run against different usage-based APIs (and only consumes a single request for subscription APIs that rate limit by request).

Edit: forgot to add — for Qwen3 everything should be running in FP8.