I currently use Cerebras for qwen3. One of the things I like is its speed(the TPM limit is rough). I am curious, how fast is qwen3 on your platform and what quantization are you running for your models?

I'm on plane wifi right now but I'll benchmark later today — when I benchmarked GLM-4.5, I could get 150-200tps in the Bay Area, California. Qwen3 is probably somewhat lower TBH. We have an open-source coding agent that includes a TPS benchmarker that works with any OpenAI compatible API, including ours: https://github.com/synthetic-lab/octofriend

To run the TPS benchmark, just run:

    octo bench tps
All it does is ask the model to write a long story without making tool calls (although we do send the tool definitions over, to accurately benchmark differences in tool call serialization/parsing). It usually consumes a little over 1k tokens so it's fairly cheap to run against different usage-based APIs (and only consumes a single request for subscription APIs that rate limit by request).

Edit: forgot to add — for Qwen3 everything should be running in FP8.

Just tried benchmarking from Mexico City, where I'm at for a wedding — looks like 130tps for Qwen3 Coder 480B here.