There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.

I think it's just routing to faster hardware:

H100 SXM: 3.35 TB/s HBM3

GB200: 8 TB/s HBM3e

2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).

FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.