Except there are providers that serve both chinese models AND opus as well. On the same hardware.
Namely, Amazon Bedrock and Google Vertex.
That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.
Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.
Deployments like bedrock have no where near SOTA operational efficiency, 1-2 OOM behind. The hardware is much closer, but pipeline, schedule, cache, recomposition, routing etc optimizations blow naive end to end architectures out of the water.
And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.
https://code.claude.com/docs/en/microsoft-foundry
https://www.anthropic.com/news/claude-in-microsoft-foundry
AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.
> Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models
We were responded about 10x not 0.5x.
x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.
These providers do not run models on CPUs, x86 vs. Arm is irrelevant.