> In practice, tps is a reflection of vram memory bandwidth during inference.

> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.

That doesn't work. Think about it a bit more.

Hint: what's in the kv cache when you start processing the 2nd token?

And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.

Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.

Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.

In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.

Oh I see. I went and confused total aggregate throughput with per-query throughput there didn't I.