I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.