It's a bit odd that you automatically assumed I don't understand the benchmarks.

For most single issues/bugs/tickets, the quality difference wasn't noticeable. But that's like using a sledgehammer to kill a fly. I was using Fable for much more ambitious and complex tasks that require orchestration, and it was crushing it. I described it here: https://news.ycombinator.com/item?id=48505782

So yes, the benchmarks are indeed accurate: where Opus 4.8 would start strong and eventually struggle or run into obstacles, Fable would relentlessly keep working, keep accurate track of all work threads (e.g. multiple inter-dependent issues being worked in parallel by subagents) and would go above and beyond.

I wasn't assuming anything. Generally speaking.

The flow you describe in that comment is rather simple in my opinion and with the right harness even Sonnet would drive most of that.

I judge by the ability to bugfix complex codebases and the direction it takes in architecture. In my opinion, that's a tad more complex (and easier to objectively measure) than orchestrating tickets, no matter how complex.