Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).
Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.
Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.
Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?
I'm starting a repository of LLM reviews [1] with the goal of creating a catalog that is more task-oriented and less marketing-y than corporate blogs or benchmark leaderboards. You seem to have a lot of experience across a bunch of different models: if you have a chance and feel like sharing, you'd be one of the first.
[1] - https://model.reviews/ - all the user-submitted content is CC licensed and will be available for download in periodic dumps.