Hacker News

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.

munksbeer 5 hours ago [ - ]

Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?

m-dot-reviews 19 hours ago [ - ]

I'm starting a repository of LLM reviews [1] with the goal of creating a catalog that is more task-oriented and less marketing-y than corporate blogs or benchmark leaderboards. You seem to have a lot of experience across a bunch of different models: if you have a chance and feel like sharing, you'd be one of the first.

[1] - https://model.reviews/ - all the user-submitted content is CC licensed and will be available for download in periodic dumps.