I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.
Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:
- all intermediaries were given the prices of all buyers up front
- private price information in certain auction types was actually being broadcast to everyone
- multiple contradictions in instructions
If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.
There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.
Unless you're coming up with a deterministic set of criteria for evaluating these bugs and issues, every single model is going to keep telling you it finds new things and to fix them.
I'm sure you said the same "find mistakes please" thing to Opus 4.8 and GPT 5.5 when you were using $previous_amazing_latest_model, and they also found and fixed them.
Once the next "Fable"-type model comes out I'm sure it's going to find even more mistakes that the "special" Fable made.
You're using these models to make mistakes and then using upgraded versions of them to find their previous mistakes and fix them, until a new version comes along that can magically fix even more mistakes their previous versions made. There's no end to it.
Yes - I was thinking this - however I had already worked on it so many times with opus and gpt that I thought they had enough time to realise some common sense things that fable just got and understood first time, on the first pass. The difference seemed significant enough to comment about.
Maybe you are something special by letting those slip through in the first place?..
The point is that there's a difference in these models and everyone is looking for where the differences are. stop being an arse.
GP literally caught them?
Prompt: can you reformat your sentence to be less unkind?
This conversation is about capabilities of Fable 5 vs. older models, not about the GP's abilities.
It's just much more thorough and spins up a lot of subagents to basically do a lot more E2E testing. Not necessarily smarter, imo you could get the same result with a lesser model by procedurally prompting, but a lot more compute and orchestration.
i had to specifically tell fable not to use a bunch of subagents in order to preserve my token allowance.
This seems like the exact project you should try out Codex Security for. It catches a lot of stuff:
https://chatgpt.com/codex/cloud/security/
> ... and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through
Wait... Are you telling me models everybody told me were better than coders up to just one month ago are actually making lots of mistakes?
This is shocking.
[dead]