I have a feeling like such posts come from a parallel reality. In my anecdotal experience confirmed by my (still subjective) benchmark (https://pshirshov.github.io/llm-bench-pi-oneshot/) Fable is not _that_ impressive. I performs on par with gpt-5.5 and opus 4.8, sometimes better, sometimes worse, it's definitely more expensive and it likes to refuse answering questions about React saying it can't help with chemistry.
Is this fuss really grounded or it's some pre-IPO AGI hype?
My experience with Fable since its release matches Simon's.
I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).
Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".
That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.
Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."
Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.
That describes all my tests with Fable.
Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?
>> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.
But can you bring anything measurable in support to your words? I did.
[dead]