> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.
And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.
Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.
Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.
Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P
Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.
There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.
So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...
Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.
No, relative performance between Python and Java can absolutely be measured.
Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.
> These comparisons are all gut feelings.
https://simonwillison.net/about/#disclosures
"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."
But I'm totally unbiased on my gut-feeling posts, trust me bro.
-- AI influencers.
Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed
It is possible to check for improvements. See for yourself:
https://generative-ai.review/2026/06/claude-fable-rush-test-...
As mentioned in another HN thread I've done a qualitative side-by-side measurements of Claude Fable vs Opus 4.8 vs ChatGPT 5.5.
Anyone is able to check the output for themselves and form a judgement.
Large visible improvements for Fable over Opus 4.8 and ChatGPT 5.5.
I recently did the same to show the progress from Opus 3.4/ChatGPT o3pro one calendar year ago.
Sorry, this post gets me irrationally irritated and makes me want to shake you and shout.
That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.
The person you responded to asked for specific things, including:
- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.
- their different generations, but all you included was the outputs
- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that
This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.
Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.
How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?
check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix
.
[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.
I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.
I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.
I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.
Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.
There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.
This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.
I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.
In my opinion, if one cannot express themselves civilly, they should refrain from commenting.
I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.
AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.
It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.
I found the website you ranted about interesting, comparing the quality of the visualization between the different models.
I don't think it was "a huge waste of time" or needed your rant.
You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.
What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.
It feels like hand written software will now be "bespoke"
That’s what evals are for.
And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.
fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)
How many $ do you guys spend when your session runs for 30min? What's the total budget?
I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.
"Don't make mistakes" does seem dumb. It's not guidance.