I've spent enough time with this now in Claude Code (and Claude.ai and Claude Code for web) to have an opinion on Fable 5: it's a beast. I'm throwing some VERY difficult problems at at - things I've been dragging my heels on for months - and it's crunching through them very happily.

One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython
A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM:

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'
Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

No, relative performance between Python and Java can absolutely be measured.

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

It is possible to check for improvements. See for yourself:

https://generative-ai.review/2026/06/claude-fable-rush-test-...

As mentioned in another HN thread I've done a qualitative side-by-side measurements of Claude Fable vs Opus 4.8 vs ChatGPT 5.5.

Anyone is able to check the output for themselves and form a judgement.

Large visible improvements for Fable over Opus 4.8 and ChatGPT 5.5.

I recently did the same to show the progress from Opus 3.4/ChatGPT o3pro one calendar year ago.

Sorry, this post gets me irrationally irritated and makes me want to shake you and shout.

That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.

The person you responded to asked for specific things, including:

- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.

- their different generations, but all you included was the outputs

- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that

This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.

Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.

check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

.

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.

I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.

There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.

It feels like hand written software will now be "bespoke"

[deleted]

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

How many $ do you guys spend when your session runs for 30min? What's the total budget?

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

Yes, exactly this. If I didn't care about price at all, I'd exclusively use this model. It functions more like an actual engineer. I'm in the midst of a DB migration, and eg 5.5 continually suggests stuff like "use DB X instead of DB Y for task Z because its 30% faster" which is an impossibility of reality, given we are migrating DBs. Fable jumped in, reduced allocs by literally 46x, found multiple bugs 4.8 and 5.5 created (max file system usage, correctness issues, etc), and continually suggested awesome improvements unprompted. As in, it would finish a task and then suggest we tackle this other existing problem I didn't know about in a very specific manner... this is the first model that feels like its coming for my job.

I'm having the same experience. I'm in the process of implementing a new CRDT for realtime collaborative editing. There just aren't a lot of implementations of CRDTs kicking around online for opus or any of the other models to have good design instincts.

Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.

I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.

I was about to ask where you work that you’re implementing new CRDTs and then I noticed your username! Thanks for all that you do!

I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.

Long shot here because I'm not knowledgeable enough about CRDTs but maybe something like DSON would help? I saw a talk about it a while ago and it might be useful.

https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...

https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...

I’d be fascinated to hear more if you’re willing to share. What is special about your document model which makes existing tools like automerge a bad fit?

> wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.

I’ll ask it for a formal proof when I get home and see how it goes.

I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.

In the real world, many of us don't have the time to create formal proofs. But our instinct in testing where edge cases may exist in code that we wrote is a type of refactoring that happens in our brains during the coding process. Hand the coding off to a machine and you have no idea where to start looking for the flaws.

Hello joseph,

I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.

> this is the first model that feels like its coming for my job

Damn you must be good, I've been feeling this for around 2 years now

It's been obvious for at least 2 years, anyone who doesn't see the writing on the wall simply hasn't learned how to use these well or has severe exponential blindness.

"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.

One thing I can tell you is you are either favored by Anthropic, or your version of the CLI does not exhaust limits, or there's some major bug, as two people around me (myself included) claim it took half an hour to hit the ceiling. Which makes it practically unusable, where the same workflow a day ago produced a good 5-6 hours of workload with several agents.

Monetization is coming. They'll tell companies, AI is replacing your workers, so it is still worth to pay 100K/year for the license, as those AI are not going to jump to other job, get sick, be late, complain, require free coffee and so on.

Soon the times of AI for $20/$200 a month will be long gone.

Get people hooked, tell them spending time coding is no longer needed, let their skills deteriorate, tell them they need cough up for a licence to do their job

Forcing developers to pay for models that were build on code they scraped scott-free

A tax to do their job that developers are jumping at the chance to pay

Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards

> Forcing developers to pay for models that were build on code they scraped scott-free

That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.

As someone noted here recently - use the frontier models as much as u can, while you can.

Thankfully, we have Chinese models we can use for a fraction of the price.

Not everyone needs a Ferrari to go for a weekly shopping.

A Ferrari will likely lap you when you’re racing, though, and the market and the economy is a race. You’ll be facing a question soon, or your employer will, whether to spend a significant chunk of free cash on fable-class tokens or on literally anything else instead - wages and salaries included.

<< You’ll be facing a question soon, or your employer will

Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.

Yeah, sure. In the same way I can see only Ferraris driving as taxis, company cars, transport vehicles, used by post, delivery services ...

You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.

Yeah same here, Fable on "high" is producing substantially better results than Open 4.8 on xhigh for me and my actual real-world evals today. It "feels" smarter and doesn't use nearly as many tokens running in circles. As a result I've been able to run two large refactors today without hitting the context limit danger zones - it's more expensive but also more efficient. It's been able to find some bugs that Opus missed. Pretty impressive stuff.

I keep getting this message:

> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.

It’s unusable for me due to the refusals. I’m using claude to find patterns in health data

I do some work in laboratory automation and it was quick to refuse the first thing I asked it to do. There wasn't anything spicy in the request, just basic liquid-handling protocol implementation. Their position seems to be that they're too stupid to classify requests safely, and that seems reasonable to me. I'd guess the classifier will improve rapidly.

Have you tried locally running qwen?

Same. I'm working on a set of python and matlab scripts that deals with segmenting MRI images into brain vs skull, and it thinks that's bioterrorism.

[dead]

Quite counterproductive to refuse to help on health issues too. If they detect health data, they can add a disclaimer, but not hide the information.

You miss the point - by collecting and processing medical data they would fall into a thoroughly regulated industry. Not because they may provide you incorrect data, because they are not allowed to process them.

There’s no way around it? Can’t you obfuscate as generic data and use keys to map to the real data?

I guess you could even turn everything into numbers, not a bad idea at all!

What custom prompt do you have set up? If you tell it you're occupation, does it turn helpful? There was a study that if you tell models they tested that you're a patient, it would refuse, but tell it you're a doctor and suddenly it turns helpful.

According to the model, it’s not the model itself that’s doing this, it’s the harness.

Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.

what prompts do you use for this?

Anthropic knows it refuses too much, they want to be very cautious to avoid any scandals. I think this is why they want to store all Fable and Mythos chats for 30 days so they can use the data to improve.

They want to be very cautious to honour the important doctrine at least until IPO launches: we are so good we are nerf our products.

I’m a point where I expect everything I do will be retained indefinitely.

I’m having a really hard time believing some weak reason for a 30 day retention policy.

I wonder if it sees Healthcare companies being targeted and that's why it's freaking out; clearly they have some pretty stupid regexes in the harness to detect this sort of shit.

e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.

How weird.

I wonder if this letter has anything to do with why anything even remotely related to biology is getting flagged.

https://www.wired.com/story/openai-anthropic-letter-ai-biolo...

I asked a question for my son about how mosquitos carry malaria and Fable was like “ok now hold it right there”

Same here. It's been rushed for the IPO (in my opinion).

Or people were quitting their subscription for codex-5.5 and it was beginning to show up in their metrics.

Or development had gotten to a point where they need real world usage to tune product and refusals.

Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.

Or…

Same I am working on music firmware for existing device. I can't proceed as it keeps switching to Opus.

Obviously, soon, for anything valuable, you will have to buy from Anthropic "special license for biology/security/finance advises".

Question is if there will be any competition in this area...

[deleted]

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

One does not need to be able to create it themselves to evaluate if the output is correct. Consider for example that you can easily determine if a meal tastes delicious without being an expert chef, or the fact that NP problems are very difficult to solve but make for easily verifiable solutions.

Humans make mistakes too, does it mean humans are unusable? We accept as empirical fast that most production quality code has 2 - 10 bugs per 1k LoC. According to your premise, virtually all existing software is therefor unusable.

What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.

Humans make mistake then to learn from it. A really good expert would never deliberately copy-paste an obscure solution from the internet, then to ask for forgiveness later.

AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?

Yeah, it makes the same old errors, being confidently wrong then sorry... I mean, it is still an LLM

AI is like a junior developer. You have to review her code carefully but she is most definitely useful.

Why is your AI a she? What's up with gendering LLMs. Reminds me of Richard Dawkins calling Claude "Claudia" and insisting it to be conscious.

This is part of the training data now. She can hear you, you know...

Still does not crack my hardest nuts. Gave it one of them and it blew through my entire allowance on thinking about one question, with no apparent answer in sight!

I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!

I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!

I also see a lot of people saying they are happy with weaker models.

At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.

The results were near useless.

The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.

Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.

Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.

I have Qwen 3.6 27B and 35B running locally and and coming from Opus it feels like talking to an imposter. Someone who pretends to be competent, but really isn’t. Results are always disappointing. Sonnet is better, but I have given up on asking it. even for simple things I wait for my opus limits to reset.

What kind of problems are you trying to have it solve ?

The Riemann hypothesis, PvNP, and the Collatz conjecture.

Not these. I wonder if the well is poisoned there. The models know that these are "unpossible", so it might not solve them just because… Maybe some day.

I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!

So, what kind of problems are you having it try to solve?

Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.

I don’t care to share my exact problems. Mostly because gpt -5.5 hallucinates false solutions, and I would rather not have people reply with "Oh but ChatGPT solves it!", because it takes expert knowledge to debunk them. To their credit ChatGPT will admit their, very fundamental mistakes when pointed out to them. But also because no-one would really care.

I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.

My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.

Bro, you are being left behind bro, it's amazing bro...

That's a bit of a tricky point. I have had quite a lot of problems with models informing me what I am attempting is impossible. If no-one has done it, or at least it doesn't know about it being done it tends to fall back on people voicing their baseless speculations, and for just about anything you propose, you can find a person who will loudly proclaim it is impossible.

The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.

A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.

I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.

Stop dancing and share the prompt, we're dying to see it

Hey, stop asking to see my nuts! My nuts are private – okay?

(Joking aside, see sibling threads.)

Ayy lmao

is this a joke? Seriously? These are some of hardest problems in Math period. 100 if not thousands of the greates minds in history have attempted to solve these problems. And you think that the current level of AI can blow through them? It is also a possibility that for example the Riemann Hypothesis is just not provable. (Goedels Theorem).

No one is expecting that! I expect _kb was sarcastic/making a point.

Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.

But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!

if it was sarcastic then whoosh on me.

The medium ones are results where one needs to construct some object, which my intuition tells me should exist. The difficult ones are typically to show that certain objects can not be constructed.

These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.

I have some medium difficulty math problems where I have used the models for the last year and a half repeatedly. Back then they were already good at pointing out obstructions and constructing counterexamples. So that tracks. But at first glance it looks like Fable actually made real progress on one problem for the first time.

A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...

Cool! Yes, we are getting there.

Being a theory builder more than a problem solver I am excited for the future.

Also excited for fully formalised mathematics to hit main stream!

Perhaps you should rephrase those nuts?

Got curious and ran a similar prompt with DeepSeek v4 Pro w/ OpenCode

No idea what's going on here but agent tested a bunch of stuff. Then I asked to build a wheel so I can run the command you noted above and it appears to pass

For those who are curious...

https://github.com/bamggm/micropython-wasm/commit/5ddebae592...

That is pretty wild, it took me a hell of a lot more coaxing and persevering to get to a similar point with eryx [0] (we spoke a bit about this before on Mastodon) using Opus, Fable seems to have a more optimistic 'sure, let's proceed as if this is possible' mindset based on your transcript. Looking forward to trying it out for some hairier problems.

[0]: https://github.com/eryx-org/eryx

Fable has been producing some really good work on my end as well. Definitely better than Opus 4.8. The only problems are the cost and constant cybersecurity refusals. A single session uses up 100% of my 5h window without finishing, and that's when it doesn't get derailed by nonsensical refusals.

Does anyone know what the architecture of Fable is? Is it harnesses? Did they solve persistent learning? What did they do?

Seems to just be a bigger model.

What can it do that Opus couldn’t?

Always hard to say for sure because I'm not sitting around running the exact same situations through both models in parallel to compare them.

It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.

In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

I did a qualitative side-by-side of Claude Fable vs Opus 4.8 vs ChatGPT 5.5

https://generative-ai.review/2026/06/claude-fable-rush-test-...

I get them to make a 3D explainer animation. You can clearly see Fable is much improved on both Opus 4.8 and ChatGPT 5.5.

Better Textures . A nifty camera follow . Humans rendered better . ... see for yourselves

Honestly, they all look good

But you said you've been working on those problems for months, so didn't you throw those same problems at Opus?

He has early access to anthropic models, of course he will hype them up, so that they will keep sharing access to preview models with him (and more traffic to his website). It also does't require him to perform any rigorous analysis of model performance, just share how it feels:

> But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

Crank up more revenue for IPO

I gave it a complete database migration of our app, opus failed hard each time... Untyped Json b for some rows, no proper normalisation, falling back asking me questions in between.

Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase

How did you do this impressive amount of work and verify that it did it perfectly all in one day?

I told Claude to do it yesterday evening, checked in during my nightly break.

I am not sure it's perfect, and it will need further validation

This morning I looked at code samples & checked if all unit/integration and e2e pass & perfomance tests pass

I also generated a postgres schema diagram.

Aka I did probably 2 hours of work, rest was not me

The opus try was last month

High, extra, or max?

High.

What are some reasons to consider your project instead of Pyodide?

It's difficult to run Pyodide inside server-side Python.

I hate how the Instagram/TikTok/YouTube influencer cancer is getting into AI. With early access and all that.

It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.

I often get early access but didn't for this one, it's quite possible there's an NDA in an email somewhere that I missed and forgot to sign.

[dead]

[flagged]

It is already disclosed [1]:

> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events.

[1] https://simonwillison.net/about/

HNs problem that they/we keep upvoting him.

My disclosures are on my blog: https://simonwillison.net/about/#disclosures

[deleted]

> VERY difficult problems

Compared to what?

Did you hit your weekly limit ?

How much does it cost? How much did those tasks you did cost?

So far it's all fitting into my current $100/month Claude Max subscription. I got lucky: I had 80% of my weekly allowance left and it resets tomorrow, so I'm burning tokens to try and use it all up by then.

Update: looks like I've spent $82.92 in Fable 5 API priced tokens so far today (still all included in my subscription.)

Here's a TIL on how I'm calculating spending using AgentsView: https://til.simonwillison.net/llms/agentsview-custom-model-p...

Seems like weekly allowance got reset back to 0%, pretty usual when they deploy new models.

Have you seen Fable randomly jump from 50% session limit to 100%? That happened to me a couple hours ago. It was preceded by a bunch of errors about failing to submit a bunch of screenshots.

I haven't noticed that, but I did notice that on a single turn of maybe a few sentences, the cache hit was somehow roughly 500K. Either that's a bug, or there are some truly massive thinking blocks or Claude Code harness system injections behind the scenes.

Nothing like that for me yet.

I'm thinking the 1M context limit bit me here. Only on Max x5.

Simon is also on Max x5

AFAICT come June 22, you won't be able to use your subscription for Fable 5?

Per the "Availability" section of the page, seems like should come back to all plans eventually...

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

wut in tarnation

Coding plans are a (massive) subsidy. We can debate until the cows come home whether western frontier models' API pricing rates are fair, but the coding plans are all heavy discounts below those API rates meant to draw people in and get them hooked (and, ostensibly, to be useful for hobbyists or other lower-usage cases).

It's been discussed at length (on this site, on other sites, on like every blog ever, etc) that, eventually, those subsidies will end, much as the $5-10 Ubers/Lyfts I used to take from the far north end of Chicago into the Loop in 2016 would eventually end once those companies had a footing and didn't need to hook folks.

So - yeah, I mean, a v5 model launching in a year where Anthropic has a rather deeply established market and in a year where AI costs are rising from nearly all providers (sometimes for multiple reasons) seems like exactly the thing I'd expect them to pull the subsidy plug on after a launch teaser.

(Even the open-weight models sometimes do this: for example, OpenCode Zen/Go has a rotating door of free models at any given time that eventually leave the free tier and move into the paid tier once the launch day hype/marketing dies down)

They gave everyone double usage to try it.

But, but, how does the pelican look?!

See parallel thread: https://news.ycombinator.com/item?id=48464054

Given how bad some of the models do on somewhat similar problems, I'm sure pelican is included in training set now. Similar problems - given airplane outline and implementation constraints do painting scheme (constraints something like "it will be implemented using covering film, hence no gradients, no impossible cuts, not more than 2 colors on engine cowl, etc). Google Gemini is meh, but GPT models are just terrible, don't have Anthropic subscription at home, hence have not tested.

[dead]

[flagged]

AI models decompose problems down into tiny pieces that exist in their training data, so in a sense, you're correct.

Though that's also what makes humans so good at solving problems as well, it turns out.

Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.

The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less. And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces. That is how the first person to run CPython in WASM did that, and that is why the plagarism machine can now do the same (only a thousand times more lame and uninspiring).

Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

>The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

That may very well be true now. And in fact, this was true of more rudimentary calculations early on in computing history, where humans were definitely more efficient, particularly for more abstract mathematics. But Moore's Law comes at you fast. Even without more efficient compute, it's rather wild how much more efficient models are becoming these days just from algorithmic and training improvements.

So, maybe for now, certainly. Are you confident that will be the case in 5-10 years? And is that really your barometer for success?

>And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces.

That is certainly a limitation for now, but plenty of academic research is being done on how to address that in a more individualized way. That said, the models also have the advantage of synthesizing learnings from user interactivity back into a future release and essentially applying that globally, which is pretty neat.

There's also some cool techniques to sort of bridge the gap today, like compound engineering.

>Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

But that's the thing: it's becoming pretty clear that the "plagiarism machine" can probably take that same problem in a prompt, having never been trained on my code, and still solve it.

In that case...maybe it doesn't feel great to have someone copy my idea. But that is certainly not plagiarism in the way you mean it. And when you put ideas out into the world, you can't be certain that someone else won't copy and remix it into something new. That's kind of how the world works already, but we're just seeing the barrier to entry decline.

> Are you confident that will be the case in 5-10 years?

Yes, I am. I am very confident that general purpose digital computers will never be more efficient then human minds in generating moderately complex code.

Why am I so confident... Well, it has been over 10 years since AlphaGo beat top go player Lee Sedol. AlphaGo was able to beat the a world class go player by doing several thousands orders of magnitude more computations then Lee Sedol, and it did so by spending several orders of magnitude more energy then the top human go player. Today, over 10 years later, the top go machines are able to beat world class go players much easier, but still do so using the exact same strategy of outcomputing the humans with thousands of orders of magnitude more computations, and spending orders of magnitudes more energy.

Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

>Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

Has it not? Why do you say that?

Also, do we still require a Deep Blue sized supercomputer for chess? :)

> The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

OK then - do it, faster.

> You can take comfort in the fact that a few months later some[...] developer can [solve] the same problem [using your work]

Isn't that what collaboration and sharing software is supposed to be all about?

[flagged]

On one hand, "clanker" has good steampunk vibes.

On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"

"AI slop" caught on but "clanker" did not.

>"AI slop" caught on but "clanker" did not.

It caught on, sure, but not exactly in the way I expected. The wild popularity of "slop" as a term for AI eventually gave way to the genericization of the word "slop" to mean "content of low quality, regardless of source", and is seemingly being used as just a derogatory term for anything that people dislike (particularly by folks in left leaning communities). For example, I've seen people refer to (clearly human written) commentary from some political commentators as "slop".

You comment kind of reinforces the idea by the fact that you have to now say "AI slop" specifically to disambiguate it. It's kind of a fascinating little turn.

"Slop" originated on /pol/ but I'm not gong to try to tread the needle by of the rules by trying to explain it without being offensive or triggering some filter: The first related term here: https://en.wiktionary.org/wiki/AI_slop#English

You have this backwards, as Simon could tell you. In fact, Simon coined “AI slop” to mean “low quality AI output.”

I didn't coin it myself, but I did help amplify it at the moment it started to take off.

claiming you aren't robophobic is the first sign of being a robophobe.

If you've got a real argument to make, by all means, make it. Your anger does not magically "make it so".

It's still a vote, and votes don't require reasons, and shouldn't be dismissed out of hand. There's a growing chorus of those who are fed up with rules for thee but not for me.

Automobiles are not interesting or useful because they're justing using trails the horses already built.

[flagged]

I think this is a worthwhile argument, but you do it a disservice by spamming it in trollish comments

I mean yeah, in this case I fed my own open source code directly into it.

[flagged]

This looks like a toy project, not a “VERY difficult” problem like you stated.

What does that mean? Have you never worked on extremely difficult problems as a side project?

I guess my comment got lost in translation. The project OP linked in his comment is a toy project, not a difficult problem as he led others to believe.

So you could have done it in your sleep, with your hands tied behind your back. Got it.

(You may not realize it but simonw is one of the cofounders of Django, Python's web framework. If they find a Python problem difficult, it probably is.)