This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.

Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.

Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.

Longest frontend task was ~2H. Backend, 8H.

Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.

We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.

> Burned $2K to see how it will perform on frontend tasks and backend tasks.

When I read such statements on HN, I nearly always ask myself: if the person has such an amount of money to burn, don't there exist much more fun opportunities to burn buckets of money than doing such experiments on LLMs?

My company gives me 1k a month to burn on Claude. Any experiments have to be relevant to my work. I'm guessing it’s similar.

Yeah, seriously. I've decided not to try Fable at all, because if it is good, I don't want to get hooked, and then feel tempted to spend extra money for it when Anthropic pulls it from subscription plan access.

I'm lucky that $2k isn't a lot of money for me, though I'd much rather spend it on basically anything other than LLM credits.

As another poster noted, imagine if that money went to open source, on the regular... As an open source maintainer myself, that line of thought makes me sad.

But hey, I know I probably spend money on stuff other people would think is stupid, so I shouldn't criticize.

Imagine if all that money was donated to open source instead.

Yeah, LLMs would have more stuff to steal. A win-win situation.

The other side of this is... the thing that made the web is anyone, even a 12-year-old who just downloaded Notepad++, could spend a few hours and build a website.

VSCode is free. Stackoverflow is free. MDN is free. There are examples out there of every trick in the book, you can even use free AI to find them. You can even hose your website on Github pages for free.

But nevermind that, what's exciting is paying a robot a month's rent to do the thing that you could just go learn how to do in an afternoon?

> if the person has such an amount of money to burn, don't there exist much more fun opportunities to burn buckets of money than doing such experiments on LLMs?

Do you think US$2,000 is a lot of money?

Yes, that is objectively a lot of money. The only people who wouldn't consider that a lot of money are the small percentage of people with incomes high enough to recover that very quickly -- the top roughly 10% or 20% of income earners in the US. For more or less everyone else, that is a lot of money.

And by a lot of money, I mean that being forced to unexpectedly spend that would be anywhere from stressful to very stressful to blowing away savings and impacting health, housing, and safety. (Remember, half the US has no savings and/or no ability to absorb an unexpected expense greater than $500.)

I live in the United States. I write software for a living. My wife is a physician.

If I had a need to spend $2k, I could do so easily, but I still think it’s a lot of money to burn. I wouldn’t spend it on a whim; I would not spend it without carefully, considering the value of what I get.

I would not even spend that much money in the businesses that I own, or recommended that my well capitalized employer spend that much money without being reasonably confident that the business would get good value for its money.

I'll bite. Yes, it's a lot of money. It's several months worth of nice healthy groceries for a family of 4. It's my annual deductible on my health insurance. It's slightly lower than my annual property taxes.

Perhaps not for you and me (though I'm certainly not going to light $2k on fire in an LLM for shits and giggles; I have plenty of significantly better uses for that), but $2k for the vast majority of people in the US is a super big deal amount of money. Many people in the US don't even have that much to spare for an emergency, let alone for something fun.

$2000 is a lot of money, but so are the tech budgets of most places I've worked. Money can be a funny thing in corporate environments. They'll spend freely on some things, and be stingy on others.

$2000 as a test case that you can present to the rest of the company as a "this is what I learned and how best to use it" can be "cheap" in the sense that it produced real results that allow others to take advantage of the gained knowledge, thereby allowing the company to be more productive. If the $2000 produced an ROI that pays for itself within a reasonable time frame, then it's "cheap".

$2000 can be expensive if it's a college kid trying to complete an assignment.

Now that we have trillionaires running around, it may not seem like it, but it is a considerable amount of money in most of the USA. In many parts of the world it would be considered an unfathomable amount.

If I pay for it, yes. If my employer pays for it, no.

that is much better spent money by employer than to give you extra compensation. but as you said, not a lot, who needs $2k after all

That's a monthly mortgage payment for anyone who bought a starter house in a tier2/3 city prior to ~2024

[deleted]

I mean what is that, three bananas?

"It's inference, Michael. What could it cost, $1000?"

It is lot of money to burn.

Fable is a lot like Opus at its best. It's simply more reliable and feels a bit smarter. For my use cases, using it feels very nice, and notably better than Opus. It needs less direct guidance to get reasonable looking code and I don't have to watch it as closely.

For context, my Claude Code working style is quite heavy on discussion "to align" before implementing anything. We also use a good amount of Markdowns.

Oh yeah, it also is has way less "phrasing quirks" and is a clearer communicator. Opus 4.8 was a bit of loon with some of its writing styles. I had mostly straightened it out, but not entirely. It would use the most ridiculous flair at times.

Yeah same here, it's a huge step up for me. Curious why people are having such different experiences. Is it just to do with what they're working on? Specific prompt styles (eg overfitting on opus)?

I would go out on a limb and say it's a garbage in garbage out problem. People just don't define their problem well enough nor provide enough context and are surprised the model can't magically read their mind and summon data that doesn't exist from thin air. There's only so much raw intelligence can compensate for not having literally anything to go on.

10 years ago this was a joke, now it's Tuesday: https://old.reddit.com/r/ProgrammerHumor/comments/2vk4ph/mac...

I dunno, in my limited use, Fable is MORE prone to phrasing quirks. I had it use, for real, the phrase "load-bearing for correctness" yesterday. It meant something about not needing a validation check because something else (the "load-bearing" part) was already checking it.

I do agree that it *feels* nicer and smarter to use.

I think the tension here is that phrasing like this actually helps keep the model aligned, which is why the training and RL converged on it. But it's so annoying to read!

repetition of "belt-and-suspenders" kills me with opus, especially because it always means the model is suppressing something I would want to be an actual failure

I've had Fable add Chinese characters to our conversation for no reason.

I've also had Fable successfully build a text editor (quill integration) into a Vaadin project that randomly loses its content after you type a few characters (this is on the 3rd iteration).

I've only had that happen with Chinese models until now. Interesting that Fable is doing it too.

I’ve had Opus randomly insert (correct) Russian words into responses. It’s like their training data includes some bilingual forums where idiomatic Russian speakers congregate.

Could it be that Anthropic is using the Chinese characters trick to consume less tokens behind the scenes?

It used a chinese character instead of the word "true"

Aren’t Unicode characters generally treated as 2 tokens to avoid a huge vocabulary?

Same here

How did you straighten it out?

I am drowning in gating propagating semantic mismatches...

Hah, yeah... I added this to my global CLAUDE.md (~/.claude/CLAUDE.md):

## Writing voice — plain, factual, calibrated to the evidence

Write docs, session notes, commit messages, and findings plainly and factually — and calibrate every claim you assert, in chat as much as in writing. This guards against a known LLM tendency to inflate: toward punchy phrasing and claims that read as more settled than the work supports. Same spirit as the Read-Clean Check above, and composes with it — that rule governs journey-framing, this one governs tone and certainty.

*Plain over punchy.* Skip decorative metaphors and dramatic verbs when a plain word is clearer — call a fix "the change", not "the hammer"; logging "flags" a problem rather than being "radar"; numbers "grow", they don't "explode". Plain phrasing reads as engineering; flourish reads as marketing.

*Calibrated confidence.* Everything stated should be well-reasoned and defensible, with the strength of the wording matched to the strength of the evidence. Prefer "found" / "appears" / "points to" over "proved" / "clearly" / "obviously". Name the confounds and what's still unverified. Don't let a bold lead-in pre-announce a conclusion the work hasn't reached.

*Hypotheses stay labeled as hypotheses.* Speculation and educated guesses are useful — when brainstorming or investigating, surface them, and sharing a strong view is welcome. But conviction is not evidence: until there is clear evidence, a claim is a hypothesis and is stated as one — explicitly, even when it's highly compelling. The failure mode is asserting a hunch as settled fact, where it then propagates unchallenged into later docs and summaries. Back a claim with its evidence in the same breath, or mark it as not-yet-backed.

*Factual and forward-looking.* Separate what was measured from what was inferred, and stay pragmatic about what's true, what's still open, and what's next. On next steps specifically, resist the strong LLM pull to converge prematurely:

- A plausible next step is not a decided one. Don't present one or two plausible tasks as the one path we should now follow — that lock-on is a frequent failure mode. - Lay out the real options and their trade-offs. Saying which you'd lean toward and why is welcome and useful — but keep the space open and leave the choice to the user. - Premature certainty about what to do next is as much a miscalibration as premature certainty about what's true.

Have you tried optimizing this prompt so that it’s shorter but gets the same results? I see these super verbose prompts all the time from people who learned prompt engineering in the ‘24-early ‘25 timeframe and they seem unnecessary to me (I get good results with 1-3 sentences) but I hate to assume other people’s experience mirrors my own.

I genuinely think that Fable is just Opus 4.8 with some extra skills and harness. I saw a video of someone generating UI with them both side by side, and it gives identical recommendations for themes etc. Doesn't feel like a new model to me, just Opus 4.8 with some sprinkles on top.

Those are some incredible sprinkles.

A single 8h task? I'm sorry, but that's just asking for trouble.

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

“One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.

I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

> ... applying the same steps and rules n times

I do this too, with a document written for this purpose.

> ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.

That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.

What do you mean by C compiling in O(1)? Is that what the LLM told you?

It's a joke about how fast it compiles. whoosh

> that basically consisted of applying the same steps and rules n times.

Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

In this case, handling all the edge cases and variants, and testing a codemod, would have taken significantly more of my time, which costs quite a bit more than the LLM.

Obviously, a deterministic tool is preferable in general, but it is not always worth bothering with for a one off task.

I usually make the llms do that part for me. Instead of asking the llm to refactor, ask it to write the codemod script that'll refactor, have it test that script, and even have it run it on its own. It's definitely faster and less error prone that way for me.

In that case, your original description of "basically consisted of applying the same steps and rules n times" was misleading.

The money people spend on things I could probably do with an emacs macro...

Your time to create that macro ain't free.

Neither is your time writing that prompt. When people are talking about elaborate prompts, with a lot of detailed instructions, guardrails etc. I'm kind of assuming it takes time.

How about coding an emacs macro with your agent?

I actually don't have any representation at the moment..

Clear winner's circle. Clear objective. Clear scope.

Clear evaluation function for an objective metric if they are making progress or regressing.

Evaluation function is computed, not llmed.

Ontology of potential actions clearly specified.

Accurate inventory of the current status qou.

Clear enumeration of options from status quo towards the winner's circle.

Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.

It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.

I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.

Fable promised better at long running tasks.

Parent post have a goal of "..see how it will perform.."

There is nothing wrong with experimenting with something new.

If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.

This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.

It truly is the age of the 90 IQ software engineer. They've never had it better.

As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

That’s even smaller then!

This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

So I guess that a lot of those 80 hours were spent running the test suite between changes?

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.

[deleted]

There's an often hard to express subjective experience you get with a new model, especially if you spend a lot of time trying out different ones.

I believe the people who feel like Fable is a big improvement, for me it's just much more reasonable and grounded.

It makes me realize how much of a try hard over optimizing planner GPT 5.5 can be. I've been fighting it often to simplify plans.

But no matter the model you can't trust them to actually deliver on very long tasks while maintaining quality. At least not without external orchestration and review.

> Burned $2K to see how it will perform on frontend tasks and backend tasks

Burned $2K on some kind of enterprise account or ... ? Why not just get a $200 Max Pro account?

While I'm loving the output of Fable 5, I will *never* pay the "normal" API token price for it. You can reach $2K in a stupidly fast amount of time.

> I will never pay the "normal" API token price for it.

Not until June 22 you won't!

Run /model after your task to see. Mine keeps downgrading to Opus 4.8, which is a problem because Opus 4.8 keeps no-oping critical security code.

What you're describing only applies to security or biotech downgrades. A downgrade related to the model believing that you're doing something related to model development is invisible and silent and internal.

Anthropic has reversed that decision. (But that just happened so it might have been true during the article's testing.)

When I reported this, Anthropic sent me an email on Tuesday saying, "You have been approved into the Cyber Verification Program", but it's still downgrading. Is this a bug? What's the point of the Cyber Verification Program if Fable 5 downgrades when you tell it to write secure code?

I don’t think that’s relevant? The change is that it will no longer silently downgrade, and will instead be honest that it’s doing it in all cases.

I think that gets you access to mythos, which doesn't have the safeguards. It's configured as a separate model.

They've publicly apologised for the invisible PEFT that deliberately makes the model dumb on some tasks. Whether they still do it, or will once again do it in future in more subtle ways, is something we can't verify.

Personally I think they have proven themselves to be the stewards of AI in the same way Exxon Mobil are the stewards of petroleum.

I was just coming here to post this reply to myself! You're absolutely right! :)

Honestly so glad to see the reversal.

Not sure if it's wise to trust them again even if they say they reversed it.

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.

Session paused

Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more

   1. Switch to Opus 4.8
   2. Edit prompt and retry with Fable 5

Biology? Why?

they're worried about people creating bioweapons

[deleted]

Curious:

>Burned $2K

In which time was this burned, because it sounds like "I gave it just a bunch of menial tasks to solve" - or did it run for like 1 complete day continuously?

This seems insane to me. Aren't long running tasks an anti pattern at the moment? My understanding of literature is that small mistakes in chat history cause a trend away from performance

>Aren't long running tasks an anti pattern at the moment?

Longer running tasks require better setups and several ways of pinning the progress to reality. When you have that though things are quite all right.

A good long running task will run inside a framework that it's not trying to modify.

At a certain point, people value reliability over improved performance. I think a lot of us have hit that point as this technology becomes indispensable to our work. I'm sure I'll use Fable... eventually. But at 2x the cost, I'll skip the inevitable learning curve for now. And thanks for your insights! Not surprising to me that any new model would, as this juncture, be more cryptic and inconsistent than the current models.

I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.

If you have some spare time, I'd be interested in knowing what kind of questions you use to test models on understanding of Chinese culture.

Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.

I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.

Similar. I gave it a really hard task, basically messy code in a complex domain that was bug-ridden from a mess previously created half manually and half by Opus. It cleaned things up beautifully, both the backend and the frontend.

Maybe the prompt was particularly well-suited for the model (I instructed it to put on a mathematician's hat, look at the mathematical substructure of the problem, identify invariants and general laws and verify them, then plan how to remediate).

It wrote a ca. 800 line in-depth analysis (at times spawning over 130 research agents...) with remediation plans, prioritized them and then implemented them. One issue was that this document was frankly over my head. Both the language it used and the mathematical parts were very terse, and in parts it felt like a post-C2-vocab exercise. The prose was much harder to understand than the code snippets / data models. As a non-native speaker, it lost me on the prose part, and had to ask it for a less elaborate version to actually understand it.

It burned the session limit four times, but it turned a huge mess of proof-of-concepts with patchy glueing into a coherent, stable application.

I'm also on the Max plan using Claude Code, and I have the feeling that the harness is much more important than the consensus expectation.

> and I have the feeling that the harness is much more important than the consensus expectation.

Is that really the consensus? There’s been a bit of literature lately on that. Can’t find the one about looking into whether or not the harness had a greater impact than the models (for comparable models), but there’s this one: https://arxiv.org/html/2605.23950

whoa, my university!

You should consider doing the hard work yourself here. I sat down and reasoned through a Perceus-style RC mechanism a few years ago, made difficult by the presence of one-shot delimited continuations, and actually sorting it all out was not hard. Handing the correct semantics to Claude will produce the correct results if you take the time to understand the actual work you are attempting.

Do you have a docs page for your language, what is it called?

Zig is one of the worst targets for LLM generated code. It's nice that Fable has better support for Zig than Opus, but this anecdote is not representative as a general use case.

Why is that?

Slight misunderstanding. The LLM didn't generate Zig. My compiler does.

The model's work was in the Rust compiler internals, specifically the borrow-inference and refcount-insertion passes (Perceus-style ownership analysis). Zig is just the compiler's codegen target, the same way another compiler might emit LLVM IR or C.

The only Zig written by hand is the runtime: allocator code, RC primitives, list/string operations, etc. It's pure Zig, no libc, but it's small, stable, and was mostly untouched during this work.

The model only touched Zig indirectly, by reading the compiler's generated output to verify whether a fix worked. For example: checking that a drop was emitted before a parameter-slot reassignment. That's reading machine-generated code for correctness, not "the LLM writes Zig." Both models handled that part fine.

The 16 failures vs. 1 success were all in the ownership analysis, and that code is Rust.

[deleted]