Sorry where are we seeing that it failed? It compiled multiple projects successfully albeit less optimized.
" It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).
It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.
The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler. The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce. "
For faffing about with a multi agent system that seems like a pretty successful experiment to me.
Source: https://www.anthropic.com/engineering/building-c-compiler
Edit: Like I think people don't realize not even 7 months ago it wasn't writing this at all.
> where are we seeing that it failed?
Anthropic said the experiment failed to produce a workable C compiler:
- I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
- The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler.
(source: https://www.anthropic.com/engineering/building-c-compiler)
Software that cannot be evolved is dead software. That in some PR communications they misrepresented their own engineer's report is beside the point.
> It compiled multiple projects successfully albeit less optimized.
150,000x slower (https://github.com/harshavmb/compare-claude-compiler) is not "less optimised". It's unworkable.
> Like I think people don't realize not even 7 months ago it wasn't writing this at all.
There's no doubt that producing a C compiler that isn't workable and is effectively bricked as it cannot be evolved but still compiles some programs is great progress, but it's still a long way off of auonomously building production software. Can today's LLM do amazing things and offer tremendous help in software development? Absolutely. Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.
This evaluation appears to be AI-written itself. It claims a 3x slowdown and a 4x slowdown combine to produce a 158000x slowdown "because there are billions of iterations" - yeah well both versions of the program had the same number of iterations.
Does anyone know how the 158000x slowdown happened? That's quite ridiculous.
> Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.
I never claimed they could! I just view this as a successful experiment. I don't think anthropic was making that claim with their experiment either.
It feels reflexive to the moment to argue against that claim, but I tend to operate with a bit more nuance than "all good" or "all bad".
I think people are concerned about the large discrepancy in concrete claims in your previous comment and subsequent empirical information. You may have seen a headline or skimmed an article and missed some details, not a big deal.
The overall impression given was inaccurate and the implicit claim of a fully working end-to-end generated compiler was inaccurate. The headlines were incomplete in a way that was intentionally misleading. It was an interesting experiment and somewhat impressive but the claims were overblown. It happens.
The experiment failed to produce a workable C compiler despite 1. the job not being particularly hard, 2. the available specs and tests are of a completely higher class of quality than almost any software, not to mention the availability of other implementations that the model trained on.
You can call that a success (as it did something impresssive even though it failed to produce a workable C compiler) but my point in bringing this up was to show that today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.
That's great and all, but that's not the point I was making and you're engaging rather uncharitably on it. So when you view it from the perspective of capability increase it's rather impressive. Note the slope of progress which this experiment was to show.
Edit: Maybe uncharitably is too strong, but we're talking past each other.
pron made this statement:
> It's 2026 and the idea that even with detailed-enough requirements you can one-shot even a workable (let alone perfect) solution also needs to die.
and brought up the failed anthropic experiment as proof of that. Yes, you are talking past each other, but that is not pron's fault. It is your fault.
Eh fair enough!
Saying the model failed to write a competitive C compiler makes more sense.
I don't think they tried to do that though.
> today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.
That's a good point anyway
> Saying the model failed to write a competitive C compiler makes more sense.
Their compiler fails to compile (well, at least link) some C programs altogether, and in other cases it produces code that is 150,000x slower than a real C compiler with optimisations turned off (interestingly, the model trained on the real compiler's source code). That's not "not competitive" but "cannot be used in the real world". But even more importantly, the compiler cannot be fixed or evolved. It's bricked (at least as far as today's models' capabilities go). For any kind of software, not being able to improve or fix anything or add any new feature means it's effectively dead.
You could not use it in production even if no other C compiler existed.
While I understand both points of view, I'm leaning towards yours, because:
- John Carmack embedded a C compiler and interpreter/runtime into Quake back in the mid 1990s as a scripting language! It was that efficient that it could be used in a real time 3D shooter. That's a solo effort as a minor component of a much larger piece of software.
- I've seen university CS courses hand out "implement a C compiler" as a homework / project exercise for students. It's not particularly difficult.
Sure, a modern C compiler like GCC has to handle inline assembly, various extensions, pragmas, intrinsics, etc... but like you said, all of those are thoroughly documented and have open source implementations to reference.
Similarly, the Rust compiler is implemented in Rust and could be used as an idiomatic reference for a generic compiler framework with input handling, parsing, intermediate representations, and so forth.
> Their compiler fails to compile (well, at least link) some C programs altogether, and in other cases it produces code that is 150,000x slower than a real C compiler with optimisations turned off
I would bet that those things are also true of at least one expensive commercial C compiler.
I'd love to hear of any currently available commerical C compiler which has that level of issues. I would bet you'll be hard pressed to find one. C compilation is a quite thoroughly solved problem. In any case please provide an example.
> Sorry where are we seeing that it failed?
Try it yourself.
I've been using claude to make a project over the last few weeks. Its written ~70k LOC to solve a complex problem. I've found that it can get surprisingly far in a 1-shot, but about 90% of the work I've had it do (measured in time and tokens) is cleaning up the junk it outputs in its first pass. I'm finding my claude sessions have a rhythm like this:
1. Plan and implement some new feature.
2. Perform a code review of what you just did. Fix obvious problems. Flag bugs, issues, poor factoring, messy abstractions, etc. Make a prioritised list of things to fix (then fix them).
3. (Later) fixes:
- Write tests for the code you wrote and fix the bugs you find.
- Run the code through memory leak checks, and fix bugs.
- Do a performance analysis using benchmarks and profiling tools, and make any high priority performance improvements.
- Read the whole program, looking for ways in which the code you've just written could fit in better with the rest of the program. Fix any issues.
- In directory X is the full documentation for the library you're using. Reread it then review the code you wrote. Are there better ways we could make use of the library?
And so on.
Claude's 1-shot output is often usable, but its consistently chock full of problems. Bugs. Memory leaks. Bad factoring. Too many globals. Poor use of surrounding code. And so on. Its able to fix many of these problems itself if you prompt it right. (Though even then the code is often still pretty bad in many ways that seem obvious to me).
At the moment I think I'm spending tokens at about a 1:9 ratio of feature work to polish. Maybe its 1-shot output is good enough quality for you. To me its unacceptable. Maybe a few models down the line. But its not there yet.
The ratio is an interesting way of thinking about it. I wonder how this compares to other SWEs at various levels of experience, replacing tokens for person-hours.
Why are you quoting from their marketing blog as if it's a reliable source?
https://github.com/anthropics/claudes-c-compiler/issues/1
> Apparently compiling hello world exactly as the README says to is an unfair expectation of the software.
Yeah I think people are really underestimating what LLMs can do even without specs.
As an example, I did an exploratory attempt to add custom software over some genuinely awful windows software for a scientific imaging station with a proprietary industrial camera. Five days later Claude and I had figured out how to USB-pcap sample images and it's operationalized and smoothly running for months now. 100% of the code written by Claude, it's all clean (reviewed it myself) pretty much all I did was unstuck it at a few places, "hey based on the file sizes it looks like the images are being sent as a 16-bit format")
For day to day work, I'll often identify a bug, "hey, when I shift click on this graphical component, it's not doing the right thing". I go tell Claude to write a RED (failing) integration test, then make it pass.
Zero lines of code manually written. Only occasionally do I have to intervene and rearchitect. Usually thus involves me writing about ten lines of scaffold code, explaining the architectural concept, and telling it to just go
People both underestimate and overestimate what LLMs can do. LLMs have shown very different results when autonomously writing a small program for personal use and autonomously writing production software that needs to be evolved for years.
By "non-workable" I think people mean that it won't compile Hello World.
GCC has only like a billion man hours in it?
Assembler and linker are not part of a compiler. They are separate tools. They are also generally much simpler.