> Sorry where are we seeing that it failed?

Try it yourself.

I've been using claude to make a project over the last few weeks. Its written ~70k LOC to solve a complex problem. I've found that it can get surprisingly far in a 1-shot, but about 90% of the work I've had it do (measured in time and tokens) is cleaning up the junk it outputs in its first pass. I'm finding my claude sessions have a rhythm like this:

1. Plan and implement some new feature.

2. Perform a code review of what you just did. Fix obvious problems. Flag bugs, issues, poor factoring, messy abstractions, etc. Make a prioritised list of things to fix (then fix them).

3. (Later) fixes:

- Write tests for the code you wrote and fix the bugs you find.

- Run the code through memory leak checks, and fix bugs.

- Do a performance analysis using benchmarks and profiling tools, and make any high priority performance improvements.

- Read the whole program, looking for ways in which the code you've just written could fit in better with the rest of the program. Fix any issues.

- In directory X is the full documentation for the library you're using. Reread it then review the code you wrote. Are there better ways we could make use of the library?

And so on.

Claude's 1-shot output is often usable, but its consistently chock full of problems. Bugs. Memory leaks. Bad factoring. Too many globals. Poor use of surrounding code. And so on. Its able to fix many of these problems itself if you prompt it right. (Though even then the code is often still pretty bad in many ways that seem obvious to me).

At the moment I think I'm spending tokens at about a 1:9 ratio of feature work to polish. Maybe its 1-shot output is good enough quality for you. To me its unacceptable. Maybe a few models down the line. But its not there yet.

The ratio is an interesting way of thinking about it. I wonder how this compares to other SWEs at various levels of experience, replacing tokens for person-hours.