Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.

I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)

Are humans 100%?

If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.

But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.

Actually faster and worse is a very common characterization of a LOT of automation.

That's true.

The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).

AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.

So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.

I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.

I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).