But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)
Are humans 100%?
If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.
But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.
Actually faster and worse is a very common characterization of a LOT of automation.
That's true.
The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).
AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.
So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.
I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.
I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).