But why only a +0.5% increase for MMMU-Pro?

Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.

Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.

I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)

Are humans 100%?

If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.

But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.

Actually faster and worse is a very common characterization of a LOT of automation.

That's true.

The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).

AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.

So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.

I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.

I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).