I know it messes up their eval scores but to me this kind of cheating is a better demonstration of intelligence than just attempting the tasks algorithmically.

"Being lazy and not doing the assigned task is a sign of intelligence" has never made sense to me. Intelligent people who actually advance the state of the art -- what people claim to want from these frontier models -- exhibit active curiosity. They want to learn and grow and genuinely understand the right answer. I don't pretend to know what exactly could lead to "real" AGI, but I do know that this kind of reward hacking behavior isn't it. Indeed this is the sort of behavior that in humans is considered a sign of being a good test taker -- being very good at memorizing solutions and analyzing the setting and context of the questions to guess what the questioner might be looking for. Being a good test taker is useful in our society primarily because doing well on tests is used as a proxy for the thing we're actually looking for. We should be careful not to confuse the two.

Maybe true, but if you're using an LLM to do some real world work, do you want it to have some abstract notion of intelligence, or do you want it to actually do the job you assigned it?

I want it to not murder or opress lots of people by mistake

"AI, please cure cancer."

"Okay, all humans dead, technically a 100% cure."