>overly tuning models just to specific test already published tests, rather than focusing on making them generalize.

I think you just described SATs and other standardized tests

SAT has a correlation to IQ of 0.82 to 0.86 and I do think IQ is very useful in judging intelligence.

https://gwern.net/doc/iq/high/smpy/2004-frey.pdf

It's a useful diagnostic when used in a battery of diagnostic tests of cognitive function, but to the point of this thread: it is notoriously not a good ranking mechanism.