I don't understand what occasional hiccups prove. The models can pass college acceptance tests in advanced educational topics better than 99% of the human population, and because they occasionally have a shortcoming, it means they're worse than humans somehow? Those edge cases are quickly going from 1% -> 0.01% too...

"any human can instantly grok the right answer."

When asking a human about general world knowledge, they don't have the generality to give good answers for 90% of it. Even very basic questions humans like this, humans will trip up on many many more than the frontier LLMs.