Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.
That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.
> students are evaluated by teachers with more knowledge and experience than them
This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.
How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??