Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.

Depends on benchmark.

If questions are fixed they are trivial to game.