My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.
My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.