Those who fail to study history (or live through it) are doomed to repeat it.

SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.

I don't have the solution just noticing the pattern.

That's a slightly different problem. There's no thing as saturation for a performance benchmark like SPEC; we can always conceive of a faster processor (even if we don't know how to build one). Saturation is the problem that once you are at (or near) 100% pass rate on a test of pass/fail questions, there's no room for the score to keep going up and the test has lost any power to discriminate between competing options.

However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.

Maybe OP was thinking about compilers "cracking" certain SPEC benchmarks: implementing exactly the optimization needed to boost a benchmark quite a lot, but that opt. probably won't apply to any other code out there (usually it's so targeted and risky with general C/C++ code that intentionally it doesn't work on anything else). That happened a couple of times over the years, I know about the Intel compiler cases for ex. I can certainly see LLM providers adding tricks that help a certain class of benchmarks, but doesn't help much for anything else.

Intel's done it again recently, this time targeting Geekbench: https://www.intel.com/content/www/us/en/support/articles/000...

Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.